Sitemaps are one of the most important sources of new URLs for search engines. They scan URLs from sitemaps regularly. And if the URLs meet the requirements of search engines, they will be indexed. However, if sitemaps contain broken links, 4xx, 3xx pages, etc., this will negatively affect your crawling budget. Therefore, it is wise to regularly check the health of sitemaps.
With JetOctopus you can check all sitemaps or just a separate sitemap.
There are two ways to check sitemaps in JetOctopus:
If you need to compare/analyze data from crawl and sitemaps, select the first mode. Among other benefits, it will help find orphan pages.
To check only sitemaps, select the second mode.
Sitemaps as an additional source of URLs during the crawl
Go to “New Crawl” and activate the “Process sitemaps” checkbox in the “Basic settings”. JetOctopus will search for your sitemaps itself, for example, by address https://example.com/sitemap.xml.
If the sitemaps have a specific URL address, you need to add it (in addition to activating the checkbox “Process sitemaps”). To do this, go to “Advanced settings” and enter the absolute URLs of sitemaps in the “Sitemaps” list. Each URL should be in a new line.
When JetOctopus finishes scanning the links on the pages, it will process the sitemaps from the list.
You can add sitemap index files: JetOctopus will process all sitemaps from the index file.
When the crawl is finished, you can select URLs from the sitemaps for analysis only with one click. Go to the desired dataset. In the “Join Dataset” block, select “+Sitemap URLs”.
Next, select which URLs you want to analyze: only found in the sitemap, or not presented in the sitemaps.
Only XML Sitemap audit
To check only URLs from sitemaps, select “Only Sitemap” mode in the “Basic settings”.
Next, go to the “Advanced settings” and specify your own list of absolute URLs of sitemaps in the “Sitemaps” field. You can also add sitemap index files to this list. JetOctopus will process all sitemaps from the index files.
JetOctopus has a separate dashboard for quick analysis of sitemaps.
Go to the menu “Crawler” – “Sitemaps”. Here you will find general information about the sitemaps.
You can also explore “Sitemaps problems by depth” and “URL Distribution by depth”. We recommend paying attention to the “URL Distribution” chart. Ideally, the percentage of URLs found in web crawl and site maps is 100%.
A list of orphan and non-200 pages are available for analysis. Sitemaps should only contain absolute URLs with 200 response codes. Everything else needs to be fixed.
To check that all sitemap files are working properly, go to the data table and select “Sitemap files”. In this report, you will find a list of sitemaps processed by JetOctopus. Analyze the following metrics for each sitemap:
Number of URLs – can not be more than 50,000;
Filesize – no more than 50MB (uncompressed);
Status code – if the sitemap URL is non-200, it is not available for scanning by search engines.
Configure additional columns to see information about the “Error Message” (the reason why a sitemap was unavailable), “Date lastmod” (lastmod attribute) and “Date Crawled” (when JetOctopus crawls your sitemap).
All URLs that we found in sitemaps are in a separate data table “Sitemap URLs”.