If you have crawled the same website twice or more, you may notice that the crawl results are different. In this article, we will explain why this happens and what to keep in focus when analyzing differences in crawls.
Next, we will describe in points the possible reasons why the results of crawls differ.
The first reason why the results of crawling of the same website are not the same is the different configuration of the crawl. To check what settings were made during the crawl, go to your project, where there is a list of crawls.
Click on the information icon next to the desired crawl. And a pop-up window with crawl settings will appear.
Also, you can find crawl settings in a separate “Settings” report in a “Crawler” menu.
Note the following points:
More information: How to find orphan pages with JetOctopus.
Also pay attention to the “Respect robots rules/ ignore robots rules” settings. If you have activated the “Respect robots rules” checkbox while setting up the crawl, our crawler will follow all the directives for search engines and will not scan URLs, which, for example, are blocked by the robots.txt file. Therefore, the results of crawls with different rules for robots will be different.
Make sure you choose to crawl the main domain or subdomains and the availability of a custom robots.txt file.
You will find all the items that can affect the results of crawling in the information window or in “Settings” report in crawl results.
Crawl results often differ because websites were updated. This is especially relevant for large and e-comm websites. You can track whether critical updates for SEO have changed as a result of the update by setting up alerts.
If you select “Process sitemap” during crawl settings, and your sitemaps are updated, the crawl results will be different.
If you have a large website that has many URLs on different levels and a wide horizontal structure, then the results may differ with the same crawl configuration and the same page limit. For this reason, we recommend a complete crawl for websites of all sizes.
More information: Why Is Partial Crawling Bad for Big Websites? How Does It Impact SEO?
If some pages returned 5xx response codes during one of the crawls, our crawler could not access the page code and couldn’t get the list of links from the code. 5xx status codes are temporary and may indicate server overloading, in particular, by our crawler. Check the crawl results for 5xx pages. The results of the crawls will be different if there are 5xx status codes.
More information: What if you run a crawl and your website returns the 503 response code?
Why are there 5xx pages in the crawl results and why are they not reproduced with the manual checking?