September 30, 2020
by Ann

Why your crawl can’t be finished.

Sometimes crawl fails because the JetOctopus bot is blocked by the webserver. 

If this problem occurs, you will the following information in your crawl project:

  • a lot of URLs returning a 401 or 403 status code;
  • the crawl slows down progressively, and appears to stop running completely;
  • crawler shows ‘crawl error’ status;
  • crawl shows that only 1 page was crawled.

The reason for your crawl being blocked is most likely one of the following:

– JetOctopus uses different cloud providers which may be blocked by default as they are also often used by scrapers.

– There is an automated system on the server which detects and blocks suspicious activities.

A manual block has been implemented by a server administrator, based on manual inspection of server activity, possibly triggered by a high load caused by the crawl, or a large number of crawl errors.

– The use of a Googlebot user agent has led to the failure of a reverse DNS lookup, appearing to be a scraper that is spoofing Googlebot.

We recommend trying the following solutions to blocked crawls:

– provided that you know the site, you can ask the server administrators to whitelist the IP that JetOctopus uses to crawl: 

54.36.123.8
54.38.81.9
147.135.5.64
139.99.131.40
198.244.200.110

– some websites will block requests which come from a Googlebot user-agent but do not originate from a Google IP address. In this scenario, selecting a different user agent often makes the crawl succeed;

– some websites attempt to use Javascript to block crawlers that do not execute the page. This type of block can normally be circumvented by enabling our Javascript renderer.


If you have any problems with the crawl,
feel free to send a message to support@jetoctopus.com.

About Ann
Ann is the Content Strategist in JetOctopus. She learned the dynamics of SEO from scratch by understanding the inner workings of logs and crawl reports data. She cross-examines actual SEO cases by studying clients’ websites and loves being an active member of the technical team. Because of her dedication, she has become technically savvy and always bases her definition of quality content on real data and not on speculation. You can find Ann on Twitter and LinkedIn.

Search

Categories

Get exclusive tech SEO insights
We are tech SEO geeks who believe that SEO is predictable and numeric. Don’t miss our insigths!