JetOctopus can crawl staging websites and websites closed from scanning by search engines. You can also run a crawl of a website that needs authentication (password/login). To crawl staging websites, just set up special crawl settings.
If a staging site is blocked by a robots.txt file, you need to add a custom robots.txt before starting a new crawl.
Click “New Crawl” to start setup.
More information: How to configure a crawl of your website.
Then go to “Advanced settings” and select “Custom robots.txt”. Enter here the robot.txt file that should be on a production website.
Custom robots.txt file must contain at least one section with user-agent: *.
To crawl all URLs on your staging website add such directives:
Authentication is one of the most secure ways to close a website from crawling by search engines and from random visitors.
To scan an authenticated website, go to “Advanced settings”. Under HTTP Auth, enter the User (login) and password. It is completely safe, your data will not be passed on to anyone and will not be saved for future crawls.
JetOctopus automatically determines the fields where the crawler needs to enter data for authentication, so you do not need to make additional settings.
Pay attention! If a staging website is additionally blocked by robots.txt, deactivate the “Respect robots rules” checkbox in “Basic settings” or add a custom robots.txt file (see above).
You can crawl the staging or production website, which shows updated data only with special cookies. To do this, go to “Advanced settings” and add a list of cookies for every request. Enter one cookie per line without additional characters between lines.
Also, you can add Custom Requests Headers in the field below.
Staging sites are easier to overload. Therefore it is necessary to crawl them carefully.
Be sure to set the minimum number of threads in the “Basic settings”.
You can also set a timeout between requests in “Advanced settings”.
If your staging site is blocked by a robots.txt file, disable the “Respect robots rules” checkbox or add a custom robots.txt file.
Select “Follow links with rel=”nofollow” and “Follow links on pages with <meta name=”robots” content=”noindex,follow”/>”, if your staging site is closed from indexing by meta-robots.