8 Common Robots.txt Mistakes and How to Avoid Them


Jun 20, 2019
  • Please, share!

8 Common Robots.txt Mistakes and How to Avoid Them

Yes, Robots.txt can be ignored by bots and yes, it's not secure: everyone could see the content of this file entering http://www.yourwebsite.com/robots.txt. Nevertheless, well-considered robots.txt helps to deliver your relevant content to search engine bots and omit low-priority pages in SERP. At first glance giving directories is an easy task but any kind of management requires your attention and forethought. Let’s go through the most common Robots.txt mistakes and find out how to establish a constructive dialogue with Googlebot.

Essentials

We don’t want to sound like a broken record repeating the definitions and the main rules of robots.txt. There are a variety of useful guidelines that describe robots.txt basics in details. We’ve gathered the most relevant sources in the clickable list for your convenience:

  1. Google: What is a robots.txt file?
  2. Google: Create a robots.txt file
  3. Google: Test your robots.txt
  4. Google: Robots.txt Specifications
  5. Wikipedia: Robots Exclusion Protocol
  6. Moz: Robots.txt Basis
  7. Search Engine Journal: Best Practices for Setting Up Robots.txt
  8. Now let’s cut to the chase and reveal the most common robots.txt mistakes & ways to avoid them.

    Video version

    Mistakes in Robots.txt To Avoid

    1. Ignoring disallow directives for specific user-agent block

    Suppose you have two categories that should be blocked for all crawlers and also one URL that should be available only for Googlebot:

    This file asks Googlebot to scan the whole website. Remember that if you name the specific bot in robots. txt, it will only obey the directives addressed to it. To specify exceptions for Googlebot, you should repeat disallow directives for each user-agent block, like this:

    2. One robots.txt file for different subdomains

    Remember that subdomain is generally treated as a separate website and thus follows only its own robots.txt directives. Suppose your website has a few child domains that serve different purposes. You can try to take the easy way out and сreate one robots.txt file with the aim to provide your subdomains from it, for instance:

    Well, it isn't that easy. You cannot specify a subdomain (or a domain) in a robots.txt file with a wave of a magic wand. Each subdomain should have the separate robots.txt file. For instance, this file opens access to everything on the subdomain “admin”:

    Robots.txt works only if it is present in the root. You need to upload a separate robots.txt for each subdomain website, where it can be accessed by a search bot.

    3. Listing of secure directories

    Since Robots.txt is easily available for users and harmful bots, don’t add private data as illustrated by the following examples:

    We do not live in an ideal world, where all competitors respect each other. When you are trying to disallow some private data in robots. txt, you give fast access to your information for bad bots. It’s like sticking a note with a polite request on your backpack:

    “Dear thieves! I have 1000 dollars in the left pocket. Please, don’t touch it! Thank you so much!”

    The only way to keep a directory hidden is to put it behind a password.

    Remember: secure data = password-protected data. That is the only reliable way to protect such sensitive data like customer credit card details or credentials.

    Also, note that Google can index pages blocked in robots.txt if Googlebot finds internal links pointing to these pages. In a scenario like this, Google will likely use a title from some the internal links pointing to the URL, but the URL will rarely be displayed in SERP because Google has very little information about it.

    Google Webmaster Central office-hours hangout on on March 22, 2019 00:25:05 mark

    4. Blocking relevant pages

    Very often webmasters accidentally close profitable pages by the following way: suppose you need to block all data in the following folder - https://yourbrand.com/key/ So, you will add the following lines in the robots. txt:

    Yes, now you’ve protected the credentials of your clients, but you also blocked the relevant pages like: https://yourbrand.com/keyrings-keychains/?ie=UTF8&node=29

    This blunder can harm your SEO efforts. With the exception of wildcards, the path is used to match the beginning of a URL (and any valid URLs that start with the same path).

    Two slashes /.../ separate a distinct category of URLs, in our case we ask not to crawl the whole content in /key/ category but if we add just one slash /key after Disallow, we block all links which begins with key combination of symbols. The dollar specification designates the end of the URL, it tells search bot - “the path ends here.”

    5. Forgetting to add directives for specific bots where it’s needed

    Google's main crawler is called Googlebot. In addition, there are 12 more specific spiders each of which has its own name as User-agent and crawls some part of your website (for instance, Googlebot-News scans applicable content for inclusion on Google News, Googlebot-Image searches for photos and images, etc.)

    Some of the content you publish may not be applicable for inclusion on SERP. For example, you want all your pages to appear in Google Search, but you don't want photos in your personal directory to be crawled. In such a case, use robots.txt to disallow the user-agent Googlebot-image from crawling the files in the directory (while allowing Googlebot to scan all files):

    One more example - suppose you want ads on all your URLs, but you don't want those pages to appear in SERP. So, you'd prevent Googlebot from crawling, but allow Mediapartners-Google bot:

    6. Adding the relative path to sitemap

    Sitemap assists crawlers in faster indexation, and that’s why it should be submitted on your website. Also, it’s a good idea to leave a clue to search bot in robots.txt about where your sitemap is placed.

    Note that bots cannot reach sitemap files using a relative path, URL must be absolute.

    7. Ignoring Slash in a Disallow field

    Search bots won't respect your robots.txt rules if you miss a slash in the correlated section.

    8. Forgetting about case sensitivity

    The path value in robots.txt is used as a basis to determine whether or not a rule applies to a specific URL on a site. Directives in the robots.txt file are case-sensitive. Google says you can use only the first few characters instead of the full name in robots.txt directories. For example, instead of listing all upper and lower-case permutations of /MyBlogException, you could add the permutations of /MyE but only if no other, crawlable URLs exist with those first characters.

    Wrapping Up

    A little text file with directives for search bots in the root folder could greatly affect crawlability of your URLs and even entire website. Robots.txt is not as easy as it may seem: one superfluous slash or missed wildcard could block your profitable pages or vice versa open access to duplicate or private content. JetOctopus crawler reveals each webpage that is blocked by robots.txt, so that you can easily check whether Disallow/Allow directives are correct.

    We have a client-oriented philosophy, so if you have any questions about technical SEO in general, and robots. txt file in particular, feel free to drop us a line serge@jetoctopus.com

    Get more useful info Simple Lifehack: How to Create a Site That Will Never Go Down

    ABOUT THE AUTHOR

    Ann Yaroshenko is a Content Marketing Strategist at JetOctopus. She has Master’s diploma in publishing and editing and Master’s diploma in philology. Ann has two years experience in Human Resources Management. Ann is a part of JetOctopus team since 2018.

  • Please, share!
auto.ria.com
Auto classified, 20m pages crawled
Duplications
What problem was your SEO department working at when you decided to try our crawler?
We needed to detect all possible errors in no time because Google Search Console shows only 1000 results per day.

What problems did you find
That’s quite a broad question. We managed to detect old, unsupported pages and errors related to them. We also found a large number of duplicated pages and pages with 404 response code.

How quickly did you implement the required changes after the crawl?
We are still implementing them because the website is large and there are lots of errors on it. There are currently 4 teams working on the website. In view of this fact we have to assign each particular error to each particular team and draw up individual statements of work.

And what were they?
It’s quite difficult to measure results right now because we constantly work on the website and make changes. But a higher scan frequency by bots would mean the changes are productive. However, around one and a half months ago we enabled to index all the paginated pages and this has already affected our statistics.

Having seen the crawl report, what was the most surprising thing you found? (Were there errors you’d never thought you’d find?)
I was surprised to find so many old, unsupported pages which are out of the website’s structure. There were also a large number of 404 pages. We are really glad we’ve managed to get a breakdown of the website subdirectories. Thus we’ve made a decision which team we will start working with in the beginning.

You have worked with different crawlers. Can you compare JetOctopus with the others and assess it?
Every crawler looks for errors and finds them. The main point is the balance between the scanned pages and the price. JetOctopus is one of the most affordable crawlers.

Would you recommend JetOctopus to your friends?
We’re going to use it within our company from time to time. I would recommend the crawler to my friends if they were SEO optimizers.

Your suggestions for JetOctopus.
To refine the web version ASAP. There are a few things we were missing badly:
Thank you very much for such a detailed analysis. Currently we have been reflecting on a redirects' problem.
ultra.by
Do you wan’t
more SEO traffic?
All sites have technical errors which block your SEO traffic boost
I’ts an axiom
Find out the most epic errors ot your site and start fixing them
The amount of the provided information is quite enough and you can see at once where and what the problems are
auto.ria.com