what is a robots.txt file and how it affects search engine indexing

The robots.txt is a text file that resides in the top level folder on your webserver. The purpose of the robots.txt file (or just robots file) is to provide some guidance to search engine crawlers as to what should be crawled and indexed, and what content should not be crawled.

The robots.txt file uses an exclusion standard, which means that you need to use it if and only if you have to explicitly exclude URLs or if some URLs that have specific or different requirements that the default. It is very unlikely that a simple website or blog will have such requirements.

There are some advantages for such an approach. If there are a lot of files, especially images and binary files on your server that you do not wish to be crawled then they can be explicitly excluded. This will reduce the load on your web servers from repeated crawling by search bots. You can also instruct different crawlers to follow different rules that could improve crawling efficiency.

location of robots.txt

The robots.txt should always reside in the top most folder of your server. This folder is also known as the root folder or the document root of the web server. In other words, the robots.txt file should be accessible to everybody with the URL: http://<websitedomain>/robots.txt

how to create a robots file

The robots.txt file can be created using any method, as long as it is a text file that can be read and parsed using standard methods. You can use a text editor such as vi or notepad to create this file. You can also auto-generate the file using your website code, if that is required. The Google Webmaster Tools provides a web interface to create such a file as well.

do you really need it?

No. It is not mandatory to have such a robots file. If there is no such file, then search engines will still crawl your web pages and index them normally. There are other ways to specify such rules for the search engines: using a robots meta tag in the page content or the X-Robots-Tag in the HTTP header will serve the same purpose. We will just consider the text file in this post.

robots.txt format and directives

So, if decide that you need a robots.txt file then you also need to know the robots file format. There are four valid directives in the file: sitemap, user-agent, allow and disallow. All directives should be in a new line, and should be separated from its value with a colon (:). We will see some examples of the robot file format below.

sitemap directive

The sitemap directive instructs the crawler where the sitemap file is located. By default, the sitemap file is located in the root folder with the name sitemap.xml. If you want to use a different file or file location then you could use this directive to point the crawler to its location. You should use an absolute URL to specify the sitemap file.

sitemap: [absoluteURL]

An example where the sitemap file is inside a folder called sitemaps is:

sitemap: http://www.example.com/sitemaps/sitemap.xml

You can have multiple sitemap entries in the file. Also, the sitemap does not necessarily be in the same domain.

user-agent directive

You can have different rules for different search crawlers, as long as you know the name by which it identifies itself. Every block inside the robots file should start with this directive (expect the sitemap block). You can use wildcards to match the crawlers as well. Some examples of user-agent directive are:

user-agent: googlebot-news
user-agent: google
user-agent: bing
user-agent: *

There are some things you should know when writing user-agent blocks.

The crawler maps to one and exactly one block.
The crawler maps to the most specific block or pattern in the list. For example, the googlebot-news will match to googlebot-news and googlebot will match to google in the above example.
The order of the blocks in the file is not relevant.
You can have multiple user-agent directive in a single block.

allow and disallow directives

The allow and disallow directives are only valid inside a user-agent directive or block. It maps to the user-agent directive that immediately precedes it. And the allow directive can be used to override the disallow directives.

All path directives in the file should be relative paths to root of the server or the document root. The only exception to this is the Sitemap URL, which can be an absolute URL. A simple example will look something like this:

user-agent: google
disallow: /thisfolder
disallow: /newfolder/
disallow: /cfolder/*.php$

user-agent: google
user-agent:bing
disallow: /thisfolder
disallow: /cfolder/*.php$

The allow directive is similar to the disallow pattern in terms of the URLs or paths specified. There are several things to know when writing your path…

Always start the path value with a "/" to denote the root folder
Any URL or path that starts with the specified path is disallowed (or allowed). So, /thisfolder is equivalent to /thisfolder*
You can use * to match any set (0 or more) of valid characters (as in regular expressions)
You can use $ to denote the end of an URL.

The allow directive is optional in the file. When not specified, the files or URLs are allowed or crawled by default. The allow directive takes precedence over the disallow directive, which means you can use them to override the disallow directives in the file.

what not to use in robots file

The robots.txt is not the only option that you have available. If you have only a few webpages where the instructions differ from the overall set of rules, then you could implement them on the page using the meta tags. X-Robots-Tag in the HTTP headers is a good option, if you want to control the indexing of non-HTML content on the website. This is also a good option if you want to ensure that a particular page or content should not definitely be indexed.

Google documentation explicitly states that you should not use the robots.txt methods to block private content or to handle URL canonicalization.

limitations of robots.txt

There are several limitations with using the text file method which you should be aware of. Always remember that this is only a guidance to the bots and nobody is guaranteed to follow your instructions :-).

crawler inconsistency

All reputable web crawlers will adhere to the instructions specified in the robots file. It goes without saying that there may be crawlers and spam bots that will not even read or follow the such instructions, but many if not most search engine crawlers do. Also, it is possible that different crawlers have different ways of using or matching your path and directives.

disclosure of private files and folders

When writing directives in the robots.txt file you should be aware that this file is publicly available and can read be read by anybody. It is a good idea not the include folder names or file names within the file for this reason, especially private folders and files. Such folders and file should be omitted from the file and you should be using the meta tag or http header method. Also beware that the path directives in the file are case sensitive.

wrong paths

Incorrect directives caused due to spelling errors and such are usually ignored by search bots. The robots.txt file parsing is extremely flexible and should not cause a disruption in the crawling process. Having said that, you should make sure that the path directives are correct and you are not accidentally preventing URLs from getting crawled when it should be.

In such a case, where you have mis-configured the robots.txt it can have negative implications on your SEO. It can lower your quality scores and search rankings.

not immediate

The changes to your robots.txt file take a while to propagate through the search engine indexes. This is in no way guaranteed nor is it immediate. It might be a while before it completely disappear from the search engines. You should try other methods such as a formal removal request in addition to this.

You should also be aware of the fact that robots.txt file directives cannot prevent other sites from linking to your pages. There is also a possibility that you are linking to the page from another page within your site. If robots.txt is the only method you are using to exclude, then your pages might still get crawled if the crawlers follow other links that point to your excluded page.