One of the challenges of search engine optimization is that one must take in the information and decide what is right. This becomes more difficult when tips and techniques are contradictory. Robots.txt is an ideal example. Many SEOs recommend using robots.txt to control which sites spiders crawl; Google, on the other hand, recommends avoiding them. First, what exactly is robots.txt – and who is right?
What is robots.txt?
Say you are one of the considerate websites that provides an aesthetically pleasing experience for viewers, as well as one that is easy on the printer. Instead of printing your graphics, colors, and ads, the users can simply print the text content that they want. You want the browser to crawl through the display page rather than the print page; further, you want to exclude the print page altogether so you are not penalised for having duplicate content.
This is just one example where a website owner might opt to use robots.txt. This is a text file that you insert to indicate to search robots that you do not wish them to comb that site. These files have to be inserted into the main directory so bots find it.
Do the bots have to listen to your request? No, in fact, that’s all it is, a request. You are asking, “Please, if you would, do not crawl this site.” It is not a firewall that prevents them from doing so, though usually they will comply. The typical structure for the file is:
User – agent
Disallow:[list of files and directories to be excluded]
It is incredibly important to make sure the syntax and format are correct so the bots can read the text file accurately. Many sites create a robots.txt file that indicates that bots can crawl everything. This looks like:
User-agent: *
Disallow:
On the other hand, a simple / can make a big difference. This slightly different txt disallows the bots from indexing anything.
User-agent: *
Disallow:/
Google’s John Mu recently posted on a webmaster forum that he would strongly consider “removing the robots.txt file completely.” He says, “The general idea behind blocking some of those pages from crawling is to prevent them from being indexed. However, that’s not really necessary – websites can still be crawled, indexed, and ranked fine with those pages like their terms of service or shipping information indexed (sometimes that’s even helpful to the user).”
If you have been dropping the robots.txt files in order to keep these types of pages from lowering your ranking, you needn’t bother. You will most likely see no change in the SERPs. Further, the risk of misplacing a / or forgetting a hyphen are negated and life gets a tad simpler.