Back in the days around 3 B.G (Before Google) AltaVista was the new search engine on the block. In an effort to show off the power of their minicomputers, the AltaVista team at Digital decided to crawl and index the entire web. This was at the time a new concept. Many web masters didn't relish the idea of a "robot" program accessing every page on their web site as this would add more load to their web servers and increase their bandwidth costs. So in 1996 the Robots Exclusion Standard was created to address these web master concerns.
You can use a simple text file called robots.txt to keep search engines out of a directory. Here is a very simple example that will prevent all search engines (user-agents) from accessing the /images directory.
User-agent: * Disallow: /images
When you block the /images directory, you also block all subdirectories. For example, the directory /images/logos and the file /images.html will also be disallowed.
Strange enough, the first draft of this standard did not contain an "Allow" directive. Later on this has been added, yet without a guarantee of support by all search engines. This implies that anything not specifically disallowed has to be seen as a target for web crawlers.
If you choose to disallow access to your entire web site, you can use a robots.txt like this:
User-agent: * Disallow: /
The next lines apply to every search robot when the User-agent is *. Through the specification of the signature of a web crawler as User-agent specific instructions can be given to such a search robot.
User-agent: Googlebot Disallow: /google-secrets
The protocol has been changed since the initial spec was put into place. Wildcards have proven to be an extension that has proved popular.
User-agent: Slurp Disallow: /*.gif$
This prevents Yahoo! (whose web crawler is called Slurp) from indexing any files on your site that end with ".gif". Keep in mind that wildcard matches are not supported by all search engines so you have to preface these lines with the appropriate User-agent line.
You can combine several of the above techniques in one robots.txt file. Here's a theoretical example.
Computer applications work great when it comes to following well defined instructions. The human brain however is less efficient at these functions, so the best advice is to keep things simple.
For us mortals there is a robots.txt analysis tool in Google's webmaster tools. Highly recommended. Another good resource for more information on the Robots Exclusion Standard is www.robotstxt.org.
Today when companies are spending a lot of money to be included in search engine listings, the idea of excluding your content may seem quaint. But from a security perspective there are many valid reasons for limiting what a search engine indexes on your site.