Effectively Using Robots.txt Files

Having the power to control what pages search engines look at on our sites can be a powerful resource when it comes to search engine optimization and even privacy. If you have a photos section and you don't want them listed in Yahoos photo search or you have section of your site for a youth group and you don't want that showing up in Google then effectively using a robots.txt is the solution to your problem.

Note: What I am about to describe should not be used as a security measure. This will not stop visitors from clicking links and going to these sections of your site. This is something the major search engines do to give users some control on what from their sites are listed in the search engines. This is a powerful tool for what it is but don't expect it to be more than that.

Search Engines Looking At Your Site

When someone links to your site or you submit your site to a search engine for them to take a look at, a program or bot comes to take a look at your site or the page at the other end of the link. It then follows the links on that page to other pages on your site until it has looked at everything available. The information from these pages gets sent through some secret algorithms and then added to the search engines listings.

But, before search engines like Google, Yahoo, Microsoft, the others start indexing your site they look for some rules on what not to list in the search engines in a file called robots.txt. So, if you have pages you don't want listed this is the place to list them.

Robots.txt File

This file has a really simple syntax and is really easy to use. To tell all of these bots to not list any of your pages your robots.txt file would contain:

User-agent: *
Disallow: /

This does two things. First, the User-agent line tells it what bot you are talking to. The * here means all of them. You could replace this with the name of a specific bot. If you didn't want your site listed in Google but wanted it everywhere else you would change the * to the name of the Google bot. In most cases the * is what you want.

Next we have the path we don't want listed in the search engine. The / used here means everything. If you had specific pages you didn't want listed you would instead do something like:

User-agent: *
Disallow: /test-page-1.html
Disallow: /mytest/test-page-2.html

This is telling telling all search engines to not list these 2 pages. Everything else should be listed. Now, let's take a look at the case where you want to disallow a certain directory or path:

User-agent: *
Disallow: /test/
Disallow: /mytest/test-page-2.html
Disallow: /yourtes

The /test/ here is telling the search engines to not list anything like example.com/test/ or further down from /test/ in your structure. This means /test/somepath, /test/mypage.html, or anything at that path. The /yourtes here is doing something similar but broader. /yourtest.html, /yourtest/someinfo, /yourtesting/anotherpath, and anything else that starts off with /yourtes will not be indexed.

There are some things you can't do here. You can't use a User-agent like *bot*, and you can't do paths like /test/*/somepath or *.png.

Placement of this robots.txt file should be at the root of your site. For example, the path to this file would be example.com/robots.txt. And, the name of this is case sensitive. It should all be in lowercase letters.

Non-Standard Tags

There are some non-standard tags that some search engines support. Don't rely on them to work everywhere.

User-agent: *
Crawl-delay: 10
Disallow: /test/
Allow: /test/test.html

The Crawl-delay here is telling the search engine how long to wait between scans of your site. The allow here is saying to allow that one page. So, everything under /test/ except /test/test.html would not be listed.

These are not standard and not all search engines look at them.

Extending the Standard

User-agent: *
Disallow: /test/
Request-rate: 1/60
Visit-time: 0100-0500

There has been some extending of the standard but this is still not official. Here you have Request-rate which is telling the bot to scan 1 page every 60 seconds. The Visit-time is exactly what it means. It is setting a time of day for the bot to scan the site.

These, again, are not part of the official standard so they should not be relied on.

Conclusion

If you are going to use a robots.txt file I suggest staying as close to the standard as you can or even just following it completely. That will give you the widest amount of control over what gets listed.

Resources