Robots.txt file basics
When a web crawler finds a website, it first reads a file named robots.txt
which is positioned at the root of a domain. In this file, you can set how and if the web crawler can visit a website. This offers us a possibility to deny the access to our website for specific search engines. Note that the robots.txt
file can not be used to keep files secret, because one can open them with a browser.
Some search engines still show the "hidden" pages in their search results, but without descriptions.
Structure
The first line describes the crawler or User-agent which is adressed by the following rules. There is no limitation for the amount of these "blocks"
. After reading a block which starts with User-agent: *
a crawler will stop reading the file, so special-crawler-blocks have to be placed at the beginning of the file.
You can add one-row comments to a robots.txt. They simply start with #
. One-row comments
can be helpful to describe your settings in the robots.txt. They are ignored by crawlers.
Possible Statements
Statement | Example | Description |
---|---|---|
User-Agent: | User-Agent: * | User-Agent: Googlebot |
Specifies web crawler. * selects all crawlers. |
Disallow: | Disallow: / | Disallow: /images/ | Disallow: /test.html |
Doesn't allow reading of files. |
Allow: | Allow: / | Allow: /free/ | Allow: /public.html |
Allows reading of files. |
Crawl-delay: | Crawl-delay: 100 |
Sets readout speed. If we use our example, only every 100 seconds a new page may be opened for reading. |
Sitemap: | Sitemap: http://for-example.com/sitemap-url.xml |
The Sitemap can be found using this url. This only works for the following crawlers: Googlebot, Yahoo! Slurp, msnbot, Ask.com |
Example
User-agent: Googlebot
Allow: /public/
Disallow: /not-google.html
User-agent: *
Disallow /images/
Disallow /privates/
Disallow: intern_file.html
#I am a one-row comment
More Information
You'll find more Information at http://www.robotstxt.org