Robots.txt file is a very important file if you want to have a good ranking on search engines, many websites don't offer this file. A
Robots.txt
file is helpful to keep out unwanted search engine spiders like email
retrievers, image strippers, etc. It defines which paths are off limits
for spiders to visit. This is useful if you want to hide some personal
information or some secret files.
What is Robots.txt
Robots.txt file is a special text file that is always located in your Web server's root directory.
Robots.txt file contains restrictions for Web Spiders, telling them where they have permission to search. A
Robots.txt
is like defining rules for search engine spiders (robots) what to
follow and what not to. It should be noted that Web Robots are not
required to respect
Robots.txt files, but most well written Web Spiders follow the rules you define.
How to Create Robots.txt
The format for the
robots.txt file is special.
It consists of records. Each record consists of two fields : a
User-agent line and one or more Disallow: lines. The format is:
":"
The
robots.txt file should be created in Unix
line ender mode! Most good text editors will have a Unix mode or your
FTP client *should* do the conversion for you. Do not attempt to use an
HTML editor that does not specifically have a text mode to create a
robots.txt file.
User-agent
The User-agent line specifies the robot. For example:
User-agent: googlebot
You may also use the wildcard character "*" to specify all robots:
User-agent: *
You can find user agent names in your own logs by checking for requests to
robots.txt. Most major search engines have short names for their spiders.
Disallow
The second part of a record consists of Disallow: directive
lines. These lines specify files and/or directories. For example, the
following line instructs spiders that it can not download
contactinfo.htm:
Disallow: contactinfo.htm
You may also specify directories:
Disallow: /cgi-bin/
Which would block spiders from your cgi-bin directory.
There is a wildcard nature to the Disallow directive. The
standard dictates that /bob would disallow /bob.html and /bob/indes.html
(both the file bob and files in the bob directory will not be indexed).
If you leave the Disallow line blank, it indicates that ALL files
may be retrieved. At least one disallow line must be present for each
User-agent directive to be correct. A completely empty
Robots.txt file is the same as if it were not present.