Robots and Spiders
Introduction
World wide web robots (also called bots, crawlers or spiders) are programs
that traverse many pages in the world wide web by
recursively retrieving linked pages.
For more information
see the Names of Robots section below.
In the past there have been times where robots
had visited web servers when and where they shouldn't have.
Sometimes these reasons were robot specific
(e.g. certain robots overloaded servers with rapid-fire
requests, or retrieved the same files repeatedly).
In other situations robots traversed parts of a server
that weren't intended to be traversed (e.g. deep virtual trees,
duplicated pages, temporary pages, or
unsuitable cgi-scripts).
These incidents indicated the need for an established
standard for web servers to indicate to robots which diretories
and files on the server should not be accessed.
Robot Exclusion Method
The method that is used to exclude robots from a web server is to
create a file on the server which specifies the access policy for
the robots.
This file is placed in the HTML directory on the server with the name of
robots.txt.
The format and semantics of the robots.txt file are as follows:
The file consists of one or more records separated by one or
more blank lines (terminated by CR,CR/NL, or NL). Each
record contains lines of the form
<field>:<optionalspace><value><optionalspace>.
The field name is case insensitive.
Comments can be included in the file using UNIX bourne shell
conventions, i.e. the '#' character is used to
indicate that any preceding space and the remainder of
the line up to the line termination is to be ignored.
Lines containing only a comment are ignored completely.
The record starts with one or more User-agent
lines, followed by one or more Disallow lines,
as detailed below. Unrecognised headers are ignored.
- User-agent
-
The value of this field is the name of the robot the
record is describing an access policy for.
If more than one User-agent field is present the record
describes an identical access policy for more
than one robot. At least one field needs to be present
per record.
The robot should be liberal in interpreting this field.
A case insensitive substring match of the name without
version information is recommended.
If the value is *, the record describes
the default access policy for any robot that has not
matched any of the other records. It is not allowed to
have multiple such records in the robots.txt
file. -
- Disallow
-
The value of this field specifies a partial URL that is not
to be visited. This can be a full path, or a partial
path; any URL that starts with this value will not be
retrieved. For example, Disallow: /help
disallows both /help.html and
/help/index.html, whereas
Disallow: /help/ would disallow
/help/index.html
but allow /help.html.
Any empty value, indicates that all URLs can be
retrieved. At least one Disallow field needs to
be present in a record.
The presence of an empty /robots.txt file
will be treated as if it was not present at all and
all robots will have access to all files and directories.
The following example /robots.txt file specifies
that no robots should visit any URL starting with
/closeddir/test/ or
/tmp/:
User-agent: *
Disallow: /closeddir/test/
Disallow: /tmp/
This example /robots.txt file specifies
that no robots should visit any URL starting with
/closeddir/test/, except the robot called
scooter:
User-agent: *
Disallow: /closeddir/test/
User-agent: scooter
Disallow:
This example indicates that no robots should visit
this site beyond the current directory:
User-agent: *
Disallow: /
The following example lets the Infoseek robot visit the pages
within a directory specifically created and optimized to
achieve a top ranking within Infoseek and let the Northern
Light robot visit the pages within a directory specifically
created and optimized to achieve a top ranking within Northern
Light. At the same time it denies access to the
Northern Light robot and all the pages within the Northern
Light-optimized directory to the Infoseek robot. Lastly it
denies access to both directories to all the other robots:
# Infoseek robot (SideWinder) can visit Infoseek-optimized dir,
# but not Northern Light-optimized directory.
User-agent: sidewinder
Disallow: /northernlight_optimised_directory/
# Northern Light (Gulliver) can visit Northern Light-optimized dir,
# but not Infoseek-optimized directory.
User-agent: gulliver
Disallow: /infoseek_optimised_directory/
# All other robots denied access to both directories
User-agent: *
Disallow: /infoseek_optimised_directory/
Disallow: /northernlight_optimised_directory/
AltaVista - Scooter
Excite - ArchitextSpider
HotBot - Slurp
Infoseek - Infoseek Sidewinder
Infoseek - Infoseek Robot
Lycos - Lycos
Northern Light - Gulliver
WebCrawler - Webcrawler
|