Robots and Spiders

Introduction

World wide web robots (also called bots, crawlers or spiders) are programs that traverse many pages in the world wide web by recursively retrieving linked pages. For more information see the Names of Robots section below.

In the past there have been times where robots had visited web servers when and where they shouldn't have. Sometimes these reasons were robot specific (e.g. certain robots overloaded servers with rapid-fire requests, or retrieved the same files repeatedly). In other situations robots traversed parts of a server that weren't intended to be traversed (e.g. deep virtual trees, duplicated pages, temporary pages, or unsuitable cgi-scripts).

These incidents indicated the need for an established standard for web servers to indicate to robots which diretories and files on the server should not be accessed.

Robot Exclusion Method

The method that is used to exclude robots from a web server is to create a file on the server which specifies the access policy for the robots. This file is placed in the HTML directory on the server with the name of robots.txt.

Exclusion File Format

The format and semantics of the robots.txt file are as follows:

The file consists of one or more records separated by one or more blank lines (terminated by CR,CR/NL, or NL). Each record contains lines of the form <field>:<optionalspace><value><optionalspace>. The field name is case insensitive.

Comments can be included in the file using UNIX bourne shell conventions, i.e. the '#' character is used to indicate that any preceding space and the remainder of the line up to the line termination is to be ignored. Lines containing only a comment are ignored completely.

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below. Unrecognised headers are ignored.

User-agent

The value of this field is the name of the robot the record is describing an access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.

The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is *, the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the robots.txt file.

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.

Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

The presence of an empty /robots.txt file will be treated as if it was not present at all and all robots will have access to all files and directories.

Robots.txt Examples

The following example /robots.txt file specifies that no robots should visit any URL starting with /closeddir/test/ or /tmp/:

User-agent: *
Disallow: /closeddir/test/ 
Disallow: /tmp/ 
This example /robots.txt file specifies that no robots should visit any URL starting with /closeddir/test/, except the robot called scooter:

User-agent: *
Disallow: /closeddir/test/ 

User-agent: scooter
Disallow:
This example indicates that no robots should visit this site beyond the current directory:
User-agent: *
Disallow: /
The following example lets the Infoseek robot visit the pages within a directory specifically created and optimized to achieve a top ranking within Infoseek and let the Northern Light robot visit the pages within a directory specifically created and optimized to achieve a top ranking within Northern Light. At the same time it denies access to the Northern Light robot and all the pages within the Northern Light-optimized directory to the Infoseek robot. Lastly it denies access to both directories to all the other robots:

# Infoseek robot (SideWinder) can visit Infoseek-optimized dir,
# but not Northern Light-optimized directory.
User-agent: sidewinder
Disallow: /northernlight_optimised_directory/

# Northern Light (Gulliver) can visit Northern Light-optimized dir,
# but not Infoseek-optimized directory.
User-agent: gulliver
Disallow: /infoseek_optimised_directory/

# All other robots denied access to both directories
User-agent: *
Disallow: /infoseek_optimised_directory/
Disallow: /northernlight_optimised_directory/

Names of Robots

AltaVista - Scooter
Excite - ArchitextSpider
HotBot - Slurp
Infoseek - Infoseek Sidewinder
Infoseek - Infoseek Robot
Lycos - Lycos
Northern Light - Gulliver
WebCrawler - Webcrawler

Email: support@supersoft-solutions.com
Phone: 1-860-432-4449
 
Copyright © 1997-2006
Read our Privacy Statement
Web Page Designs by J. Robert Nelli
Superior Software Solutions
88 Old Farm Road
South Windsor, CT 06074