Fractle

Support for robots.txt

Specifications, Details, and Examples


Mandelbot is Fractle's web crawler and it lets you control what files are crawled using a robots.txt file. This page describes Mandelbot's support of robots.txt, which is part of Mandelbot's support for the Robots Exclusion Protocol.


Location of robots.txt

Mandelbot looks for a file named robots.txt in the root directory of your website. For example, for the url http://www.example.com/dogs/barking.html, Mandelbot would look for a corresponding robots.txt file at http://www.example.com/robots.txt.

Different domains, sub-domains, protocols (http and https), and ports, each require their own robots.txt file in their respective root directory. The robots.txt file at http://www.example.com/robots.txt only applies to urls starting with http://www.example.com/. It will not apply to any urls starting with: https://www.example.com/, http://subdomain.example.com/, or http://www.example.com:8080/.

Mandelbot will not look in sub-directories for your robots.txt file nor in files with different names. The root directory is the only location Mandelbot will look for a robots.txt file. If it is not there, as far as Mandelbot is concerned, it does not exist.


Changing robots.txt

If you create or change your robots.txt file, please allow at least 48 hours for Mandelbot to discover the changes; we cache the robots.txt file to avoid sending unnecessary requests to your server.


Temporarily Stop Crawling

To temporarily stop Mandelbot from crawling your site, return a response containing a 503 Service Unavailable HTTP Status Code to any request made by Mandelbot. When such a response is received, Mandelbot will stop crawling and resume at some later time.

Do not change your robots.txt file to temporarily stop Mandelbot from crawling; it will not take effect until the cached copy expires and can negatively affect the indexing of your pages in Fractle.


HTTP Status Codes

The HTTP Status Code that is received by Mandelbot when it makes an HTTP request for robots.txt determines how it is interpreted.

If a 2xx Success status code is received, the response will be processed according to the rules on this page.

If a 3xx Redirection status code is received, the redirects will be followed and the end result treated as though it were the robots.txt response.

If a 4xx Client Error status code is received, it will be treated as though no robots.txt file exists and crawling of any url is allowed. This is true even for 401 Unauthorized and 403 Forbidden responses.

If a 5xx Server Error status code is received, it will be treated as a temporary server problem and the request will be retried later. If Mandelbot determines a server is incorrectly returning 5xx status codes instead of 404 status codes, it will treat 5xx errors as 404 errors.


Format of robots.txt

The robots.txt file should be a plain text file consisting of lines separated by carriage returns, line feeds, or both.

A valid line consists of several ordered elements: a field, a colon, a value, and an optional comment prefixed by a hash: <field>:<value>(#<comment>).

The <field> element is case-insensitive and any whitespace before or after any element is ignored.

Invalid lines and valid lines with an invalid <field> are ignored; the remaining valid lines are processed according to the rules on this page. The treatment of valid lines with a valid <field> and an invalid <value> is undefined.


User Agents and Directive Groups

A robots.txt file contains zero or more directive groups. A directive group consists of one or more user agent lines, followed by one or more directive lines.

A user agent line has <field> = User-agent and a case-insensitive <value> which represents the user agent's name. It is invalid to use the same user agent in multiple user agent lines within a robots.txt file.

Zero or one of the directive groups will apply to Mandelbot, which follows the directives for User-agent: Mandelbot and falls back to the directives for User-agent: * if there is no group specifically for Mandelbot.

In this example, Mandelbot follows its specific group:

User-agent: Mandelbot
Disallow: /private   # Mandelbot follows this directive

User-agent: *
Disallow: /secret    # Mandelbot ignores this directive

In this example, Mandelbot falls back to the default group:

User-agent: Anotherbot
Disallow: /private   # Mandelbot ignores this directive

User-agent: *
Disallow: /secret    # Mandelbot follows this directive

In this example, no group applies to Mandelbot:

User-agent: Anotherbot
Disallow: /private   # Mandelbot ignores this directive

In this example, two user-agents including Mandelbot share a group:

User-agent: Mandelbot
User-agent: Anotherbot
Disallow: /private   # Mandelbot follows this directive

User-agent: *
Disallow: /secret    # Mandelbot ignores this directive


Disallow Directive

A disallow directive is a type of directive line. It has <field> = Disallow and a <value> which is a relative path from the website root. The path must start with a forward slash, is case sensitive, and special characters must be percent encoded.

Mandelbot will not crawl any url that is prefixed by the path of any disallow directive in the applicable directive group.

In this example, Mandelbot and other robots are blocked from crawling any url:

User-agent: *
Disallow: /

In this example, Mandelbot is blocked from crawling any url:

User-agent: Mandelbot
Disallow: /

In this example, Mandelbot is blocked from crawling any url prefixed by /secret or /hidden (e.g. /secret/doc.html is blocked, but /private/secret/doc.html is not):

User-agent: Mandelbot
Disallow: /secret
Disallow: /hidden

A special type of disallow directive is one with no path, which means allow everything. It is used to override the default directive group. In this example, Mandelbot may crawl any url while other robots are blocked from crawling any url:

User-agent: Mandelbot
Disallow: 

User-agent: *
Disallow: /


Allow Directive

An allow directive is a type of directive line. It has <field> = Allow and a <value> which is a relative path from the website root. The path must start with a forward slash, is case sensitive, and special characters must be percent encoded.

Allow directives are used to override disallow directives. By default, all urls are allowed, so allow directives are only necessary when a disallow directive's scope needs to be reduced.

Mandelbot will crawl any url that is prefixed by the path of any allow directive in the applicable directive group.

In this example, Mandelbot is blocked from crawling any url prefixed by /secret except those prefixed by /secret/readme.txt (e.g. /secret/doc.html is blocked, but /secret/readme.txt and /secret/readme.txt?v=1 are not):

User-agent: Mandelbot
Disallow: /secret
Allow: /secret/readme.txt


Wildcards

The disallow and allow directives support wildcards. The paths may use a * to represent 0 or more characters and a $ as the last character to represent the end of a Url.

In this example, Mandelbot is blocked from crawling the contents of any folder named private (e.g. /secret/private/doc.html is blocked, but /secret/private-stuff/doc.html is not):

User-agent: Mandelbot
Disallow: /*/private/

In this example, Mandelbot is blocked from crawling urls that end with a pdf file extension (e.g. /doc.pdf is blocked, but /doc.pdf?load=1 is not):

User-agent: Mandelbot
Disallow: /*.pdf$


Precedence of Directives

To determine which directive applies, Mandelbot sorts the directives by path length in descending order, with allow directives taking precedence in the case of ties.

If the highest precedence directive that matches the url is an allow directive, Mandelbot will crawl the url; if it is a disallow directive, Mandelbot will not crawl the url; and if no directive matches, Mandelbot will crawl the url.

In this example, Mandelbot is blocked from crawling urls with a pdf file extension except those prefixed by /files (e.g. /doc.pdf is blocked, but /files.pdf is not) as /files is the same length as /*.pdf (Allow directives take precedence in ties) and /doc has no effect as it's shorter than /*.pdf (longer directives take precedence):

User-agent: Mandelbot
Disallow: /*.pdf
Allow:    /files
Allow:    /doc


Exotic Usage of Wildcards

Placing a * at the end of a path is usually redundant as it is implicit; but it has two use cases: allowing a regular dollar character at the end of a path and changing the precedence of a directive.

In this example, Mandelbot is blocked from crawling the url /money and any urls prefixed by /earn$ because the * in /earn$* turns the dollar character into a regular character:

User-agent: Mandelbot
Disallow: /money$
Disallow: /earn$*

In this example, Mandelbot is blocked from crawling urls with a pdf file extension except those prefixed by /doc (e.g. /files.pdf is blocked, but /doc.pdf is not) as the extra * characters at the end of the paths change their length and therefore their precedence:

User-agent: Mandelbot
Disallow: /*.pdf*
Allow:    /files
Allow:    /doc****


Comments

Mandelbot ignores comments. Comments may be included after any valid line by prefixing the comment with a #. The hash and everything after it is ignored. A comment is also allowed on its own line.

In this example containing different types of comment, Mandelbot is blocked from crawling any url prefixed by /secret and other robots are blocked from crawling any url:

# A standalone comment on its own line

User-agent: Mandelbot
# Whitespace is allowed but not required between elements
Disallow: /secret# This is a comment

User-agent: * # For other robots
Disallow: /   # Block everything


Sitemaps

A sitemap as defined on sitemaps.org provides a method for you to inform Mandelbot about all the pages on your site. Mandelbot may use it to make crawling more efficient.

You specify sitemaps in your robots.txt file by creating lines with <field> = Sitemap and a <value> which is an absolute url to a valid sitemap file (e.g. Sitemap: http://www.example.com/sitemap.xml).

Sitemap lines exist outside directive groups and usually appear at the start of the robots.txt file before any directive groups. However, they may occur anywhere in the robots.txt file. Sitemap lines are not associated with any specific user agent and may be used by any crawler regardless of where the Sitemap line occurs in the robots.txt file. You can include multiple sitemap lines within your robots.txt file.


Crawl Delay

Mandelbot doesn't currently support Crawl-delay directives in robots.txt files, but we can manually set the Crawl-delay used by Mandelbot for your website if you contact us.


Robots Exclusion Protocol

Mandelbot's support of robots.txt is just part of its support for the Robots Exclusion Protocol. Mandelbot also supports Robot Tags, which provide control over what files are indexed.

For an overview of the protocol, additional information on Mandelbot's use of robots.txt, and details on interactions and conflicts between robots.txt and Robot Tags, read about the Robots Exclusion Protocol.