Robots Exclusion Protocol | |
The Robots Exclusion Protocol is the protocol for instructing search engines whether or not to index, archive or summarize any given page. These instructions are contained in a robots.txt file in the root (or other ) folder of the site. The robots.txt definitions are advise to robots, they do not actually block access to data. Well-behaved robots will obey many of these directives. |
Disallowing features | |
Many TikiWiki features are provided through specific programs, so crawling those features can be stopped by blocking the specific programs. If you are using Search Engine Friendly URLs (SEFURL) those should also be listed. You should consider the characteristics of the search site and your content; for example, if you have images with short descriptions then those are not of much use to search engines which do not handle images well. Copy to clipboard
|
Blocking duplicate access paths | |
Wildcards are new features which many robots do not yet recognize. Some robots will recognize some wildcards. "*" means any characters. "?" means the question mark. "&" is an ampersand. "$" represents the end of the line. If your site is not using SEFURL, many parts of the site have to be accessed with at least one parameter, such as "?id=123", thus you should not block access to many patterns with a question mark in them. If you are using SEFURL then your URLs will have fewer question marks. For TikiWiki, "Disallow: /*&" could be used to disallow every URL with an ampersand, which will avoid having robots trying to examine variations of pages. However, you should examine your site to consider whether you want to block all URLs with an ampersand or only specific parameters. Some default URLs require at least one ampersand, such as accessing a file's information requires specification of both a gallery ID and a file ID. By adding a parameter after the ampersand you can disallow specific parameters. For example, if a robot fully crawls a file directory in the default order there is no need for it to also follow the URL with "&sort_mode" and view the same data in a different order. It is obvious that the early 2009 version of the Cuil crawler robot, twiceler, crawls every identified variant of a URL. It is not known whether twiceler obeys the * wildcard. Copy to clipboard
Keywords: |
Robots.txt Directives | ||||||||||||||||||||||||
|
HTML META Directives | ||||||||||||||||||
Alias names for this page
Robot | robots.txt | Robots |