|
|
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Google still uses a few .
First is the NOARCHIVE.
Thats the archived version of a website.
You can tell google to disable it by:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
eta
Tags-URL
It differs from the Protocol for Robots Exclusion in that you need no effort or permission from your Web Server Administrator.
Note: Currently only few robots support this tag!
<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body> ...
Some examples:
<meta name="robots" content="index,follow"> <meta name="robots" content="noindex,follow"> <meta name="robots" content="index,nofollow"> <meta name="robots" content="noindex,nofollow">
Note the "robots" name of the tag and the content are case insensitive.
You obviously should not specify conflicting or repeating directives such as:
<meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW">
A formal syntax for the Robots META tag content is:
content = all | none | directives all = "ALL" none = "NONE" directives = directive ["," directives] directive = index | follow index = "INDEX" | "NOINDEX" follow = "FOLLOW" | "NOFOLLOW"Using a Robots Meta Tag
Brett TabkeThe Robots META tag is a tag to tell a robot if it is ok to index this page or not. It also is used to invite a spider to walk down through all your pages. It is growing in importance.
It is also useful if you don't have access to your servers root directory to control a robots.txt file.
Some search engines, such as Inktomi now fully obey the Robots Meta Tag. Inktomi will crawl down through a site if the Index,Follow syntax is used.
Robots Meta Tag Format
The Robots META tag is placed in the HEAD section of your HTML document: The format is quite simple: (case is not significant)
<HTML> <HEAD> <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW"> <META NAME="DESCRIPTION" CONTENT="THIS PAGE ...."> <TITLE>...</TITLE> </HEAD> <BODY> ...Robot Meta Tag Options
There are four directives that can be placed in a robots meta tag. The CONTENT section of the meta tag can contain:index,noindex,follow,nofollow and are separated by commas.
At this point, only the following combinations make sense:
The INDEX directive tells the robot it is ok to index the page.
The FOLLOW directive tells the robot it is ok to follow the links found on this page. Some search engine articles on Robots Meta tag say the predefined defaults are INDEX and FOLLOW, not true with Inktomi. The default with Inktomi is index,nofollow.
There are also, two global directives that can specify both actions: ALL=INDEX,FOLLOW, and NONE=NOINDEX,NOFOLLOW.
Robots Meta Tag Examples:
<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW"> <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> <META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW"> <META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">Related
- Robots.txt Validator
- Robots.txt Tutorial Using robots.txt.
- Robots Exclusion Meta Tag Using robots metatags.
- Robots.txt : The Big Crawl
We recently spidered 2million robots.txt files and found a surprising number of problems.- Robots Exclusion Standard rfc4.
- Root of Robots Exclusion Standard directory with some interesting files.
- Search Indexing Robots and Robots.txt article at searchtools.com.
The 'robots' meta tag
Example:
<meta name="robots" content="noindex,nofollow">This is one tag that is still widely respected by search engines. It is used to pass instructions to the search engines' robots - often referred to as spiders or crawlers. The default (i.e., if there is no robots meta tag) is for search engines to index the page and to follow links on the page - if this is your intention, you can omit the tag entirely.
index = index this page * noindex = don't index this page follow = follow the links from this page to get more pages * nofollow = don't follow the links from this page all = index this page and follow the links from it * none = don't index this page and don't follow the links * = default setting (no need for a tag)
In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms:
|
The Robots Exclusion Protocol |
A Web site administrator can indicate which
parts of the site should not be vistsed by a robot, by providing
a specially formatted file on their site, in
http://.../robots.txt.
|
|
The
Robots META tag |
A Web author can indicate if a page may or may
not be indexed, or analysed for links, through the use of a
special HTML META tag.
|
Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.
In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:
to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:User-agent: * Disallow: /
Note that currently only a few robots implement this.
In this simple example:
a robot should neither index this document, nor analyse it for links.<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
Full details on how this tags works is provided:
Robots META Tag
By Submit Corner
Tell a Friend About This Page
Overview: The Robots Tag declares to search engines what content to index and spider
Robots, also known as spiders, are automated mechanisms that spider your site, or search your site on how to categorize the information you submitted to the search engine. Typically, a website owner would submit the main page and the robots would visit your site and collect all subpages and related links from your main page. However, this tag enables you to control which pages you would like spidered, and which to ignore. For instance, certain webpages and directories (ie: CGI Scripts) you may not want indexed in the search engines. Using the robots tag, you can define which pages to follow, which to index and which to ignore completely.
META Tag Usage| META Name: | "Robots" |
| Supported Types: | noindex | index | nofollow | follow |
| General Usage: | <META name="Robots" content="index,follow"> |
| Code Generator: | Create Robots META Tag for me [Click Button Below] |
The Robots META
Tag is used by search engines as a means to indicate the level of
spidering a search engine should do. Most search engines look for this
META tag and will only index and/or spider the pages you want to be
indexed.
Recommended Usage: Suggested
Meta Robots: Controls search engine robots on a
per-page basis.
Example: <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> Robots may
traverse this page but not index it.
Recommendation: Do not use. If you need to control the search
engine robots, use a robots.txt file instead. It is more widely supported and is
not ignored as this tag sometimes is.
Controlling the Spiders
Really bad spiders go after directories
and files you specifically disallow. ...
tells the googlebot to stay away from all
files in the /cgi-bin/ directory: ...
There may be more than one spider identification line and there may be more than one directory/file name line.
Here is an example that tells the googlebot to stay away from all files in the /cgi-bin/ directory:
User-agent: googlebot Disallow: /cgi-bin/
If you have instructions for other spiders, insert a blank line in the file then follow the same format — spiders first, then instructions.
Here is an example robots.txt file with 3 instruction blocks.
User-agent: googlebot User-agent: Googlebot-Image Disallow: /hotbotstuff/ User-agent: slurp Disallow: /googlestuff/ User-agent: * Disallow: /cgi-bin/ Disallow: /secret/ Disallow: /mypasswords.html
When a good spider reads the robots.txt file, it keeps the instructions at hand while it's spidering the web site.
Before the spider indexes a page, it consults the robots.txt file, omitting any instruction blocks that do not pertain to it. For example, the googlebot would read blocks with a user-agent of "*" and a user-agent of "googlebot". In the above example, it would read only the first and last instruction blocks.
The spider reads from top to bottom, stopping whenever there's a directory or file name match. Thus, a spider consulting the above robots.txt file before spidering http://example.com/secret/index.html would see the
Disallow: /secret/
line and not index the /secret/index.html page.
Some spiders will recognize an Allow: instruction in the robots.txt file. If you use an allow instruction to allow indexing certain directories or files, realize that those spiders that don't recognize it will simply ignore it.
An example of use would be to let google index only the http://example.com/mostlysecret/googlefood.html page in that directory and disallow the entire /mostlysecret/ directory to all other robots. Because the robots.txt file is read from top to bottom, the allow should be in the file above the "all robots" disallow. Example:
User-agent: googlebot User-agent: Googlebot-Image Disallow: /hotbotstuff/ Allow: /mostlysecret/googlefood.html User-agent: slurp Disallow: /googlestuff/ User-agent: * Disallow: /mostlysecret/ Disallow: /cgi-bin/ Disallow: /secret/ Disallow: /mypasswords.html
Robots.txt and the Robot Meta Tag If I only wanted Googlebot to ignore those directories & files, I'd type "User-agent: Googlebot". The second line refers to an entire directory. ...
User-agent: *
Disallow: /images/
Disallow: /contact.html
Disallow: /privacy/privacy.html
The first line specifies which robots should ignore /images/, /contact.html and /privacy/privacy.html. The asterisk * is a wildcard - so all robots should ignore the directories and files listed below it. If I only wanted Googlebot to ignore those directories & files, I'd type "User-agent: Googlebot".
The second line refers to an entire directory. Nothing in that directory will be indexed.
The third line refers to a specific page in the root directory - in this case the contact.html file.
The fourth line refers to a specific file in a specific directory.
Creating and Using a robots.txt File
This line can be repeated for each directory or
file you want to exclude, ...
You know that the spider that Google sends out is called 'Googlebot'.
...
1. Exclude a file from an individual Search Engine
You have a file, privatefile.htm, in a directory called 'private' that you do not wish to be indexed by Google. You know that the spider that Google sends out is called 'Googlebot'. You would add these lines to your robots.txt file:
User-Agent: Googlebot
Disallow: /private/privatefile.htm
Now you want to keep Google away from those images. Google grabs these images with a sperate bot from the one that indexes pages generally, called Googlebot-Image. You have a couple of choices here:
User-Agent: Googlebot-Image
Disallow: /images/
That will work if you are very organized and keep all your images strictly in the images folder.
User-Agent: Googlebot-Image
Disallow: /
This one will prevent the Google image bot from indexing any of your images, no matter where they are in your site
Finally, you have two pages called content1.html and content2.html, which are optimized for Google and Lycos respectively. So, you want to hide content1.html from Lycos (The Lycos spider is called T-Rex):
User-Agent: T-Rex
Disallow: /content1.html
and content2.html from Google.
User-Agent: Googlebot
Disallow: /content2.html
The 'robots' meta tag
Example:
<meta name="robots" content="noindex,nofollow">
This is one tag that is still widely respected by search engines. It is used to pass instructions to the search engines' robots - often referred to as spiders or crawlers. The default (i.e., if there is no robots meta tag) is for search engines to index the page and to follow links on the page - if this is your intention, you can omit the tag entirely.
| index | = index this page * |
| noindex | = don't index this page |
| follow | = follow the links from this page to get more pages * |
| nofollow | = don't follow the links from this page |
| all | = index this page and follow the links from it * |
| none | = don't index this page and don't follow the links |
The 'robots' meta tag
Example:
<meta name="robots" content="noindex,nofollow">
This is one tag that is still widely respected by search engines. It is used to pass instructions to the search engines' robots - often referred to as spiders or crawlers. The default (i.e., if there is no robots meta tag) is for search engines to index the page and to follow links on the page - if this is your intention, you can omit the tag entirely.
| index | = index this page * |
| noindex | = don't index this page |
| follow | = follow the links from this page to get more pages * |
| nofollow | = don't follow the links from this page |
| all | = index this page and follow the links from it * |
| none | = don't index this page and don't follow the links |
The created field contains the date the site section was originally published. This is used for site maintenance and archive operations.
| <meta name="created" content=" [date] "> |
| Variable | Description |
| [date] | The section created date in YYYYMMDD format to allow numeric sorting. |
Every page MUST have a creation date in the specified date format.
The created field MUST reflect the date the site section was originally published.