Traffic-Meta Tags-Robots Meta
 

<META NAME="ROBOTS" CONTENT="ALL">
<meta name="robots" content="index, follow" />
<META NAME="robots" CONTENT="index,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> Robots may traverse this page but not index it.
<meta name="robots" content="noindex,nofollow">

 


Meta Tags-

Example: 
 
Recommendation:
 
Complete Syntax: 
 
Length:  Minimum     n/a                     Maximum     n/a                                     Recommended    n/a
Usage:
 
Description:
 
Comments:
 
Examples:
 
Google-Comments:
Yahoo-Comments:
MSN-Comments:
AOL-Comments:
Ask Jeeves-Comments:
AltaVista-Comments:
Excite-Comments:
HotBot-Comments:
Itomi-Comments:
InfoSeek-Comments:
Lycos-Comments:
NorthernLight-Comments:
 
USA  Usage/Comments:
UK    Usage/Comments:
CDN Usage/Comments:
DCMI Usage/Comments:
Other International/Comments:
 
Commerical Usage/Comments:
Governmental Usage/Comments:
Education Usage/Comments:
Non-profit Usage/Comments:
 
HTML 1.0
HTML 2.0
HTML 3.2
HTML 4.0
XHTML
DHTML
eGMS
PICS
DCMI
W3C
 
 
 
 
 
 
 

 

 

Google still uses a few .
First is the NOARCHIVE.
Thats the archived version of a website.
You can tell google to disable it by:
<META NAME="ROBOTS" CONTENT="NOARCHIVE">

 

eta

   

Tags-URL

 

HTML Author's Guide
to the Robots META tag.

The Robots META tag is a simple mechanism to indicate to visiting Web Robots if a page should be indexed, or links on the page should be followed.

It differs from the Protocol for Robots Exclusion in that you need no effort or permission from your Web Server Administrator.

Note: Currently only few robots support this tag!

 

Where to put the Robots META tag

Like any META tag it should be placed in the HEAD section of an HTML page:
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>
...

What to put into the Robots META tag

The content of the Robots META tag contains directives separated by commas. The currently defined directives are [NO]INDEX and [NO]FOLLOW. The INDEX directive specifies if an indexing robot should index the page. The FOLLOW directive specifies if a robot is to follow links on the page. The defaults are INDEX and FOLLOW. The values ALL and NONE set all directives on or off: ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW.

Some examples:

<meta name="robots" content="index,follow">
<meta name="robots" content="noindex,follow">
<meta name="robots" content="index,nofollow">
<meta name="robots" content="noindex,nofollow">

Note the "robots" name of the tag and the content are case insensitive.

You obviously should not specify conflicting or repeating directives such as:

<meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW">

A formal syntax for the Robots META tag content is:

content    = all | none | directives
all        = "ALL"
none       = "NONE"
directives = directive ["," directives]
directive  = index | follow
index      = "INDEX" | "NOINDEX"
follow     = "FOLLOW" | "NOFOLLOW"

Using a Robots Meta Tag

Brett Tabke

The Robots META tag is a tag to tell a robot if it is ok to index this page or not. It also is used to invite a spider to walk down through all your pages. It is growing in importance.

It is also useful if you don't have access to your servers root directory to control a robots.txt file.

Some search engines, such as Inktomi now fully obey the Robots Meta Tag. Inktomi will crawl down through a site if the Index,Follow syntax is used.

Robots Meta Tag Format

The Robots META tag is placed in the HEAD section of your HTML document: The format is quite simple: (case is not significant)
 

<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
<META NAME="DESCRIPTION" CONTENT="THIS PAGE ....">
<TITLE>...</TITLE>
</HEAD>
<BODY>
...

Robot Meta Tag Options

There are four directives that can be placed in a robots meta tag. The CONTENT section of the meta tag can contain:

index,noindex,follow,nofollow and are separated by commas.

At this point, only the following combinations make sense:
 

The INDEX directive tells the robot it is ok to index the page.

The FOLLOW directive tells the robot it is ok to follow the links found on this page. Some search engine articles on Robots Meta tag say the predefined defaults are INDEX and FOLLOW, not true with Inktomi. The default with Inktomi is index,nofollow.

There are also, two global directives that can specify both actions: ALL=INDEX,FOLLOW, and NONE=NOINDEX,NOFOLLOW.

Robots Meta Tag Examples:

<META NAME="ROBOTS" CONTENT="INDEX,FOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW">
<META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">

Related

 

The 'robots' meta tag

Example:
<meta name="robots" content="noindex,nofollow">

This is one tag that is still widely respected by search engines. It is used to pass instructions to the search engines' robots - often referred to as spiders or crawlers. The default (i.e., if there is no robots meta tag) is for search engines to index the page and to follow links on the page - if this is your intention, you can omit the tag entirely.

 

index = index this page *
noindex = don't index this page
follow = follow the links from this page to get more pages *
nofollow = don't follow the links from this page
all = index this page and follow the links from it *
none = don't index this page and don't follow the links
 

* = default setting (no need for a tag)
 

Robots Exclusion

Sometimes people find they have been indexed by an indexing robot, or that a resource discovery robot has visited part of a site that for some reason shouldn't be visited by robots.

In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms:

 

The Robots Exclusion Protocol
 
A Web site administrator can indicate which parts of the site should not be vistsed by a robot, by providing a specially formatted file on their site, in http://.../robots.txt.

 

The Robots META tag
 
A Web author can indicate if a page may or may not be indexed, or analysed for links, through the use of a special HTML META tag.

 

The remainder of this pages provides full details on these facilities.  

Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work for every Robot. If you need stronger protection from robots and other agents, you should use alternative methods such as password protection.


The Robots Exclusion Protocol

The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting robots which parts of their site should not be visited by the robot.

In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records like:

User-agent: *
Disallow: /
to see if it is allowed to retrieve the document. The precise details on how these rules can be specified, and what they mean, can be found in:

 


 

The Robots META tag

The Robots META tag allows HTML authors to indicate to visiting robots if a document may be indexed, or used to harvest more links. No server administrator action is required.

Note that currently only a few robots implement this.

In this simple example:

<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
a robot should neither index this document, nor analyse it for links.

Full details on how this tags works is provided:

 

Robots META Tag
By Submit Corner
Tell a Friend About This Page

 

Overview: The Robots Tag declares to search engines what content to index and spider

Robots, also known as spiders, are automated mechanisms that spider your site, or search your site on how to categorize the information you submitted to the search engine. Typically, a website owner would submit the main page and the robots would visit your site and collect all subpages and related links from your main page. However, this tag enables you to control which pages you would like spidered, and which to ignore. For instance, certain webpages and directories (ie: CGI Scripts) you may not want indexed in the search engines. Using the robots tag, you can define which pages to follow, which to index and which to ignore completely.

META Tag Usage


 
META Name: "Robots"
Supported Types: noindex | index | nofollow | follow
General Usage: <META name="Robots" content="index,follow">
Code Generator: Create Robots META Tag for me [Click Button Below]

Search Engines Usage

The Robots META Tag is used by search engines as a means to indicate the level of spidering a search engine should do. Most search engines look for this META tag and will only index and/or spider the pages you want to be indexed.

Recommended Usage: Suggested

 

Meta Robots: Controls search engine robots on a per-page basis.
Example: <META NAME="ROBOTS" CONTENT="NOINDEX,FOLLOW"> Robots may traverse this page but not index it.
Recommendation: Do not use. If you need to control the search engine robots, use a robots.txt file instead. It is more widely supported and is not ignored as this tag sometimes is.

 

15. My Googlebot question isn't answered here. Where should I send it?

Controlling the Spiders
Really bad spiders go after directories and files you specifically disallow. ...
tells the googlebot to stay away from all files in the /cgi-bin/ directory: ...

There may be more than one spider identification line and there may be more than one directory/file name line.

Here is an example that tells the googlebot to stay away from all files in the /cgi-bin/ directory:

User-agent: googlebot
Disallow: /cgi-bin/

If you have instructions for other spiders, insert a blank line in the file then follow the same format — spiders first, then instructions.

Here is an example robots.txt file with 3 instruction blocks.

  1. Tell the googlebot and Googlebot-Image (Google's image indexing spider) to say away from the /hotbotstuff/ directory.
  2. Tell HotBot's spider to stay away from the /googlestuff/ directory.
  3. Tell all spiders (indicated by an asterisk in place of the spider identification) to stay out of the /cgi-bin/ and /secret/ directories and out of the mypasswords.html file located in the document root.
User-agent: googlebot
User-agent: Googlebot-Image
Disallow: /hotbotstuff/

User-agent: slurp
Disallow: /googlestuff/

User-agent: *
Disallow: /cgi-bin/
Disallow: /secret/
Disallow: /mypasswords.html

When a good spider reads the robots.txt file, it keeps the instructions at hand while it's spidering the web site.

Before the spider indexes a page, it consults the robots.txt file, omitting any instruction blocks that do not pertain to it. For example, the googlebot would read blocks with a user-agent of "*" and a user-agent of "googlebot". In the above example, it would read only the first and last instruction blocks.

The spider reads from top to bottom, stopping whenever there's a directory or file name match. Thus, a spider consulting the above robots.txt file before spidering http://example.com/secret/index.html would see the

Disallow: /secret/

line and not index the /secret/index.html page.

Some spiders will recognize an Allow: instruction in the robots.txt file. If you use an allow instruction to allow indexing certain directories or files, realize that those spiders that don't recognize it will simply ignore it.

An example of use would be to let google index only the http://example.com/mostlysecret/googlefood.html page in that directory and disallow the entire /mostlysecret/ directory to all other robots. Because the robots.txt file is read from top to bottom, the allow should be in the file above the "all robots" disallow. Example:

User-agent: googlebot
User-agent: Googlebot-Image
Disallow: /hotbotstuff/
Allow: /mostlysecret/googlefood.html

User-agent: slurp
Disallow: /googlestuff/

User-agent: *
Disallow: /mostlysecret/
Disallow: /cgi-bin/
Disallow: /secret/
Disallow: /mypasswords.html
Robots.txt and the Robot Meta Tag
If I only wanted Googlebot to ignore those directories & files, I'd type "User-agent:
Googlebot". The second line refers to an entire directory. ...
 

The syntax of the robots.txt file

User-agent: *
Disallow: /images/
Disallow: /contact.html
Disallow: /privacy/privacy.html

The first line specifies which robots should ignore /images/, /contact.html and /privacy/privacy.html. The asterisk * is a wildcard - so all robots should ignore the directories and files listed below it. If I only wanted Googlebot to ignore those directories & files, I'd type "User-agent: Googlebot".

The second line refers to an entire directory. Nothing in that directory will be indexed.

The third line refers to a specific page in the root directory - in this case the contact.html file.

The fourth line refers to a specific file in a specific directory.

 

Creating and Using a robots.txt File
This line can be repeated for each directory or file you want to exclude, ...
You know that the spider that Google sends out is called 'Googlebot'. ...

 

1. Exclude a file from an individual Search Engine

You have a file, privatefile.htm, in a directory called 'private' that you do not wish to be indexed by Google. You know that the spider that Google sends out is called 'Googlebot'. You would add these lines to your robots.txt file:

User-Agent: Googlebot
Disallow: /private/privatefile.htm

 

Now you want to keep Google away from those images. Google grabs these images with a sperate bot from the one that indexes pages generally, called Googlebot-Image. You have a couple of choices here:

User-Agent: Googlebot-Image
Disallow: /images/

That will work if you are very organized and keep all your images strictly in the images folder.

User-Agent: Googlebot-Image
Disallow: /

This one will prevent the Google image bot from indexing any of your images, no matter where they are in your site

 

Finally, you have two pages called content1.html and content2.html, which are optimized for Google and Lycos respectively. So, you want to hide content1.html from Lycos (The Lycos spider is called T-Rex):

User-Agent: T-Rex
Disallow: /content1.html

and content2.html from Google. 

User-Agent: Googlebot
Disallow: /content2.html

 

The 'robots' meta tag

Example:
<meta name="robots" content="noindex,nofollow">

This is one tag that is still widely respected by search engines. It is used to pass instructions to the search engines' robots - often referred to as spiders or crawlers. The default (i.e., if there is no robots meta tag) is for search engines to index the page and to follow links on the page - if this is your intention, you can omit the tag entirely.

 

index = index this page *
noindex = don't index this page
follow = follow the links from this page to get more pages *
nofollow = don't follow the links from this page
all = index this page and follow the links from it *
none = don't index this page and don't follow the links
 

* = default setting (no need for a tag)
 
 

The 'robots' meta tag

Example:
<meta name="robots" content="noindex,nofollow">

This is one tag that is still widely respected by search engines. It is used to pass instructions to the search engines' robots - often referred to as spiders or crawlers. The default (i.e., if there is no robots meta tag) is for search engines to index the page and to follow links on the page - if this is your intention, you can omit the tag entirely.

 

index = index this page *
noindex = don't index this page
follow = follow the links from this page to get more pages *
nofollow = don't follow the links from this page
all = index this page and follow the links from it *
none = don't index this page and don't follow the links
 

* = default setting (no need for a tag)
 

2.3 Created

The created field contains the date the site section was originally published. This is used for site maintenance and archive operations.

<meta name="created" content=" [date] ">

 
Variable Description
[date] The section created date in YYYYMMDD format to allow numeric sorting.

Every page MUST have a creation date in the specified date format.

The created field MUST reflect the date the site section was originally published.