... fully automate the Google SiteMap generation and
maintenance, without ... released content pages. After updating
your database, call a ... its search services, the Google empire
strikes back
Recently Google® has launched a service called
Google® SiteMaps to optimize Googlebot's web site crawling. It
allows webmasters to submit new/modified URLs to Google's spider
Googlebot. Google SiteMap submissions have no impact on
rankings on Google's
SERPs nor will they influence PageRank™ calculations, but most
probably they will help webmasters to get their new stuff crawled by
Googlebot faster than before. Although we can't predict how or how
much this new service, which is still in BETA state, will help site
owners, we tell you how to make use of Google SiteMaps.
[Update July 2005: It works like a charm. It even works better than
expected. It improves search engine visibility to a great degree, as
long as webmasters care what they submit to Google.]
Google's sitemap program provides detailed crawler
reports which make it easy to fix issues like broken links,
conflicts with robots.txt exclusions etc. etc. in no time. [Shrink]
Web sites without an underlying database and many
smaller sites can't work with the dynamic approach to fully automate
the Google SiteMaps channel outlined here, so here you go ... [Shrink]
Usually Google's crawler
Googlebot will find each and every page on your web site, as
long as a link known by Google points to it. Once Googlebot has
spidered a page, it returns every once in a while to check for your
updates. Shortly after Googlebot's visit, Google updates its index
and changes the ranking of your pages. If you've changed the content
(visible text, image descriptions...) of a page, Google will deliver
your pages on its
SERPs for new keywords or it will rank your pages differently.
If you've added new pages and if you provide links to these
pages, Google includes them into its index. That's pretty much
simplistic, but to make use of Google's SiteMap service you don't
need to understand the details. Hire a
SEO
if you want your web site ranked better.
This established procedure has its disadvantages, both for Google
and webmasters as well. It burns resources and a high percentage of
the results are pretty much useless. Googlebot has to crawl zillions
of pages daily, just to find out that they weren't changed. Since
Googlebot is that busy fetching pages archived in the stone age of
the internet, it may find new content way too late. What was missing
is a communication channel between Google and site owners, dedicated
to adjust the crawling process. Dealing with billions of pages on
the net, Google had no chance to communicate with webmasters on an
individual or per site base.
Google SiteMaps now opens this cannel for everybody,
offering a method to exchange information on new and modified
content in a timely manner. Both sides can benefit: Google saves a
whole of a lot of machine time and bandwidth costs, site owners get
their new content earlier on Google's SERPs and reduce their server
load by Googlebot no longer spidering archived content too
frequently.
Due to the nature of the beast, this channel needs full automation
on both sides. Google itself as operator of the service took care of
it, but a few millions of webmasters around the globe are called to
implement a suitable solution to fully automate the Google
SiteMap generation and maintenance, without reinventing the
wheel. Google offers a
Sitemap Generator requiring Python 2.2 or higher
installed on the web server. A couple of webmasters have posted
their scripts on the boards, a few blogs are offering solutions
written in PHP, ASP and other programming languages, and a handful
of tools are available from messages in
Google's SiteMap Group. Sooner or later all content
management systems will come with this functionality build-in.
Fortunately, there is a lot of useful stuff out there.
Unfortunately, in a pretty fast growing number of posts and code
snippets compilable from message boards, blogs, usenet groups and
search engine related web sites, the webmaster searching for an
adaptable solution fitting a particular web site's needs, seeks the
needle in a haystack. Providing a webmaster has enough of a code
monkey to customize a script, we don't offer just another
superfluous piece of code, but a tutorial how to make use of
Google SiteMaps.
First of all, Google's SiteMap service does not
replace the established crawling procedure. It's offered as an
addition to the old fashioned spidering by following links. That
means, webmasters don't need to send each and every URL thru this
new channel. Googlebot will still find (all) pages, whether they are
listed in the web site's site map or not. Also, Google SiteMaps do
not make standard site maps obsolete. Googlebot will continue to
follow links from these site navigation elements.
This said, here is how
Google SiteMap works:
1. The webmaster compiles a list of useful URLs and adds a few
optional attributes (date of last modification, priority and change
frequency) to each URL entry. This list must be served as XML file
according to the
sitemap protocol defined by Google. Usually a file named
'sitemap.xml' gets placed on the web server's root directory. Google
accepts plain text files too, but processes sitemaps provided in XML
format with a higher priority.
2. The webmaster submits the URL of the sitemap to Google. Google
checks it for valid syntax and provides online stats showing the
submission state. For accepted sitemaps, Google schedules a crawl
using the information provided by the webmaster. Shortly after each
download of a sitemap, Googlebot visits the web site and fetches new
and modified content. From this point on, the established procedure
applies.
3. On every change of content, the webmaster updates the sitemap and
resubmits it to Google. -> 2.
It's that easy. Even the XML format does not require
additional software nor understanding of
XML at all. Once the initial Google SiteMap implementation
works, resubmits can be done fully automated.
As said before, it's not necessary to put each URL
available from a web site into the sitemap, although Google
encourages webmasters to submit even images and movies, what makes
not so much sense without META data describing the content1.
Google SiteMaps was launched to give webmasters an
opportunity to tell Google which pages they consider valuable for
search engine users. For example, if your contact page behaves
dynamically depending on the referring page, you don't need to
submit every permutation to Google. Also, don't bother submitting
URLs excluded in your
robots.txt. Actually a no-brainer, don't submit doorway pages,
duplicated content and alike, chances are good that Google will
ignore your sitemaps after a while if you cheat.
Concentrate your efforts on pages which are hard to spider, for
example dynamic URLs having many arguments in the query string,
pages linked from dynamic pages, and pages deeply buried in your
linking hierarchy. If you're using session IDs, provide Google with
clean URLs (all randomly generated noise truncated). In the sitemap
you can use long dynamic URLs up to 2048 characters.
Mass submissions of URLs are not a new thing, but the possibility to
suggest how a search engine crawler should handle them is new and
pioneering. Google's sitemap protocol defines three optional
attributes of URLs: priority, change frequency and
last modification. If you can't provide a particular
attribute for a page (yet), skip it. The <url> tag is perfectly
valid containing the page location alone. Put in additional
information as you can, but don't try to populate these tags with
more or less useless values just because they are defined.
The most important tag is <lastmod>, telling Google when a
page was indeed modified or created. This enables Googlebot to pick
fresh content aimed, probably a long time before it finds the very
first link pointing to it by accident. Changes of this attribute in
the underlying database should trigger a sitemap resubmission by the
way. It seems to be important to avoid abuse of <lastmod>, in the
best interest of the webmaster. Minor changes of templates affecting
a bunch of pages are no reason to submit all pages based on the
altered template as modified. Modifications are different wording,
additional text information and brand new content.
The <priority> tag is meant as a hint to balance crawling
capacities. Say a sitemap contains 10,000 modified URLs, but
Googlebot's time slot scheduled for the web site in question would
allow the fetching of only 1,000 pages. Now Googlebot should extract
1,000 URLs ordered by priority and probably last modification from
the sitemap, fetch these pages and return later on to eat the 9,000
remaining pages.
Google says 'Search engines use this information when selecting
between URLs on the same site, so you can use this tag to increase
the likelihood that your more important pages are present in a
search index.'. This statement made many site owners hope, they may
get influence on rankings on Google's SERPs. That's wishful
thinking. It simply means, that possibly Googlebot will crawl
high-priority URLs before low-priority pages.
Assign reasonable priorities from 0.0 to 1.0 to your pages. For
example, a brand new article should get a higher priority assigned
than the more or less static home page. Given priorities are
interpreted relative to other pages on the same web site. The best
advice is: honestly assign high priorities to often changed pages
which are of a great interest for your users, and low priorities to
static stuff.
The <changefreq> tag seems to be meant as an educated guess,
just a hint to the crawler. The list of valid values is short:
"always", "hourly", "daily", "weekly", "monthly", "yearly" and
"never". Irregularly changes are not covered, thus assign your best
guess or even skip it, then rely on <lastmod>. "Never" stands for
archived content. Use "always" for frequently updated news feeds and
other stuff triggering content changes on (nearly) every page view.
1
META data describing non-textual content
means title/alt text in image elements, anchor text in
links, and surrounding text as well as META description
tags. HTML pages get crawled more frequently than images or
videos. Image/video-URIs harvested during regular crawls get
queued into the specific crawling schedules. Since there is
a relation between descriptive META data and non-textual
content, it makes sound sense to submit all kind of content
via sitemaps. It sure helps Google to make its
image/video-search more current.
Scheduling batch jobs to generate RSS feeds and
similar stuff like the sitemap.xml file is a way to complex
procedure to handle such a simple task, and this approach is
fault-prone. Better implement your sitemap generator as dynamic XML
file, that is a script reflecting the current state of your web site
on each request1.
After submitting a sitemap to Google, you don't know when Googlebot
finds the time to crawl your web site. Most probably you'll release
a lot of content changes between the resubmit and Googlebot's visit.
Also, perhaps crawlers of other search engines may be interested in
your XML sitemap in the future. There are other advantages too, so
you really should ensure that your sitemap reflects the current
state of your web site everytime a web robot fetches it.
You can use every file name with your sitemap. Google accepts what
you submit, 'sitemap.xml' is just a default. So you can go for
'sitemap.php', 'sitemap.asp', 'mysitemap.xhtml' or whatever
scripting language you prefer, as long as the content is valid XML.
However, there are good reasons to stick with the default
'sitemap.xml'. Here is an example for Apache/PHP:
Configure your webserver to parse .xml files for PHP,
e.g. by adding this statement to your root's .htaccess
file:
AddType application/x-httpd-php .htm .xml
.rss
Now you can use PHP in all .php, .htm, .xml
and .rss files. http://www.yourdomain.com/sitemap.xml behaves like
any other PHP script. Note: static XML files will produce a PHP
error caused by the XML version header.
You don't need XML software to produce the pretty simple
XML of Google's sitemap protocol. The PHP example below should be
easy to understand, even if you prefer another programming language.
Error handling as well as elegant programming was omitted to make
the hierarchical XML structure transparent and understandable.
$urlTag = $urlOpen;
$urlValue = $locOpen .makeUrlString("$url") .$locClose;
if ($modifiedDateTime) {
$urlValue .= $lastmodOpen
.makeIso8601TimeStamp($modifiedDateTime) .$lastmodClose;
if (!$isoLastModifiedSite) { // last modification of web site
$isoLastModifiedSite =
makeIso8601TimeStamp($modifiedDateTime);
}
}
if ($changeFrequency) {
$urlValue .= $changefreqOpen .$changeFrequency
.$changefreqClose;
}
if ($priority) {
$urlValue .= $priorityOpen .$priority .$priorityClose;
}
$urlTag .= $urlValue;
$urlTag .= $urlClose;
return $urlTag;
}
Now fetch the URLs from your database. It's a good idea to have a
boolean attribute to exclude particular pages from the sitemap.
Also, you should have an indexed date-time attribute storing the
last modification. Your content management system should enable the
attributes ChangeFrequency, Priority,
PageInSitemap and perhaps even LastModified on the
user interface. Example query: "SELECT pageUrl, pageLastModified,
pagePriority, pageChangeFrequency from pages WHERE pages.pageSiteMap
= 1 AND pages.pageActive = 1 AND pages.pageOffsite <> 1 ORDER BY
pages.pageLastModified DESC". Loop:
After the loop you can add a few templated pages/scripts, not stored
as content pages, which change on each page modification or not:
if (!$isoLastModifiedSite) { // last modification of web site
$isoLastModifiedSite = makeIso8601TimeStamp(date('Y-m-d
H:i:s'));
}
$urlsetValue .= makeUrlTag ("$rootUrl/what-is-new.htm",
$isoLastModifiedSite, "daily", "1.0");
Now write the complete XML. Dealing with a larger amount of pages,
you should print the <url> tag on each iteration followed by a
flush(). If you publish tens of thousands of pages, you should
provide multiple sitemaps and a
sitemap index. Each sitemap file that you provide must have no
more than 50,000 URLs and must be no larger than 10MB.
Google will process all <url> entries where the URL begins with the
URL of the sitemap file. If your website is distributed over many
domains, provide sitemaps per domain. Subdomains and the 'www
prefix' are treated as seperate domains. URLs like
'http://www.domain.us/page' are not valid in a sitemap located on
'http://domain.us/'. The script's output should be something like
Feel free to use and customize the code above. If you do so, put
this comment into each source code file containing our stuff:
COPYRIGHT (C) 2005 BY SMART-IT-CONSULTING.COM
* Do not remove this header
* This program is provided AS IS
* Use this program at your own risk
* Don't publish this code, link to
http://www.smart-it-consulting.com/ instead
1
On large sites it may be a good idea to run
the script querying the database on another machine to avoid
web server slow downs. Also, using the sitemap index file
creatively can help: reserve one or more dynamic sitemap
files for fresh content and provide static sitemaps, updated
weekly or so, containing all URLs. The sitemap tag
of the sitemap index offers a lastmod tag to tell
Google which sitemaps were modified since the last download.
Use this tag to avoid downloads of unchanged static
sitemaps.
Ask Googlebot to Crawl New and Modified Pages on Your Web Site
Create a
Google Account, then go to the
Google SiteMap Submit Page. Enter your sitemap URL and wait for
the first download displayed on the
stats page. If the status is not 'Ok', correct the errors and
resubmit your sitemap until it's approved. Bookmark the stats page
and check back every once in a while (and after script changes!) to
track Googlebot's usage of your sitemap.
You don't need to resubmit your sitemap manually. Being a smart
webmaster, you'll automate the resubmits. The easiest way to
automate sitemap resubmits to Google is to trigger a HTTP request on
change of released content pages. After updating your database, call
a function to ping Google. Since your dynamic sitemap file is always
up to date, you don't need to do more. A PHP example:
This function returns something like "HTTP/1.1 200 OK
Content-Type: text/html; charset=UTF-8 Content-Language: en
Cache-control: private Content-Length: 0 Date: Sat, 04 Jun 2005
21:41:00 GMT Server: GFE/1.3" or "ERROR" on failure. If the string
doesn't contain the return code "200 OK" something is fishy.
Resubmits via ping don't appear in your account's sitemap stats.
If your content changes frequently, you should set up a
cron job pinging Google twice a working day or so, instead of
bothering Google with a ping on each record change.
Before
Yahoo's Site Explorer goes live, Google provides advanced
statistics in the sitemap program. The 'lack of stats' has produced
hundreds of confused posts in the
Google Groups. As Google Sitemaps was announced at June/02/2005,
Shiva Shivakumar stated "We are starting with some basic
reporting, showing the last time you've submitted a Sitemap and when
we last fetched it. We hope to enhance reporting over time, as we
understand what the webmasters will benefit from". Google's
Sitemaps team closely monitored the issues and questions brought up
by the webmasters, and since August/30/2005 there are enhanced
stats. Here is how it works.
Google provides crawling reports for sites, where the sitemap was
submitted via a
Google account. To view the reports, a site's ownership must be
verified first. On the
Sitemap Stats Page each listed sitemap has a verify
link (which changes to a stats link after verification). On
the verification form Google provides a case sensitive
unique file name, which is assigned to the current Google account
and the sitemap's location.
Uploading and submitting an empty verification file with this name
per domain tells Google that indeed the site owner
requests access to the site's crawler stats. If a sitemap is located
in the root directory, the verification file must be served from the
root too (detailed
instructions here). Do not delete the verification file after
the verification procedure, because Google checks for its existance
periodically. If you delete this file, you've to verify your site
again.
That means, that for example free hosted bloggers submitting their
RSS feed as sitemap don't get access to their stats, because they
can't upload files to their subdomain's root index. At the time of
writing, Google doesn't provide stats for
mobile sitemaps1,
that is Google Sitemaps restricted to content made up for WAP
devices (cell phones, PDAs and other handheld devices) provide basic
reports only.
In the first example, crawling stats get enabled for all URIs in and
below the domain's root. In the second example, crawling stats get
enabled for all URIs in and below /directory/, but not in upper
levels. By the way, the crawler reports provide information on URIs
spidered from sitemaps and URIs found during
regular crawls by following links, regardless whether the URI is
listed in a sitemap or not.
Sounds pretty easy, but there is a pitfall. For security reasons,
Google will not verify a location, if the Web server's response on
invalid page requests is not equal 4042
(the error message says "We've detected that your 404 (file not
found) error page returns a status of 200 (OK) in the header.",
but it occurs even on redirects, e.g. 302). For example, if a site
makes use of customized error documents, the HTTP response code is
not 404 as requested. The server does a redirect to the Custom Accounting error
page and sends a 302 header, which cannot be overwritten by the
error page itself. Here is a .htaccess example:
ErrorDocument 404 http://www.domain.com/err.htm?errno=404
[302, doesn't work for verification purposes]
Hint: the verification process is a one time thingy
per domain.
Once the verification process is finished, the crawler stats (or
better problem reports) are accessible from the sitemaps
stats page. At the moment those reports show all kind of errors per
URI, but they say nothing about successful fetches. Because every
error status is linked to an explanation, this tool makes it pretty
easy to fix the issues3.
I could think of enhancements though:
The error 'HTTP Error' doesn't tell the error code, it's linked to
the 'page not found' FAQ entry. However, 'HTTP Error' occurs on all
sorts of problems, for example crawling of URIs in password
protected areas, harvested from the toolbar or linked from outside.
Providing the HTTP response code and the date and time of crawling
as well would simplify debugging.
In case of invalid URIs found on foreign pages, it would be
extremely helpful to know which page contains the broken link.
Firing up an email to the other site's webmaster would make everyone
happy, inclusive Google.
Well, probably I'm greedy. Google's crawler report
is a great tool, kudos to the sitemaps team! In combination with my
spider tracker,
sitemap generator and some other tools I've everything I need to
monitor and support the crawl process.
1
A mobile sitemap is a standard Google
compliant sitemap, populated with URIs of WAP pages of one
particular markup language (XHTML, WML...), submitted via
another form. Currently accepted markup languages are
XHTML mobile profile (WAP 2.0), WML (WAP 1.2)
and cHTML (iMode).
2
To check a site's 404 handling, Google
requests randomly generated files like /GOOGLE404probee4736a7e0e55f592.html
3
If you don't fix all reported issues with
URIs on your site, you'll miss out on some traffic at least.
So track down the errors. If you find broken links on your
pages, correct them. If you submit invalid URIs via sitemap,
change it. A few errors will stay in the section listing
unreachable URIs found during the regular crawl process. Try
to find the source of each invalid inbound link in your
referrer stats, 404 logs and such, and write the other
Webmaster a polite letter asking to edit the broken link. If
you can't track down the source, guess the inbound link's
target as best as you can. Then put up a simple script under
the invalid URI doing a permanent (301!) redirect, pointing
to the page on your site which is/could be the link's
destination. This way you don't waste any traffic nor the
ranking boost earned from those inbounds.
A few weeks after Google's launch of SiteMaps, more
and more webmasters complain about their sites disappearing from
Google's index shortly after a sitemap submission. Did Google trick
innocent newbies and not so savvy webmasters into a very smart (but,
being a beta version, still errornous) spammer and scraper trap?
Tired on countless approaches to abuse its search services, the
Google empire strikes back! Seriously, Google launched SiteMaps to
explore the 'hidden web', and to learn more about web site
structures - including widely used 'helpers' like feeder pages and
similar stuff.
Danny Sullivan asked Shiva Shivakumar, engineering director and
the technical lead for Google SiteMaps, How will you prevent
people from using this to spam the index in bulk? He said We
are always developing new techniques to manage index spam. All those
techniques will continue to apply with the Google Sitemaps.
Analyzing a few of the disappeard web sites and their sitemaps, it
seems that the causes for disappearing from Google's index can
obviously be found in Shiva Shivakumar's answer.
A few examples cannot prove, that there is no bug causing removals
of clean web sites in Google's new service. But before webmasters
complain, they should be sure that their sites do comply to
Google's guidelines. Shit happens. Even experienced webmasters
can fail.
One circumstance commonly applies to web sites wiped out from
Google's index after sitemap based deep crawls. The sitemaps were
generated by foreign tools, which spider for links and/or collect
URLs from the web server's file system. With large sitemaps, human
reviews are limited, especially if the file names don't follow human
readable naming conventions, and/or query strings are insignificant.
Google applies spam filters on unintentional spider food supplied in
sitemaps too. Some scenarios of unintentional cheating:
Huge assorted links pages from spider traps, which were very
popular in 1999/2000, were not deleted on the web server. The
webmaster has only removed the links from the home page.
A developer playing with a vendor's data feed many months ago
has generated zillions of interlinked product pages in a forgotten
directory, all linking to the domain's home page, which Google sees
as doorway pages.
A formerly spamming site was completely revamped and reindexed
on a reinclusion request. The webmaster switched the HTML file name
extension from .html to .htm in his devemopment tool, kept the
directory structure, and forgot to delete the old stuff on the web
server. Unfortunately the sitemap generator submitted the spammy
.html files, packed with hidden links and invisible text.
A bunch of rarely crawled printer friendly pages without a
robots NOINDEX meta tag gets submitted via sitemap. The primary
versions of these pages were well ranked, caused by lots of deep
inbound links from other sites. For some odd reason the duplicate
content filter likes the more or less unlinked printer friendly
pages better. Those cannot be found by
site:+unique-word-appearing-in-every-bottom-line searches, because
the printer friendly pages lack a bottom line containing the search
term.
From reading the boards and feedback to this
tutorial, we'd like to add a few items to
Google's facts and fiction listing:
Fiction: Assigning a high priority in a Google SiteMap
increases the URL's PageRank™. Fact: PageRank™ calculations have nothing to do with
sitemap priority. It simply means, that possibly Googlebot will
crawl high-priority URLs before low-priority pages.
Fiction: According to Google's TOS, commercial sites cannot
participate in the sitemap program. Fact: Every web site can submit sitemaps to Google. You
don't need a Google account to participate. Even webmasters of
commercial sites may use their Google accounts to track sitemap
downloads.
Fiction: Participating web sites must have Phyton 2.2
installed on their webservers. Fact: Only Google's free sitemap generator requires
Phyton. You can use everyting you have to create and submit the
sitemaps. Even notepad and a web browser will do the job, the
sitemap protocol is that simple.
Fiction: Google penalizes web sites for frequent submissions. Fact: There are no such penalties. Google encourages sitemap
submissions on content changes. However, if your content changes
every minute, you should go for a reasonable submission frequency.
Static websites do need another solution to
generate a valid Google SiteMaps XML file. Unfortunately, many
webmasters cannot use the free sitemap generator provided by Google
for various reasons. Not even a week after Google's announcement,
all search engine marketing related forums, blogs and usenet groups
provide links to more or less useful Google Sitemap tools. There is
a lot of crap floating around, thus we tried collect a few
'nuggets'. We didn't evaluate the tools listed below, and we cannot
vouch for them, but they seem to be pretty decent. We don't link to
tools without positive webmaster feedback on the boards.
Do not download files if your client machine lacks a
suitable protection. Especially do not download executables, never.
Google SiteMaps Pal is an online service generating the
sitemap.xml file containing a maximum of 100 URLs, spidered from
a submitted URL.
Google SiteMap XML Validator is an online service validating
the XML structure of your Google SiteMap. It can submit your
sitemap.xml file to Google, if you don't want to use your Google
account.
Node Map is a web site packed with information on Google
SiteMaps, including tools for generation and validation of
sitemap XML files.
phpSitemap is a PHP script compiling the XML file from the
web server's file system, it does not (yet) spider a web site to
include dynamic links.
Simple Sitemaps is a PHP script generating a dynamic Google
XML Sitemap plus a pseudo-static HTML Site Map and a
RSS 2.0 site feed from a simple text file. Simple Sitemaps is
suitable for smaller Web sites with no more than 100 pages.
If you operate a forum, blog or similar kind of web site
based on foreign software, chances are good the software vendor
supplies a sitemap generator. Visit the vendor's web site before you
implement a hack.
Professional Services / Implementation of Google SiteMaps
Smart IT Consulting offers professional
implementation services for Google Sitemaps, as well as reviews,
advice and alike. To get in touch with us, please
click here.
... them to Google. Thus constantly updating your site
... you the sitemap XML file Google needs to ... the
different Services and Programs offered by Google click here:
Google Adsense & Google
Google Sitemaps Explained - How To Use Google
Sitemaps Three Ways To Index Your Site With
Google Sitemaps [Difficult, Hard, And Easy]
Google has recently implemented a program where any webmaster can
create a Sitemap of their Site and submit it for indexing by Google.
It is a quick and easy way for you to keep your site constantly
indexed and updated in Google.
The program is appropriately called Google Sitemaps.
In order for you to best use Sitemaps, you must have an XML
generated file on your site that will transmit or send any updates,
changes, and data to Google. XML (Extensible Markup Language)is
everywhere these days, you have probably seen the orange XML logo on
many web sites and its often associated with Blogging because Blogs
use XML/RSS feeds to syndicate their content.
Today RSS is known mostly as 'Really Simple Syndication' but its
original acronym stood for 'Rich Site Summary'. XML is only simple
code like HTML and it is used to syndicate your content to
all interested parties.
And the interested party in this case is Google. By creating
Sitemaps Google is really asking webmasters to take charge of the
indexing and updating of their sites. Basically, doing the
Googlebot's job!
This is a 'Good' thing! With the steady influx of new web sites
growing rapidly, indexing all this material will become a challenge,
even with the resources of Google. With Sitemaps, websmasters can
now take charge and make sure their site is crawled and indexed.
Please note, indexing your site with Sitemaps WON'T
improve your rankings in Google. You will still be competing with
the other sites in Google for top positions. But with Sitemaps you
can make sure all your pages are crawled and indexed quickly by
Google.
There are some other big advantages of using Google's Sitemaps -
mainly you have control over a few key variables, attributes or
tags. To explain this as simply as possible, your XML powered
sitemap file will have this simple code for each page of your site:
Along with 'urlset' tags at the beginning and end of your code,
and an XML version indication - that's basically your XML file! File
size will depend on the number of webpages you have.
Taking a closer look at this XML file:
location - http://www.yoursite.com - name of your webpage
priority - you set the priority you want Google to place
on that page in your site. You can prioritize your pages: 0.0 being
the least, 1.0 being the highest, 0.5 is in the middle. This is
ONLY relative to your site. It will not affect your rankings.
Why is this important? You have certain pages on your site that are
more important than others, (home page, high profit page, opt-in
page, etc.) by placing high priority on these pages, you will
increase their importance in Google.
last modified - when you last modified that page, this
timestamp allows crawlers to avoid recrawling pages that haven't
changed.
change frequency - you can tell Google how often you
change that particular page. Never, weekly, daily, hourly, and so on
- if you frequently update your page this could be extremely
important.
Why do I need a XML Generator?
In order for this XML sitemap file on your site to be constantly
updated, you need a Generator that will spider your site, list all
the urls and automatically feed them to Google. Thus constantly
updating your site in Google's massive index or database. Keep in
mind, Google also gives you the option of submitting a simple
text file with all your URLs.
Now there is already a flood of these generators popping up!
Different ways of generating your XML powered sitemap file. More are
probably appearing as you read this. For your convenience, three
ways to generate your XML Sitemaps file are listed below:
Difficult - Google's Python Generator
That's a relative term, if you know your server like the back of
your hand and installing scripts doesn't scare the bejesus out of
you, you're probably smiling at the word difficult. Google supplies
a link to a generator which you can download and set up on your
server. It will cough up your sitemap XML file and automatically
feed it to Google.
Google XML Generator
In order for this Generator to work, Python version 2.2 must be
installed on your web server - many servers don't have this. If you
know what you're doing, this will probably be a good choice.
You don't need a Google Account to use Sitemaps but it's
encouraged because you can track your sitemap's progress and view
diagnostic information. If you already have another Google Account
gmail, Google Alerts, etc. just use that one to sign in and follow
directions from there.
To submit your Sitemap using an HTTP request, issue your request
to the following URL:
This is a php generator that you can place on your server. This
generator will spider your site, and produce your XML sitemap file.
Download the phpSitemapNG and upload it your server. Run the
generator to get your XML sitemap file and send it to Google.
PHP Generator
Again, this is only hard to do if you don't know your way around
PHP files or scripts.
Easy - Free Online Generator
These Generators are popping up everywhere, and Google now keeps
a list of these 'third party suppliers' of generators on their site.
Find them here:
Google's List of Third Party Generators
One of the easiest to use is
www.xm-sitemaps.com, and you can index up to 500 pages with this
online Generator very quickly and it will give you the sitemap XML
file Google needs to index your site. It will go into your site,
spider it and index all your pages into an XML sitemap of your site.
You can download this file, Compressed or Non- compressed and make
minor changes such as setting the priority, changing frequency, etc.
Then upload this file to your site as sitemap.xml to the root
directory of your server i.e. where you have your homepage. Then
notify Google Sitemaps of your XML file and you're in business.
Of course, the only drawback, if you constantly add pages
to your site you will need to also add these pages to your XML
sitemap file. This won't be much of a problem unless you're daily
adding pages to your site - then you will need something like the
PHP or Python generator to do all this for you automatically.
Google is still the major search engine on the web so getting
your pages indexed and updated quickly is the major reason to use
Google Sitemaps. If you want your site to remain competitive it's
probably the wisest route to take.
A free java based Google sitemap generator tool to help
you make sitemaps. ... the website address you wish to
create a google sitemap from, such as
http://www.popupcheck.com for ... You may stop the sitemap
generation at any time by pressing the
Google Sitemap Generator now supports up to 50,000 pages!
Updates and bug fixes found
here.
NOTE: You should see a graphic
above that says 'AuditMyPC.com Sitemap Builder'; If you do not
see the graphic, please check to make sure your browser or
firewall is not blocking ActiveX or Java and that you are using
the latest version of Java, found at
http://java.com/en/download/index.jsp
Feel free to contact me at
network-security-2@auditmypc.net
Why the Sitemap Generator
Google Sitemaps is a service that allows webmasters
to submit an XML map of their site containing information such
as change frequency, priority, date created and more. Google
suggests a great sitemap program to create the site map, but you
must have Python in order for it to run. For further reading on
Google's program, visit
Google Sitemaps FAQ.
I needed a way to accomplish this without installing Python
or extra software on my server and Java was the perfect
solution!
Free Site Map Builder
This free and easy to use site map builder is fast,
efficient and allows you to build a sitemap that can instantly
be submitted to Google. The only requirement is a popular
browser that supports java, such as Internet Explorer, Firefox
and others.
Sitemap Generator Overview
It's simple, just enter the website you would like to
create a site map from and click 'start crawling'. The
program will generate a sitemap using the default values, such
as thread count of 5, include all files, etc. Care has been used
to make sure the crawler does not follow external links :)
Sitemap Generator Details
This site map tool is loaded with tons of features and
consists of three dialogs:
Sitemap Options - these options allow you to specify
crawling behavior.
URL - Enter the website address you wish to create a
google sitemap from, such as
http://www.popupcheck.com for example. The address must
be in HTML format and will follow all links without
leaving the root server; it will NOT follow external
links.
Include Filter - This is a list of path patterns,
asterisk (*) wildcard supported, case insensitive. When
the sitemap generator / crawler is about to process a
website, it is validated against all inclusion patterns.
If none have matched then the location will not be
processed and will not be included into sitemap. If you
leave this area empty, then it is assumed that you want
to include everything so that all locations will are
processed (provided they conform to the rules listed
below.
Example:
/dir1/* - process all files located below (recursively)
directory "dir1"
*.html - process all HTML files at any location
Exclude Filter - Patterns included here will be
excluded from processing. When the sitemap generator
crawls a website, that site is validated against
exclusion patterns. If any patterns match, the location
will not be processed and included into Google sitemap.
Load From - Provides you with the option to process
files from the entire server and below the initial
directory which is specified by URL parameter. For
example, if the URL parameter is a server address, this
option does not effect the behavior of the google
sitemap generator; However, if you enter a directory,
say for example
http://www.popupcheck.com/news/index.html, only files
below /news directory will be processed including and
sub directories.
Thread Count - The number of simultaneous crawling
threads to run when creating the Google site map. This
may significantly decrease overall crawling time if
large number of threads are specified but will increase
bandwidth usage - so use with caution or just run with
the default.
Crawling - Sitemap Generation is activated by pressing
the 'start crawling' button
Once you click on the button, crawling will begin
and you'll be presented with status indicators for
thread status, uri, values and more. All parameters are
self explanatory and 'Finished' will appear once
crawling is completed. You may stop the sitemap
generation at any time by pressing the 'cancel'
button.
Sitemap
The sitemap tab contains all the locations / files
that have been crawled.
You have the option to edit 'Modified', 'Change
frequency' and 'Priority' cells for each row (or all
rows - well get to that in a moment).
Modified - This is the date the document was last
modified and uses the following formats:
dd.mm.yyyy
dd.mm.yyyy hh:mm
dd/mm/yyyy
dd/mm/yyyy hh:mm
To assign the same date for a group of selected cells,
simply enter the date modified once the cells are
highlighted. To select multiple cells, use SHIFT or CTRL
along with mouse or cursor key.
To assign the same date for every cell, simply click on
the column header and enter the date; all cells in same
column will receive the new value.
Change frequency - Tells Google Sitemaps
the frequency that content of a particular URL will
change. Your options are "always", "hourly", "daily",
"weekly", "monthly", "yearly" or "never". The value
"always" should be used to describe documents that
change each time they are accessed. The value "never"
should be used to describe archived URLs.
Priority - The priority of a particular URL relative
to other pages on your site. You may select between 0.0
and 1.0, where 0.0 identifies the lowest priority
page(s) on your website and 1.0 identifies the highest
priority page(s) on your website.
How To Create Google Sitemap
To create a Google Sitemap, simply press the
'save' button and a Google Compliant Sitemap will be created and
saved to your local computer. Google sitemaps can be loaded from
file by clicking "open" button (format is auto detected).
You may switch between PLAIN or XML formats. Please note,
that PLAIN format doesn't support "Modified", "Change frequency"
and "Priority" attributes.
Submit your Sitemap to Google
Once you have created the google sitemap file, place it on
your server and then visit
Google Sitemaps to submit your map.
How do I Resubmit my Sitemaps?
You can resubmit by logging into your Google Sitemap
account and clicking Resubmit or by sending
google a HTTP request, also know as 'pinging google'.
To resubmit your Sitemap using an HTTP request, you simply
type this url in your browser:
http://www.google.com/webmasters/sitemaps/ping?sitemap=sitemap_url
As an example, if your Google Site Map is
located at http://www.auditmypc.com/sitemap.htm, you'll enter
the following website address into your browser:
www.google.com/webmasters/sitemaps/ping?sitemap=http://www.auditmypc.com/sitemap.htm
but, you'll need to encode the URL first!
Encode your Google Sitemap
You must URL Encode everything after the /ping?sitemap=
So, you'll actually enter:
www.google.com/webmasters/sitemaps/ping?sitemap=http%3A%2F%2Fwww.auditmypc.com%2Fsitemap.htm
Notice how the '/' became '%3a'. Sounds confusing? No
problem, simply use the URL Encoder below:
In Summary, take the address of your sitemap, enter it into
the 'Site Map Location' area above and press 'Encode URL'
button. Append the encoded result to the end of google sitemap
url like this:
http://www.google.com/webmasters/sitemaps/ping?sitemap=[encoded
result]
To decode a Google Site Map address, simply enter the encoded
url into the 'Encoded Result' and press Decode URL
For more detail on how to resubmit the Google Sitemap, simply
visit
Google Sitemap.
Top Sitemap Builder Suggestions
Suggestions (Updated July 23rd, Thank You!) from users that
have been using the Sitemap Generator and that have been
incorporated into the application include:
Add a Google Sitemap URL Encoder / Decoder to help
submit sitemaps.
Make the URL list sortable by clicking the column title.
When generating a site map for larger sites, it can become
overwhelming.
The ability to save your settings when you create a
sitemap. The next time you generate a sitemap, you'll be
able to refer to these settings.
Include in the sitemap program the ability to honor
robots.txt files. When a site map is created, the program
will look at its own filters first and then look at the
website's robots.txt file if the option is selected.
Include in the sitemap builder the ability to honor meta
tag robot rules. When a site map is generated, the program
will look at its own filters first and then look at the
website's robots.txt file and meta tag if the options are
selected.
I want to thank everyone who has taken the time to suggest
improvements!
Sitemap Bugs and Fixes
Version 1.x
June 15th, 2005 - Identified and fixed bug that when crawling
over 3000 URLs, the Sitemap Generator would eventually drop down
to one thread and may stop creating a sitemap entirely.
June 16th, 2005 - Identified and Fixed:
Drop downs appeared to be missing from the sitemap
creator when the user changed format from 'Plain' and then
switched back to 'XML' format.
Sometimes crawling startup failed when trying to
generate a Google Sitemap.
When creating a Google sitemap from a website that
contains more than 4,000 pages, the applet started to
consume more system resources. The sitemap creator now
supports up to 50,000 pages!
Added the ability to resize the site map applet to fit
different size monitors.
July 24th, 2005 - Updated site map submission instructions.
July 27th, 2005 - Version 1.3 of Google Sitemap
Version 1.3 of the site map generator sports an entirely new
skin, is more compact, has improved code and fixes a number of
small bugs.
One of the biggest requests by users was greater detail when
errors were discovered. If your website has any errors you will
now have the ability to see exactly what caused the error when
attempting to build a site map.
The number of invalid pages is easily identifiable. Each
sitemap error can be expanded to see all related pages that
point to that particular problem. You also have the ability to
cut and paste Google Sitemap errors.
September 8th, 2005 - Fixed a bug where the sitemap generator
would hang if spaces were found in the url.
September 14th, 2005 - Version 1.4 release. No bugs, site map
cosmetic enhancements.
Common Sitemap Builder Problems
Problem: The sitemap generator misses a few or many files.
Solution: If you are having problems building a sitemap,
it may be due to your Robots.txt file or your Metatag. Try
unchecking the Follow robots.txt rules and/ or meta name robots
rules.
Problem: I can't see the the graphic (see below), so I can't
start the test.
Solution: If that's the case, then your browser settings may be
preventing sitemap generation.
In IE, look under Tools, Internet Options, Security, Custom Accounting
Level, Scripting of Java applets and choose prompt. Active
scripting should be enabled as well.
In Firefox, look under tools, options, web features and make
sure the Enable Java and JavaScript is selected.
If after trying these you still have a problem, please let me
know and I will do my best to get you up and running.
Please forward this article onto your colleagues,
assistants, partners and friends, BUT NOT
to your competition.
You have permission to publish this article electronically or in
print, free of charge, as long as the bylines, links and website
references are included. A courtesy copy of your publication
would be appreciated.
Your questions, comments, or suggestions are always appreciated.
Thank YOU,