understanding the XML sitemap and the reasons why you should have one for your site

Sitemaps are an integral part of the most modern websites. Sitemap is essentially a file which lists all the URLs that point to the pages in the website. The sitemap can be in any format: text, HTML or XML.

The sitemap in XML format, usually referred to as the XML Sitemap is this URL list in an XML file format that is easily understood by search engines. There are certain advantages in having an XML sitemap, most of which is related to SEO and web crawling.

The XML format allows several metadata information to be tagged on to a URL or site link. For each URL in the file, you can specify a last modified date, expected change frequency of the page and a crawl priority. All of these values may be used by the search engine to crawl the web site in an efficient way.

Advantages

Find All Pages

The web engine crawler parses the XML sitemap and identifies all or most the webpages. This works as a starting point for the crawling process. The presence of the links in the file makes it easier for the crawler to find pages and crawl them quickly.

In the absence of a sitemap, the crawler will need to parse the contents in a web page and find the links. If you have any pages that is not explicitly linked from another indexed page then there is a possibility that the page will not be indexed by the web crawler.

Faster Indexing

When there are new pages and content on your website, it is very much a possibility that the page is not linked from any other pages. This can be true especially in the case of a blog. Adding the URL to the new content in the sitemap will allow the web engines to quickly find the new web page and crawl it.

Otherwise, you will need to wait till another page, either external or internal to get crawled by the web engine and the URL to be identified.

Faster Update of Modified Content

The XML sitemap has metadata which specifies how often a webpage might be updated. The crawlers can take advantage of this information and re-crawl pages that are most likely to be updated. It is not necessary that the search engines adhere closely to this. Most times, the search engines have their own specific algorithm to identify and crawl updated content.

Meta Information

There are other metadata information along with your URLs in the XML Sitemap. The information that you can add to the URLs may vary slightly by content type. All URLs can have three important piece of information as detailed below: Last Modified, Change Frequency and Priority.

Last Modified: This specifies the last modified date of the URL. You can update this information when you update the content of the file or page.

Change Frequency: This specifies as to how often the content of the page is likely to change. You can specify values such as daily, weekly, monthly, yearly etc. This is much more of a hint for the search engines rather than a command and the search engines will use their own algorithm to decide the crawl frequency.

Priority: This denotes the priority of the current URL or web page in the website. For example, the home page and blog posts will have a higher priority than the archive pages. You can specify your own priority depending on your website structure. Again this is only a hint to the crawler on how the webpages should be crawled.

Content Types or Multimedia Meta data

You can specify additional content type specific metadata for your multimedia and image files. This allows the crawlers to find and identify content correctly, as well as to categorize them appropriately. The most common file types used in web sites are image and video files. These are also the media types which is more difficult to parse and categorize based on semantics or content.

The XML Sitemap supports the following metadata for image files: A caption for the image, a title for the image, a geographical location of the image and the license information for the image reuse. You can also specify these information in the image (img) HTML tag for Search Engine Optimization (SEO). Adding that information to the Sitemap does not hurt either.

For video and audio files, you can specify the following metadata: You will need to specify a title, a description, page URL, thumbnail URL and URL to the raw video file. The video metadata is an extension to the standard Sitemap protocol that is developed and supported by Google. This will only have limited impact on other search engines and web crawlers.

You should understand that Sitemap is not a necessary or required feature of the website. It is more of a nice-to-have feature that aids the web and search engine crawlers to index your website efficiently and effectively. The need for the website Sitemap depends on the structure of your web site. Most search engines can index most of the pages with out the Sitemap. It makes more sense when you happen to have a website that has

Dynamic Content: Your website has large amounts of dynamic content. Many web pages and content are created by AJAX and there are images and multimedia files that are not well linked to pages.

Less Inbound Links: Your website is new or have very few inbound links from external websites. This makes it harder for the search engines to detect the existence of a website or webpage.

Less Internal Links: Your website have large amounts of stand alone content that are not well linked from other webpages in the website.

Whatever the reason, it is usually good have an XML Sitemap for the website as it ensures that all your webpages will be indexed, with very minimum of effort on your part. Generating a Sitemap is not hard and there are usually several automated methods to do it. You can even write one by hand for smaller websites.