how to create an XML sitemap using wget and shell script

A SiteMap or Site Map is a file that lists the pages of a webpage that are accessible to users and search engines. This can be in any format as long as whoever is reading it can understand the format. There are mainly two types of format that are used when creating a site map: XML and HTML.

All websites should ideally have at least one form of sitemap, especially for search engine optimization (SEO). An XML sitemap is usually preferred for SEO as it contains a lot of relevant metadata information for the URLs. Almost all search engines have the capability to read a properly formatted sitemap, which is then used to index the pages on the website.

If you create websites using any of the website building framework such as WordPress or Drupal, then there is already a built-in functionality (or a plugin) that can help you to automatically generate relevant sitemaps. But if you are developing websites using other web technologies such as HTML, CSS, Javascript or PHP without the aid of a website building software or platform, then you will need to create sitemap manually.

If and when you do have to create sitemap manually, it is often not that bad if it is a small website with just a few pages. If the website has even as few as 30 or 40 pages, then it becomes a nightmare to create the XML sitemap by hand. It is also a on-going maintenance issue if the website has constant updates where new pages are regularly added along with pages being deleted. It is quite easy to make silly errors including spelling mistakes or to miss pages.

We will try to create or develop a simple shell script that can crawl the website and generate a simple workable XML sitemap. That would make it very easy to regularly generate sitemaps. We will use the wget utility in Linux to crawl the website.

I will take a step by step approach so that you can learn and better understand how the script is created. If it is not of much interest to you then just scroll to the bottom to get the complete script.

We will assume that your website is running locally on localhost, which is usually the case if you are in the process of developing the site (not always though). We will crawl the website first using the simplest of  wget command.

$ wget http://localhost/mywebsite/

Now, we do not really care about saving the content of the webpages locally. Also, we need to recurse into the website hierarchy rather than just the home page. Let’s add the recursive and the spider option to wget.

$ wget --spider --recursive http://localhost/mywebsite/

Now, wget by default only crawls to a depth of 5. We want to crawl the entire website no matter what the depth. We will set the depth to infinity. You can modify this to the depth level you want.

$wget --spider --recursive --level=inf http://localhost/mywebsite/

We will store the output to a local file, which we will be able to manipulate later. Also, we will use the –no-verbose option to reduce the logs. We just need the URL of the page that it is downloading, and nothing else. Keeping it small will make it easier to parse it. So, the command now looks like this:

$wget --spider --recursive --level=infinity --no-verbose --output-file=/home/tom/temp/linklist.txt http://localhost/mywebsite/

Now, this file linklist.txt contains all the URLs on the website, however not in the exact format we want. We will strip out only the URL part of this log file using grep and awk.

The log messages in the output file should be something like what is in the example below. The URL that we are interested is the text just after the word URL: up till the next blank or white space character. So, now we will try to extract that text using awk. We will pipe several awk commands in steps to strip out exactly what we want from the lines.

2015-08-07 15:21:59 URL:http://localhost/mywebsite/keynotes/feed/ [769] -> "localhost/mywebsite/keynotes/feed/" [1]

First, we get all the lines that we want from the file..

$ grep -i URL /home/tom/temp/linklist.txt

Now, we will split the line and strip out the part after the string URL: from the line using awk. That should be simple enough:

awk -F 'URL:' '{print $2}'

Now, we can trim out the spaces from the line, as it might contain some leading spaces.

awk '{$1=$1};1'

The next step is to strip out just the url, which is a first part of the string up till the first white space.

awk '{print $1}'

You can probably combine all of the above into a single awk command, but this keeps it easy enough to understand. Now, we can sort the URLs and then remove the duplicates suing the sort utility. We will also remove any blank lines using sed.

sort -u | sed '/^$/d'

Putting it all together, the entire command will look something like this

grep -i URL /home/tom/temp/linklist.txt | awk -F 'URL:' '{print $2}' | awk '{$1=$1};1' | awk '{print $1}' | sort -u | sed '/^$/d' > sortedlinks.txt

There are several other options to do this exact same thing using sed or even tr. But I think the above set is simple and modular enough so that you can customize it further to match your requirements. You could replace the domain name of the URL if needed with a simple sed command.

The next step is to generate the sitemap XML file. We will generate the site map from a boilerplate template….with preset values. We will just create a very simple sitemap suitable enough for simple static websites. We are not going to add any extra fields, such as with more sophisticated systems or deal with images.

First we will loop through the links in the file, and insert a url tag for each of the URLs we want. We will look at only URL that ends either with a slash (/), or the extensions html or htm. We will add just one xml tag for location for each of these urls.

Here is the entire bash shell script for the process.

#!/bin/bash
sitedomain=http://www.mywebsitedomain.com/
wget --spider --recursive --level=inf --no-verbose --output-file=/home/tom/temp/linklist.txt $sitedomain
grep -i URL /home/tom/temp/linklist.txt | awk -F 'URL:' '{print $2}' | awk '{$1=$1};1' | awk '{print $1}' | sort -u | sed '/^$/d' > /home/tom/temp/sortedurls.txt
header='<?xml version="1.0" encoding="UTF-8"?><urlset
      xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
            http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">' 
echo $header > sitemap.xml
while read p; do
  case "$p" in 
  */ | *.html | *.htm)
    echo '<url><loc>'$p'</loc></url>' >> sitemap.xml
    ;;  
  *)
    ;;
 esac
done < /home/tom/temp/sortedurls.txt
echo "</urlset>" >> sitemap.xml

You can add additional fields such last-modified, changefreq or priority as needed. I have kept it simple for most part. You requirements will vary and you can adapt and develop this script further to add more fields.

While this is a good method to generate sitemap for local websites, there are definitely other methods to generate xml sitemap like online sitemap generators.

A note of caution: It is actually not a good idea to create XML files directly from shell scripts, especially if it is complex and large. You might be better off finding and using a perl or python library to create more sophisticated XML sitemap files. You can also use the intermediate files generated by this as your input.