how to use wget to find broken links on your website

As websites get larger and larger with ever-growing content, it could easily get to be a nightmare to maintain. It is true for any website. But this is especially true of a blog, a forum or any site with a constant influx of new content.

Over time, content on the website can get outdated and be removed. This could lead to broken links in other pages and posts. Often, it can be cumbersome to find and correct these broken links as it can be various different parts of the site even the ones that you don't remember exists.

There are several different tools that allow you to find broken links in a website. Many of these tools are online websites, search engine tools (like Google Webmaster Tools) or stand alone utilities that need to be installed. We will take a look at how you can use the wget utility in Linux to achieve the same functionality…quickly and easily.

what is wget

wget is a software utility from GNU Foundation that crawls and retrieves content from web servers. It supports several different web protocols such as http, https and ftp. It can be used to easily spider and replicate websites and has some very desirable features such as recursive crawling, link re-writing and proxy support.

what is a broken link

Before we embark on finding broken links, we should probably define what a broken link is. A broken link or broken URL is a web url that does not return a valid web object. It points to a non-existent web resource. Sometimes, broken links are also called dead links.

Typically the web server will return an HTTP code 404 when you try to access a broken web link. There are several different reasons why you might have broken links on your site. All of these reasons are usually due to a webmaster or designer error or oversight, which makes it easier to fix once you know where it is.

Type Error or Spelling Mistakes: It is quite possible that there has been a mis-type in the URL when it was coded. This is more common than you think.
Deleted Resources or Pages: It is quite possible that the resource has been deleted. It is true resources that has been around for a while and it has been deleted by somebody, may be even accidentally. You might have not been able to track down all the references to the resources or missed some.
Renamed Resources: It is not uncommon for some resources, such as pages, images or css files to be renamed. Sometimes, it is difficult to track down all the references and change them manually.

issues due to broken links

Whatever the cause of broken links, it can have a negative impact on your website. It is desirable to fix these broken links as soon as possible.

User Perception and Engagement: Broken links on the website tend to reduce the quality of the website. If a user is seeing repeated error pages when clicking on links, then it negatively affects the perception of the user about the website. It also tends to reduce user engagement as well.
Search Engine and SEO:The same goes for various search engine as well. When search engine crawlers sees repeated errors on your website, it tends to reduce the overall ranking of your website with in the search engine index. This can negatively affect the rankings in the search engine for your website and web pages.

using wget to find broken links

There are several web based crawlers that can crawl your website and find broken links. We will put the wget utility to the same task so that you can use it directly from the command prompt. We will work our way through the various options of wget in order to get to the right set of command line options to use when trying to find broken links.

Let's say your website is currently hosted locally at localhost. The simplest form of the wget command is:

bash$ wget http://localhost/

recurse to crawl all pages

Now, we need to check all pages for broken links not just the home page. Add the wget's recursive option (-r or –recursive) to the command. This will cause wget to get files recursively, kind of like spidering.

bash$ wget -r http://localhost/

infinite depth

We have no way of knowing how deep the website structure is. We will set the depth to infinity so as to make sure that all the pages are crawled. The command line option to set the depth is -l or –level, which we will set to inf.

bash$ wget -r -l inf http://localhost/

download all referenced resources

Now, we have to download not only the pages but also all the resources that are also necessary to display the HTML pages correctly. The -p or –page-requisites option will allow you to do just that.

bash$ wget -r -p -l inf http://localhost/

time stamping

When crawling websites, it is often prudent to download only files that have changed since the last time you crawled. This is true when your website is large and can save you some time. In order to deal with it, we will turn on the time stamping feature of wget using the -N option.

bash$ wget -r -p -N -l inf http://localhost/

disable host directories (optional)

When you crawl recursively, the files are stored under host-prefixed directories. You can specify that you don't want the host directories to be created by using the -nH or –no-host-directories option.

bash$ wget -r -p -N -nH -l inf http://localhost/

specify directory prefix

If you have disabled the host directories, then you might want to specify your own directory prefix. You can specify your custom directory prefix by using the -P or –directory-prefix option.

bash$ wget -r -p -N -nH -p mywebsite -l inf http://localhost/

log file configuration

Now, you should save the output to an external log file, so that we can analyze it later. You can specify a log file by using the -o or –output-file command line option. In this example, we will save the log to a file called logfile.log in the current working directory.

bash$ wget -r -p -N -nH -o logfile.log -p mywebsite -l inf http://localhost/

reduce the load (optional)

You probably want to reduce the load on your server as you crawl, to prevent it from crashing or have performance degradation. The option to do that is -w or –wait where you can specify the number of seconds between queries. You can further randomize this by using the –random-wait option.

bash$ wget -r -p -N -nH -w 5 --random-wait -o logfile.log -p mywebsite -l inf http://localhost/

Now, this will crawl the entire website and save the files to a folder called mywebsite and save the crawl log to a file names logfile.log.

There are several other options that you can use depending on your requirement. wget is an extremely versatile and powerful program with loads of options for almost every scenario. The above command is probably one of simplest. You should check the complete documentation for wget to further optimize your settings.

Note: There is a command line option called -m or –mirror that is a shortcut for "-r -N -l inf –no-remove-listing", most of which we use in the examples here. You could choose to use this instead of the separate options.

identify the broken links

Once you have crawled the website, you can now analyze the log file to see if there are any links that are broken or if there are any errors. The errors that happened during the crawl is marked as errors in the log file. You can quickly check for these by searching for the word error in the log file.

We will use the grep command to search through the log file.

bash$ grep -B1 -i "error" logfile.log

The -B1 option will print out the link that is broken, which is useful if you want to fix these links on your website.

identify locations of broken links

The second part to fixing the broken link issues is finding the location of the broken link. You would want to easily track down the pages where such links occur in the website. Most times, you might already know but sometimes it can be quite tricky to track those down.

You can again use the grep command to find these links in your website. Take the broken URL or link that you got from the previous grep command or from the log file, and use it to search the crawled version of the website.

bash$ grep -irnH "<broken-link>" mywebsite

This should print out all the pages or files where the broken link occurs on the website. The "broken-link" is the link that you found is broken and mywebsite is the folder where the crawled version of the website is stored.

Once you have these information, you can easily fix such links in the website. This is a quick and easy way to find and fix broken links in websites locally, before publishing to the live or production servers.

Also once you have the commands figured out that works best for your case, you should be able to easily automate the whole thing. Write a simple shell script that will crawl the website and print out the broken links. You can then run the script periodically or as a cron job to automate the whole thing.