Finding Bad Links in Perl
Introduction
When you run a web site of any size, links are bound to go bad. External links may have gone away, images may have moved, and even internal files may have been deleted. Many commercial products can crawl a site and report on the bad links. But with the power of Perl, we can write a simple link checker to help keep a site up to date.
To check a site for bad links, we need to be able to do two things:
First, we need to be able to connect to locations on the web, and be able to determine whether the connection was a success.
Second, if the connection is successful, we need to have the ability to parse the contents of a page for more links. This will allow us to crawl a particular web site, searching for bad links.
If we head over to the Comprehensive Perl Archive Network (CPAN), we'll find the LWP suite of modules and the HTML::Parser module. The LWP modules do all the dirty work to connect to a web site and give us the results. In turn, the HTML::Parser module does the nittygritty tasks of parsing the HTML document. We'll also use the URI::URL module to help format URIs from relative paths, which are sometimes used in HREF and SRC attributes.