- Introduction
- Fetching the Site and Checking Links
- Conclusion
- Complete Listing
Fetching the Site and Checking Links
This script is simple, and does the most essential task: checking for bad links. More verbose and detailed reporting will be left as an exercise to the reader. This script gives the basic foundation. Let's begin!
NOTE
Click here to download a zip file containing the complete code for this article.
01: #!/usr/bin/perl -w 02: use strict; 03: use HTML::Parser; 04: use LWP::UserAgent; 05: use URI::URL;
Lines 1 through 5 begin the script by turning on warnings, and enable the strict pragma to help make sure that the code is somewhat sane. These two things are essential for any program. Line 3 uses the HTML::Parser module. This module parses the HTML we retrieve and gives us the links on that page. The LWP::UserAgent on line 4 is used to do the actual fetching of web pages. Finally, we bring in the URI::URL module to help build any relative URLs into full URIs.
We now have all the tools we need.
06: my %LINKS; 07: my %GOOD_LINKS; 08: my %BAD_LINKS; 09: my $BASE; 10: my @TO_CHECK; 11: my $URL = $ARGV[0] || "http://mydomain.com";
Lines 6 to 11 define our global variables. The %LINKS hash is used as a container for links that are found on the main URL. The %GOOD_LINKS and %BAD_LINKS hashes are used to keep track of which links are good and which are bad, respectively. $BASE holds the base URI that's being fetched. It will be used with the URI::URL module. The @TO_CHECK array holds all the links that need to be checked. This list grows as the program runs and parses more pages. The $URL variable is the URL from which we should start the crawl, and should be a base domain name. Since this is a command-line program, we can pass an argument for the URL, or use a default (http://mydomain.com).
We have the needed modules and our variables. Now on to the good stuff.
12: { 13: package GetLinks; 14: use base 'HTML::Parser';
Lines 12 to 14 begin a new block in which we create a new package, or basically a little module within our script. We name the package GetLinks, and make it a subsclass of the HTML::Parser module. By making it a subclass of HTML::Parser, we can inherit all its functionality, as well as override its start() method. This will be better explained in a moment.
15: sub start { 16: my $self = shift; 17: my ($tag, $tag_attr) = @_; 18: if ($tag eq 'a' and defined $tag_attr->{href}) { 19: $LINKS{$tag_attr->{href}} = 0; 20: } 21: if ($tag eq 'img' and defined $tag_attr->{src}) { 22: $LINKS{$tag_attr->{src}} = 0; 23: } 24: }
Lines 15 to 24 make up the GetLinks::start() method. This is a callback used in HTML::Parser. Whenever HTML::Parser comes across the start of an HTML tag, this callback is invoked, which allows us to do something based on the tag that's being parsed. In this case, we're working on <A> and <IMG> tags. Line 18 checks whether the tag (which is passed as an argument to the start() method) is an <A> tag. If it is, we want to make sure that one of the attributes to this tag is HREF. Not all <A> tags have an HREF attribute; some only have a NAME attribute, and we don't want to concern ourselves with those.
If these conditions are met, line 19 adds the URI that's referenced in the HREF attribute to our %LINKS hash. The $tag_attr variable is a hash reference, and contains the attribute data for the tag that's being worked on. The HREF hash key will contain a URI. Lines 21 to 23 do the same conditional but for <IMG> tags, and make sure that there's a SRC attribute. You may wonder why a hash is used to store this information, instead of another list. A hash is used so more information about a link can be added easily when this script is expanded. Just a small thing to do in order to ensure scalability and maintainability.
25: }
Line 25 simply finishes off the block in which we created the GetLinks package. Now we can head back into the main section of the script.
26: my $ua = new LWP::UserAgent; 27: $ua->agent("LinkCheck/0.1");
Line 26 creates a new LWP::UserAgent object. The LWP::UserAgent module basically creates a web client for us to use. Line 27 gives our user agent a name, LinkCheck/0.1. This information generally is logged into a web server's access log, so name it something useful (or fun).
28: print "Starting scan from $URL\n";
Line 28 just prints out a statement saying that the checking has begun.
29: my $req = new HTTP::Request('GET',$URL); 30: my $res = $ua->request($req);
Lines 29 and 30 create a new HTTP::Request object, which does all the necessary things to send an HTTP request to a server. In this case, we're making a GET request to the URL we provided when the script was executed. Line 30 uses our user agent's request() method to run the request. The response of the request is stored in an HTTP::Response object, which is what the $res variable will be.
31: if (!$res->is_success) { 32: die "Can't fetch $URL"; 33: }
Line 31 checks whether the request for the page failed. The is_success() method returns a true value if the requested page was successfully found. If we don't get a true value, our program dies with a simple message. Of course, if we can't get to the original URL, we may as well stop there.
34: $BASE = $res->base;
Line 34 sets the $BASE variable to the base of our response. No real magic here, but it will be used later.
35: my $parser = GetLinks->new; 36: $parser->parse($res->content);
Line 35 creates a new instance of the GetLinks package we created in lines 12 to 25. Remember, the GetLinks package is a subclass of HTML::Parser, so it inherits the methods exported by HTML::Parser. We use the parse() method, inherited from HTML::Parser, on line 36. The argument given to the parse() method is the HTML content returned from our request to the web site. We access this content via the content() method of our response object ($res in this case). As you may guess, this content is the HTML of the URL we requested, and HTML::Parser does its magic to parse the HTML tags in this content. As the parsing is happening, the start() callback method is used, and our %LINKS hash is populated.
37: for my $link (keys %LINKS) { 38: my $true_url = url($link, $BASE)->abs; 39: push(@TO_CHECK, $true_url); 40: }
Lines 37 to 40 loop through the keys of the %LINKS hash. Each key is a URL for a web page or image, since that's what our start() callback looks for. Line 38 passes the link (the URL we're working on) and the base URL we have to the URI::URL::abs() method. The url() method is a constructor for URI::URL, so we can do all this in one shot. The returned value is stored in $true_url, and will be what we eventually check. If the abs() method sees that $link is already an absolute URI, then it does nothing to it. If it sees that it's a relative path, it appends it to $BASE. For example, if $link is '/pages/foo.html' and $BASE is 'http://mydomain.com/', $true_url will be 'http://mydomain.com/pages/foo.html'.
Line 39 pushes this URL onto our @TO_CHECK array, which continuously contains all the links that need to be checked.
41: while (my $url = shift @TO_CHECK) { 42: next if exists $GOOD_LINKS{$url} or exists $BAD_LINKS{$url};
Line 41 starts looping through and picking URLs off the @TO_CHECK array. We're shifting the elements off because we want to alter the array and remove URLs we've checked. Later in this loop we'll add new URLs to the list, so we may as well remove them as we check them. Line 42 checks whether the URL is in either the %GOOD_LINKS or %BAD_LINKS hash. These hashes get populated later on in this loop.
43: $req = new HTTP::Request('GET', $url); 44: $res = $ua->request($req);
Lines 43 and 44 make an HTTP request to the current URL and put the response into the $res variable. This is the same thing we did with our beginning URL, in lines 29 and 30.
45: if ($res->is_success) {
Line 45 checks whether we connected successfully to the URL. This returns a true value if the web server returned a 200 response to us. A 500 (or other response) results in a false value.
46: if ($res->content_type =~ /text\/html/i && $url =~ /$URL/i) {
Line 46 gets the content type of the page we've fetched (from the Content-Type header of the response). If the content type is 'text/html', we can expect to get back a page of HTML. We want to know if it's HTML because if it isn't we don't want to parse it for links. We wouldn't want to parse image data or a text file for hyperlinks. As well as checking for content type, we make sure that the URL we began with (http://mydomain.com/) is part of the current $url. If not, we don't want to scan it. If we did, we would end up crawling external sites, which we don't want to do!
47: my $parser = GetLinks->new; 48: $parser->parse($res->content);
At this point, we've connected successfully to the URL, and we know that the page we connected to is an HTML document. Lines 47 and 48 create a new GetLinks instance and pass the contents of the document to GetLinks to be parsed.
49: for my $link (keys %LINKS) {
Since the contents of the page have gone through our start() callback, the links found in the document have been added to the %LINKS hash. We want to cycle through these links and add them to the @TO_CHECK array. Line 49 begins this loop.
50: my $abs = url($link, $BASE)->abs;
Line 50 gets the absolute path of the hyperlink and puts the value in the $abs variable.
51: unless(exists $GOOD_LINKS{$abs} or exists $BAD_LINKS{$abs}) { 52: push(@TO_CHECK, $abs); 53: }
Since we don't want to check the same link twice, we check whether the URL is in the %GOOD_LINKS or the %BAD_LINKS hash. If not, we put the URL onto the end of the @TO_CHECK array in order to process it later.
54: }
Line 54 ends the loop through the keys of the %LINKS hash.
55: }
Line 55 closes the conditional on line 46.
56: $GOOD_LINKS{$url}++; 57: } else { 58: $BAD_LINKS{$url}++; 59: } 60: }
Line 56 adds the URL to the %GOOD_LINKS hash. This line is accessed if we got a true response from is_success() on line 45. If is_success() returned a false value, we would be on line 58, where the bad link would be recorded in the %BAD_LINKS hash. When all the links are checked and nothing is left in @TO_CHECK, we have a hash with all the good URLs, and one with all the bad ones. The only thing left to do is use this information.
61: print qq{Bad links\n}; 62: print qq{$_\n} for keys %BAD_LINKS;
Lines 61 and 62 do some very basic display of the results. Since we're mainly concerned with the bad links, we loop through the keys in %BAD_LINKS and display all the bad links. That's it!