Understanding the Vampire Bot
When you improvise code, you need to first write down the information changes, which typically isn't a problem if you're familiar with the task or you have a lot of experience to draw upon. But suppose you're writing a programsuch as a vampire botthat you've never written before, and you have no experience with similar programs to draw on. You need some techniques to help you uncover the information changes. Whenever you're writing a program that automatically does some activity that you know how to do manually, one heuristic is to ask yourself, "How do I do this task manually?" In the case of the vampire bot, the question would be "What do I do when I download a number of images from a web page?" Typically, the information changes are implicit in the answer to the question. The form of the information in the changes may be different, but the information itself is often the same.
In our vampire bot, for example, if you were to download images manually from a web page, you would probably open your browser, type the URL of the web page, note the images you want to download, and then download each image to your computer (usually by right-clicking each image, choosing the Save As menu item, and entering a destination location and filename for the image). Based on this description, here's one possible set of information changes:
URL-->WebPage-->ImageList-->Files
But these changes aren't the same ones that your vampire bot uses. The main difference between downloading images manually from a web page and a vampire bot doing it automatically is that the vampire bot can't "see" the web page the same way you can. The vampire bot has to work with the raw HTML. Thus, the information changes for the vampire bot are as follows:
URL-->RawHTML-->ImageList-->Files
These are really the same set of changes as the previous onesthe web page and the raw HTML denote the same information, but in different forms.
Of course, there's still the matter of implementing the changes. Before you sit down and start coding, you need to at least understand the key changes. Every program has at least one change that distinguishes it from other programs. For a vampire bot, the key change is going from the raw HTML to a list of images; in other words,
RawHTML-->ImageList
To understand what we need to do to implement this change, it's helpful to go through an actual example.
Remember, the vampire bot can't see the web page the same way you can. It has to work with the raw HTML. Imagine that your bot has somehow downloaded the raw HTML for the sample web page at this address:
http://www.professorf.com/planets.html
This web page contains the HTML code shown in Listing 2.
Listing 2HTML Source Code (Raw HTML) for planets.html
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
Our vampire bot has to extract the names of all the images, which are highlighted in Listing 3.
Listing 3The Image Filenames
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
Let's tackle the simpler problem of extracting one image, such as planets/sun.gif, from the raw HTML (Listing 4):
Listing 4One Image Filename
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
We'll generalize our findings to extracting all images.
Step 1: Finding a Close Pattern
To extract planets/sun.gif, we first need to find a pattern that gets us close to the image. One good pattern is IMG SRC=, since we know that an image's filename always follows IMG SRC=. Another good pattern is .gif, because we know all the images in this example end in .gif. We'll use .gif as the pattern that gets us close (see Listing 5):
Listing 5Finding a Close Pattern (patt)
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
Assume that we've stored the value of this pattern that gets us close to an image's filename in a string variable named patt. Further assume that patt's position is stored in an integer variable labeled ploc.
NOTE
As a matter of convention, in our code listings we'll depict C# keywords and code related to the Framework Base Classes in bold monospace type. Moreover, we'll depict any code that you can vary in italic monospace; for example, variable names and prompts, to name a few.
Step 2: Finding the Starting Pattern
Given that we've found a pattern that gets us close to the filename, ploc, the second step is to search backward, looking for a pattern that tells us we're at the start of the image's filename. The double quote character (") is a good start-of-the-image pattern (see Listing 6):
Listing 6Double Quote as Starting Pattern (spat)
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
Assume that we've stored the string pattern that tells us we're at the start of the image in a string variable named spat, and the location of spat in an integer variable named sloc. As you'll see later on, there are a number of built-in string functions that you can use to search for string patterns and find their locations.
Step 3: Finding the Ending Pattern
Given that we know where the start of the image is (sloc), the third step is to search forward, looking for a pattern that tells us we're at the end of the image. Again, a double quote character is a good end-of-the-image pattern (see Listing 7):
Listing 7Double Quote as Ending Pattern (epat)
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
Assume that we've stored the value of the pattern signaling the end of the image in a string variable named epat, and the location of epat in an integer variable named eloc.
At this point, we have the starting location of the image's filename (sloc) and the ending location of the image's filename (eloc), and we can extract the name.
Step 4: Extracting the Image Name
To recap, our test image's filename, planets/sun.gif, is bounded by the starting and ending patterns, both of which happen to be double quotes (see Listing 8):
Listing 8Starting and Ending Patterns
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
The locations of these starting and ending patterns are stored in the variables sloc and eloc. The final step is to extract the string between these locations (see Listing 9):
Listing 9The Filename Between the Starting and Ending Patterns
<HTML> <HEAD> <STYLE> <!-- TD {font-family:verdana;font-size:9pt;font-weight:bold;color:white} --> </STYLE> </HEAD> <BODY BGCOLOR=BLACK> <TABLE> <TR><TD VALIGN=TOP>Sun </TD><TD><IMG SRC="planets/sun.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mercury</TD><TD><IMG SRC="planets/mercury.gif"></TD></TR> <TR><TD VALIGN=TOP>Venus </TD><TD><IMG SRC="planets/venus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Earth </TD><TD><IMG SRC="planets/earth.gif"> </TD></TR> <TR><TD VALIGN=TOP>Mars </TD><TD><IMG SRC="planets/mars.gif"> </TD></TR> <TR><TD VALIGN=TOP>Jupiter</TD><TD><IMG SRC="planets/jupiter.gif"></TD></TR> <TR><TD VALIGN=TOP>Saturn </TD><TD><IMG SRC="planets/saturn.gif"> </TD></TR> <TR><TD VALIGN=TOP>Uranus </TD><TD><IMG SRC="planets/uranus.gif"> </TD></TR> <TR><TD VALIGN=TOP>Neptune</TD><TD><IMG SRC="planets/neptune.gif"></TD></TR> <TR><TD VALIGN=TOP>Pluto </TD><TD><IMG SRC="planets/pluto.gif"> </TD></TR> </TABLE> </BODY> </HTML>
That string, of course, is the filename of the image we want the vampire bot to download. To generalize these steps to get a list of all the image filenames, we just need to repeat the procedure for each image in the raw HTML, looking for the next image where the last image left off; that is, start looking for the next image after eloc.
To summarize, to implement a vampire bot you need to improvise code over the information changes shown in Listing 10:
Listing 10Information Changes for a Basic Vampire Bot
URL-->RawHTML-->ImageList-->Files
Listing 10 shows that the vampire bot takes a web page address (URL) as input and gets the HTML source code (RawHTML) at that URL. From the RawHTML it builds a list of image files to download (ImageList). Finally, the bot downloads the list of images into Files on the user's computer.
If possible, you should determine the datatype for the information before you start coding. Listing 11 shows the information changes with datatypes.
Listing 11Information Changes with Datatypes
string:URL-->string:RawHTML-->ArrayList:ImageList-->Disk:Files
This is a straightforward chart with plenty of latitude for improvising enhancements; for example, downloading various kinds of image types. In the following sections we show you how to code these changes, starting with getting the URL as input.