2.8 Grokking the Site
Spam messages generally contain a method for the recipient to contact the spammer (or the business on whose behalf the spam is being sent). Some offer an email address and others a phone number, but most try to get the recipient to go to a website. To do this, spammers include a clickable web reference (URL) in the body of the spam email. But because spammers seek to keep their identities a secret, they generally try to disguise all web references. In this section we show some of the tricks they use.
First note that a web reference is usually specified as part of an HTML a command:
<a href="http://www.example.com"> visible text </a>
The <a (followed by a space) begins the command. The <a is followed by one or more keywords particular to that command, all terminated by the > character. The href= keyword indicates a web reference (URL). Following the > is the actual text that will appear on the screen and that must be clicked to invoke the web reference. The </a> ends the command.
In the following, we examine the pieces of this web reference one item at a time.
- Section 2.8.1 examines the leading <a part.
- Section 2.8.2 explains case insensitivity.
- Section 2.8.3 examines the http: part.
- Section 2.8.4 shows how email addresses can mask the www.example.com part.
- Section 2.8.5 shows that IP numbers and hexadecimal representations of IP numbers are used by spammers to disguise host names.
- Section 2.8.6 shows how the www.example.com part can be hidden with redirects.
- Section 2.8.7 shows how the www.example.com part can be ridiculously stretched out to say aa.bb.cc.dd.ee.ff.gg.hh.ii.jj.kk.example.com.
- Section 2.8.8 shows how the www.example.com part can be hidden behind CNAME records.
- Section 2.8.9 shows that URLs can also be used as comments.
One common method of using URLs to fight spam is to record the host names from those URLs in a database. Each time a new piece of email shows up, the URL is found and the new host name is looked up, and if the new name is found in the database, it is interpreted as spam. In addition to host names, a well-designed antispam database includes IP numbers.
To illustrate, consider a database that contains the host name spam.example.com and that host's IP number (192.168.33.44). When new mail arrives, the host name in the arriving mail is looked up in the database. If that new host also has the address 192.168.22.44, it too is rejected even though the new and old host names may be different.
2.8.1 The HTML Keyword
The <a command always contains a web reference, but other HTML commands can also contain web references. The <a command indicates a web reference by using an href= expression. For example:
<a href="http://www.example.com">
But other HTML commands use different expressions to indicate the web reference. Table 2.1 lists the HTML commands that allow URLs and the expression used by each.
Table 2.1. HTML Commands That Reference URLs
Command |
Expression |
Description |
<a |
href= |
Create a hyperlink (href=) or identifier (name=) in a document. |
<applet |
codebase= |
Define an executable applet with a document. |
<area |
href= |
Define a mouse-clickable area within a map. |
<base |
href= |
The base for all URLs in the document. |
<body |
background= |
Set background to a URL. |
<del |
cite= |
Citation reference for deleted information. |
<embed |
src= |
Embed an object in a document. |
<form |
action= |
URL to use on submission of a form. |
<frame |
src= |
Define a frame within a frameset. |
<iframe |
src= |
Embed a frame inside a document. |
<img |
dynsrc= |
Specify a video clip to play. |
<img |
lowsrc= |
Specify a low-resolution image to preload. |
<img |
src= |
Specify an image to load. |
<img |
usemap= |
Coordinates list for a map. |
<ins |
cite= |
Citation for inserted commentary. |
<input |
src= |
Image to select for an input choice. |
<isindex |
action= |
Create a searchable document. |
<link |
href= |
Create an interdocument link. |
<link |
src= |
Specify an external style sheet to use. |
<meta |
url= |
Reference for an HTTP refresh. |
<object |
classid= |
Identify the class of an object. |
<object |
codebase= |
Source of the code base for the object. |
<object |
data= |
Source of data for the object. |
<object |
name= |
Source for the name of the object. |
<object |
usemap= |
Specify the image map to use with the object. |
<q |
cite= |
Citation for the enclosed quotation. |
<script |
src= |
Source for external language code to run. |
<table |
background= |
Source of background image to load. |
<td |
background= |
Source of background image to load. |
<th |
background= |
Source of background image to load. |
<tr |
background= |
Source of background image to load. |
2.8.2 Just in Case
Before digging deeply into URLs, we first need to comment on one of the (sometimes overlooked) characteristics of HTML in general.
First, note that all HTML commands and the URLs they reference are case insensitive. That is, all the following are the same:
<a href="www.example.com"> <A HREF="WWW.EXAMPLE.COM"> <a HrEf="wWw.ExAmPlE.CoM">
Note, too, that host and domain names are also case insensitive.
2.8.3 The Protocol Specification
A protocol can be thought of as a language used by programs to communicate over a network connection. [10] Usually a request is sent from the client software to the server software, and an answer (or reply or data) is returned. The most common protocols are http (Hypertext Transport Protocol), https (HTTP with Secure Sockets Layer, or SSL), ftp (File Transfer Protocol), and file (for viewing local files).
In a URL, the expression that identifies the protocol prefixes the host or domain:
<a href="http://www.example.com" ... >
That prefix expression (here, the http://) is followed by a host or domain specification (or an IP number) and whatever additional information is needed:
<a href="http://www.example.com/cgi-bin/search?who=bob"> <a href="http://192.168.44.55">
Note that the protocol can be excluded, in which case it is automatically set to the document's default. The protocol, host or domain, and other information are normally enclosed in quotation marks to protect HTML-capable mail programs from stumbling over illegal characters. The quotation marks may be double or single, but whichever is used, they must pair up (a double quotation mark may not be mixed with a single quotation mark):
href="http://www.example.com"
href="http://www.example.com"
href="http://www.example.com" Won't work.
Note, however, that quotation marks can often be safely omitted, so you should not count on their presence when parsing spam email.
If a spam HTML message is in a language other than English, the quotes may be present but specified using the other language's encoding:
href=EF2Dhttp://www.example.comEF2D
Here, the EF2D is hexadecimal that represents two binary byte values (not four characters) that specify quotation marks appropriate to the language. So again it is better not to depend on quotation marks when parsing URLs.
Although the protocol, when present, is always specified with a trailing ://, in actuality all that is really needed is the colon. [11] Thus, all three of the following produce the same URL result:
http://www.example.com http:www.example.com http:////////////www.example.com
Notice that the number of forward slashes is unimportant. The single required character is the colon.
Also note that there can be no space between the protocol and its colon, but the colon can be followed by arbitrary white space characters.
http :www.example.com Space before colon won't work.
http: www.example.com Space after the colon is OK.
http: A new line is OK.
www.example.com
Note also that the protocol does not need to actually be present with each URL. A <base command (if present) sets a prefix that will precede all URLs that do not specify a protocol. The prefix is always terminated by a forward slash (/) even if one is omitted from the <base command.
<base href="http://www.example.com"> <img src="images/bob.jpg">
These two commands are the equivalent of the following (single) command:
<img src="https://www.example.com/images/bob.jpg">
If a <base is omitted and if the URL omits a protocol, the default is generally the http:// protocol:
<a href="www.example.com"> The protocol defaults to http://.
2.8.4 Email Addresses Mask URLs
To protect from inappropriate input, some HTML-compliant mail readers interpret an email address that is part of an http:// reference to be the same as a host or domain specification. [12] For example, the email address in the first line is interpreted (in the second line) as if the user part and the @ were omitted:
<a href="http://bob@www.example.com"> <a href="http://www.example.com">
The lesson is that whenever you are parsing a host or domain specification, you will need to start parsing over again when you encounter an @ character.
2.8.5 IP Numbers Too
Spammers are aware that the domain part of the URL does not need to be expressed in host.domain form (as www.example.com), and they use that fact to help disguise the host's name. For example, the following replaces the host.domain form with the IP number of the host www.example.com:
<a href="http://192.168.22.33">
Here 192.168.22.33 is the IP number for www.example.com. So be aware, when parsing URLs, that the host.domain part can be expressed as an IP number, too.
Also note that IP numbers can be expressed in decimal or in hexadecimal when prefixed with a literal 0x, thus making them even harder to detect:
<a href="http://0xC0A81621"> IP number in hexadecimal
<a href="http://3232241185"> IP number in decimal
All this effort to disguise a host's name with a cryptic-looking hexadecimal address allows spam email to double as a means to accomplish fraud. Consider, for example, the following web reference and surrounding text:
Your ATM card PIN number has expired. For security reasons, connect to our <a href="https://www.ABCDE-Bank.com:Secure@0xC0A8162"> secure server</a> and select a new PIN as soon as possible.
Users who click on this URL are taken to the spammer's fraud site at @0xC0A8162, and not to the bank's site as they would expect. Even more dangerous, users see the following literal link address in the browser's link window, thereby being further fooled into thinking that the link is legitimate:
This example shows why it is crucial to start over when you encounter an @ when parsing a URL and why you need to allow for host.domain, IP address, and hexadecimal forms of addresses when parsing the URL.
2.8.6 Dealing with Redirects
A redirecting site is a host that takes a web reference that points to itself, strips away the self-referencing part, and then issues an HTTP redirect command back to the user's browser with the remaining part of the reference. The effect for the user is to view the redirected-to site and not the redirecting host's site.
http://redirect.example.com/*http://www.real.host http://redirect.example.com/*http://www.real.host http://www/real.host
Here, redirect.example.com is a redirecting site. When a browser visits it with the full URL shown on the first line, it recognizes the self-referencing part (bold on the second line), strips that self-reference, and returns an HTTP redirect to the actual host (shown on the third line).
The presence of redirect sites increases the complexity of parsing an actual site from a URL in spam email. Currently it is sufficient to simply start parsing over again when you encounter an http: or an https: while parsing a URL. As spammers gain skill and experience, however, such a simple solution will become less effective.
There are many redirect sites on the Internet. One that spammers know well is rd.yahoo.com. [13] They all share the same characteristics. The URL for the redirecting host is listed first, followed by a forward slash or backslash, then a special character, and then the URL for the actual host.
<a href="http://rd.yahoo.com/xxxx/*http://real.site">
For the rd.yahoo.com family of servers, the special character is an asterisk. For others the special character varies. The one common characteristic among all redirect servers is that the special character is not one that would normally appear as part of a URL. [14]
The redirect site can be followed by an arbitrary amount of URL information. For example:
href="http://rd.yahoo.com/bypass/winkie/food/thumb*big/dairy/ noisy/gyroscope/middle/fred/*http://real.site"
Recall that the special character must follow a forward slash. So here the * in thumb*big is not special. That is, it does not begin a URL for the real site.
Note that backslashes are the equivalent of forward slashes in the redirect portion of the URL because they are essentially ignored by the redirect server:
href="http://rd.yahoo.com\bypass/winkie/food\thumb*big/dairy/noisy/gyroscope/\middle/fred/http://www.chunky.example.com/acidrain/moose/jane/wheel\*http://real.site"
Here, the actual site (always last) is real.site. But beware: real.site is the spammer's site and subject to the spammer's rules. It is not unreasonable to expect spammer sites to employ new methods that will make the real.site appear perhaps second from the last or even third from the last:
href="http://rd.yahoo.com\bypass/winkie/food\thumb*big/dairy/noisy/gyroscope/\middle/fred/http://www.chunky.example.com/acidrain/moose/jane/wheel\*http://real.site?search=http://some.bogus.site&referenece=http://another.bogus.site"
Just as it is important to begin parsing over again when you find an http: or https:, it is also necessary to stop parsing when a URL argument (the ?) appears. Clearly, spammers will go to great lengths to disguise the actual URL, and the examples we have shown here are only a hint of the techniques you will see in the future.
2.8.7 Wildcard DNS Records
The host name part of a URL can be subject to the same random word masquerading as the body, making it difficult to detect. To illustrate, consider these two URLs:
href="http://bob.biff.bonny.bill.betty.boop.example.com" href="http://andy.able.alex.annie.alice.boop.example.com"
If you look at only one of these, there is no way to know for certain where the randomizing ends. But looking closely at two, one might surmise that the host name is boop.example.com. But one might be wrong. The real host might actually be example.com. To find out which is right, we need to delve briefly into DNS records.
When an HTML-capable mail program needs to connect to a URL's site, it first must look up the address of that site. For both of the sample randomized host names, it would (for example) find the address 192.168.111.44. But look at what happens when we look up the two names we suspect to be the real host names for these URLs:
boop.example.com 192.168.1.23 example.com 192.168.111.44
Clearly, our presumption—that the boop.example.com host was the real host—was wrong. Instead, the real host is example.com because it has the address 192.168.111.44.
But a savvy spammer might anticipate this logic and use a host name that appears random but is actually the real host name.
boop.example.com some innocent's address
example.com some innocent's address
able.alex.andy.boop.example.com 192.168.111.44
Here, if you decide to use the address that you suppose is the real host, you might cause some innocent site's address to be interpreted as that of a spamming site. The correct way to record the spamming address for later use is to look up and record the full domain name in the reference, even if it appears to be random.
2.8.8 CNAME Records and URLs
When an HTML-capable mail program attempts to look up a URL's host or domain, it may receive another host name in return rather than the expected address. That new host name is called a CNAME record. Here's how it works:
- You find the URL http://example.com.
- You look up the host name example.com.
- You expect an address, such as 198.162.33.44, but instead you get the host name www.example.net.
When you look up a URL's host name and expect an address but instead get another host name in return, you need to do an additional lookup to find the actual address. [15]
example.com www.example.net www.example.net 192.168.33.44
But CNAMEs can lead to other CNAMEs, thus creating a long thread of potential lookups, and CNAMEs can even form infinite loops:
example.com www.example.net
www.example.net www.example.com
www.example.com example.com An infinite loop!
When you combine the risk of CNAMEs with the need to decipher long host names, the thread of lookups might get ugly indeed:
able.alex.andy.boop.example.com www.example.net
www.example.net boop.example.com
boop.example.com 192.168.111.44
alex.andy.boop.example.com 192.168.22.4
andy.boop.example.com 10.9.4.2
boop.example.com 192.168.111.44
example.com example.net
example.net example.com
example.com example.net
example.net example.com
etc. in an infinite loop
When you write code to decipher a long host name, be sure to account for the possibility of infinite loops.
2.8.9 URLs Used as Comments
Just as non-HTML words can be used to create comments (see section 2.5), so can URLs. For example, consider the following:
V<a href="bob.example.com"></a>i<a href="jane.example.com></a>a<a href="alice.example.com"></a>g<a href="dan.example.com"></a>ra
When viewed on an HTML-aware mail reader, the preceding would appear like this:
Viagra
The use of URLs as comments is intended to make it difficult to find the actual URL in the message. When you look for the URL, it is not enough to simply look for the </a> abutting the reference, because arbitrary nonprint HTML can also appear between the two:
V<a href="bob.example.com"><font size="+5"></font></a>i<a href="jane.example.com"><font color="white"></font></a>
Finding the URL when this technique is used requires your spam scanner to act almost as an actual HTML parser.
2.8.10 JavaScript.Encode URLs
Last here—but certainly not the last word in hiding the URL—is the technique of encoding a URL using JavaScript.Encode. The spammer's idea in this strategy is to wrap the URL in JavaScript so that it will be decoded by the browser. For example, consider the following obscured URL (wrapped to fit the page):
<script language="JScript.Encode">#@~^hQAAAA=3D=3D~@#@&[Km!:+ YcADbYn'E@> !(o"bHA~?"Z'r4OYa)Jz+!+ O, FF+R8*f&^kxV 4YhVr~qq9:C{!P_2&!C:'TPwI)\AAr"92"> '!,j/I}SdqHMxE WE@*@!&qwI)\A@*BbI@#@&AyIAAA==^#~@</script>
Here, the HTML tag, <script, tells the interpreter to decode what follows using the JavaScript.Encode protocol. That protocol will then decode everything between the leading <script and the ending </script>. When decoded, the URL becomes the following:
document.write('<IFRAME src="https://192.168.23.45/link.html" WIDTH=0 HEIGHT=0 FRAMEBORDER=0 SCROLLING=0>')
After decoding, it is clear that the encoded URL contains a web reference that the spammer wants to keep secret. But we can use it to record the URL of the spamming site. Here, that site is represented by the IP address 192.168.23.45, but in other JavaScript. Encoded URLs the reference may be a host name or may be further obscured by other means.
See http://www.virtualconspiracy.com/ for C language source examples of ways to decode JavaScript.Encode.