2.6 Attempting to Fool Signature Detectors
Signature detectors work by computing a value for a block (or chunk) of text in a message. For example, suppose a message contained this text:
We sell herbs at the lowest price you will ever find on the net.
A signature-generating program might create an expression that represents this text, such as the following:
244372015810742154622705
This "signature" is saved in a file or database. Later, when another message arrives, its chunks are also given signatures. Then each signature is looked up, and, if it is found, that serves as an indication that the message may have been seen before. Clearly, several signatures will have to match so that one message may be considered significantly like another.
Similar spam detection software recognizes phonemes (distinct parts of words) instead of chunks of words. Other software performs permutations of the divided text to increase the number of signatures used, and still others perform statistical analysis on individual words and then store the probabilities.
But all these forms of spam detection share the common method of examining the message's text. Spammers, recognizing that text analysis is being used, have responded by adding large chunks of random text to each message.
Random text can be actual words and names in random order:
dissonant deanna heron aphasia restaurateur circulate controllable corporeal cranston giuliano helmholtz bertha albany shank eye asphyxiate commentary gaston aide filler chipboard prostheses perturb cryptographer atlantic bernice
Random text can also be random combinations of characters:
eyhydxre yaceyaxv gesmveu vmlpv wmgrxa drgcah mqbjneq wbfqzkmwr fdbkqogtgzwv lsunhut wuwnp- hivrkef dhdpfhcu ndowgkx cjxrofun yepjhxp rhbxag ncgvmv
Random text can also be a solid stream of random characters:
hyfaqjimgdalmrymmolaktivajvctikdhpfzaplgumufsvtjgu tccqenngjwtodktenkrvefpmkiherqymsccysqfbmapkkvxuo tauimesuijmivglyefqlgclxvyjsxfgsfadrhvnrhzacfncmssx awlzrjilipsbuuenbbdtievlmkpycivegidatnlccffyajnbmqw
Finally, random text can also be an actual abstract from real text, where a different abstract is used in each message: [9]
Mother called me home that night with a shout that told me there was trouble. "Mom," I yelled, pounding up the back steps. "What's wrong, Mom?"
When you parse a message to detect spam, your goal is to find a way to skip such random text and to run signatures only on the portions of the message that do not change. Portions of a message that should be checked include the following:
- Common images
- Web references
- Email addresses
- Phone numbers