- 2.1 Representing Ordinary Strings
- 2.2 Representing Strings with Alternate Notations
- 2.3 Using Here-Documents
- 2.4 Finding the Length of a String
- 2.5 Processing a Line at a Time
- 2.6 Processing a Byte at a Time
- 2.7 Performing Specialized String Comparisons
- 2.8 Tokenizing a String
- 2.9 Formatting a String
- 2.10 Using Strings As IO Objects
- 2.11 Controlling Uppercase and Lowercase
- 2.12 Accessing and Assigning Substrings
- 2.13 Substituting in Strings
- 2.14 Searching a String
- 2.15 Converting Between Characters and ASCII Codes
- 2.16 Implicit and Explicit Conversion
- 2.17 Appending an Item Onto a String
- 2.18 Removing Trailing Newlines and Other Characters
- 2.19 Trimming Whitespace from a String
- 2.20 Repeating Strings
- 2.21 Embedding Expressions Within Strings
- 2.22 Delayed Interpolation of Strings
- 2.23 Parsing Comma-Separated Data
- 2.24 Converting Strings to Numbers (Decimal and Otherwise)
- 2.25 Encoding and Decoding rot13 Text
- 2.26 Encrypting Strings
- 2.27 Compressing Strings
- 2.28 Counting Characters in Strings
- 2.29 Reversing a String
- 2.30 Removing Duplicate Characters
- 2.31 Removing Specific Characters
- 2.32 Printing Special Characters
- 2.33 Generating Successive Strings
- 2.34 Calculating a 32-Bit CRC
- 2.35 Calculating the MD5 Hash of a String
- 2.36 Calculating the Levenshtein Distance Between Two Strings
- 2.37 Encoding and Decoding base64 Strings
- 2.38 Encoding and Decoding Strings (uuencode/uudecode)
- 2.39 Expanding and Compressing Tab Characters
- 2.40 Wrapping Lines of Text
- 2.41 Conclusion
2.36 Calculating the Levenshtein Distance Between Two Strings
The concept of distance between strings is important in inductive learning (AI), cryptography, proteins research, and in other areas.
The Levenshtein distance is the minimum number of modifications needed to change one string into another, using three basic modification operations: del(-etion), ins(-ertion), and sub(-stitution). A substitution is also considered to be a combination of a deletion and insertion (indel).
There are various approaches to this, but we will avoid getting too technical. Suffice it to say that this Ruby implementation (in Listing 2.2) allows you to provide optional parameters to set the cost for the three types of modification operations and defaults to a single indel cost basis (cost of insertion = cost of deletion).
Listing 2.2. The Levenshtein Distance
class String def levenshtein(other, ins=2, del=2, sub=1) # ins, del, sub are weighted costs return nil if self.nil? return nil if other.nil? dm = [] # distance matrix # Initialize first row values dm[0] = (0..self.length).collect { |i| i * ins } fill = [0] * (self.length - 1) # Initialize first column values for i in 1..other.length dm[i] = [i * del, fill.flatten] end # populate matrix for i in 1..other.length for j in 1..self.length # critical comparison dm[i][j] = [ dm[i-1][j-1] + (self[j-1] == other[i-1] ? 0 : sub), dm[i][j-1] + ins, dm[i-1][j] + del ].min end end # The last value in matrix is the # Levenshtein distance between the strings dm[other.length][self.length] end end s1 = "ACUGAUGUGA" s2 = "AUGGAA" d1 = s1.levenshtein(s2) # 9 s3 = "pennsylvania" s4 = "pencilvaneya" d2 = s3.levenshtein(s4) # 7 s5 = "abcd" s6 = "abcd" d3 = s5.levenshtein(s6) # 0
Now that we have the Levenshtein distance defined, it's conceivable that we could define a similar? method, giving it a threshold for similarity. For example:
class String def similar?(other, thresh=2) if self.levenshtein(other) < thresh true else false end end end if "polarity".similar?("hilarity") puts "Electricity is funny!" end
Of course, it would also be possible to pass in the three weighted costs to the similar? method so that they could in turn be passed into the levenshtein method. We have omitted these for simplicity.