Working with Strings
- 2.1 Representing Ordinary Strings
- 2.2 Representing Strings with Alternate Notations
- 2.3 Using Here-Documents
- 2.4 Finding the Length of a String
- 2.5 Processing a Line at a Time
- 2.6 Processing a Byte at a Time
- 2.7 Performing Specialized String Comparisons
- 2.8 Tokenizing a String
- 2.9 Formatting a String
- 2.10 Using Strings As IO Objects
- 2.11 Controlling Uppercase and Lowercase
- 2.12 Accessing and Assigning Substrings
- 2.13 Substituting in Strings
- 2.14 Searching a String
- 2.15 Converting Between Characters and ASCII Codes
- 2.16 Implicit and Explicit Conversion
- 2.17 Appending an Item Onto a String
- 2.18 Removing Trailing Newlines and Other Characters
- 2.19 Trimming Whitespace from a String
- 2.20 Repeating Strings
- 2.21 Embedding Expressions Within Strings
- 2.22 Delayed Interpolation of Strings
- 2.23 Parsing Comma-Separated Data
- 2.24 Converting Strings to Numbers (Decimal and Otherwise)
- 2.25 Encoding and Decoding rot13 Text
- 2.26 Encrypting Strings
- 2.27 Compressing Strings
- 2.28 Counting Characters in Strings
- 2.29 Reversing a String
- 2.30 Removing Duplicate Characters
- 2.31 Removing Specific Characters
- 2.32 Printing Special Characters
- 2.33 Generating Successive Strings
- 2.34 Calculating a 32-Bit CRC
- 2.35 Calculating the MD5 Hash of a String
- 2.36 Calculating the Levenshtein Distance Between Two Strings
- 2.37 Encoding and Decoding base64 Strings
- 2.38 Encoding and Decoding Strings (uuencode/uudecode)
- 2.39 Expanding and Compressing Tab Characters
- 2.40 Wrapping Lines of Text
- 2.41 Conclusion
Atoms were once thought to be fundamental, elementary building blocks of nature; protons were then thought to be fundamental, then quarks. Now we say the string is fundamental.
—David Gross, professor of theoretical physics, Princeton University
A computer science professor in the early 1980s started out his data structures class with a single question. He didn't introduce himself or state the name of the course; he didn't hand out a syllabus or give the name of the textbook. He walked to the front of the class and asked, "What is the most important data type?"
There were one or two guesses. Someone guessed "pointers," and he brightened but said no, that wasn't it. Then he offered his opinion: The most important data type was character data.
He had a valid point. Computers are supposed to be our servants, not our masters, and character data has the distinction of being human readable. (Some humans can read binary data easily, but we will ignore them.) The existence of characters (and thus strings) enables communication between humans and computers. Every kind of information we can imagine, including natural language text, can be encoded in character strings.
A string, as in other languages, is simply a sequence of characters. Like most entities in Ruby, strings are first-class objects. In everyday programming, we need to manipulate strings in many ways. We want to concatenate strings, tokenize them, analyze them, perform searches and substitutions, and more. Ruby makes most of these tasks easy.
Most of this chapter assumes that a byte is a character. When we get into an intermationalized environment, this is not really true. For issues involved with internationalization, refer to Chapter 4, "Internationalization in Ruby."
2.1 Representing Ordinary Strings
A string in Ruby is simply a sequence of 8-bit bytes. It is not null-terminated as in C, so it can contain null characters. It may contain bytes above 0xFF, but such strings are meaningful only if some certain character set (encoding) is assumed. (For more information on encodings, refer to Chapter 4.
The simplest string in Ruby is single-quoted. Such a string is taken absolutely literally; the only escape sequences recognized are the single quote (\') and the escaped backslash itself (\\):
s1 = 'This is a string' # This is a string s2 = 'Mrs. O\'Leary' # Mrs. O'Leary s3 = 'Look in C:\\TEMP' # Look in C:\TEMP
A double-quoted string is more versatile. It allows many more escape sequences, such as backspace, tab, carriage return, and linefeed. It also allows control characters to be embedded as octal numbers:
s1 = "This is a tab: (\t)" s2 = "Some backspaces: xyz\b\b\b" s3 = "This is also a tab: \011" s4 = "And these are both bells: \a \007"
Double-quoted strings also allow expressions to be embedded inside them. See section 2.21, "Embedding Expressions Within Strings."