Working with Strings in Ruby
- 2.1 Representing Ordinary Strings
- 2.2 Representing Strings with Alternate Notations
- 2.3 Using Here-Documents
- 2.4 Finding the Length of a String
- 2.5 Processing a Line at a Time
- 2.6 Processing a Character or Byte at a Time
- 2.7 Performing Specialized String Comparisons
- 2.8 Tokenizing a String
- 2.9 Formatting a String
- 2.10 Using Strings as IO Objects
- 2.11 Controlling Uppercase and Lowercase
- 2.12 Accessing and Assigning Substrings
- 2.13 Substituting in Strings
- 2.14 Searching a String
- 2.15 Converting Between Characters and ASCII Codes
- 2.16 Implicit and Explicit Conversion
- 2.17 Appending an Item onto a String
- 2.18 Removing Trailing Newlines and Other Characters
- 2.19 Trimming Whitespace from a String
- 2.20 Repeating Strings
- 2.21 Embedding Expressions within Strings
- 2.22 Delayed Interpolation of Strings
- 2.23 Parsing Comma-Separated Data
- 2.24 Converting Strings to Numbers (Decimal and Otherwise)
- 2.25 Encoding and Decoding <tt>rot13</tt> Text
- 2.26 Encrypting Strings
- 2.27 Compressing Strings
- 2.28 Counting Characters in Strings
- 2.29 Reversing a String
- 2.30 Removing Duplicate Characters
- 2.31 Removing Specific Characters
- 2.32 Printing Special Characters
- 2.33 Generating Successive Strings
- 2.34 Calculating a 32-Bit CRC
- 2.35 Calculating the SHA-256 Hash of a String
- 2.36 Calculating the Levenshtein Distance Between Two Strings
- 2.37 Encoding and Decoding Base64 Strings
- 2.38 Expanding and Compressing Tab Characters
- 2.39 Wrapping Lines of Text
- 2.40 Conclusion
- Atoms were once thought to be fundamental, elementary building blocks of nature; protons were then thought to be fundamental, then quarks. Now we say the string is fundamental.
- —David Gross, professor of theoretical physics, Princeton University
A computer science professor in the early 1980s started out his data structures class with a single question. He didn’t introduce himself or state the name of the course; he didn’t hand out a syllabus or give the name of the textbook. He walked to the front of the class and asked, “What is the most important data type?”
There were one or two guesses. Someone guessed “pointers,” and he brightened but said no, that wasn’t it. Then he offered his opinion: The most important data type was character data.
He had a valid point. Computers are supposed to be our servants, not our masters, and character data has the distinction of being human readable. (Some humans can read binary data easily, but we will ignore them.) The existence of characters (and therefore strings) enables communication between humans and computers. Every kind of information we can imagine, including natural language text, can be encoded in character strings.
A string is simply a sequence of characters. Like most entities in Ruby, strings are first-class objects. In everyday programming, we need to manipulate strings in many ways. We want to concatenate strings, tokenize them, analyze them, perform searches and substitutions, and more. Ruby makes most of these tasks easy.
For much of the history of Ruby, a single byte was considered a character. That is not true of special characters, emoji, and most non-Latin scripts. For a more detailed discussion of the ways that bytes and characters are often not the same, refer to Chapter 4, “Internationalization in Ruby.”
2.1 Representing Ordinary Strings
A string in Ruby is composed simply of a sequence of 8-bit bytes. It is not null terminated as in C, so it may contain null characters. Strings containing bytes above 0xFF are always legal, but are only meaningful in non-ASCII encodings. Strings are assumed to use the UTF-8 encoding. Before Ruby 2.0, they were assumed to be simple ASCII. (For more information on encodings, refer to Chapter 4.)
The simplest string in Ruby is single quoted. Such a string is taken absolutely literally; the only escape sequences recognized are the single quote (\') and the escaped backslash itself (\\). Here are some examples:
s1 = 'This is a string' # This is a string s2 = 'Mrs. O\'Leary' # Mrs. O'Leary s3 = 'Look in C:\\TEMP' # Look in C:\TEMP
A double-quoted string is more versatile. It allows many more escape sequences, such as backspace, tab, carriage return, and linefeed. It allows control characters to be embedded as octal numbers, and Unicode code points to be embedded via their hexadecimal reference number. Consider these examples:
s1 = "This is a tab: (\t)" s2 = "Some backspaces: xyz\b\b\b" s3 = "This is also a tab: \011" s4 = "And these are both bells: \a \007" s5 = "This is the unicode snowman: \u2603"
Non-ASCII characters will be shown “backslash escaped” when their string is inspected, but will print normally. Double-quoted strings also allow expressions to be embedded inside them. See Section 2.21, “Embedding Expressions within Strings.”