- Introducing Regular Expressions
- Some Regular Expression Rules
- Programming Regular Expressions in C++
- Moving Forward
Some Regular Expression Rules
Since people are familiar with the * and possibly the ? patterns in filename matching, I’ll start with the equivalent in regular expressions. To match any single character in regular expressions (same as the ? in filename matching), you use a single period (.) character. In other words, a single period matches any single character. The following strings (or any other single character) will match the . pattern:
a b 1 2
Now consider this pattern:
.*
This pattern matches any number of characters, thus behaving similarly to the * pattern. The following strings, for example, will match this pattern:
abc abc123 123 Four score and seven years ago
As you can, see, any string whatsoever will match this pattern. The idea is that the . character matches any single character, and when you tack on the * character, you’re saying that the . character can be repeated any number of times for a match.
In filename patterns, letters, numbers, and other non-pattern characters (such as everything but the two ? characters in the filename pattern abc??.txt) are called literals. The same is true in regular expressions. Thus, suppose you want a regular expression that works the same way as this filename pattern:
a??c
You would write it like this:
a..c
This pattern matches any string that is four characters long, starts with an a and ends with a c. Here are some strings that would match this pattern:
abcc a12c a c
For that last one, that’s two spaces between the a and the c.
Next, consider the following filename pattern:
abc*
This pattern matches any file starting with abc followed by anything whatsoever. Here’s the equivalent regular expression:
abc.*
The following are example strings would match:
abc123 abcxyz abcz abc
Notice that any string starting with abc will match, even the string consisting of only abc.
Here’s another example. Consider this pattern:
a.*b
The following are examples of strings that will match this pattern:
ab a123b abbbb axyzb
Again, notice the first example, in which no characters are present between the a and the b. The .* pattern means any characters, including no characters.
Repeating Matches
In the preceding patterns, you may have noticed that unlike with filename patterns, matching more than one character requires two characters in the pattern—the period followed by the asterisk (.*). That’s because the * character means that the preceding character can repeat any number of times (including zero or no times) to get a match.
Regular expressions use various characters to signify repeating matches. While the * character means that a character can match any number of times, including zero, the plus (+) pattern indicates that the preceding pattern must match at least one or more times (but not zero times). Take this pattern, for example:
ab+c
The following strings will match this pattern:
abc abbc abbbbbbbbc
But ac will not match because at least one b must be present between a and c. In the case of the pattern ab*c, all of the preceding examples will match, as will ab.
If you want to match a character either zero times or one time (but not more than one) use the ? character. Consider this pattern:
ab?c
Following are the only strings that will match this pattern:
ac abc
If you want to be absolutely precise, you can specify how many times you allow a character to match. To do so, place opening and closing braces ({}) following the character, and put inside the braces either a number for an exact amount, or two numbers separated by a comma for a range. For example:
ab{10}c
matches b exactly 10 times, and thus only this string will match:
abbbbbbbbbbc
This pattern will match b two, three, four, or five times:
ab{2,5}c
Thus, only these strings will match:
abbc abbbc abbbbc abbbbbc
Treating Special Characters or Literals
What if you want to match an asterisk, a question mark, a period, or any of the special characters? Just put a backslash before it. For example, the following pattern:
a\*b
matches only one string:
a*b
And this pattern:
a\.b*
matches any of the following strings, for example:
a. a.b a.bb a.bbb
Lots More About the Syntax
Perl-style regular expressions can do far more than I’ve described so far. If you want to learn more about the syntax, here are some places to go. First, check out the documentation for the Boost C++ library that I’m going to use in the next section. Next, check out the official Perl documentation on regular expressions. Finally, if you really want to get serious about regular expressions, most people agree that the best book available is Mastering Regular Expressions by Jeffrey Friedl (O’Reilly, 2002).
Now let’s look at some C++ code to implement regular expressions.