Regular Expressions 102: Text Translation in C++
- The Most Useful Regular-Expression Characters
- Escape Characters and Other Pitfalls
- Replacing Text: Some Simple Examples
- A Few More Words
If you've programmed in C++ or one of the other popular programming languages for a while, you probably know how to perform complex text-translation tasks, such as reformatting source code from one HTML format to another. Such tasks are always doable by examining one character at a time and then making a series of decisions. But how would you like to be able to perform complex text translation by specifying a couple of text patterns and then just calling a function? That's possible with the new Regular Expression library included in the C++0x specification. In this article, I'll focus on text translation by search-and-replace.
The Most Useful Regular-Expression Characters
First let's do some review. My previous article "Regular Expressions 101" explained the basic use of the C++ Regular Expression library. The following table reviews some of those basics.
Special Character(s) |
Matches |
. |
Any one character. |
[range] |
A range of characters such as [a-z], which matches any one lowercase letter, or [0-9], which matches any one digit. You can also build complex ranges, as in [abm-z0-9], which matches any one of the following: a, b, any lowercase letter from m to z inclusive, or a digit. |
expression* |
An expression repeated zero or more times. As I explain shortly, the asterisk (*) is an expression modifier, not a separate expression. |
expression+ |
An expression repeated one or more times. |
(expression) |
A group. |
The last syntax element in the table, forming groups, is important as a way of specifying which characters to repeat. Groups are an essential to search-and-replace operations, which I'll delve into later in this article.
These five syntax elements are a relatively small subset of the full regular-expression syntax, but they're by the far the most widely used. With these elements, you can specify a wide variety of patterns.
Subtleties of the Repeat Operators
The repeat operators, * and +, involve a subtlety that nearly all manuals and articles on regular expressions gloss over or misstate. Consider the following regular expression:
ca*t
To understand this pattern, you should first note that, by default, a character without special meaning is interpreted literally. That is, usually a character "is what it is." In the regular expression ca*t, a regular-expression function first looks for c in the input string:
c
The function doesn't then try to match a—or at least, not exactly! Instead, the characters a* are taken together to form a sub-pattern that says "Match zero or more copies of the letter a." Therefore, the expression ca*t matches any of the following strings:
ct cat caat caaat
And so on. It might seem reasonable to say that a* means "Match a and then match zero or more copies of a"—but that's not how it works.
Understanding Groups
Finally, let's review the purpose of groups. The repeat operators, * and +, apply to just one character—the character that precedes them—unless the preceding characters are in a range or in a group. For example, what does the following match?
bana(na)+
The final occurrence of na is in a group. As special characters, the parentheses are not matched literally, but rather are used to specify the group. So if you call a function that tries to match this pattern, the function tries to match characters in the input string in this order:
- Match bana exactly once.
- Match one or more occurrences of na. The use of the asterisk (*) would cause matching of zero or more copies instead.
So any of the following strings comprise a match:
banana bananana banananana bananananana
and so on.