String Pattern-Matching
A common operation in computing is to determine whether a string of characters matches some desired pattern. Common patterns include URLs, phone numbers, identification numbers, source code in editors, or content in web pages. I just described a variety of kinds of strings. Worse yet, the need to perform all of these operations might exist in one application.
You could write a whole bunch of code for each of the matching scenarios I just defined. Or you could simply use the Regex.IsMatch static method defined in the System.Text.RegularExpressions namespace. This is an excellent tool for validating text input. Visual Studio .NET provides a tremendous amount of help information; instead of competing with Visual Studio .NET as a reference for regular expressions, I'll just demonstrate a few regular expressions here, and encourage you to explore the help information for more examples.
The form of the invocation to IsMatch is as follows:
Regex.IsMatch(inputString, regularExpressionString)
IsMatch returns a Boolean indicating whether the input string matches the regular expression. Several examples are provided in the following sections.
Escape Characters and @
Regular expressions are composed of special and escaped characters and literal values. The special and escaped characters make up the regular expression grammar. Writing a regular expression is programming; the regular expression language is terse, but it's still programming.
In C#, when you include a backslash (\) followed by a character, C# interprets that as an escape character. For example, \r\n is the carriage return and newline pair. If you want to use the literal values and prevent escaping, you precede the string with the "at" symbol (@), like this:
@"c:\temp\myfile.txt"
Without the @, C# would escape the \t in the line above, treating it as a tab. The \m would be an unrecognized escape character. If you want to treat backslashes as literal backslashes, you use @ or escape the backslash (\\). This double backslash means to use the literal backslash character (\). Here's the same file path as above, written using the double backslash value:
"c:\\temp\\myfile.txt"
It's important to be aware of this distinction when working with file paths and regular expressions. A large part of the regular expression grammar is composed of escaped characters, and occasionally you may want to use a literal backslash in your expression.
Matching Repeating Digits
The easiest way to create a regular expression is to place a single literal or escape value for every value that you want to match. For example, \d represents a digit. Thus, you could use a number of \d values to match a string of digits in the input string. Here's an example of invoking the Regex.IsMatch static method to look for a pattern of digits that represents a U.S. telephone number (without the area code):
Using System.Text.RegularExpressions; Regex.IsMatch("555-1212", "\d\d\d-\d\d\d\d");
If a sequence of digits anywhere in the input string matches the phone number, IsMatch returns true.
The preceding is a simple regular expression. Expressions can become very complex, but like code, should be only as complex as they need to be. You can shorten the local U.S. phone number by using the repeat quantifier syntax. A three-hyphen-four phone number sequence can be rewritten as follows:
"\d{3}-\d{4}"
Any string containing three digits, a hyphen, and four more digits would return a true value for IsMatch.
You can combine the repeat quantifier with any valid literal or escape character. With minor adjustments, you can create a regular expression that will match U.S. phone number values, with or without an area code.
Using Groups
Regular expressions support a grouping syntax. The grouping syntax is represented by parentheses: (). You can combine parentheses, literal parenthetical characters, and OR logic to test for variations in expressions. For example, in the U.S. it's generally acceptable to represent a phone number with or without the area code, and the area code is commonly wrapped in parentheses. To validate phone number input, you could test an input string against a couple of variations:
(\(\d{3}\) \d{3}-\d{4})|(\d{3}-\d{4})
Table 1 decomposes the expression above.
Table 1 Decomposed Regular Expressions for U.S. Phone Numbers, Demonstrating Grouping and OR Logic
Character |
Description |
( |
Starts the first group |
\( |
Matches a literal left parenthesis |
\d{3} |
Matches three digits |
\) |
Matches a literal right parenthesis |
<space> |
Matches a literal space |
\d{3} |
Matches three digits |
- |
Matches a hyphen |
) |
Ends the first group |
| |
OR logic, as in "match the first expression OR the second expression" |
( |
Starts the second group |
\3{d} |
Matches three digits |
- |
Matches a literal hyphen |
\4{d} |
Matches four digits |
Examples of strings of characters that would match include (555) 555-1212 or 555-1212. IsMatch returns true if these strings occur anywhere in the input string.
We'll look at boundary conditions in a moment. First, let's look at a way to use optional groups; this will allow us to shorten expressions that have redundant sub-parts.
Matching Optional Groups
Adding a question mark after a groupa parenthetical sub-expressionmarks that group as optional. The following expression adds an optional group to a U.S. formatted postal code:
\d{5}(-\d{4})?
The preceding expression matches postal codes with five digits followed by an optional hyphen and four more digits.
The following regular expression consolidates the phone number expression by making the area code an optional group. I also included the beginning (^) and ending ($) delimiters (discussed in the next section), which means that the input string must match the expression in its entirety.
^(\(\d{3}\) )?(\d{3}-\d{4})$
If you want to match an entire string, you can add boundary conditions to delimit strings or look for word boundaries. See the next section for details.
Start of Line, End of Line, and Boundaries
There are several metacharacters that you can use to describe boundary conditions on your regular expressions. As indicated in the preceding example, you can use the caret (^) and the dollar sign ($) to indicate the part of the expression that must occur at the beginning of the input string and the end of the input string.
By combining the beginning metacharacter (^) in the phone number expression with the optional group, we're indicating that the input string must start with the area code or the first three digits of the phone number. The ending metacharacter ($) indicates that the input string must end with the final four characters of the phone number. Using the beginning and ending metacharacters, the entire input string must be in one of the following forms:
(###) ###-####
or
###-####
There are several boundary metacharacters that you can employ, as shown in Table 2.
Table 2 Boundary Metacharacters
Metacharacter |
Description |
\s |
Matches any whitespace, including \n, \r, \t, \v, \x85, or \p{Z}. (A smaller subset is represented if you use the RegexOptions.ECMAScript option.) |
\n |
Represents the newline character. |
\r |
Represents a carriage return. |
\t |
Represents a tab. |
\v |
Represents a vertical tab. |
\x85 |
I'm not sure what this represents. (Perhaps one of you will write me at pkimmel@softconcepts.com and let me know.) |
\p |
Matches named character classes. |
There are several metacharacters used for other purposes. For example, you can use \b to indicate that the expression must occur on a boundary condition.
Regular expressions really require a whole book of their own. .NET regular expressions are similar to the regular expression language in Perl 5, and you can download Dan Appleman's ebook on regular expressions from Amazon.com (Regular Expressions with .NET, PDF format, 75 pages, available for $15). Of course, you should finish reading this article to see if your answer is in here, for free.
Here's an example that demonstrates using the boundary metacharacter to match a U.S. Social Security number within a string:
\b\d{3}-\d{2}-\d{4}\b
The input string must contain a string of digits that occur on a boundary, and the input string must match digits in the format ###-##-####. Clearly, this is consistent with a U.S. Social Security number. The expression indicates whether the input string contains a succession of digits and hyphens in this format, but of course it won't indicate whether the Social Security number is valid.