Regular Expressions: Matching Sets of Characters
- Matching One of Several Characters
- Using Character Set Ranges
- "Anything But" Matching
- Summary
In this lesson you'll learn how to work with sets of characters. Unlike the ., which matches any single character (as you learned in the previous lesson), sets enable you to match specific characters and character ranges.
Matching One of Several Characters
As you learned in the previous lesson, . matches any one character (as does any literal character). In the final example in that lesson, .a was used to match both na and sa, . matched both the n and s. But what if there was a file (containing Canadian sales data) named ca1.xls as well, and you still wanted to match only na and sa? . would also match c, and so that filename would also be matched.
To find n or s you would not want to match any character, you would want to match just those two characters. In regular expressions a set of characters is defined using the metacharacters [ and ]. [ and ] define a character set, everything between them is part of the set, and any one of the set members must match (but not all).
Here is a revised version of that example from the previous lesson:
sales1.xls orders3.xls sales2.xls sales3.xls apac1.xls europe2.xls na1.xls na2.xls sa1.xls ca1.xls [ns]a.\.xls sales1.xls orders3.xls sales2.xls sales3.xls apac1.xls europe2.xls na1.xls na2.xls sa1.xls ca1.xls
The regular expression used here starts with [ns]; this matches either n or s (but not c or any other character). [ and ] do not match any charactersthey define the set. The literal a matches a, . matches any character, \. matches the ., and the literal xls matches xls. When you use this pattern, only the three desired filenames are matched.
NOTE
Actually, [ns]a.\.xls is not quite right either. If a file named usa1.xls existed, it would match, too. The solution to this problem involves position matching, which will be covered in Lesson 6, "Position Matching."
TIP
As you can see, testing regular expressions can be tricky. Verifying that a pattern matches what you want is pretty easy. The real challenge is in verifying that you are not also getting matches that you don't want.
Character sets are frequently used to make searches (or specific parts thereof) not case sensitive. For example:
The phrase "regular expression" is often abbreviated as RegEx or regex.
[Rr]eg[Ee]x
The phrase "regular expression" is often abbreviated as RegEx or regex.
The pattern used here contains two character sets: [Rr] matches R and r, and [Ee] matches E and e. This way, RegEx and regex are both matched. REGEX, however, would not match.
TIP
If you are using matching that is not case sensitive, this technique would be unnecessary. This type of matching is used only when performing case-sensitive searches that are partially not case sensitive.