- Matching One of Several Characters
- Using Character Set Ranges
- “Anything But” Matching
- Summary
Using Character Set Ranges
Let’s take a look at the file list example again. The last used pattern, [ns]a.\.xls, has another problem. What if a file was named sam.xls? It, too, would be matched because the . matches all characters, not just digits.
Character sets can solve this problem as follows:
Text
sales1.xls orders3.xls sales2.xls sales3.xls apac1.xls europe2.xls sam.xls na1.xls na2.xls sa1.xls ca1.xls
RegEx
[ns]a[0123456789]\.xls
Result
sales1.xls orders3.xls sales2.xls sales3.xls apac1.xls europe2.xls sam.xls na1.xls na2.xls sa1.xls ca1.xls
Analysis
In this example, the pattern has been modified so that the first character would have to be either n or s, the second character would have to be a, and the third could be any digit (specified as [0123456789]). Notice that file sam.xls was not matched, because m did not match the list of allowed characters (the 10 digits).
When working with regular expressions, you will find that you frequently specify ranges of characters (0 through 9, A through Z, and so on). To simplify working with character ranges, regex provides a special metacharacter: - (hyphen) is used to specify a range.
Following is the same example, this time using a range:
Text
sales1.xls orders3.xls sales2.xls sales3.xls apac1.xls europe2.xls sam.xls na1.xls na2.xls sa1.xls ca1.xls
RegEx
[ns]a[0-9]\.xls
Result
sales1.xls orders3.xls sales2.xls sales3.xls apac1.xls europe2.xls sam.xls na1.xls na2.xls sa1.xls ca1.xls
Analysis
Pattern [0-9] is functionally equivalent to [0123456789], and so the results are identical to those in the previous example.
Ranges are not limited to digits. The following are all valid ranges:
A-Z matches all uppercase characters from A to Z.
a-z matches all lowercase characters from a to z.
A-F matches only uppercase characters A to F.
A-z matches all characters between ASCII A to ASCII z (you should probably never use this pattern, because it also includes characters such as [ and ^, which fall between Z and a in the ASCII table).
Any two ASCII characters may be specified as the range start and end. In practice, however, ranges are usually made up of some or all digits and some or all alphabetic characters.
Multiple ranges may be combined in a single set. For example, the following pattern matches any alphanumeric character in uppercase or lowercase, but not anything that is neither a digit nor an alphabetic character:
[A-Za-z0-9]
This pattern is shorthand for
[ABCDEFGHIJKLMNOPQRSTUVWXYZabcde fghijklmnopqrstuvwxyz01234567890]
As you can see, ranges make regex syntax much cleaner.
Following is one more example, this time finding RGB values (colors specified in a hexadecimal notation representing the amount of red, green, and blue used to create the color). In Web pages, RGB values are specified as #000000 (black), #ffffff (white), #ff0000 (red), and so on. RGB values may be specified in uppercase or lowercase, and so #FF00ff (magenta) is legal, too. Here is an example taken from a CSS file:
Text
body { background-color: #fefbd8; } h1 { background-color: #0000ff; } div { background-color: #d0f4e6; } span { background-color: #f08970; }
RegEx
#[0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f][0-9A-Fa-f]
Result
body { background-color: #fefbd8; } h1 { background-color: #0000ff; } div { background-color: #d0f4e6; } span { background-color: #f08970; }
Analysis
The pattern used here contains # as literal text and then the character set [0-9A-Fa-f] repeated six times. This matches # followed by six characters, each of which must be a digit or A through F (in either uppercase or lowercase).