Characters
Other than the metacharacters, each character in an expression falls into one of groupings:
- Normal Characters
- Single Character Escape
- Multiple Character Escape
- Character Category
- Character Block
- XML Character References
The sections below provide one or two examples for each grouping. For tables that detail every character expression, see http://www.XMLSchemaReference.com/regularExpression.html. Note that tables on the reference website contain numerous examples, just like the other tables in this document.
Normal Characters
Normal characters are basically the Latin characters that many people expect: the characters 'e', 'g', 'n', 'r', and might be arranged in an expression that looks like 'green'. The only possible surprise here is that a few characters that might otherwise be normal (such as '.' or '?') are actually metacharacters with special meanings.
The following two examples permit the string 'green' in a corresponding XML instance:
green [egnr]+
Single Character Escape
A single character escape matches common hard-to-type characters. For example '\t' matches the otherwise unreadable "tab" character. Single character escapes also permit the otherwise unmatchable metacharacters. For example, '\.' Matches a period character, as opposed to '.', which is the wildcard.
The following expressions all permitamong other valid valuesa tab, followed by 'green', followed by a period:
\t[egnr]+\. \t.*\. \tgreen\.
Here is the full table of single character escapes, along with numerous examples.
Multiple Character Escape
A multiple character escape matches common sets of characters. For example, '\s' matches any whitespace character (tab, space, newline or return); '\d' matches a decimal digit (0-9). Single character escapes are frequently enhanced with cardinality modifiers, such as '\s*' to match one or more whitespace characters. Note that all multiple character escape sequences have a negation which is the uppercase equivalent. So, '\S' matches any character except one of the whitespace characters.
The follow expression permits an unbounded number of non-negative integers delimited by whitespace:
(\s*\d+)+\s*
Note that at least one decimal number is required; leading and trailing whitespace is permitted.
A simpler form matches permits basically the same values, but does not actually require any characters:
(\s\d)*
Here is the full table of multiple character escapes, along with numerous examples.
Character Category
Character categories are great, but also dangerous. For example, '\p{Lu}' matches any uppercase character. In practice, this probably works just fine. In theory, the regular expression might not be what you want, since this matches, say, uppercase German characters with umlauts. On the other hand, '\p{P}' matches any punctuation, which can be quite useful. Note that a character category specified with a lower case 'p' matches that category; an uppercase 'P' specifies any character except the one that category. So '\P{P}' matches any character except punctuation.
The following expression matches a capitalized word in any combination of language character sets:
\p{Lu}\p{Ll}*
Here is the full table of character categories, along with numerous examples.
Character Block
The Unicode Standard supports character blocks. A block is a range of characters set aside for a specific purpose. Some examples of these blocks are the characters for a language (such as Greek), the Braille character set, and various drawing symbols.
The Schema Recommendation provides a regular expression mechanism for identifying characters that belong to a specific block of interest. The syntax for expression which identifies a block is '\p{IsBlockName}', where 'BlockName' is a name from the table of block names. Like the character categories, an uppercase 'P' (as in '\P{IsBlockName}') excludes the characters in that block.
An expression which constrains the corresponding XML to just the Greek characters set might look like:
\p{IsGreek}*
XML Character References
An expression may match a character by using the common XML character reference, which is a decimal number delimited by '&' and ';', or a hex number delimited by '&#' and ';'. For example, both 'Z' and 'Z' match the uppercase Latin 'Z'. The number embedded in an XML character reference corresponds directly to the characters documented in the Unicode Standard.