15.8 Subexpressions
A complete expression can be embedded within another expression, creating a subexpression. The embedded expression is enclosed by parentheses, '(' and ')'. On its own, however, a subexpression has no effect on the complete pattern. The following two examples are functionally identical:
abcde a(bcd)e
At least two features are supported by this concept. A subexpression allows a sequence to be optional or repeatable and allows branches to be inserted into the middle of a larger expression.
Quantified groups
One reason for using a group is to give the enclosed tokens a quantifier. The whole group may be optional or repeatable. The same techniques are used as for single atoms:
a(bcd)?e a(bcd){5,9}e
Note that the first example above is not equivalent to the expression 'ab?c?d?e'. The difference is that, in this case, the characters 'b', 'c' and 'd' must all be present (in that order) or must all be absent.
An ISBN code might be allowed to be incomplete if the publisher part of the code can be implied:
<pattern value="(0-201-)?[0-9]{5}-[0-9x]" />
Branching groups
A group is useful when several options are required at a particular location in the pattern, because a subexpression can contain branches. Consider the following example:
abc(1|2|3)d
This pattern matches the values 'abc1d', 'abc2d', and 'abc3d'. Of course, with only a single character in each branch, this is just an alternative for the more succinct pattern 'abc[123]d'. However, that much simpler technique cannot work for multicharacter scenarios. In the following example, the values allowed are 'abc111d', 'abc222d', and 'abc333d':
abc(111|222|333)d
Each branch is a complete expression, and may also contain subexpressions, though this is only needed when there are fixed characters before or after the embedded options:
...(...|aaa(...|...|...)zzz|...)...
An ISBN code for any book published in France (area code '2') or Poland (area code '83') is quite straightforward to express (though the following formulation unfortunately permits a missing or extra digit in the publisher or book code and does not prevent the hyphen that should separate these two parts from actually occuring before or after them both):
<pattern value="(2|83)-[0-9-]{7,8}-[0-9x]" />
15.9 Character Class Escapes
There are various categories of character class escape. The simplest kind, single character escape, has already been discussed. This is an escape sequence for a single character that has a significant role in the expression language, such as '\{' to represent the '{' character (they are listed and discussed in more detail above). The other escape types are
multicharacter escapes (such as '\S' (non-whitespace) and '.' (non-line-ending character));
general category escapes (such as '\p{L}' and '\p{Lu}') and complementary general category escapes (such as '\P{L}' and '\P{Lu}');
block category escapes (such as '\p{IsBasicLatin}' and '\p{IsTibetan}') and complementary block category escapes (such as '\P{IsBasicLatin}' and '\P{IsTibetan}').
Multicharacter escapes
For convenience, a number of single character escape codes are provided to represent very common sets of characters, including
non-line-ending characters;
whitespace characters and non-whitespace characters;
initial XML name characters (and all characters except these characters);
subsequent XML name characters (and all characters except these characters);
decimal digits (and all characters except these digits).
The '.' character represents every character except a newline or carriage-return character. The sequence '.....' therefore represents a string of five characters that is not broken over lines. The simplest possible pattern for an ISBN code would be thirteen dots (ten digits and three hyphens):
<pattern value=" " />
The remaining multicharacter escape characters are escaped in the normal way: by a '\' symbol. They are all defined in pairs, with a lowercase letter representing a particular common requirement, and the equivalent uppercase letter representing the opposite effect.
The escape sequence '\s' represents any whitespace character, including the space, tab, newline and carriage-return characters. The '\S' sequence therefore represents any non-whitespace character.
The escape sequence '\i' represents any XML initial name character ('_', ':', or a letter). The '\I' sequence therefore represents any XML noninitial character. Similarly, the escape sequence '\c' represents any XML name character, and '\C' represents any non-XML name character.
The escape sequence '\d' represents any decimal digit. It is equivalent to '\p{Nd}' (see below). The '\D' sequence therefore represents any other character. The ISBN examples can now be shortened still further, and note that an escape sequence can even be placed within a character class, in this case to indicate that the check digit may be a digit instead of the letter 'x' (but note further that such escape sequences cannot be used to indicate the start or end of a range of characters):
<pattern value="\d*-\d*-\d*-[\dx]" />
The escape sequence '\w' represents all characters except punctuation, separators, and 'other' characters (using a mixture of techniques described above and below, this is equivalent to '[�--[\p{P}\p{Z}\p{C}]]'), whereas the '\W' sequence represents only these characters.
Quantifiers can be used with these escape sequences. For example, '\d{5}' specifies that five decimal digits are required.
Category escapes
The escape sequence '\p' or '\P' introduces a category escape set. A category token is enclosed within curly brackets, '{' and '}'. These tokens represent predefined sets of characters, such as all uppercase letters (a general kind of category escape) or the Tibetan character set (a block from the Unicode character set).
General category escapes
A general category escape is a reference to a predefined set of characters, such as the uppercase letters, or all of the punctuation characters. These sets of characters have special names, such as 'Lu' for uppercase letters, and 'P' for all punctuation. For example, '\p{Lu}' represents all uppercase letters, and '\P{Lu}' represents all characters except uppercase letters.
Single letter codes are used for major groupings, such as 'L' for all letters (of which uppercase letters are just a subset). The full set of options is listed below:
L |
|
All Letters |
|
Lu |
uppercase |
|
Ll |
lowercase |
|
Lt |
titlecase |
|
Lm |
modifier |
|
Lo |
other |
M |
|
All Marks |
|
Mn |
nonspacing |
|
Mc |
spacing combination |
|
Me |
enclosing |
N |
|
All Numbers |
|
Nd |
decimal digit |
|
Nl |
letter |
|
No |
other |
P |
|
All Punctuation |
|
Pc |
connector |
|
Pd |
dash |
|
Ps |
open |
|
Pe |
close |
|
Pi |
initial quote |
|
Pf |
final quote |
|
Po |
other |
Z |
|
All Separators |
|
Zs |
space |
|
Zl |
line |
|
Zp |
paragraph |
S |
|
All Symbols |
|
Sm |
math |
|
Sc |
currency |
|
Sk |
modifier |
|
So |
other |
C |
|
All Others |
|
Cc |
control |
|
Cf |
format |
|
Co |
private use |
For details see http://www.unicode.org/Public/3.1-Update/UnicodeCharacter-Database-3.1.0.html
Block category escapes
The Unicode character set is divided into many significant groupings such as musical symbols, Braille characters, and Tibetan characters. A keyword is assigned to each group, for example, 'MusicalSymbols', 'BraillePatterns', and 'Tibetan'.
The following table lists the full set of keywords in alphabetical order:
AlphabeticPresentationForms |
Hebrew |
|
Arabic |
HighPrivateUseSurrogates |
|
ArabicPresentationForms-A |
HighSurrogates |
|
ArabicPresentationForms-B |
Hiragana |
|
Armenian |
IdeographicDescriptionCharacters |
|
Arrows |
IPAExtensions |
|
BasicLatin |
Kanbun |
|
Bengali |
KangxiRadicals |
|
BlockElements |
Kannada |
|
Bopomofo |
Katakana |
|
BopomofoExtended |
Khmer |
|
BoxDrawing |
Lao |
|
BraillePatterns |
Latin-1Supplement |
|
ByzantineMusicalSymbols |
LatinExtended-A |
|
Cherokee |
LatinExtended-B |
|
CJKCompatibility |
LatinExtendedAdditional |
|
CJKCompatibilityForms |
LetterlikeSymbols |
|
CJKCompatibilityIdeographs |
LowSurrogates |
|
CJKCompatibilityIdeographsSupplement |
Malayalam |
|
CJKRadicalsSupplement |
MathematicalAlphanumericSymbols |
|
CJKSymbolsandPunctuation |
MathematicalOperators |
|
CJKUnifiedIdeographs |
MiscellaneousSymbols |
|
CJKUnifiedIdeographsExtensionA |
MiscellaneousTechnical |
|
CJKUnifiedIdeographsExtensionB |
Mongolian |
|
CombiningDiacriticalMarks |
MusicalSymbols |
|
CombiningHalfMarks |
Myanmar |
|
CombiningMarksforSymbols |
NumberForms |
|
ControlPictures |
Ogham |
|
CurrencySymbols |
OldItalic |
|
Cyrillic |
OpticalCharacterRecognition |
|
Deseret |
Oriya |
|
Devanagari |
PrivateUse (three separate sets) |
|
Dingbats |
Runic |
|
EnclosedAlphanumerics |
Sinhala |
|
EnclosedCJKLettersandMonths |
SmallFormVariants |
|
Ethiopic |
SpacingModifierLetters |
|
GeneralPunctuation |
Specials (two separate sets) |
|
GeometricShapes |
SuperscriptsandSubscripts |
|
Georgian |
Syriac |
|
Gothic |
Tags |
|
Greek |
Tamil |
|
GreekExtended |
Telugu |
|
Gujarati |
Thaana |
|
Gurmukhi |
Thai |
|
HalfwidthandFullwidthForms |
Tibetan |
|
HangulCompatibilityJamo |
UnifiedCanadianAboriginalSyllabics |
|
HangulJamo |
YiRadicals |
|
HangulSyllables |
YiSyllables |
A reference to one of these categories involves a keyword that begins with 'Is...' followed by a name from the list above, such as 'Tibetan'. For example, '\p{IsTibetan}' represents any Tibetan character and '\P{IsTibetan}' represents any character not from this set.