- Introduction
- Simple Templates
- Atoms
- Quantifiers and Quantities
- Escape Characters
- Character Classes
- Character Class Ranges
- Subexpressions
15.4 Quantifiers and Quantities
By default an atom represents a fragment of the pattern that must occur in a value and may not repeat:
<pattern value="a" /> <Code>a</Code> <!-- OK --> <Code></Code> <!-- ERROR ('a' must be present) --> <Code>aaa</Code> <!-- ERROR ('a' must not repeat) -->
The expression 'abc' specifies that there must be one 'a' character, followed by one 'b' character, followed by one 'c' character. The values 'ab', 'bc', and 'abcc' would not match this pattern.
But it is possible to state that an atom is optional or repeatable, or even to specify an allowed range of occurrences. The pattern language achieves this by allowing quantifier symbols to be placed after the atoms they relate to. The symbols '?', '+' and '*' are used for this purpose (and have meanings that will be unsurprising to those familiar with DTD content models). Alternatively, a quantity allows any number of occurrences to be precisely specified.
Optional quantifier
The '?' quantifier indicates that the atom before it is optional. For example:
ab?c
Legal values in this case include 'abc' and 'ac'.
Note that it is possible to have two identical optional tokens in sequence, such as 'a?a?b'. This is because, unlike the case with DTD and schema element models, look-ahead parsing is permitted. This means the value 'ab' can be matched to this pattern, as can 'aab' (and just 'b'), without causing any problems for the parser. The level of violence and strength of language in a TV program could be indicated with star ratings, '*' (the minimum), '**', '***', '****' or '*****' (the five-star maximum), but perhaps using the letter 's' to represent each star (asterisks cannot be used without further complications that are explained later):
<pattern value="ss?s?s?s?" /> <Ratings Violence="ssss" StrongLanguage="ss" />
Repeatable quantifier
The '+' quantifier signifies that the atom is repeatable. The atom must be present, but any number of further occurences are allowed. For example:
ab+c
Legal values in this case include 'abc' and 'abbbbbbbbbc', but 'ac' would not be valid.
It is not ambiguous to create patterns such as 'b+b+' (though it would be pointless). The parser would not need to match a particular 'b' character to one atom or the other (except for the first and last 'b' in the sequence).
Optional and repeatable quantifier
The '*' quantifier indicates that the atom is both optional and repeatable. This could be seen to be functionaly equivalent to '?+' if such combinations were legal:
ab*c
This expression makes the 'b' atom optional and repeatable so legal values include 'ac', 'abc' and 'abbbbbbc'.
Again, it is not ambiguous to create patterns such as 'b*z?b*'. If the 'z' atom is absent, no attempt is made to decide whether a particular 'b' atom belongs to the first part of the pattern or to the last part.
Greedy quantifiers and backtracking
When a single atom can be matched to multiple characters in a value, matching patterns to values can become quite complex, including multiple interpretations where only one of the possible interpretations would successfully match. The perceived issue here is one of 'greed.' Consider the pattern 'a+b?a' (a series of 'a' characters or a series of 'a' characters followed by a single 'b' character and a single 'a' character) and an attempt to validate the value 'aaa' against it. There are two possible interpretations of the pattern, and one of them would not report a match with the value.
In the first scenario, the first atom, 'a+', could reasonably match the entire value (it could be greedy). But the remainder of the pattern, 'b?a', could not then be matched to anything, so the value would be deemed to be invalid (the missing character 'b' is not a problem here, because it is optional, but the missing additional 'a' character would trigger a failed match).
In the second scenario, the initial atom is only matched to the first two characters of the value instead of all three. The final atom of the pattern, 'a', could then be successfully matched with the final 'a' of the value.
A successful match should be reported if either of the interpretations is applicable to the value. In fact, a pattern is matched to a value first by an attempt at the greedy approach, then, if this fails to match the value, by attempts at less greedy interpretations until (hopefully) a successful match can made.
Readers familiar with the use of expression languages to find text strings should note that in XML an expression is always expected to apply to the content of an entire element or to a complete attribute value. Hence, there need be no concern that expressions could be crafted that might inadvertently (through sheer greed) find a false match spanning two real instances of a value (and everything between) within a single line of text.
Complex example
The following example includes all three quantifiers, and all the following Code elements are valid according to this pattern:
<pattern value="a+b?c*" /> <Code>a</Code> <Code>ab</Code> <Code>ac</Code> <Code>abc</Code> <Code>aaa</Code> <Code>aaab</Code> <Code>aaabc</Code> <Code>aaabccc</Code>
Quantities
A quantity is a more finely tuned instrument for specifying occurrence options than the qualifiers described above. Instead of a single symbol, such as '+', a quantity involves either one or two integer values enclosed by curly braces ('{' and '}').
The simplest form of quantity involves a single integer value. This value specifies how many times the atom must occur. For example:
ab{3}c
This pattern specifies that the value must be 'abbbc'.
A quantity range involves two values, separated by a comma. The first value indicates the minimum number of occurrences allowed, and the second value indicates the maximum number of occurrences allowed. For example:
ab{3,5}c
This pattern specifies that the value must be 'abbbc', 'abbbbc', or 'abbbbbc'.
It is also possible to specify just a minimum number of occurrences. If the second value is absent but the comma is still present, then only a minimum is being specified. The following pattern allows for everything from 'abbc' to 'abbbbbbbbbc' and beyond:
ab{2,}c
Note that it is not possible to specify just a maximum number of repetitions in this way. It is always necessary to supply a minimum value. However, a minimum value of '0' is allowed, so '{0,55}' achieves the aim of specifying a maximum of '55'.