XML Schema Regular Expressions
Introduction
XML Schema simple types ultimately constrain most actual data (as opposed to structure) in corresponding XML. XML Schema has built-in datatypes to assist. For example, there are various date and numeric datatypes such as dateTime and positiveInteger, which specify a specific date and decimal format, respectively. In addition to the built-in datatypes, there are constraining facets, such as minInclusive and maxInclusive, which might restrict the value of a date or decimal to a range.
While the built-in datatypes and constraining facets are a great start, they are often insufficient, especially for string values. Regular expressions provide a powerful mechanism for restricting data values in XML. A simple type specifies a regular expression constraining facet with a pattern element. The following example specifies that a part number consists of an uppercase character followed by 1 or more decimal digits:
<xsd:simpleType name="partNumber"> <xsd:restriction base="xsd:token"> <xsd:pattern value="[A-Z]\d+"/> </xsd:restriction> </xsd:simpleType>
Creating a regular expression is a serious endeavor in logic. While, simple pattern matching is quite straight-forward, advanced features such as set subtraction and negation require serious thought and testing.
The casual regular expression writer makes lots of assumptions. A typical American programmer, for example, probably assumes the Latin character set. This programmer also uses simple characters such as the wildcard character ('.') and cardinality modifiers (such as '*').
A simple pattern that matches any string containing 'match', might be:
.*match.*
More often than not, this actually works fine. However, does the recipient of the corresponding XML actually perform as expected, given that this pattern also allows German and Japanese characters?
More likely the expression should be more constraining, such as:
\p{IsBasicLatin}*\s+match\s+\p{IsBasicLatin}*\s
The strings which match this regular expression are constrained (roughly) to Latin-based sentences that contain the explicit word 'match'.
The rest of this document demonstrates various aspects of constructing a regular expression.