- 11.1 Advantages and Disadvantages of Shredding
- 11.2 Shredding with the XMLTABLE Function
- 11.3 Shredding with Annotated XML Schemas
- 11.4 Summary
11.3 Shredding with Annotated XML Schemas
This section describes another approach to shredding XML documents into relational tables. The approach is called annotated schema shredding or annotated schema decomposition because it is based on annotations in an XML Schema. These annotations define how XML elements and attributes in your XML data map to columns in your relational tables.
To perform annotated schema shredding, take the following steps:
- Identify or create the relational target tables that will hold the shredded data.
- Annotate your XML Schema to define the mapping from XML to the relational tables.
- Register the XML Schema in the DB2 XML Schema Repository.
- Shred XML documents with Command Line Processor commands or built-in stored procedures.
Assuming you have defined the relational tables that you want to shred into, let's look at annotating an XML Schema.
11.3.1 Annotating an XML Schema
Schema annotations are additional elements and attributes in an XML Schema to provide mapping information. DB2 can use this information to shred XML documents to relational tables. The annotations do not change the semantics of the original XML Schema. If a document is valid for the annotated schema then it is also valid for the original schema, and vice versa. You can use an annotated schema to validate XML documents just like the original XML Schema. For an introduction to XML Schemas, see Chapter 16, Managing XML Schemas.
The following is one line from an XML Schema:
<xs:element name="street" type="xs:string" minOccurs="1"/>
This line defines an XML element called street and declares that its data type is xs:string and that this element has to occur at least once. You can add a simple annotation to this element definition to indicate that the element should be shredded into the column STREET of the table ADDRESS. The annotation consists of two additional attributes in the element definition, as follows:
<xs:element name="street" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="STREET"/>
The same annotation can also be provided as schema elements instead of attributes, as shown next. You will see later in Figure 11.8 why this can be useful.
<xs:element name="street" type="xs:string" minOccurs="1"> <xs:annotation> <xs:appinfo> <db2-xdb:rowSetMapping> <db2-xdb:rowSet>ADDRESS</db2-xdb:rowSet> <db2-xdb:column>STREET</db2-xdb:column> </db2-xdb:rowSetMapping> </xs:appinfo> </xs:annotation> <xs:element/>
The prefix xs is used for all constructs that belong to the XML Schema language, and the prefix db2-xdb is used for all DB2-specific schema annotations. This provides a clear distinction and ensures that the annotated schema validates the same XML documents as the original schema.
There are 14 different types of annotations. They allow you to specify what to shred, where to shred to, how to filter or transform the shredded data, and in which order to execute inserts into the target tables. Table 11.4 provides an overview of the available annotations, broken down into logical groupings by user task. The individual annotations are further described in Table 11.5.
Table 11.4. Overview and Grouping of Schema Annotations
If You Want to |
Use This Annotation |
Specify the target tables to shred into |
db2-xdb:rowSet db2-xdb:column db2-xdb:SQLSchema db2-xdb:defaultSQLSchema |
Specify what to shred |
db2-xdb:contentHandling |
Transform data values while shredding |
db2-xdb:expression db2-xdb:normalization db2-xdb:truncate |
Filter data |
db2-xdb:condition db2-xdb:locationPath |
Map an element or attribute to multiple columns |
db2-xdb:rowSetMapping |
Map several elements or attributes to the same column |
db2-xdb:table |
Define the order in which rows are inserted into the target table, to avoid referential integrity violations |
db2-xdb:rowSetOperationOrder db2-xdb:order |
Table 11.5. XML Schema Annotations
Annotation |
Description |
db2-xdb:defaultSQLSchema |
The default relational schema for the target tables. |
db2-xdb:SQLSchema |
Overrides the default schema for individual tables. |
db2-xdb:rowSet |
The table name that the element or attribute is mapped to |
db2-xdb:column |
The column name that the element or attribute is mapped to. |
db2-xdb:contentHandling |
For an XML element, this annotation defines how to derive the value that will be inserted into the target column. You can chose the text value of just this element (text), the concatenation of this element's text and the text of all its descendant nodes (stringValue), or the serialized XML (including all tages) of this element and all descendants (serializeSubtree). If you omit this annotation, DB2 chooses an appropriate default based on the nature of the respective element. |
db2-xdb:truncate |
Specifies whether a value should be truncated if its length is greater than the length of the target column. |
db2-xdb:normalization |
Specifies how to treat whitespace—valid values are whitespaceStrip, canonical, and original |
db2-xdb:expression |
Specifies an expression that is to be applied to the data before insertion into the target table. |
db2-xdb:locationPath |
Filters based on the XML context. For example, if it is a customer address then shred to the cust table; if it is an employee address then shred to the employee table. |
db2-xdb:condition |
Specifies value conditions so that data is inserted into a target table only if all conditions are true. |
db2-xdb:rowSetMapping |
Enables users to specify multiple mappings, to the same or different tables, for an element or attribute. |
db2-xdb:table |
Maps multiple elements or attributes to a single column. |
db2-xdb:order |
Specifies the insertion order of rows among multiple tables. |
db2-xdb:rowSetOperationOrder |
Groups together multiple db2-xdb:order annotations. |
To demonstrate annotated schema decomposition we use the shredding scenario in Figure 11.1 as an example. Assume that the target tables have been defined as shown in Figure 11.1. An annotated schema that defines the desired mapping is provided in Figure 11.8. Let's look at the lines that are highlighted in bold font. The first bold line declares the namespace prefix db2-xdb, which is used throughout the schema to distinguish DB2-specific annotations from regular XML Schema tags. The first use of this prefix is in the annotation db2-xdb:defaultSQLSchema, which defines the relational schema of the target tables. The next annotation occurs in the definition of the element name. The two annotation attributes db2-xdb:rowSet="ADDRESS" and db2-xdb:column="NAME" define the target table and column for the name element. Similarly, the street and city elements are also mapped to respective columns of the ADDRESS table. The next two annotations map the phone number and the type attribute to columns in the PHONES table. The last block of annotations belongs to the XML Schema definition of the Cid attribute. Since the Cid attribute value becomes the join key between the ADDRESS and the PHONE table, it has to be mapped to both tables. Two row set mappings are necessary, which requires the use of annotation elements instead of annotation attributes. The first db2-xdb:rowSetMapping maps the Cid attribute to the CID column in the ADDRESS table. The second db2-xdb:rowSetMapping assigns the Cid attribute to the CID column in the PHONES table.
Figure 11.8. Annotated schema to implement the shredding in Figure 11.1
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified" xmlns:db2-xdb="http://www.ibm.com/xmlns/prod/db2/xdb1" > <xs:annotation> <xs:appinfo> <db2-xdb:defaultSQLSchema>db2admin</db2-xdb:defaultSQLSchema> </xs:appinfo> </xs:annotation> <xs:element name="customerinfo"> <xs:complexType> <xs:sequence> <xs:element name="name" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="NAME"/> <xs:element name="addr" minOccurs="1" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="street" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="STREET"/> <xs:element name="city" type="xs:string" minOccurs="1" db2-xdb:rowSet="ADDRESS" db2-xdb:column="CITY"/> <xs:element name="prov-state" type="xs:string" minOccurs="1" /> <xs:element name="pcode-zip" type="xs:string" minOccurs="1" /> </xs:sequence> <xs:attribute name="country" type="xs:string" /> </xs:complexType> </xs:element> <xs:element name="phone" minOccurs="0" maxOccurs="unbounded" db2-xdb:rowSet="PHONES" db2-xdb:column="PHONENUM"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="type" form="unqualified" type="xs:string" db2-xdb:rowSet="PHONES" db2-xdb:column="PHONETYPE"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="Cid" type="xs:integer"> <xs:annotation> <xs:appinfo> <db2-xdb:rowSetMapping> <db2-xdb:rowSet>ADDRESS</db2-xdb:rowSet> <db2-xdb:column>CID</db2-xdb:column> </db2-xdb:rowSetMapping> <db2-xdb:rowSetMapping> <db2-xdb:rowSet>PHONES</db2-xdb:rowSet> <db2-xdb:column>CID</db2-xdb:column> </db2-xdb:rowSetMapping> </xs:appinfo> </xs:annotation> </xs:attribute> </xs:complexType> </xs:element> </xs:schema> |
11.3.2 Defining Schema Annotations Visually in IBM Data Studio
You can add annotations to an XML Schema manually, using any text editor or XML Schema editor. Alternatively, you can use the Annotated XSD Mapping Editor in IBM Data Studio Developer. To invoke the editor, right-click on an XML Schema name and select Open With, Annotated XSD Mapping Editor. A screenshot of the mapping editor is shown in Figure 11.9. The left side of the editor shows the hierarchical document structure defined by the XML Schema (Source). The right side shows the tables and columns of the relational target schema (Target). You can add mapping relationships by connecting source items with target columns. There is also a discover function to find probable relationships. Mapped relationships are represented in the mapping editor by lines drawn between source elements and target columns.
Figure 11.9 Annotated XSD Mapping Editor in Data Studio Developer
11.3.3 Registering an Annotated Schema
After you have created your annotated XML Schema you need to register it in the XML Schema Repository of the database. DB2's XML Schema Repository is described in detail in Chapter 16, Managing XML Schemas. For the annotated schema in Figure 11.8 it is sufficient to issue the REGISTER XMLSCHEMA command with its COMPLETE and ENABLE DECOMPOSITION options as shown in Figure 11.10. In this example the XML Schema is assumed to reside in the file /xml/myschemas/cust2.xsd. Upon registration it is assigned the SQL identifier db2admin.cust2xsd. This identifier can be used to reference the schema later. The COMPLETE option of the command indicates that there are no additional XML Schema documents to be added. The option ENABLE DECOMPOSITION indicates that this XML Schema can be used not only for document validation but also for shredding.
Figure 11.10. Registering an annotated XML schema
REGISTER XMLSCHEMA 'http://pureXMLcookbook.org' FROM '/xml/myschemas/cust2.xsd' AS db2admin.cust2xsd COMPLETE ENABLE DECOMPOSITION; |
Figure 11.11 shows that you can query the DB2 catalog view syscat.xsrobjects to determine whether a registered schema is enabled for decomposition (Y) or not (N).
Figure 11.11. Checking the status of an annotated XML schema
SELECT SUBSTR(objectname,1,10) AS objectname, status, decomposition FROM syscat.xsrobjects ; OBJECTNAME STATUS DECOMPOSITION ---------- ------ ------------- CUST2XSD C Y |
The DECOMPOSITION status of an annotated schema is automatically changed to X (inoperative) and shredding is disabled, if any of the target tables are dropped or a target column is altered. No warning is issued when that happens and subsequent attempts to use the schema for shredding fail. You can also use the following commands to disable and enable an annotated schema for shredding:
ALTER XSROBJECT cust2xsd DISABLE DECOMPOSITION; ALTER XSROBJECT cust2xsd ENABLE DECOMPOSITION;
11.3.4 Decomposing One XML Document at a Time
After you have registered and enabled the annotated XML Schema you can decompose XML documents with the DECOMPOSE XML DOCUMENT command or with a built-in stored procedure. The DECOMPOSE XML DOCUMENT command is convenient to use in the DB2 Command Line Processor (CLP) while the stored procedure can be called from an application program or the CLP. The CLP command takes two parameters as input: the filename of the XML document that is to be shredded and the SQL identifier of the annotated schema, as in the following example:
DECOMPOSE XML DOCUMENT /xml/mydocuments/cust01.xml XMLSCHEMA db2admin.cust2xsd VALIDATE;
The keyword VALIDATE is optional and indicates whether XML documents should be validated against the schema as part of the shredding process. While shredding, DB2 traverses both the XML document and the annotated schema and detects fundamental schema violations even if the VALIDATE keyword is not specified. For example, the shredding process fails with an error if a mandatory element is missing, even if this element is not being shredded and the VALIDATE keyword is omitted. Similarly, extraneous elements or data type violations also cause the decomposition to fail. The reason is that the shredding process walks through the annotated XML Schema and the instance document in lockstep and therefore detects many schema violations "for free" even if the XML parser does not perform validation.
To decompose XML documents from an application program, use the stored procedure XDBDECOMPXML. The parameters of this stored procedure are shown in Figure 11.12 and described in Table 11.6.
Figure 11.12. Syntax and parameters of the stored procedure XDBDECOMPXML
>>-XDBDECOMPXML--(--rschema--,--xmlschemaname--,--xmldoc--,----> >--documentid--,--validation--,--reserved--,--reserved--,------> >--reserved--)------------------------------------------------>< |
Table 11.6. Description of the Parameters of the Stored Procedure XDBDECOMPXML
Parameter |
Description |
rschema |
The relational schema part of the two-part SQL identifier of the annotated XML Schema. For example, if the SQL identifier of the XML Schema is db2admin.cust2xsd, then you should pass the string 'db2admin' to this parameter. In DB2 for z/OS this value must be either 'SYSXSR' or NULL. |
xmlschemaname |
The second part of the two-part SQL identifier of the annotated XML Schema. If the SQL identifier of the XML Schema is db2admin.cust2xsd, then you pass the string 'cust2xsd' to this parameter. This value cannot be NULL. |
xmldoc |
In DB2 for Linux, UNIX, and Windows, this parameter is of type BLOB(1M) and takes the XML document to be decomposed. In DB2 for z/OS this parameter is of type CLOB AS LOCATOR. This parameter cannot be NULL. |
documentid |
A string that the caller can use to identify the input XML document. The value provided will be substituted for any use of $DECOMP_DOCUMENTID specified in the db2-xdb:expression or db2-xdb:condition annotations. |
validation |
Possible values are: 0 (no validation) and 1 (validation is performed). This parameter does not exist in DB2 for z/OS. |
reserved |
Parameters reserved for future use. The values passed for these arguments must be NULL. These parameters do not exist in DB2 for z/OS. |
A Java code snippet that calls the stored procedure using parameter markers is shown in Figure 11.13
Figure 11.13. Java code that invokes the stored procedure XDBDECOMPXML
CallableStatement callStmt = con.prepareCall( "call SYSPROC.XDBDECOMPXML(?,?,?,?,?, null, null, null)"); File xmldoc = new File("c:\mydoc.xml"); FileInputStream xmldocis = new FileInputStream(xmldoc); callStmt.setString(1, "db2admin" ); callStmt.setString(2, "cust2xsd" ); // document to be shredded: callStmt.setBinaryStream(3,xmldocis,(int)xmldoc.length() ); callStmt.setString(4, "mydocument26580" ); // no schema validation in this call: callStmt.setInt(5, 0); callStmt.execute(); |
While the input parameter for XML documents is of type CLOB AS LOCATOR in DB2 for z/OS, it is of type BLOB(1M) in DB2 for Linux, UNIX, and Windows. If you expect your XML documents to be larger than 1MB, use one of the stored procedures listed in Table 11.7. These stored procedures are all identical except for their name and the size of the input parameter xmldoc. When you call a stored procedure, DB2 allocates memory according to the declared size of the input parameters. For example, if all of your input documents are at most 10MB in size, the stored procedure XDBDECOMPXML10MB is a good choice to conserve memory.
Table 11.7. Stored Procedures for Different Document Sizes (DB2 for Linux, UNIX, and Windows)
Stored Procedure |
Document Size |
Supported since |
XDBDECOMPXML |
≤1MB |
DB2 9.1 |
XDBDECOMPXML10MB |
≤10MB |
DB2 9.1 |
XDBDECOMPXML25MB |
≤25MB |
DB2 9.1 |
XDBDECOMPXML50MB |
≤50MB |
DB2 9.1 |
XDBDECOMPXML75MB |
≤75MB |
DB2 9.1 |
XDBDECOMPXML100MB |
≤100MB |
DB2 9.1 |
XDBDECOMPXML500MB |
≤500MB |
DB2 9.5 FP3 |
XDBDECOMPXML1GB |
≤1GB |
DB2 9.5 FP3 |
XDBDECOMPXML1_5GB |
≤1.5GB |
DB2 9.7 |
XDBDECOMPXML2GB |
≤2GB |
DB2 9.7 |
For platform compatibility, DB2 for z/OS supports the procedure XDBDECOMPXML100MB with the same parameters as DB2 for Linux, UNIX, and Windows, including the parameter for validation.
11.3.5 Decomposing XML Documents in Bulk
DB2 9.7 for Linux, UNIX, and Windows introduces a new stored procedure called XDB_DECOMP_XML_FROM_QUERY. It uses an annotated schema to decompose one or multiple XML documents selected from a column of type XML, BLOB, or VARCHAR FOR BIT DATA. The main difference to the procedure XDBDECOMPXML is that XDB_DECOMP_XML_FROM_QUERY takes an SQL query as a parameter and executes it to obtain the input documents from a DB2 table. For a large number of documents, a LOAD operation followed by a "bulk decomp" can be more efficient than shredding these documents with a separate stored procedure call for each document. Figure 11.14 shows the parameters of this stored procedure. The parameters commit_count and allow_access are similar to the corresponding parameters of DB2's IMPORT utility. The parameters total_docs, num_docs_decomposed, and result_report are output parameters that provide information about the outcome of the bulk shredding process. All parameters are explained in Table 11.8.
Figure 11.14. The stored procedure XDB_DECOMP_XML_FROM_QUERY
>>--XDB_DECOMP_XML_FROM_QUERY--(--rschema--,--xmlschema--,--> >--query--,--validation--,--commit_count--,--allow_access--,----> >--reserved--,--reserved2--,--continue_on_error--,--------------> >--total_docs--,--num_docs_decomposed--,--result_report--)-->< |
Table 11.8. Parameters for XDB_DECOMP_XML_FROM_QUERY
Parameter |
Description |
rschema |
Same as for XDBDECOMPXML. |
xmlschema |
Same as xmlschemaname for XDBDECOMPXML. |
query |
A query string of type CLOB(1GB), which cannot be NULL. The query must be an SQL or SQL/XML SELECT statement and must return two columns. The first column must contain a unique document identifier for each XML document in the second column of the result set. The second column contains the XML documents to be shredded and must be of type XML, BLOB, VARCHAR FOR BIT DATA, or LONG VARCHAR FOR BIT DATA. |
validation |
Possible values are: 0 (no validation) and 1 (validation is performed). |
commit_count |
An integer value equal to or greater than 0. A value of 0 means the stored procedure does not perform any commits. A value of n means that a commit is performed after every n successful document decompositions. |
allow_access |
A value of 1 or 0. If the value is 0, then the stored procedure acquires an exclusive lock on all tables that are referenced in the annotated XML Schema. If the value is 1, then the stored procedure acquires a shared lock. |
reserved, reserved2 |
These parameters are reserved for future use and must be NULL. |
continue_on _error |
Can be 1 or 0. A value of 0 means the procedure stops upon the first document that cannot be decomposed; for example, if the document does not match the XML Schema. |
total_docs |
An output parameter that indicates the total number of documents that the procedure tried to decompose. |
num_docs_ decomposed |
An output parameter that indicates the number of documents that were successfully decomposed. |
result_report |
An output parameter of type BLOB(2GB). It contains an XML document that provides diagnostic information for each document that was not successfully decomposed. This report is not generated if all documents shredded successfully. The reason this is a BLOB field (rather than CLOB) is to avoid codepage conversion and potential truncation/data loss if the application code page is materially different from the database codepage. |
Figure 11.15 shows an invocation of the XDB_DECOMP_XML_FROM_QUERY stored procedure in the CLP. This stored procedure call reads all XML documents from the info column of the customer table and shreds them with the annotated XML Schema db2admin.cust2xsd. The procedure commits every 25 documents and does not stop if a document cannot be shredded.
Figure 11.15. Calling the procedure SYSPROC.XDB_DECOMP_XML_FROM_QUERY
call SYSPROC.XDB_DECOMP_XML_FROM_QUERY ('DB2ADMIN', 'CUST2XSD', 'SELECT cid, info FROM customer', 0, 25, 1, NULL, NULL, '1',?,?,?) ; Value of output parameters -------------------------- Parameter Name : TOTALDOCS Parameter Value : 100 Parameter Name : NUMDOCSDECOMPOSED Parameter Value : 100 Parameter Name : RESULTREPORT Parameter Value : x'' Return Status = 0 |
If you frequently perform bulk shredding in the CLP, use the command DECOMPOSE XML DOCUMENTS instead of the stored procedure. It is more convenient for command-line use and performs the same job as the stored procedure XDB_DECOMP_XML_FROM_QUERY. Figure 11.16 shows the syntax of the command. The various clauses and keywords of the command have the same meaning as the corresponding stored procedure parameters. For example, query is the SELECT statement that provides the input documents, and xml-schema-name is the two-part SQL identifier of the annotated XML Schema.
Figure 11.16. Syntax for the DECOMPOSE XML DOCUMENTS command
>>-DECOMPOSE XML DOCUMENTS IN----'query'----XMLSCHEMA-------> .-ALLOW NO ACCESS-. >--xml-schema-name--+----------+--+-----------------+-----------> '-VALIDATE-' '-ALLOW ACCESS----' >--+----------------------+--+-------------------+--------------> '-COMMITCOUNT--integer-' '-CONTINUE_ON_ERROR-' >--+--------------------------+-------------------------------->< '-MESSAGES--message-file-' |
Figure 11.17 illustrates the execution of the DECOMPOSE XML DOCUMENTS command in the DB2 Command Line Processor.
Figure 11.17. Example of the DECOMPOSE XML DOCUMENTS command
DECOMPOSE XML DOCUMENTS IN 'SELECT cid, info FROM customer' XMLSCHEMA db2admin.cust2xsd MESSAGES decomp_errors.xml ; DB216001I The DECOMPOSE XML DOCUMENTS command successfully decomposed all "100" documents. |
If you don't specify a message-file then the error report is written to standard output. Figure 11.18 shows a sample error report. For each document that failed to shred, the error report shows the document identifier (xdb:documentId). This identifier is obtained from the first column that is produced by the SQL statement in the DECOMPOSE XML DOCUMENTS command. The error report also contains the DB2 error message for each document that failed. Figure 11.18 reveals that document 1002 contains an unexpected XML attribute called status, and that document 1005 contains an element or attribute value abc that is invalid because the XML Schema expected to find a value of type xs:integer. If you need more detailed information on why a document is not valid for a given XML Schema, use the stored procedure XSR_GET_PARSING_DIAGNOSTICS, which we discuss in section 17.6, Diagnosing Validation and Parsing Errors.
Figure 11.18. Sample error report from bulk decomp
<?xml version='1.0' ?> <xdb:errorReport xmlns:xdb="http://www.ibm.com/xmlns/prod/db2/xdb1"> <xdb:document> <xdb:documentId>1002</xdb:documentId> <xdb:errorMsg>SQL16271N Unknown attribute "status" at or near line "1" in document "1002".</xdb:errorMsg> </xdb:document> <xdb:document> <xdb:documentId>1005</xdb:documentId> <xdb:errorMsg> SQL16267N An XML value "abc" at or near line "1" in document "1005" is not valid according to its declared XML schema type "xs:integer" or is outside the supported range of values for the XML schema type </xdb:errorMsg> </xdb:document> </xdb:errorReport> |