NoSQL for Mere Mortals: Designing for Document Databases
- Normalization, Denormalization, and the Search for Proper Balance
- Planning for Mutable Documents
- The Goldilocks Zone of Indexes
- Modeling Common Relations
- Summary
- Case Study: Customer Manifests
- Review Questions
- References
- “Making good decisions is a crucial skill at every level.”
- —PETER DRUCKER
- AUTHOR AND MANAGEMENT CONSULTANT
Topics Covered In This Chapter
- Normalization, Denormalization, and the Search for Proper Balance
- Planning for Mutable Documents
- The Goldilocks Zone of Indexes
- Modeling Common Relations
- Case Study: Customer Manifests
Designers have many options when it comes to designing document databases. The flexible structure of JSON and XML documents is a key factor in this—flexibility. If a designer wants to embed lists within lists within a document, she can. If another designer wants to create separate collections to separate types of data, then he can. This freedom should not be construed to mean all data models are equally good—they are not.
The goal of this chapter is to help you understand ways of assessing document database models and choosing the best techniques for your needs.
Relational database designers can reference rules of normalization to help them assess data models. A typical relational data model is designed to avoid data anomalies when inserts, updates, or deletes are performed. For example, if a database maintained multiple copies of a customer’s current address, it is possible that one or more of those addresses are updated but others are not. In that case, which of the current databases is actually the current one?
In another case, if you do not store customer information separately from the customer’s orders, then all records of the customer could be deleted if all her orders are deleted. The rules for avoiding these anomalies are logical and easy to learn from example.
In this chapter, you learn about normalization and denormalization and how it applies to document database modeling. You also learn about the impact of updating documents, especially when the size of documents changes. Indexes can significantly improve query response times, but this must be balanced against the extra time that is needed to update indexes when documents are inserted or updated. Several design patterns have emerged in the practice of document database design. These are introduced and discussed toward the end of the chapter.
This chapter concludes with a case study covering the use of a document database for tracking the contents of shipments made by the fictitious transportation company introduced in earlier chapters.
Normalization, Denormalization, and the Search for Proper Balance
Unless you have worked with relational databases, you probably would not guess that normalization has to do with eliminating redundancy. Redundant data is considered a bad, or at least undesirable, thing in the theory of relational database design. Redundant data is the root of anomalies, such as two current addresses when only one is allowed.
In theory, a data modeler will want to eliminate redundancy to minimize the chance of introducing anomalies. As Albert Einstein observed, “In theory, theory and practice are the same. In practice, they are not.” There are times where performance in relational databases is poor because of the normalized model. Consider the data model shown in Figure 8.1.
Figure 8.1 Normalized databases have separate tables for entities. Data about entities is isolated and redundant data is avoided.
Figure 8.1 depicts a simple normalized model of customers, orders, and products. Even this simple model requires eight tables to capture a basic set of data about the entities. These include the following:
- Customers table with fields such as name, customer ID, and so on
- Loyalty Program Members, with fields such as date joined, amount spent since joining, and customer ID
- Customer Addresses, with fields such as street, city, state, start date, end date, and customer ID
- Customer Credit Histories report with fields such as credit category, start date, end date, and customer ID
- Orders, with fields such as order ID, customer ID, ship date, and so on
- Order Items, with fields such as order ID, order item ID, product ID, quantity, cost, and so on
- Products, with fields such as product ID, product name, product description, and so on
- Daily Inventory Levels, with fields such as product ID, date, quantity available, and so on
- Promotions, with fields such as promotion ID, promotion description, start date, and so on
- Promotion to Customers, with fields such as promotion ID and customer ID
Each box in Figure 8.1 represents an entity in the data model. The lines between entities indicate the kind of relationship between the entities.
One-to-Many Relations
When a single line ends at an entity, then one of those rows participates in a single relation. When there are three branching lines ending at an entity, then there are one or more rows in that relationship. For example, the relation between Customer and Orders indicates that a customer can have one or more orders, but there is only one customer associated with each order.
This kind of relation is called a one-to-many relationship.
Many-to-Many Relations
Now consider the relation between Customers and Promotions. There are branching lines at both ends of the relationship. This indicates that customers can have many promotions associated with them. It also means that promotions can have many customers related to them. For example, a customer might receive promotions that are targeted to all customers in their geographic area as well as promotions targeted to the types of products the customer buys most frequently.
Similarly, a promotion will likely target many customers. The sales and marketing team might create promotions designed to improve the sale of headphones by targeting all customers who bought new phones or tablets in the past three months. The team might have a special offer on Bluetooth speakers for anyone who bought a laptop or desktop computer in the last year. Again, there will be many customers in this category (at least the sales team hopes so), so there will be many customers associated with this promotion.
These types of relations are known as many-to-many relationships.
The Need for Joins
Developers of applications using relational databases often have to work with data from multiple tables. Consider the Order Items and Products entities shown in Figure 8.2.
Figure 8.2 Products and Order Items are in a one-to-many relationship. To retrieve Product data about an Order item, they need to share an attribute that serves as a common reference. In this case, Product_ID is the shared attribute.
If you were designing a report that lists an order with all the items on the order, you would probably need to include attributes such as the name of the product, the cost per unit, and the quantity. The name of the product is in the Product table, and the other two attributes are in the Order Items table (see Figure 8.3).
Figure 8.3 To be joined, tables must share a common value known as a foreign key.
In relational databases, modelers often start with designs like the one you saw earlier in Figure 8.1. Normalized models such as this minimize redundant data and avoid the potential for data anomalies. Document database designers, however, often try to store related data together in the same document. This would be equivalent to storing related data in one table of a relational database. You might wonder why data modelers choose different approaches to their design. It has to do with the trade-offs between performance and potential data anomalies.
To understand why normalizing data models can adversely affect performance, let’s look at an example with multiple joins.
Executing Joins: The Heavy Lifting of Relational Databases
Imagine you are an analyst and you have decided to develop a promotion for customers who have bought electronic accessories in the past 12 months. The first thing you want to do is understand who those customers are, where they live, and how often they buy from your business. You can do this by querying the Customer table.
You do not want all customers, though—just those who have bought electronic accessories. That information is not stored in the Customer table, so you look to the Orders table. The Orders table has some information you need, such as the date of purchase. This enables you to filter for only orders made in the past 12 months.
The Orders table, however, does not have information on electronic accessories, so you look to the Order Items table. This does not have the information you are looking for, so you turn to the Products table. Here, you find the information you need. The Products table has a column called Product_Category, which indicates if a product is an electronic accessory or some other product category. You can use this column to filter for electronic accessory items.
At this point, you have all the data you need. The Customer table has information about customers, such as their names and customer IDs. The Orders table has order date information, so you can select only orders from the past 12 months. It also allows you to join to the Order_Items table, which can tell you which orders contained products in the electronic accessories category. The category information is not directly available in the Order_Items table, but you can join the Order_Items table to the Products table to get the product category (see Figure 8.4).
Figure 8.4 Analyzing customers who bought a particular type of product requires three joins between four tables.
To get a sense of how much work is involved in joining tables, let’s consider pseudocode for printing the name of customers who have purchased electronic accessories in the last 12 months:
for cust in get_customers(): for order in get_customer_orders(cust.customer_id): if today() - 365 <= order.order_date: for order_item in get_order_items (order.order_id): if 'electronic accessories' = get_product_category(order_item.product_id): customer_set = add_item (customer_set,cust.name); for customer_name in customer_set: print customer_name;
In this example, the functions get_customers, get_customer_orders, and get_order_items return a list of rows. In the case of get_customers(), all customers are returned.
Each time get_customer_orders is called, it is given a customer_id. Only orders with that customer ID are returned. Each time get_order_items is called, it is given an order_id. Only order items with that order_id are returned.
The dot notation indicates a field in the row returned. For example, order.order_date returns the order_date on a particular order. Similarly, cust.name returns the name of the customer currently referenced by the cust variable.
Executing Joins Example
Now to really see how much work is involved, let’s walk through an example. Let’s assume there are 10,000 customers in the database. The first for loop will execute 10,000 times. Each time it executes, it will look up all orders for the customer. If each of the 10,000 customers has, on average, 10 orders, then the for order loop will execute 100,000 times. Each time it executes, it will check the order date.
Let’s say there are 20,000 orders that have been placed in the last year. The for order_item loop will execute 20,000 times. It will perform a check and add a customer name to a set of customer names if at least one of the order items was an electronic accessory.
Looping through rows of tables and looking for matches is one—rather inefficient—way of performing joins. The performance of this join could be improved. For example, indexes could be used to more quickly find all orders placed within the last year. Similarly, indexes could be used to find the products that are in the electronic accessory category.
Databases implement query optimizers to come up with the best way of fetching and joining data. In addition to using indexes to narrow down the number of rows they have to work with, they may use other techniques to match rows. They could, for example, calculate hash values of foreign keys to quickly determine which rows have matching values.
The query optimizer may also sort rows first and then merge rows from multiple tables more efficiently than if the rows were not sorted. These techniques can work well in some cases and not in others. Database researchers and vendors have made advances in query optimization techniques, but executing joins on large data sets can still be time consuming and resource intensive.
What Would a Document Database Modeler Do?
Document data modelers have a different approach to data modeling than most relational database modelers. Document database modelers and application developers are probably using a document database for its scalability, its flexibility, or both. For those using document databases, avoiding data anomalies is still important, but they are willing to assume more responsibility to prevent them in return for scalability and flexibility.
For example, if there are redundant copies of customer addresses in the database, an application developer could implement a customer address update function that updates all copies of an address. She would always use that function to update an address to avoid introducing a data anomaly. As you can see, developers will write more code to avoid anomalies in a document database, but will have less need for database tuning and query optimization in the future.
So how do document data modelers and application developers get better performance? They minimize the need for joins. This process is known as denormalization. The basic idea is that data models should store data that is used together in a single data structure, such as a table in a relational database or a document in a document database.
The Joy of Denormalization
To see the benefits of denormalization, let’s start with a simple example: order items and products. Recall that the Order_Items entity had the following attributes:
- order_item_ID
- order_id
- quantity
- cost_per_unit
- product_id
The Products entity has the following attributes:
- product_ID
- product_description
- product_name
- product_category
- list_price
An example of an order items document is
{ order_item_ID : 834838, order_ID: 8827, quantity: 3, cost_per_unit: 8.50, product_ID: 3648 }
An example of a product document is
{ product_ID: 3648, product_description: "1 package laser printer paper. 100% recycled.", product_name : "Eco-friendly Printer Paper", product_category : "office supplies", list_price : 9.00 }
If you implemented two collections and maintained these separate documents, then you would have to query the order items collection for the order item you were interested in and then query the products document for information about the product with product_ID 3648. You would perform two lookups to get the information you need about one order item.
By denormalizing the design, you could create a collection of documents that would require only one lookup operation. A denormalized version of the order item collection would have, for example:
{ order_item_ID : 834838, order_ID: 8827, quantity: 3, cost_per_unit: 8.50, product : { product_description: "1 package laser printer paper. 100% recycled.", product_name : "Eco-friendly Printer Paper", product_category : "office supplies", list_price : 9.00 } }
Avoid Overusing Denormalization
Denormalization, like all good things, can be used in excess. The goal is to keep data that is frequently used together in the document. This allows the document database to minimize the number of times it must read from persistent storage, a relatively slow process even when using solid state devices (SSDs). At the same time, you do not want to allow extraneous information to creep into your denormalized collection (see Figure 8.5).
Figure 8.5 Large documents can lead to fewer documents retrieved when a block of data is read from persistent storage. This can increase the total number of data block reads to retrieve a collection or subset of collections.
To answer the question “how much denormalization is too much?” you should consider the queries your application will issue to the document database.
Let’s assume you will use two types of queries: one to generate invoices and packing slips for customers and one to generate management reports. Also, assume that 95% of the queries will be in the invoice and packing slip category and 5% of the queries will be for management reports.
Invoices and packing slips should include, among other fields, the following:
- order_ID
- quantity
- cost_per_unit
- product_name
Management reports tend to aggregate information across groups or categories. For these reports, queries would include product category information along with aggregate measures, such as total number sold. A management report showing the top 25 selling products would likely include a product description.
Based on these query requirements, you might decide it is better to not store product description, list price, and product category in the Order_Items collection. The next version of the Order_Items document would then look like this:
{ order_item_ID : 834838, order_ID: 8827, quantity: 3, cost_per_unit: 8.50, product_name : "Eco-friendly Printer Paper" }
and we would maintain a Products collection with all the relevant product details; for example:
{ product_description: "1 package laser printer paper. 100% recycled.", product_name : "Eco-friendly Printer Paper", product_category : 'office supplies', list_price : 9.00 }
Product_name is stored redundantly in both the Order_Items collection and in the Products collection. This model uses slightly more storage but allows application developers to retrieve information for the bulk of their queries in a single lookup operation.
Just Say No to Joins, Sometimes
Never say never when designing NoSQL models. There are best practices, guidelines, and design patterns that will help you build scalable and maintainable applications. None of them should be followed dogmatically, especially in the presence of evidence that breaking those best practices, guidelines, or design patterns will give your application better performance, more functionality, or greater maintainability.
If your application requirements are such that storing related information in two or more collections is an optimal design choice, then make that choice. You can implement joins in your application code. A worst-case scenario is joining two large collections with two for loops, such as
for doc1 in collection1: for doc2 in collection2: <do something with both documents>
If there are N documents in collection1 and M documents in collection2, this statement would execute N × M times. The execution time for such loops can grow quickly. If the first collection has 100,000 documents and the second has 500,000, then the statement would execute 50,000,000,000 (5 × 105) times. If you are dealing with collections this large, you will want to use indexes, filtering, and, in some cases, sorting to optimize your join by reducing the number of overall operations performed (see Figure 8.6).
Figure 8.6 Simple join operations that compare all documents in one collection to all documents in another collection can lead to poor performance on large collections. Joins such as this can be improved by using indexes, filtering, and, in some cases, sorting.
Normalization is a useful technique for reducing the chances of introducing data anomalies. Denormalization is also useful, but for (obviously) different reasons. Specifically, denormalization is employed to improve query performance. When using document databases, data modelers and developers often employ denormalization as readily as relational data modelers employ normalization.
There is another less-obvious consideration to keep in mind when designing documents and collections: the potential for documents to change size. Documents that are likely to change size are known as mutable documents.