SQL Server Reference Guide

Mar 28, 2003

␡

⎙ Print

< Back Page 317 of 365 Next >

In the previous article in this series, I explained a little about the latest IT buzzword: NoSQL. I described where it came from, and why the data professional, specifically the SQL Server professional needs to care about it.

No doubt your developers are already using a NoSQL offering, even if they are only testing it out. It's best for you to understand where it does and does not fit. In that last article I explained its primary strengths and weaknesses. Most NoSQL excels at ease of programming for the developer, and does a good job of horizontally scaling with distributed systems. Where NoSQL has some disadvantages is in the ACID properties it maintains and the Programmatic Referential Integrity requirements. Again, for many developers this is acceptable.

I'll have to start out this article as I did the last one — with a caveat. There are so many NoSQL implementations and the term has such a loose meaning that the moment this article is published it will be out of date. Rather than let that happen, I'll point to a location that keeps up on the latest implementations for a more authoritative list. What I will do here is break the NoSQL offerings into the largest, most popular categories — at least as of this writing. You can take the information here and use it to apply to the latest products that exist in the NoSQL World.

As before, since there is no official governing body for the NoSQL products, each is left to define itself as NoSQL or something else. That means some of the information in this overview might not be applicable to a given product, but the concepts should apply at least broadly.

NoSQL Uses and Implementations

NoSQL databases, because of the distributed nature of their architectures (see my last article in this series) do well in column-style heavy read operations involving a search. They can distribute the read operations across multiple nodes in the system and filter the data accordingly.

Also, some kinds of writes — especially small, few-columned writes (such as a Tweet in Twitter) do well with a NoSQL implementation.

Although some folks throw around the term "Large" data as a use-case for NoSQL, I think that requires a definition. What is considered "large" data one year is a normal size of data in many organizations the next. And SQL Server, Oracle and other RDBMS systems routinely store Petabytes of data, which at the moment most folks consider a large data set.

The large data argument normally involves the locking mechanism and transactions. In an RDBMS, to ensure that someone isn't selling inventory while another person is selling the same item, locks are enforced by the RDBMS platform. In NoSQL implementations — some of them, anyway — locks are relaxed, or managed at the program level. This allows greater scale, but can also create "Eventual Consistency", meaning that a time-stamp is used to overwrite the latest datum. Sometimes the distributed nature means that the system might need to "catch up" as the reads happen faster than the writes. Again, in narrower columns and quick writes, this isn't always an issue.

NoSQL implementations handle indexing differently, depending on the product. In some cases the indexes are easier to design and deal with, and in others they are more difficult.

At the risk of over-generalizing, I'll categorize the NoSQL products into three large groups: Key/Value, Document, and Graph-Oriented databases. I'll mention a fourth which is certainly not an SQL implementation, but not always talked about as a NoSQL offering.

For the comparison, it works well for me to think about the storage arrangements of data. In a Relational system, a simple set of table relationships might look something like this for a names-to-addresses setup:

Name NameID, PK, INT FName, VARCHAR(50) LName, VARCHAR(50

Address AddressID, PK, INT NameID, FK, INT Street, VARCHAR(150) City, VARCHAR(50) State, CHAR(2) Zip, VARCHAR(12) DateStart, DATETIME

In an RDBMS we would set up a foreign key from Address that points to the Primary Key in Name, allowing us to have multiple addresses for a single person, in this case, like this:

Name 1234 Buck Woody

Address 1 1234 784 Overton Lane Seattle WA 98052 07/07/2005

2 1234 453 Bellevue Square Melbourne FL 32935 06/13/2000

Using the second value as the "key" that points back, we relate the data, forming the heart of an RDBMS system. I'll use that same example as we go.

Key/Value Pair Storage — Columnar Databases

This type of NoSQL offering is exactly what its name implies — there is a value used as a "Key" (like our Primary Keys), but a generic container called the "Value" holds everything else. It's as if there are only two columns in a database, and data is available only for one.

The interesting thing is that the "Key" serves not only as a unique indicator of some kind, but often is the way the data is physically partitioned or laid out on disk. That means you can get incredible performance by merely using a lot of Key values — hundreds or thousands of them.

So one possible structure for the data above is:

NameTable Key: 1234 Value: ["Buck Woody", "784 Overton Lane, Seattle, WA 98052, 07/07/2005'] [453 Bellevue Square, Melbourne, FL, 32935, 06/13/2000]

Sometimes the items I show separated here by commas have a prefix format and even other mechanisms. In most cases the programming language used to obtain the data defines the data store format.

Although a formal "JOIN" operation isn't always that common in a NoSQL database, it is possible. I've worked with some that defined the data above this way:

Key: 1234 Value: ["Buck Woody", "784 Overton Lane, Seattle, WA 98052, 06/13/2000] Key: 1456 Value: ["Buck Woody", 453 Bing Place, 07/12/2005]

Using functions found in some language constructs, the two data sets are brought back using an associative array. I this case, it used the name (a bad choice, in my opinion) to show that Buck Woody moved from one street to another in Seattle on the 12th of July in 2005.

Using this construct, you would have something that looks more like a Dimension Table in a Star Schema. If you used multiple values similar to the first example, it feels more like a de-normalized table in SQL Server. And of course other structures are possible.

I'll show you where to find some examples of Key/Value NoSQL databases at the end of this article.

Document Structures

http://en.wikipedia.org/wiki/Document-oriented_database

Document Databases deal heavily with semi-structured data. Rather than working with small, discrete columns, these databases can also deal with larger sets of even binary information. But they shine best as a place to store differing "columns" within multiple "rows" — even when they are not structured that way. In fact, XML Databases are often referred to as "Document-Oriented Databases".

The structure in this example might look something like this:

Name="BuckWoody", Address="784 Overton Lane, Seattle, WA 98052", Profession=[{Name:"Computer Technology",YearsExperience:25}, {Name:"Marjorie", YearsExperience:15}, {Name:"Christina Woody", Address="784 Overton Lane, Seattle, WA 98052"}, {Name:"Bill Wilson", Age:27}

Notice how the structure of the data "wanders" all over the place, even in line with a single element. This is probably the most confusing structure to an RDBMS data professional.

Most often these databases are accessed using JSON. More on that here.

Graph Oriented Data

http://en.wikipedia.org/wiki/Graph_database

While an RDBMS follows a relational calculus from math, Graph databases follow a different mathematical model. This model deals with structures called Nodes or sometimes Vertices, and the connections between them are called Edges.

A Graph database is useful for a quick routing lookup structure, especially when data has multiple relationships to each other. The interesting thing is that you can quickly query via the relationship or the data. In other words, you're able to quickly find out how many relationships a datum has, or how much data is attached to a relationship. Although not a perfect comparison, you can think of the functions you can perform on a HierarchyID data type in SQL Server. In fact, Graph database Nodes and Edges are very similar to the HierarchyID examples.

You'll deal much more with relationships in this type of NoSQL offering, but not in the same way you would with an RDBMS. For our example, the structure might look like this:

GraphObject Name = createGraphObject(); // create a node called GraphObject GraphObject Address = createGraphObject(); // create second node Name.relate( Address ); // creates the edge or relationship between nodes

Now to work with this data, the pseudo-code looks something like this:

Name.traverseToNeighbors().getSingleMember(); // returns a single address Address.traverseToNeighbors().getSingleMember(); // returns a single name

Working with Graph databases you not only get good relationship management, you can also have a better control over integrity in the data, since the functions handle the transaction-like processing.

Object-Oriented Databases

Most research I've done does not place an Object-Oriented Database (OODB) in the NoSQL camp, and I tend to agree with that categorization. The reason is that for the most part, an OODB can be implemented by simply combining Object-Oriented Programming (OOP) over almost any data store or structure. There are full OODB systems in use, but they are not ubiquitous. I mention it here only for completeness, and to show that I don't think it belongs in the NoSQL category. You can learn more about OODB's here, so I won't cover the concepts in great depth.

You'll find that in an OODB the data structure is similar (sometimes identical) to programming classes. For that reason, developers can usually navigate and work with these structures more quickly and easily than having to map out the structures from an RDBMS.

One of the most interesting things I've found in an OODB is the locking mechanisms. Because each data element is an object, they can be grouped in multiple ways, and then locked accordingly. This allows a great mix of low and high level locking based on the need of the program, but can also cause clashes between program elements requiring access to the data. For those reasons, and because of the increased programmatic complexity, I sometimes see a much more relaxed level of "Dirty Reads" — meaning that developers sometimes only lock an object or series of objects for writes. This is something to be aware of when your developers look into this paradigm, and one of the reasons I include OODB in this article.

NoSQL Resources

I'm leery of using Wikipedia as a standard resource tool, but in this case it is useful. You can start here: http://en.wikipedia.org/wiki/NoSQL

In as much as there is a standard body of work around the NoSQL offerings, this one seems to be the most up to date: http://nosql-database.org/

Even Windows Azure Table Storage can be used in a NoSQL like fashion. More on that here: http://blog.smarx.com/posts/sample-code-for-batch-transactions-in-windows-azure-tables

< Back Page 317 of 365 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address