Data Layer
The data layer is the base of your entire application, storing all the dynamic information for your application. In most applications, this is actually split into two parts. One part is the large, slow storage used to store any file-like objects or any data that is too large to store in a smaller storage system. This is typically provided for you by a network-attached-storage type of system provided by your cloud hosting solution. In Amazon Web Services, this is called Simple Storage Service or S3.
Another large part of this layer is the small, fast, and queryable information. In most typical systems, this is handled by a database. This is no different in cloud-based applications, except for how you host this database.
Introducing the AWS Databases
In Amazon Web Services, you actually have two different ways to host this database. One option is a nonrelational database, known as SimpleDB or SDB, which can be confusing initially to grasp but in general is much cheaper to run and scales automatically. This nonrelational database is currently the cheapest and easiest to scale database provided by Amazon Web Services because you don't have to pay anything except for what you actually use. As such, it can be considered a true cloud service, instead of just an adaptation on top of existing cloud services. Additionally, this database scales up to one billion key-value pairs per domain automatically, and you don't have to worry about over-using it because it's built using the same architecture as S3. This database is quite efficient at storing and retrieving data if you build your application to use with it, but if you're looking at doing complex queries, it doesn't handle that well. If you can think of your application in simple terms relating directly to objects, you can most likely use this database. If, however, you need something more complex, you need to use a Relational DB (RDB).
RDB is Amazon's solution for applications that cannot be built using SDB for systems with complex requirements of their databases, such as complex reporting, transactions, or stored procedures. If you need your application to do server-based reports that use complex select queries joining between multiple objects, or you need transactions or stored procedures, you probably need to use RDB. This new service is Amazon's solution to running your own MySQL database in the cloud and is actually nothing more than an Amazon-managed solution. You can use this solution if you're comfortable with using MySQL because it enables you to have Amazon manage your database for you, so you don't have to worry about any of the IT-level details. It has support for cloning, backing up, and restoring based on snapshots or points-in-time. In the near future, Amazon will be releasing support for more database engines and expanding its solutions to support high availability (write clustering) and read-only clustering.
If you can't figure out which solution you need to use, you can always use both. If you need the flexibility and power of SDB, use that for creating your objects, and then run scripts to push that data to MySQL for reporting purposes. In general, if you can use SDB, you probably should because it is generally a lot easier to use. SDB is split into a simple three-level hierarchy of domain, item, and key-value pairs. A domain is almost identical to a "database" in a typical relational DB; an Item can be thought of as a table that doesn't require any schema, and each item may have multiple key-value pairs below it that can be thought of as the columns and values in each item. Because SDB is schema-less, it doesn't require you to predefine the possible keys that can be under each item, so you can push multiple item types under the same domain. Figure 2.1 illustrates the relation between the three levels.
Figure 2.1 The SDB hierarchy.
In Figure 2.1, the connection between item to key-value pairs is a many-to-one relation, so you can have multiple key-value pairs for each item. Additionally, the keys are not unique, so you can have multiple key-value pairs with the same value, which is essentially the same thing as a key having multiple values.
Connecting to SDB
Connecting to SDB is quite easy using the boto communication library. Assuming you already have your boto configuration environment set up, all you need to do is use the proper connection methods:
>>> import boto >>> sdb = boto.connect_sdb() >>> db = sdb.get_domain("my_domain_name") >>> db.get_item("item_name")
This returns a single item by its name, which is logically equivalent to selecting all attributes by an ID from a standard database. You can also perform simple queries on the database, as shown here:
>>> db.select("SELECT * FROM `my_domain_name` WHERE `name` LIKE '%foo%' ORDER BY `name` DESC")
The preceding example works exactly like a standard relational DB query does, returning all attributes of any item that contains a key name that has foo in any location of any result, sorting by name in descending order. SDB sorts and operates by lexicographical comparison and handles only string values, so it doesn't understand that [nd]2 is less than [nd]1. The SDB documentation provides more details on this query language for more complex requests.
Using an Object Relational Mapping
boto also provides a simple persistence layer to translate all values so that they can be lexicographically sorted and searched for properly. This persistence layer operates much like the DB layer of Django, which it's based on. Designing an object is quite simple; you can read more about it in the boto documentation, but the basics can be seen here:
from boto.sdb.db.model import Model from boto.sdb.db.property import StringProperty, IntegerProperty, ReferenceProperty, ListProperty class SimpleObject(Model): """A simple object to show how SDB Persistence works in boto""" name = StringProperty() some_number = IntegerProperty() multi_value_property = ListProperty(str) class AnotherObject(Model): """A second SDB object used to show how references work""" name = StringProperty() object_link = ReferenceProperty(SimpleObject, collection_name="other_objects")
This code creates two classes (which can be thought of like tables) and a SimpleObject, which contains a name, number, and multivalued property of strings. The number is automatically converted by adding the proper value to the value set and properly loaded back by subtracting this number. This conversion ensures that the number stored in SDB is always positive, so lexicographical sorting and comparison always works. The multivalue property acts just like a standard python list, enabling you to store multiple values in it and even removing values. Each time you save the object, everything that was in there is overridden. Each object also has an id property by default that is actually the name of the item because that is a unique ID. It uses Python's UUID module to generate this ID automatically if you don't manually set it. This UUID module generates completely random and unique strings, so you don't rely on a single point of failure to generate sequential numbers. The collection_name attribute on the object_link property of AnotherObject is optional but enables you to specify the property name that is automatically created on the SimpleObject. This reverse reference is generated for you automatically when you import the second object.
boto enables you to create and query on these objects in the database in another simple manor. It provides a few unique methods that use the values available in the SDB connection objects of boto for you so that you don't have to worry about building your query. To create an object, you can use the following code:
>>> my_obj = SimpleObject("object_id") >>> my_obj.name = "My Object Name" >>> my_obj.some_number = 1234 >>> my_obj.multi_value_property = ["foo", "bar"] >>> my_obj.put() >>> my_second_obj = AnotherObject() >>> my_second_obj = "Second Object" >>> my_second_obj.object_link = my_obj >>> my_second_obj.put()
To create the link to the second object, you have to actually save the first object unless you specify the ID manually. If you don't specify an ID, it will be set automatically for you when you call the put method. In this example, the ID of the first object is set but not for the second object.
To select an object given an ID, you can use the following code:
>>> my_obj = SimpleObject.get_by_id("object_id")
This call returns an instance of the object and enables you to retrieve any of the attributes contained in it. There is also a "lazy" reference to the second object, which is not actually fetched until you specifically request it:
>>> my_obj.name u'My Object Name' >>> my_obj.some_number 1234 >>> my_obj.multi_value_property [u'foo', u'bar'] >>> my_obj.other_objects.next().name u'Second Object'
You call next() on the other_objects property because what's returned is actually a Query object. This object operates exactly like a generator and only performs the SDB query if you actually iterate over it. Because of this, you can't do something like this:
>>> my_obj.other_objects[0]
This feature is implemented for performance reasons because the query could actually be a list of thousands of records, and performing a SDB request would consume a lot of unnecessary resources unless you're actually looking for that property. Additionally, because it is a query, you can filter on it just like any other query:
>>> query = my_obj.other_objects >>> query.filter("name like", "%Other") >>> query.order("-name") >>> for obj in query: ...
In the preceding code, you would then be looping over each object that has a name ending with Other, sorting in descending order on the name. After returning all matching results, a StopIteration exception is raised, which results in the loop terminating.