Cloud Computing Reference Architecture: Data Ingress and Storage Access
Windows Azure, along with many cloud computing options, represents a large set of components that can be arranged in many ways to solve a problem or create a new capability. As such, it’s all about options—there are many ways to do things. The Windows Azure platform has components that allow you to run software you already have (Infrastructure-as-a-Service, or IaaS) on Windows or Linux operating systems, the ability to write code independent of an operating system, and deliver that code at scale (Platform-as-a-Service, or PaaS) or even to use complete solutions such as Hadoop or Media Streaming (Software-as-a-Service). All of these are available in Windows Azure, and of course the concepts hold for other vendors or even with a hosted solution. While I’ll use Windows Azure for the specifics in this article, the concepts hold for other solutions as well.
In this article, I’ll set up a situation that represents either a problem or desired capabilities I’ve discussed with clients. The information here doesn’t come from any specific customer (I wouldn’t do that even if they were OK with it) but from an amalgam of things I see asked about multiple times. As I see those requests come through, I’ll abstract the basic concepts from lots of inquiries and designs, and then I’ll put together one possible solution.
Keep in mind that this is merely one possible method of solving the problem or creating the capability. There are many other possible ways to do the same thing, all depending on multiple variables such as performance, cost, security, and more. No one design fits all; there is no “solution in a box” for these types of things, if you want the flexibility of changing them based on your requirements and constraints.
In this example, the client has a current system. I’ll use that as the design pattern. Another option would be to create an entirely new application, which requires that we focus on requirements more than the current design. Using a current design does have advantages; you can migrate only one component, test to see the viability of the solution, and simply move that component to something else if it doesn’t work out.
Of course, using a current design means that you might not take advantage of what Windows Azure (or any other platform for that matter) has to offer. Usually a new platform brings with it a new way of doing things, which means redesigning an application to take advantage of it. You have to look at the cost/benefit analysis for a project to see if the new platform offers things that make the redesign worthwhile.
Current State
In this example, an on-premises relational database management system (RDBMS) takes data from multiple sources using SQL Server Integration Services (SSIS). With some clients, I’ve seen this to be a “pull” from one or more systems, and in others there are external systems “feeding” a file location that SSIS sweeps on a timed basis to ingress the data.
Assume that there is minor data cleansing that happens in this stage—it’s the beginnings of a classic data warehouse, as in Extract, Transform and Load (ETL) processes. A data warehouse, at least as I use the term here, is a place where data sits primarily for reporting or further processing into an Enterprise Data Warehouse or downstream into a data mart, which has a more granular level of information for a department or function (see Figure 1). Feel free to quibble with my definitions here; I just wanted to make sure you understood the terms as I used them.
The larger point is that system is a place to centralize data, and then the visualization further processing is done more client-side rather than centrally.
While there are mechanisms to run full-up Business Intelligence Systems (such as SQL Server in an IaaS role in Windows Azure), this requirement focuses on the “data trough” kind of scenario, because that fits many customer’s use-cases. I’ll reference this article in the future when I discuss the options you have for further processing in a Business Intelligence scenario.
Desired Improvements
This environment is a perfect fit for an on-premises system. In fact, this is one of the earliest computer development patterns: taking data from multiple locations, lightly “cleaning” it, and allowing access to the data from one or more systems downstream.
And yet keeping this environment on-premise proves to be an issue. The instant data is in a given location, it’s needed in another. In most cases, this is within the four walls of a company, or at least on its owned network, but this is no longer the case. Multiple global locations—even for smaller companies—are common. Remote workers and integration with remote systems means that the system must be open to outside connections.
In addition, security is harder to manage when you have to control everything. With an on-premises solution, you must create and manage the access to the data. This is normally done at the operating system level, which becomes unwieldy at scale when you include external entities. You have to create accounts and manage passwords for many accounts, some very temporary.
In some cases, clients don’t want an on-premises solution at all. They might be hosting a system for their customers, or perhaps the solution they sell needs to be available globally.
In any case, in this architecture the desire is for an off-premises data store that can be accessed by multiple clients for both data ingress and egress. The data access must be secure, and it must scale based on load from the ingress or egress (see Figure 2).
Proposed Architecture
Windows Azure provides multiple features to address this type of requirement. As mentioned, other cloud vendors have methods to handle this pattern as well, but for completeness I’ll represent the solution for the Microsoft components here and you can substitute other vendor’s components to address the pattern in another way.
One proposed solution is represented in Figure 3.
In the two sections that follow, I’ll explain this diagram more fully and how they fulfill the requirements—and what the benefits are for this approach.
Components
First, I’ve represented the source and destination objects as a single entity each because with this design, they become highly scalable and location-independent. To be sure, there are considerations here; if the intent is to move terabytes of data each direction constantly, a cloud solution won’t be optimal—there simply isn’t the bandwidth to handle that scenario.
The source and destination systems now have multiple access methods. One might involve a “rich” interface, represented here as a Windows Azure web role, or even a Windows Azure website. The client would be able to click buttons, have a “drag and drop” interface, and allow multiple sign-on methods including Active Directory, Google, Facebook, LiveID and more (using Azure Connection Services) or even smartphone/tablet access using Windows Azure Mobile Services.
The heart of the system is a “Data API” system represented here by a Worker Role, which can scale programmatically in a very linear fashion based on demand. This API can include FTP, SFTP, streaming interfaces, or any upload/download method desired, all secured to the proper level. In fact, all data access passes through this component—even the Web Role or a rich client. This allows for complete client-side flexibility.
From there, the data is stored in the most efficient, cost-effective storage method, such as Windows Azure tables, blobs, or highly structured storage using Windows Azure SQL databases or any number of relational systems in an IaaS role.
Advantages
Other than answering the requirement of a centralized, off-premises data store, this architecture provides several benefits.
Because it includes a Software-as-a-Service (SaaS) approach for data along with Platform-as-a-Service (PaaS) option and even an Infrastructure-as-a-Service (IaaS) offering, the data can be stored based on the security, consistency, and other properties desired.
Because the access is provided through a single interface (but a scalable one), there is ultimate flexibility in the type of storage access, from auto-detection of the data and where it is to be stored or the security of the data, and perhaps even in-transit transform into a desired shape or format.
The overall benefits revolve around abstraction, which helps future-proof the solution, and options. With proper planning, any part of the system can be brought back “in-house” to avoid vendor lock-in and to provide higher security, or for other reasons.
The key is to think about the data first—what it shape is, its security requirements, consistency model, and other factors.
It’s all about flexibility and options. And keeping yours—and your customer’s—open.