Gathering Data for a Data Warehouse
Compiling the data that will be part of your data warehouse (DW) is one of the first and most difficult steps of building a DW. This article will give you an overview of what's involved in extracting data and some tips on how to complete this task successfully.
Let's imagine that your company or client decides they need a DW. What's the first thing you do? Perhaps the same thing as in any other project: gather the requirements. Believe it or not, the users of your data warehouse will be the ones that do most of the work. As a database architect, you're there to provide the technical expertise, but you can't teach the business users how to do their job, at least not before you learn their business.
By the time the DW project is completed, you'll probably know more about the business than most of the business owners themselves (whether you want to or not). Some companies have a luxury of having their own Business Analysts (BAs) or hiring consultants to do BA work. Either way, as a database architect and administrator you need to know the business as well as any BA.
Be sure to determine who your DW users will be. Most of the time, the users of the DW will be the upper-level managers, who tend to be difficult and intimidating to work with. Indeed, be prepared to get the door shut in front of your nose at least a few times. However, don't worry: After you can demonstrate the power and capabilities of the DW, you'll get a warm welcome and perhaps even a smile from the same managers. If they're convinced that you can generate better reports in 20 seconds than the ones they currently use and take four days to run, the managers will gladly allocate a good portion of their precious time to conversing with you.
The type of questions you ask the business users will vary from one project to the next. However, you'll invariably need to find out the following:
What are all the transactional systems used by the company? In other words, how does the company collect the data? Whether the management wishes to use any of these systems for the DW effort does not matter. Often, they don't realize where the most valuable data of the company reside. Nor should theythat's your job.
Who owns the data? This is a tough question because some of the transactional systems could have been purchased from a third-party vendor who does not wish to give out its data structure or the data format. Besides, there might be several people responsible for data upkeep within an organization. When they see a data warehouse architect, the first thing they do is update their resumes; they wonder if you're a sign of things to come. In any case, this is a very important step, so don't take it lightly. Be sure to get not only the names, but also the job titles and duties of each person or organizational unit responsible for data.
What types of reports are expected to be derived from a data warehouse? This could be by far the most exciting process in building a DW. This is where you get to interview all the future users of the DW and get to know all of their expectations. Do not make the mistake of telling them what a DW is supposed to do for them! You're just now gathering the requirements, so the final deliverable of the project has not been determined yet. Be sure to document all requirements with the name and title of the person who requested the report.
After you have documented the data sources and types of reports you need, it's time to go after the data. At this point, it's a good idea to generate a list of all the data fields (columns) that you will need to satisfy reporting requirements. After you have a list of fields, look at the source systems.
Fortunately, most of the database systems provide some form of data security. On the other hand, the secure data means that you have to get the permissions to look at it. Don't expect your client or supervisor to grant you access. They probably don't have a clue of what database permissions mean. Instead, grab your notes from Question 2 and speak with the data ownersthe folks responsible for daily operation of the transactional databases. How easy it is to get access to the data depends on the organization. The larger the organization, (usually) the tighter the security control, and therefore the longer it will take to get appropriate permissions. In smaller companies, the security might not be as tight. Even so, be sure to inform the data owners of your intentions prior to digging into their data. After you explain your mission to the data owners, they should gladly grant you data-reading permissions. If that's not the case, you might have to get an approval from the upper-level managers, who happen to be your customers.
After you glance at the source systems, you're likely to find out that certain fields requested on the reports do not appear in any of the source systems. Now what? Time to go back to the business users and get the name of the missing data source. If they asked for a data field, it's either available directly from one of the systems or is a result of a calculation. The business owners might not know exactly which system contains the missing fields, but they should know all the systems they work with.
After you know all the systems involved, it's a good idea to build a Data Flow Diagram (DFD), and present it to the users for their approval. The DFD does not have to be very detailed, but it helps to know how the data moves from one system to the next and what kind of manipulations happen during such data exchange. That way, if the same data is contained in multiple systems, you only have to extract it from the single most appropriate source.
It's a good practice to get at least a small sample from each data source to analyze the data format and shape. That's where you find out if you need to perform any data cleansing prior to transforming it into the dimensional model. The data format will also give you an idea of what's involved in writing the transformation routines.
Data cleansing might involve removing invalid characters, deleting invalid records, and much more. Sometimes, it's more cost-effective to assign a data entry clerk to clean the data, rather than developing data cleansing code. For instance, no code will tell you that "Dr. Anderson R Starkey" and "Andy Starkey, M.D." are one and the same. However, the clerk that has entered such data for the past five years can match such anomalies in her sleep.
Next, you will have to relate data in multiple systems, and try to come up with some sort of mapping of data fields. For instance, the rental information might be kept in one source, and the inventory data might be in another data store. If the company needs to analyze their expenses against their revenues, you'll have to grab data from both systems and come up with a way to compare inventory costs to the revenue that was generated against that cost.
When trying to relate the data from various sources, a useful exercise is building an Entity Relationship Diagram (ERD). You might be familiar with the ERD concept if you've modeled any relational databases. A dimensional ERD isn't concerned with Primary and Foreign Keys as much as it is concerned with the cardinality rules. For example, you might have a store that gets rental vehicles from multiple suppliers; on the other hand each supplier might work with multiple stores. Therefore, you have a many-to-many relationship between the entities of supplier and store.
Getting familiar with all the data sources and their format may take from a few weeks to a few months, depending on the size of the project. After that, you're ready to build your data extraction utilities.
In some cases, if you're lucky, extracting data will be a piece of cakeall you have to do is some formatting and the data is ready to be loaded into your staging area. But more than likely, this won't be the case. If you have text files with mainframe extracts or a relational data that can be easily transferred to your staging area, consider yourself lucky. More often, you'll find yourself in need of interfacing with some proprietary tool developed by a third party. If you don't know programming languages other than SQL, it's time to call your fellow programmers or hire a consultant having prior experience with the proprietary system(s).
Let me stress again that gathering requirements thoroughly is extremely important. The data warehouse needs only those columns that need to appear on the reports. If you import every column from each source system, you might run out of storage space on your server. If you don't have enough columns from each system, your users won't be happy with the DW. Therefore, be sure to get all the requirements ironed out prior to developing any data extraction routines.
After you have extracted all the data elements needed, you'll have to place such data in the staging area. The staging database does not have to be normalized. In fact, it might be beneficial to denormalize the staging database because it'll be used for querying and populating the dimensional database. Nor should your users attempt to get the reports from the staging database, although this idea might be tempting. You should still set up all the appropriate keys and indexes in the staging database to ensure the efficient execution of queries executed against this database.
Some IT professionals prefer generating Analysis Services cubes straight from the staging database. This is a bad practice because the staging area isn't optimized for reporting purposes. The staging database contains data in the same or similar format as it is in the source systems. Instead, the staging database should be used only to get the data extracts from various sources. Later, you transform these data and load them into the dimensional model, which is optimized for building cubes.
Although not necessary, it's a good idea to place the staging database on the same server and the same database engine where the dimensional database will reside. Although you could set up ODBC connections from the dimensional database to the staging area, it'll be more efficient to have both databases on the same server and in the same format.
I recommend using Microsoft SQL Server 2000 for all your data warehousing needs for a couple of reasons. First and foremost, SQL Server makes data warehousing affordable for many small to midsize companies that cannot afford the cost of maintaining other database management systems. When you purchase Microsoft SQL Server, the data warehousing tools are provided to you at no additional cost. In addition, SQL Server is much easier to use than any other major database engine on the market.
Summary
In this article, I gave you an overview of the first step of your data warehousing projectscompiling the data that will construct the DW. I also gave you some tips on how to be successful in extracting data and populating the staging area.