Savings Mart
Savings Mart is a fictitious discount retailer operating in a single American state. It has been in business since 2001 and hopes to open new stores by achieving greater operational efficiencies. Since its inception, Savings Mart has relied on a system of adjusting product inventory thresholds to determine when shipments will be made to stores. Every time a product is purchased, the quantity for that product and store is updated. When the quantity dips below the minimum threshold allowed for that product and store, an order is automatically generated and a shipment is made three days later.
Although this process seemed like a good way to ensure that the stores were well stocked, it resulted in shipments being made to each store almost every day. This resulted in high overhead costs for each store. Management now wishes to replace the order/shipment strategy with a system designed to predict shipment dates rather than rely on adjustable thresholds.
A sample application presented in this chapter allows the reader to create a training dataset for Savings Mart based on randomly generated purchases. The reader can then step through the process of creating a mining model based on the dataset.
Loading the Savings Mart Database
To execute the sample code with SQL Server, you will need to create a database using a script file available on the book’s Web site. The installation steps are as follows:
Open SQL Server’s Query Analyzer and connect to the server where you wish to install the database.
Click File and Open...
From the Open Query File dialog, browse to location of the InstallDB.sql file available on the book’s Web site. Once selected, click OK.
Click Query and Execute or hit F5.
Once the script has completed executing, a new database named SavingsMart will be visible in the drop-down box on the toolbar.
Figure 5.1 is a diagram of the SavingsMart database. The Products table contains a record for every product available in the stores. Each product is associated with a certain vendor and assigned to a product type, such as beverage or medicine. A field named DiscontinuedDate will contain either a null, meaning it is not discontinued, or the date that it should no longer be available for order. Every product should have a UnitQuantity of one or greater, which indicates the number of items packaged with that product. The products UnitPrice represents the retail price before any discounts are applied. The UnitType and UnitAmount fields may or may not contain values, depending on the product. For instance, a bottled water product will have a UnitType of "oz" and a Unit- Amount of 16.4, indicating that it weighs 16.4 fluid ounces. It is not necessary to record the weight of a mop, so for this product these values would be null.
Figure 5.1 The SavingsMart database. The sample database contains five hundred products and five stores. Stores are stocked with all products according to threshold values contained in the ProductThresholds table. Each time a product is ordered and a shipment completed, the quantity field in ProductQty is updated.
The Purchases table is written to every time a customer makes a single purchase. It records information common to all purchases, such as when the purchase took place, what store it was made in, and what employee rang up the purchase. Purchases are made for products available within a particular store. Availability is determined by examining the Quantity field in the table ProductQty. A purchase can include multiple products and more than one unit of each product. The ProductDetail table is a child of products, and it contains a record for each product associated with a single purchase. If the product purchased is on sale during the time of purchase, then the discount percentage, specified in the ProductDiscounts table is applied.
Once a product is sold beneath the minimum threshold allowed for the store, as indicated by the ProductThresholds table, an order is automatically placed. The quantity for the order is based on the maximum amount found in the ProductThresholds table. Each shipment is the direct result of an order and is typically completed three days after the order is placed.
Populating the Database
Once the SavingsMart database is created, the next step is to populate the database. Unlike the sample databases in Chapters 2 and 3, the SavingsMart database needs to be populated with a large quantity of data. To facilitate this process and provide a method for generating unique training datasets, a sample data-loading application is included on the book’s Web site at http://www.awprofessional.com/title/0321246268. The sample Windows application, named LoadSampleData, will allow you to simulate random purchases as well as to initiate orders and shipments needed to restock products.
Utilizing the LoadSampleData program ensures that you are dealing with a clean database. Very often, the most difficult and time-consuming part of successful data mining involves cleaning the historical database to remove or replace records holding invalid values. Refer to the next section, "Cleaning the Database," for more information about this.
The LoadSampleData program also gives you an opportunity to adjust factors affecting the mining model and therefore produce different results. For instance, the program allows you to select certain product types and vendor names that will be purchased more often.
The LoadSampleData application consists of one form, Form1.vb (see Figure 5.2). It utilizes the Microsoft Data Application block to handle data access and the Microsoft Exception Application block to handle writing exceptions to the event log. The Load Data button is used to populate tables with values from text files, available for download from the book’s Web site. The following is a list of these text files along with a brief description of what they contain:
Stores.txt—Data for a total of five stores.
Employees.txt—Assigns three employees to each store.
Vendors.txt—Data for a total of thirty-four vendors or product brands.
ProductTypes.txt—A total of fourteen product types, including such items as Beverages and Kitchen Supplies.
Products.txt—A total of five hundred products representing each of the product types and vendors.
Figure 5.2 Screenshot of the main form used in the LoadSampleData program. This program will be used to load initial data values into the SavingsMart database. It will also allow the reader to simulate random customer purchases in order to populate a large historical dataset.
The LoadData routine is also used to populate the ProductThresholds table with set values for minimum and maximum threshold amounts. The minimum field represents the minimum quantity of product that should be available in a certain store. The maximum field is the quantity that will be ordered for that store when the minimum threshold is broken. Initially, the minimum and maximum values will be set at ten and two hundred respectively. This will be the case for each product and each store, resulting in a total of twenty-five hundred records (500 products x 5 stores = 2500 records).
Finally, the LoadData routine will generate orders and shipments for each of the five stores. The initial orders will stock the stores with the maximum threshold for all five hundred products. The shipment date will occur three days after the order date to ensure that all stores are fully stocked on the first day of the year.
To begin loading data, execute the following steps:
Copy the contents of the LoadSampleData directory, available for download from the book’s Web site, to a location accessible by your development machine. Note the location because it will be used to set the sPath variable in step 4.
Open the LoadSampleData project file with Visual Studio .NET.
From Solution Explorer, right-click Form1 and select View Code.
The top of the form contains two variables that will be unique to your installation. Ensure that the sConn and sPath variables are set correctly. sConn is a string variable containing the connection string used to connect to the SavingsMart database on SQL Server . sPath is a string variable containing the file path to the text files. The text files reside in a subdirectory name TextFiles. This subdirectory is located inside the LoadSampleData directory (created in step 1).
Execute the application by selecting Start from the Debug Menu. Figure 5.2 is a screenshot of form1.
To begin, click the Load Data button and ensure that the message box "Initial Data Load is complete" appears. Note that the form contains several combo and textboxes that will determine what and how data is loaded.
'Maximum # of purchases per day Dim nMaxPurchases As Int16 = txtMaxPurchases.Text 'Maximum # of products per purchase Dim nMaxProducts As Int16 = txtMaxProducts.Text 'Maximum value of quantity per product Dim nMaxQuantity As Int16 = txtMaxQuantity.Text Dim sYear As String = cboYear.Text Dim nStoreID As Int16 = cboStoreID.Text 'These are the Product Types in which there is an 'increased chance of product selection 'The default of 1,2 represents snack foods and beverages Dim sProductTypeAdj As String = txtProductTypeAdj.Text 'These are the Vendors in which there is an increased 'chance of product selection 'The default of 20,24,27 represents Kamp, Notch, and PNuts as Vendors Dim sVendorAdj As String = txtVendorAdj.Text ProgressBar1.Minimum = 1 ProgressBar1.Maximum = 366 Try Dim params(2) As SqlParameter params(0) = New _ SqlParameter("@ID", SqlDbType.Int) params(1) = New _ SqlParameter("@ProdTypeAdj", SqlDbType.VarChar, 50) params(2) = New _ SqlParameter("@VendorAdj", SqlDbType.VarChar, 50) params(0).Value = nStoreID params(1).Value = sProductTypeAdj params(2).Value = sVendorAdj Dim ds As DataSet = _ SqlHelper.ExecuteDataset(sConn, _ CommandType.StoredProcedure, "GetProductIDS", params) Dim i As Int16 = 1 Dim dtDate As DateTime dtDate = Convert.ToDateTime("01/01/" + sYear) 'Loop through everyday of the year 'We assume the store is open every day Randomize() Do Until i = 366 'First thing is check to see if orders needs to 'be fulfilled for this day and store 'We assume that all orders are shipped 3 days 'after the orderdate in one shipment Dim params1(1) As SqlParameter params1(0) = New _ SqlParameter("@StoreID", SqlDbType.Int) params1(1) = New _ SqlParameter("@Date", SqlDbType.SmallDateTime) params1(0).Value = nStoreID 'order was placed 3 days ago params1(1).Value = dtDate.AddDays(-3) SqlHelper.ExecuteReader(sConn, _ CommandType.StoredProcedure, "InsertShipment", params1) Dim x As Int16 = 1 'This will be the total number of purchases for this day Dim nPurchases As Int16 nPurchases = CInt(Int((nMaxPurchases * Rnd()) + 1)) Do Until x = nPurchases + 1 Dim y As Int16 = 1 Dim nEmployeePos As Int16 nEmployeePos = CInt(Int((ds.Tables(1).Rows.Count * Rnd()))) Dim nEmployeeID As Integer = _ Convert.ToInt32(ds.Tables(1).Rows(nEmployeePos).Item(0)) Dim params2(2) As SqlParameter params2(0) = New SqlParameter("@ID1", SqlDbType.Int) params2(1) = New _ SqlParameter("@Date", SqlDbType.SmallDateTime) params2(2) = New SqlParameter("@ID2", SqlDbType.Int) params2(0).Value = nStoreID params2(1).Value = dtDate params2(2).Value = nEmployeeID Dim nPurchaseID As Integer = _ SqlHelper.ExecuteScalar(sConn, _ CommandType.StoredProcedure, "InsertPurchase", params2) 'This is total number of products for this purchase Dim nProducts As Int16 = _ CInt(Int((nMaxProducts * Rnd()) + 1)) Do Until y = nProducts + 1 'This is quantity for this purchase Dim nQty As Int16 = _ CInt(Int((nMaxQuantity * Rnd()) + 1)) 'This is the product for this detail record Dim nProductPos As Int16 = _ CInt(Int((ds.Tables(0).Rows.Count * Rnd()))) Dim nProductID As Integer = _ Convert.ToInt32(ds.Tables(0).Rows(nProductPos).Item(0)) 'Generate the detail record Dim params3(4) As SqlParameter params3(0) = New SqlParameter("@StoreID", SqlDbType.Int) params3(1) = New _ SqlParameter("@ProductID", SqlDbType.Int) params3(2) = New _ SqlParameter("@PurchaseID", SqlDbType.Int) params3(3) = New SqlParameter("@Qty", SqlDbType.Int) params3(4) = New _ SqlParameter("@Date", SqlDbType.SmallDateTime) params3(0).Value = nStoreID params3(1).Value = nProductID params3(2).Value = nPurchaseID params3(3).Value = nQty params3(4).Value = dtDate SqlHelper.ExecuteScalar(sConn, _ CommandType.StoredProcedure, _ "InsertPurchaseDetail", params3) y += 1 Loop x += 1 Loop i += 1 ProgressBar1.Value = i 'Go to the next day dtDate = dtDate.AddDays(1) Loop MessageBox.Show("Purchases for store " _ + Convert.ToString(cboStoreID.Text) + _ " were generated successfully") Catch ex As Exception MessageBox.Show(ex.Message) ExceptionManager.Publish(ex) End Try
The next step is to generate purchases for each of the five stores. Data mining is most effective in dealing with large datasets. Therefore, the GeneratePurchases routine, seen as follows, will insert approximately 100,000 records in the PurchaseDetail table for each store and calendar year. Purchases are generated for one store—one year at a time.
The amount of records is approximate because the routine utilizes a random number generator to determine the number of purchases per day along with the number of products per purchase. The number of records also varies depending on what input variables are chosen on form1.
The program utilizes default values specifying that purchases will be generated for Store 1 in the year 2001. The GeneratePurchases routine contains a main loop that will execute 365 times for each day of one calendar year. The variable max purchases defaults to 80 and is used to provide the maximum value for the random number generator when determining how many purchases will be generated for a single day.
The variable max products determines the number of distinct products that will be used for a single purchase. Max quantity is used to determine the quantity used in a single purchase detail record. By utilizing the random number generator and then adjusting these values for each store that is processed, we can simulate random purchase activity. In the section titled "Interpreting the Results," we will examine the results of one mining model. To ensure that your results are consistent with the explanations in this section, use the values in Table 5.2 when loading your database.
Table 5.2 Values to be used in the LoadSampleData application when generating purchases for all five stores.
Store |
Field Caption |
Value (only use the number values and not the literal values in parentheses) |
1 |
Processing Year |
2001 |
|
Max Purchases |
80 |
|
Max Products |
20 |
|
Max Quantity |
5 |
|
Product Type Adjustment |
1, 2 (Snack Foods and Beverages) |
|
Vendor Adjustment |
20, 24, 27 (Kamp, Notch, and Pnuts) |
2 |
Processing Year |
2001 |
|
Max Purchases |
60 |
|
Max Products |
12 |
|
Max Quantity |
7 |
|
Product Type Adjustment |
2, 6 (Beverages and Baking Goods) |
|
Vendor Adjustment |
13, 18 (Gombers and Joe’s) |
3 |
Processing Year |
2001 |
|
Max Purchases |
70 |
|
Max Products |
12 |
|
Max Quantity |
7 |
|
Product Type Adjustment |
2 (Beverages) |
|
Vendor Adjustment |
34 (Store Brand) |
4 |
Processing Year |
2001 |
|
Max Purchases |
100 |
|
Max Products |
5 |
|
Max Quantity |
2 |
|
Product Type Adjustment |
(leave blank) |
|
Vendor Adjustment |
34, 18 (Store Brand and Joe’s) |
5 |
Processing Year |
2001 |
|
Max Purchases |
50 |
|
Max Products |
15 |
|
Max Quantity |
8 |
|
Product Type Adjustment |
6 (Baking Goods) |
|
Vendor Adjustment |
-24, 27, 34 (Notch, Pnuts, and Store Brand) |
Of course, since we are using a random number generator, the purchases generated will represent all products equally well over the long run. The Product Type Adjustment and Vendor Adjustment variables are introduced because equal distribution of product purchases is not realistic. These variables contain a comma-delimited list of ProductTypeID and VendorID values. The stored procedure GetProductIDs uses these values when returning the dataset of available ProductID’s. If a ProductTypeID is specified, then every product that relates to that product type will be included in the list of available product id’s more than once. This will increase the chances that the product will be selected for the PurchaseDetail record. The Vendor Adjustment works similarly in that for each VendorID specified, all products assigned to that vendor will appear in the available product list more than once.
If you do not alter the values on form1, the GeneratePurchases routine will take approximately twenty minutes to load data for each store and calendar year. A progress bar is used to indicate the status of the data load because the process is somewhat time-consuming.
Cleaning the Database
Cleaning the database is one of the most important tasks in successful data mining. Databases to be mined are often constructed from multiple data sources. These data sources often involve data that is prone to a variety of errors that can destroy any chance you have of making useful predictions. Most everyone has heard or used the phrase "Garbage in, Garbage out." This phrase applies more than ever to data mining.
Possible errors include records with impossible values or values that fall outside the expected range. Also, records that include null values where nulls will negatively impact the decision-making process. Most of these errors could be prevented with database restrictions or application code, but this does not always happen. Since our sample database was artificially created, we can be reasonably sure that these errors do not exist. The following is a list of the errors that could have occurred if our database existed in the real world:
Store sale in which the ending data occurs before the starting date.
A purchase handled by a store employee before their hire date or for a store they were not assigned to.
An order made for a product that was discontinued before the order date.
A shipment or order date that is invalid or outside the days of operation for the concerned store.
A negative quantity in either PurchaseDetail, OrderDetail, ShipmentDetail, or ProductQty.
A product not associated with a vendor or a purchase with no purchase date and quantity.
A maximum amount that is greater than the minimum.
The methods used to clean a database can vary. Often, values can be corrected with a few update queries made in Query Analyzer. The hardest part is determining what the correct values should be. More than likely, outside help will be needed from people intimately familiar with the data, such as a store manager.
Creating Views
In order to ease the process of building a mining model, a special view will be created. The view, vw_Shipments, combines fields from five different tables and will be used in the next section to create the mining model. The view utilizes the function fn_GetLastShipmentDate to calculate the number of days between shipments. The Transact-SQL code (viewable from the User Defined Function tab in Enterprise Manager) for this function is as follows:
CREATE FUNCTION fn_GetLastShipmentDate ( @ShipmentID int, @ProductID int ) RETURNS datetime AS BEGIN DECLARE @ShippedDate smalldatetime, @TShipmentID int, @ret smalldatetime DECLARE cursor1 CURSOR SCROLL FOR select shippeddate, s.shipmentid from shipments s left join shipmentdetails sd on s.shipmentid = sd.shipmentid where s.storeid IN (SELECT StoreID FROM Shipments WHERE shipmentid = @Shipmentid) AND sd.productid = @ProductID ORDER BY shippeddate OPEN cursor1 FETCH NEXT FROM cursor1 INTO @ShippedDate, @TShipmentID WHILE @@FETCH_STATUS = 0 BEGIN IF @ShipmentID = @TShipmentID BEGIN FETCH PRIOR FROM cursor1 INTO @ShippedDate, @TShipmentID SET @ret = @ShippedDate GOTO Close_Cursor END FETCH NEXT FROM cursor1 INTO @ShippedDate, @TShipmentID END Close_Cursor: CLOSE cursor1 DEALLOCATE cursor1 RETURN(@ret) END
The function accepts @ShipmentID and @ProductID as input variables. It then opens a scrollable cursor (similar to an ADO resultset) based on the following SQL statement:
SELECT shippeddate, s.shipmentid FROM shipments s LEFT JOIN shipmentdetails sd ON s.shipmentid = sd.shipmentid WHERE s.storeid IN (SELECT StoreID FROM Shipments WHERE shipmentid = @Shipmentid) AND sd.productid = @ProductID ORDER BY shippeddate
The function loops through the cursor results until it locates the ShipmentID supplied as an input variable. Once located, it moves to the preceding record and returns that shipment date. This shipment date is used as the first variable for the built-in SQL function DATEDIFF. The resulting variable, DaysSinceLastShipped, will be an important column for the Analyze Shipments mining model.