- What Is Data Mining?
- What Data Mining Is Not
- The Most Common Data Mining Applications
- What Kinds of Patterns Can Data Mining Discover?
- Popular Data Mining Tools
- The Dark Side of Data Mining: Privacy Concerns
- Summary
- References
The Dark Side of Data Mining: Privacy Concerns
Data that is collected, stored, and analyzed in data mining often contains information about real people. Such information may include identification data (e.g., name, address, Social Security number, driver’s license number, employee number), demographic data (e.g., age, sex, ethnicity, marital status, number of children), financial data (e.g., salary, gross family income, checking or savings account balance, home ownership, mortgage or loan account specifics, credit card limits and balances, investment account specifics), purchase history (i.e., what is bought from where and when, either from vendors’ transaction records or from credit card transaction specifics), and other personal data (e.g., anniversary, pregnancy, illness, loss in the family, bankruptcy filings). Most of these data can be accessed through some third-party data providers. The main issue here is the privacy of the person to whom the data belongs. In order to maintain the privacy and protection of individuals’ rights, data mining professionals have ethical and often legal obligations. One way to ethically handle private data is to de-identify customer records prior to applying data mining applications so that the records cannot be traced to an individual. Many publicly available data sources (e.g., CDC data, SEER data, UNOS data) are already de-identified. Prior to accessing these data sources, those mining data are often asked to agree not to try to identify the individuals behind those figures.
In a number of instances in the recent past, companies have shared their customer data with others without seeking the explicit consent of their customers. For instance, as you might recall, in 2003, JetBlue Airlines provided more than a million passenger records of its customers to Torch Concepts, a U.S. government contractor. Torch subsequently augmented the passenger data with additional information, such as family sizes and Social Security numbers—information purchased from a data broker called Acxiom. The consolidated personal database was intended to be used for a data mining project in order to develop potential terrorist profiles. All this was done without notifying or obtaining consent from passengers. When news of the activities got out, however, dozens of privacy lawsuits were filed against JetBlue, Torch, and Acxiom, and several U.S. senators called for an investigation into the incident (Wald, 2004). Similar, but not as dramatic, privacy-related news came out in the recent past about popular social networking companies, which were allegedly selling customer-specific data to other companies for personalized targeted marketing.
A peculiar story about data mining and privacy concerns made headlines in 2012. In this instance, the company did not even use any private and/or personal data. Legally speaking, there was no violation of any laws. This story, which is about Target, goes as follows. In early 2012, an infamous story broke out about Target’s practice of using predictive analytics. The story was about a teenager girl who was being sent advertising flyers and coupons by Target for the kind of things that a new mother-to-be would buy from a store like Target. An angry man went into a Target outside Minneapolis, demanding to talk to a manager: “My daughter got this in the mail!” he said. “She’s still in high school, and you’re sending her coupons for baby clothes and cribs? Are you trying to encourage her to get pregnant?” The manager didn’t have any idea what the man was talking about. He looked at the mailer. Sure enough, it was addressed to the man’s daughter and contained advertisements for maternity clothing, nursery furniture, and pictures of smiling infants. The manager apologized and then called a few days later to apologize again. On the phone, though, the father was somewhat abashed. “I had a talk with my daughter,” he said. “It turns out there’s been some activities in my house I haven’t been completely aware of. She’s due in August. I owe you an apology.”
As it turns out, Target figured out a teen girl was pregnant before her father did! Here is how they did it. Target assigns every customer a guest ID number (tied to the person’s credit card, name, or email address) that becomes a placeholder to keep a history of everything the person has bought. Target augments this data with any demographic information that it has collected from the customer or bought from other information sources. By using this type of information, Target had looked at historical buying data for all the women who had signed up for Target baby registries in the past. The company had analyzed the data from all directions, and some useful patterns emerged. For example, lotions and special vitamins were among the products with interesting purchase patterns. Lots of people buy lotion, but Target noticed that women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium, and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals that she could be getting close to her delivery date. Target was able to identify about 25 products that, when analyzed together, allowed them to assign each shopper a “pregnancy prediction” score. Target could even estimate her due date to within a small window, so the company could send coupons timed to very specific stages of her pregnancy.
If you looked at this practice from a legality perspective, you would conclude that Target did not use any information that violates customers’ privacy rights; rather, it used transactional data that almost every other retail chain is collecting and storing (and perhaps analyzing) about customers. What was disturbing in this scenario was perhaps that pregnancy was the targeted concept. People tend to believe that certain events or concepts—such as terminal disease, divorce, and bankruptcy—should be off-limits or treated extremely cautiously.