SKIP THE SHIPPING
Use code NOSHIP during checkout to save 40% on eligible eBooks, now through January 5. Shop now.
This eBook includes the following formats, accessible from your Account page after purchase:
EPUB The open industry format known for its reflowable content and usability on supported mobile devices.
PDF The popular standard, used most often with the free Acrobat® Reader® software.
This eBook requires no passwords or activation to read. We customize your eBook by discreetly watermarking it with your name, making it uniquely yours.
Also available in other formats.
Register your product to gain access to bonus material or receive a coupon.
Master Data Analytics Hands-On by Solving Fascinating Problems You’ll Actually Enjoy!
Harvard Business Review recently called data science “The Sexiest Job of the 21st Century.” It’s not just sexy: For millions of managers, analysts, and students who need to solve real business problems, it’s indispensable. Unfortunately, there’s been nothing easy about learning data science–until now.
Getting Started with Data Science takes its inspiration from worldwide best-sellers like Freakonomics and Malcolm Gladwell’s Outliers: It teaches through a powerful narrative packed with unforgettable stories.
Murtaza Haider offers informative, jargon-free coverage of basic theory and technique, backed with plenty of vivid examples and hands-on practice opportunities. Everything’s software and platform agnostic, so you can learn data science whether you work with R, Stata, SPSS, or SAS. Best of all, Haider teaches a crucial skillset most data science books ignore: how to tell powerful stories using graphics and tables. Every chapter is built around real research challenges, so you’ll always know why you’re doing what you’re doing.
You’ll master data science by answering fascinating questions, such as:
• Are religious individuals more or less likely to have extramarital affairs?
• Do attractive professors get better teaching evaluations?
• Does the higher price of cigarettes deter smoking?
• What determines housing prices more: lot size or the number of bedrooms?
• How do teenagers and older people differ in the way they use social media?
• Who is more likely to use online dating services?
• Why do some purchase iPhones and others Blackberry devices?
• Does the presence of children influence a family’s spending on alcohol?
For each problem, you’ll walk through defining your question and the answers you’ll need; exploring how
others have approached similar challenges; selecting your data and methods; generating your statistics;
organizing your report; and telling your story. Throughout, the focus is squarely on what matters most:
transforming data into insights that are clear, accurate, and can be acted upon.
The book’s website (www.ibmpressbooks.com/title/9780133991024) offers additional pages and software codes to illustrate every method from the book in R, SPSS, Stata, and SAS. The additional content and code files will be available for download by 1/26.
Download the code files:
Chapter 4 (444 KB .zip)
Chapter 5 (2.37 MB .zip)
Chapter 6 (769 KB .zip)
Chapter 7 (813 KB .zip)
Chapter 8 (909 KB .zip)
Chapter 9 (3.51 MB .zip)
Chapter 10 (1.89 KB .zip)
Chapter 11 (294 KB .zip)
Chapter 12 (43 KB .zip)
Preface xix
Chapter 1 The Bazaar of Storytellers 1
Data Science: The Sexiest Job in the 21st Century 4
Storytelling at Google and Walmart 6
Getting Started with Data Science 8
Do We Need Another Book on Analytics? 8
Repeat, Repeat, Repeat, and Simplify 10
Chapters’ Structure and Features 10
Analytics Software Used 12
What Makes Someone a Data Scientist? 12
Existential Angst of a Data Scientist 15
Data Scientists: Rarer Than Unicorns 16
Beyond the Big Data Hype 17
Big Data: Beyond Cheerleading 18
Big Data Hubris 19
Leading by Miles 20
Predicting Pregnancies, Missing Abortions 20
What’s Beyond This Book? 21
Summary 23
Endnotes 24
Chapter 2 Data in the 24/7 Connected World 29
The Liberated Data: The Open Data 30
The Caged Data 30
Big Data Is Big News 31
It’s Not the Size of Big Data; It’s What You Do with It 33
Free Data as in Free Lunch 34
FRED 34
Quandl 38
U.S. Census Bureau and Other National Statistical Agencies 38
Search-Based Internet Data 39
Google Trends 40
Google Correlate 42
Survey Data 44
PEW Surveys 44
ICPSR 45
Summary 45
Endnotes 46
Chapter 3 The Deliverable 49
The Final Deliverable 52
What Is the Research Question? 53
What Answers Are Needed? 54
How Have Others Researched the Same Question in the Past? 54
What Information Do You Need to Answer the Question? 58
What Analytical Techniques/Methods Do You Need? 58
The Narrative 59
The Report Structure 60
Have You Done Your Job as a Writer? 62
Building Narratives with Data 62
“Big Data, Big Analytics, Big Opportunity” 63
Urban Transport and Housing Challenges 68
Human Development in South Asia 77
The Big Move 82
Summary 95
Endnotes 96
Chapter 4 Serving Tables 99
2014: The Year of Soccer and Brazil 100
Using Percentages Is Better Than Using Raw Numbers 104
Data Cleaning 106
Weighted Data 106
Cross Tabulations 109
Going Beyond the Basics in Tables 113
Seeing Whether Beauty Pays 115
Data Set 117
What Determines Teaching Evaluations? 118
Does Beauty Affect Teaching Evaluations? 124
Putting It All on (in) a Table 125
Generating Output with Stata 129
Summary Statistics Using Built-In Stata 130
Using Descriptive Statistics 130
Weighted Statistics 134
Correlation Matrix 134
Reproducing the Results for the Hamermesh and Parker Paper 135
Statistical Analysis Using Custom Tables 136
Summary 137
Endnotes 139
Chapter 5 Graphic Details 141
Telling Stories with Figures 142
Data Types 144
Teaching Ratings 144
The Congested Lives in Big Cities 168
Summary 185
Endnotes 185
Chapter 6 Hypothetically Speaking 187
Random Numbers and Probability Distributions 188
Casino Royale: Roll the Dice 190
Normal Distribution 194
The Student Who Taught Everyone Else 195
Statistical Distributions in Action 196
Z-Transformation 198
Probability of Getting a High or Low Course Evaluation 199
Probabilities with Standard Normal Table 201
Hypothetically Yours 205
Consistently Better or Happenstance 205
Mean and Not So Mean Differences 206
Handling Rejections 207
The Mean and Kind Differences 211
Comparing a Sample Mean When the Population SD Is Known 211
Left Tail Between the Legs 214
Comparing Means with Unknown Population SD 217
Comparing Two Means with Unequal Variances 219
Comparing Two Means with Equal Variances 223
Worked-Out Examples of Hypothesis Testing 226
Best Buy–Apple Store Comparison 226
Assuming Equal Variances 227
Exercises for Comparison of Means 228
Regression for Hypothesis Testing 228
Analysis of Variance 231
Significantly Correlated 232
Summary 233
Endnotes 234
Chapter 7 Why Tall Parents Don’t Have Even Taller Children 235
The Department of Obvious Conclusions 235
Why Regress? 236
Introducing Regression Models 238
All Else Being Equal 239
Holding Other Factors Constant 242
Spuriously Correlated 244
A Step-By-Step Approach to Regression 244
Learning to Speak Regression 247
The Math Behind Regression 248
Ordinary Least Squares Method 250
Regression in Action 259
This Just In: Bigger Homes Sell for More 260
Does Beauty Pay? Ask the Students 272
Survey Data, Weights, and Independence of Observations 276
What Determines Household Spending on Alcohol and Food 279
What Influences Household Spending on Food? 285
Advanced Topics 289
Homoskedasticity 289
Multicollinearity 293
Summary 296
Endnotes 296
Chapter 8 To Be or Not to Be 299
To Smoke or Not to Smoke: That Is the Question 300
Binary Outcomes 301
Binary Dependent Variables 301
Let’s Question the Decision to Smoke or Not 303
Smoking Data Set 304
Exploratory Data Analysis 305
What Makes People Smoke: Asking Regression for Answers 307
Ordinary Least Squares Regression 307
Interpreting Models at the Margins 310
The Logit Model 311
Interpreting Odds in a Logit Model 315
Probit Model 321
Interpreting the Probit Model 324
Using Zelig for Estimation and Post-Estimation Strategies 329
Estimating Logit Models for Grouped Data 334
Using SPSS to Explore the Smoking Data Set 338
Regression Analysis in SPSS 341
Estimating Logit and Probit Models in SPSS 343
Summary 346
Endnotes 347
Chapter 9 Categorically Speaking About Categorical Data 349
What Is Categorical Data? 351
Analyzing Categorical Data 352
Econometric Models of Binomial Data 354
Estimation of Binary Logit Models 355
Odds Ratio 356
Log of Odds Ratio 357
Interpreting Binary Logit Models 357
Statistical Inference of Binary Logit Models 362
How I Met Your Mother? Analyzing Survey Data 363
A Blind Date with the Pew Online Dating Data Set 365
Demographics of Affection 365
High-Techies 368
Romancing the Internet 368
Dating Models 371
Multinomial Logit Models 378
Interpreting Multinomial Logit Models 379
Choosing an Online Dating Service 380
Pew Phone Type Model 382
Why Some Women Work Full-Time and Others Don’t 389
Conditional Logit Models 398
Random Utility Model 400
Independence From Irrelevant Alternatives 404
Interpretation of Conditional Logit Models 405
Estimating Logit Models in SPSS 410
Summary 411
Endnotes 413
Chapter 10 Spatial Data Analytics 415
Fundamentals of GIS 417
GIS Platforms 418
Freeware GIS 420
GIS Data Structure 420
GIS Applications in Business Research 420
Retail Research 421
Hospitality and Tourism Research 422
Lifestyle Data: Consumer Health Profiling 423
Competitor Location Analysis 423
Market Segmentation 423
Spatial Analysis of Urban Challenges 424
The Hard Truths About Public Transit in North America 424
Toronto Is a City Divided into the Haves, Will Haves, and Have Nots 429
Income Disparities in Urban Canada 434
Where Is Toronto’s Missing Middle Class? It Has Suburbanized Out of Toronto 435
Adding Spatial Analytics to Data Science 444
Race and Space in Chicago 447
Developing Research Questions 448
Race, Space, and Poverty 450
Race, Space, and Commuting 454
Regression with Spatial Lags 457
Summary 460
Endnotes 461
Chapter 11 Doing Serious Time with Time Series 463
Introducing Time Series Data and How to Visualize It 464
How Is Time Series Data Different? 468
Starting with Basic Regression Models 471
What Is Wrong with Using OLS Models for Time Series Data? 473
Newey–West Standard Errors 473
Regressing Prices with Robust Standard Errors 474
Time Series Econometrics 478
Stationary Time Series 479
Autocorrelation Function (ACF) 479
Partial Autocorrelation Function (PCF) 481
White Noise Tests 483
Augmented Dickey Fuller Test 483
Econometric Models for Time Series Data 484
Correlation Diagnostics 485
Invertible Time Series and Lag Operators 485
The ARMA Model 487
ARIMA Models 487
Distributed Lag and VAR Models 488
Applying Time Series Tools to Housing Construction 492
Macro-Economic and Socio-Demographic Variables Influencing Housing Starts 498
Estimating Time Series Models to Forecast New Housing Construction 500
OLS Models 501
Distributed Lag Model 505
Out-of-Sample Forecasting with Vector Autoregressive Models 508
ARIMA Models 510
Summary 522
Endnotes 524
Chapter 12 Data Mining for Gold 525
Can Cheating on Your Spouse Kill You? 526
Are Cheating Men Alpha Males? 526
UnFair Comments: New Evidence Critiques Fair’s Research 527
Data Mining: An Introduction 527
Seven Steps Down the Data Mine 529
Establishing Data Mining Goals 529
Selecting Data 529
Preprocessing Data 530
Transforming Data 530
Storing Data 531
Mining Data 531
Evaluating Mining Results 531
Rattle Your Data 531
What Does Religiosity Have to Do with Extramarital Affairs? 533
The Principal Components of an Extramarital Affair 539
Will It Rain Tomorrow? Using PCA For Weather Forecasting 540
Do Men Have More Affairs Than Females? 542
Two Kinds of People: Those Who Have Affairs, and Those Who Don’t 542
Models to Mine Data with Rattle 544
Summary 550
Endnotes 550
Index 553