This eBook includes the following formats, accessible from your Account page after purchase:
EPUB The open industry format known for its reflowable content and usability on supported mobile devices.
PDF The popular standard, used most often with the free Acrobat® Reader® software.
This eBook requires no passwords or activation to read. We customize your eBook by discreetly watermarking it with your name, making it uniquely yours.
Also available in other formats.
Register your product to gain access to bonus material or receive a coupon.
Solve Data Analytics Problems with Spark, PySpark, and Related Open Source Tools
Spark is at the heart of today’s Big Data revolution, helping data professionals supercharge efficiency and performance in a wide range of data processing and analytics tasks. In this guide, Big Data expert Jeffrey Aven covers all you need to know to leverage Spark, together with its extensions, subprojects, and wider ecosystem.
Aven combines a language-agnostic introduction to foundational Spark concepts with extensive programming examples utilizing the popular and intuitive PySpark development environment. This guide’s focus on Python makes it widely accessible to large audiences of data professionals, analysts, and developers—even those with little Hadoop or Spark experience.
Aven’s broad coverage ranges from basic to advanced Spark programming, and Spark SQL to machine learning. You’ll learn how to efficiently manage all forms of data with Spark: streaming, structured, semi-structured, and unstructured. Throughout, concise topic overviews quickly get you up to speed, and extensive hands-on exercises prepare you to solve real problems.
Coverage includes:
• Understand Spark’s evolving role in the Big Data and Hadoop ecosystems
• Create Spark clusters using various deployment modes
• Control and optimize the operation of Spark clusters and applications
• Master Spark Core RDD API programming techniques
• Extend, accelerate, and optimize Spark routines with advanced API platform constructs, including shared variables, RDD storage, and partitioning
• Efficiently integrate Spark with both SQL and nonrelational data stores
• Perform stream processing and messaging with Spark Streaming and Apache Kafka
• Implement predictive modeling with SparkR and Spark MLlib
Download the sample pages (includes Chapter 3)
Preface xi
Introduction 1
PART I: SPARK FOUNDATIONS
Chapter 1 Introducing Big Data, Hadoop, and Spark 5
Introduction to Big Data, Distributed Computing, and Hadoop 5
A Brief History of Big Data and Hadoop 6
Hadoop Explained 7
Introduction to Apache Spark 13
Apache Spark Background 13
Uses for Spark 14
Programming Interfaces to Spark 14
Submission Types for Spark Programs 14
Input/Output Types for Spark Applications 16
The Spark RDD 16
Spark and Hadoop 16
Functional Programming Using Python 17
Data Structures Used in Functional Python Programming 17
Python Object Serialization 20
Python Functional Programming Basics 23
Summary 25
Chapter 2 Deploying Spark 27
Spark Deployment Modes 27
Local Mode 28
Spark Standalone 28
Spark on YARN 29
Spark on Mesos 30
Preparing to Install Spark 30
Getting Spark 31
Installing Spark on Linux or Mac OS X 32
Installing Spark on Windows 34
Exploring the Spark Installation 36
Deploying a Multi-Node Spark Standalone Cluster 37
Deploying Spark in the Cloud 39
Amazon Web Services (AWS) 39
Google Cloud Platform (GCP) 41
Databricks 42
Summary 43
Chapter 3 Understanding the Spark Cluster Architecture 45
Anatomy of a Spark Application 45
Spark Driver 46
Spark Workers and Executors 49
The Spark Master and Cluster Manager 51
Spark Applications Using the Standalone Scheduler 53
Spark Applications Running on YARN 53
Deployment Modes for Spark Applications Running on YARN 53
Client Mode 54
Cluster Mode 55
Local Mode Revisited 56
Summary 57
Chapter 4 Learning Spark Programming Basics 59
Introduction to RDDs 59
Loading Data into RDDs 61
Creating an RDD from a File or Files 61
Methods for Creating RDDs from a Text File or Files 63
Creating an RDD from an Object File 66
Creating an RDD from a Data Source 66
Creating RDDs from JSON Files 69
Creating an RDD Programmatically 71
Operations on RDDs 72
Key RDD Concepts 72
Basic RDD Transformations 77
Basic RDD Actions 81
Transformations on PairRDDs 85
MapReduce and Word Count Exercise 92
Join Transformations 95
Joining Datasets in Spark 100
Transformations on Sets 103
Transformations on Numeric RDDs 105
Summary 108
PART II: BEYOND THE BASICS
Chapter 5 Advanced Programming Using the Spark Core API 111
Shared Variables in Spark 111
Broadcast Variables 112
Accumulators 116
Exercise: Using Broadcast Variables and Accumulators 119
Partitioning Data in Spark 120
Partitioning Overview 120
Controlling Partitions 121
Repartitioning Functions 123
Partition-Specific or Partition-Aware API Methods 125
RDD Storage Options 127
RDD Lineage Revisited 127
RDD Storage Options 128
RDD Caching 131
Persisting RDDs 131
Choosing When to Persist or Cache RDDs 134
Checkpointing RDDs 134
Exercise: Checkpointing RDDs 136
Processing RDDs with External Programs 138
Data Sampling with Spark 139
Understanding Spark Application and Cluster Configuration 141
Spark Environment Variables 141
Spark Configuration Properties 145
Optimizing Spark 148
Filter Early, Filter Often 149
Optimizing Associative Operations 149
Understanding the Impact of Functions and Closures 151
Considerations for Collecting Data 152
Configuration Parameters for Tuning and Optimizing Applications 152
Avoiding Inefficient Partitioning 153
Diagnosing Application Performance Issues 155
Summary 159
Chapter 6 SQL and NoSQL Programming with Spark 161
Introduction to Spark SQL 161
Introduction to Hive 162
Spark SQL Architecture 166
Getting Started with DataFrames 168
Using DataFrames 179
Caching, Persisting, and Repartitioning DataFrames 187
Saving DataFrame Output 188
Accessing Spark SQL 191
Exercise: Using Spark SQL 194
Using Spark with NoSQL Systems 195
Introduction to NoSQL 196
Using Spark with HBase 197
Exercise: Using Spark with HBase 200
Using Spark with Cassandra 202
Using Spark with DynamoDB 204
Other NoSQL Platforms 206
Summary 206
Chapter 7 Stream Processing and Messaging Using Spark 209
Introducing Spark Streaming 209
Spark Streaming Architecture 210
Introduction to DStreams 211
Exercise: Getting Started with Spark Streaming 218
State Operations 219
Sliding Window Operations 221
Structured Streaming 223
Structured Streaming Data Sources 224
Structured Streaming Data Sinks 225
Output Modes 226
Structured Streaming Operations 227
Using Spark with Messaging Platforms 228
Apache Kafka 229
Exercise: Using Spark with Kafka 234
Amazon Kinesis 237
Summary 240
Chapter 8 Introduction to Data Science and Machine Learning Using Spark 243
Spark and R 243
Introduction to R 244
Using Spark with R 250
Exercise: Using RStudio with SparkR 257
Machine Learning with Spark 259
Machine Learning Primer 259
Machine Learning Using Spark MLlib 262
Exercise: Implementing a Recommender Using Spark MLlib 267
Machine Learning Using Spark ML 271
Using Notebooks with Spark 275
Using Jupyter (IPython) Notebooks with Spark 275
Using Apache Zeppelin Notebooks with Spark 278
Summary 279
Index 281