SKIP THE SHIPPING
Use code NOSHIP during checkout to save 40% on eligible eBooks, now through January 5. Shop now.
Video accessible from your Account page after purchase.
Register your product to gain access to bonus material or receive a coupon.
9+ Hours of Video Instruction
The perfect (and fast) way to get started with Hadoop and Spark
Hadoop and Spark Fundamentals LiveLessons provides 9+ hours of video introduction to the Apache Hadoop Big Data ecosystem. The tutorial includes background information and explains the core components of Hadoop, including Hadoop Distributed File Systems (HDFS), MapReduce, the YARN resource manager, and YARN Frameworks. In addition, it demonstrates how to use Hadoop at several levels, including the native Java interface, C++ pipes, and the universal streaming program interface. Examples include how to use benchmarks and high-level tools, including the Apache Pig scripting language, Apache Hive "SQL-like" interface, Apache Flume for streaming input, Apache Sqoop for import and export of relational data, and Apache Oozie for Hadoop workflow management. In addition, there is comprehensive coverage of Spark, PySpark, and the Zeppelin web-GUI. The steps for easily installing a working Hadoop/Spark system on a desktop/laptop and on a local stand-alone cluster using the powerful Ambari GUI are also included. All software used in these LiveLessons is open source and freely available for your use and experimentation. A bonus lesson includes a quick primer on the Linux command line as used with Hadoop and Spark.
Skill Level
Learn How To
Who Should Take This Course
Course Requirements
Lesson 1: Background Concepts
This lesson introduces Hadoop and Spark along with the many aspects and features that enable the analysis of large unstructured data sets. Many of these discussions about Hadoop ignore the fundamental change Hadoop brings to data management. Doug explains this key point using the data lake metaphor, and then provides background on how the Hadoop data platform, MapReduce, and Spark fit into the data analytics landscape. A bonus lesson is also included for new Linux users that provides the basics of the command line interface used throughout these lessons.
Lesson 2: Running Hadoop on a Desktop or Laptop
A real Hadoop installation, whether it be a local cluster or in the cloud, can be difficult to configure and possibly an expensive proposition. In order to make the examples of this tutorial more accessible, you learn how to install the Hortonworks HDP Sandbox on a desktop or laptop. The "Sandbox" is a freely available Hadoop virtual machine that provides a full Hadoop environment (including Spark). You can use this environment to try most of the examples in this tutorial. If you would rather learn about Hadoop and Spark installation details, we will also do a direct single (Linux) machine install using the latest Hadoop and Spark binary code.
Lesson 3: The Hadoop Distributed File System
The backbone of Hadoop is the Hadoop Distributed File System or HDFS. In this lesson you learn the basics of HDFS and how it is different from many standard file systems used today. In particular, Doug explains why various design trade-offs provide HDFS with a performance edge in big data applications. You also learn how to navigate HDFS using the Hadoop tools and how to use HDFS in user programs. Finally, I present some of the new features available in HDFS including high availability, federation, snapshots, and NFS access.
Lesson 4: Hadoop MapReduce
If the Hadoop Distributed File System is the backbone of Hadoop, then MapReduce is the muscle that operates on big data. In this lesson, Doug shows you how MapReduce compares to a traditional search approach. From there, he shows you how to compile and run a Java MapReduce application. Deeper background on how MapReduce works is presented along with how to use MapReduce with other languages and how to do simple debugging of a MapReduce program.
Lesson 5: Hadoop MapReduce Examples
This lesson continues with MapReduce examples. Doug first shows you a multifile word count program, and then moves on to a more practical log file analysis. From there, he demonstrates how to use a really large text file, like Wikipedia. The lesson concludes with some examples of running MapReduce benchmarks and the using the YARN job browser.
Lesson 6: Higher Level Tools
While Hadoop is very effective at presenting a basic scalable MapReduce model, some higher-level approaches have been developed. In this lesson, Doug teaches you how to use Apache Pig–a Hadoop scripting language that simplifies using MapReduce. In addition, he shows you how to use Apache Hive QL–an SQL-like language that enables higher-level "ad hoc" queries using MapReduce and HDFS. And finally, the Oozie workflow manager is presented.
Lesson 7: Using the Spark Language
Spark has become a popular tool for data analytics. In this lesson, Doug provides some of the basic aspects of the Spark language and demonstrates the Python-Spark interface, PySpark, with a simple command line example. Additional aspects of the Spark language are also used in the next two lessons.
Lesson 8: Getting Data into Hadoop HDFS
The first, and often overlooked step in data analytics, is "data ingest." As was demonstrated in Lesson 3, files can be simply copied into HDFS. However, there are methods that can preserve and import structure that could be lost with simple copying. In this lesson. Doug demonstrates how to import data into Hive tables and use Spark to import data into HDFS. He also demonstrates importing log and other streaming data directly into HDFS using Apache Flume. Finally, a complete example of using Apache Sqoop to import and export a relational database into and out of HDFS is presented.
Lesson 9: Using the Zeppelin Web Interface
Although much of the early Hadoop applications were developed using the command line interface, new web-based GUI tools such as Apache Zeppelin offer a more user-friendly approach to application development. In this lesson, a walk-through of the Zeppelin interface is provided and includes an example of how to create an interactive Zeppelin notebook for a simple Spark application.
Lesson 10: Learning Basic Hadoop Installation and Administration
One of the challenges facing Hadoop users and administrators is setting up a real cluster for production use. In this lesson, Doug teaches you how to use the Ambari web GUI to install, monitor, and administer a full Hadoop installation. He also provides a few important command line tools that will help with basic administration. Finally, some additional HDFS features such as snapshots and NFSv3 mounts are demonstrated.
About Pearson Video Training
Pearson publishes expert-led video tutorials covering a wide selection of technology topics designed to teach you the skills you need to succeed. These professional and personal technology videos feature world-leading author instructors published by your trusted technology brands: Addison-Wesley, Cisco Press, Pearson IT Certification, Prentice Hall, Sams, and Que Topics include: IT Certification, Network Security, Cisco Technology, Programming, Web Development, Mobile Development, and more. Learn more about Pearson Video training at http://www.informit.com/video.
The companion materials for this LiveLesson can be downloaded from https://www.clustermonkey.net/download/LiveLessons/Hadoop_Fundamentals/.
Introduction
Lesson 1: Background Concepts
Learning objectives
Lesson 1.1 Understand Big Data and analytics
Lesson 1.2 Understand Hadoop as a data platform
Lesson 1.3 Understand Hadoop MapReduce basics
Lesson 1.4 Understand Spark language basics
Lesson 1.5 Learn the Linux command line features
Lesson 2: Running Hadoop on a Desktop or Laptop
Learning objectives
Lesson 2.1 Install Hortonworks Hadoop and Spark HDP Sandbox
Lesson 2.2 Install from Hadoop sources
Lesson 2.3 Install from Spark from sources
Lesson 3: The Hadoop Distributed File System
Learning objectives
Lesson 3.1 Understand HDFS basics
Lesson 3.2 Use HDFS tools and do administration
Lesson 3.3 Use HDFS in programs
Lesson 3.4 Utilize additional features of HDFS
Lesson 4: Hadoop MapReduce
Learning objectives
Lesson 4.1 Understand the MapReduce paradigm
Lesson 4.2 Develop and run a Java MapReduce application
Lesson 4.3 Understand how MapReduce works
Lesson 5: Hadoop MapReduce Examples
Learning objectives
Lesson 5.1 Use the Streaming Interface
Lesson 5.2 Use the Pipes interface
Lesson 5.3 Run the Hadoop grep example
Lesson 5.4 Debugging MapReduce
Lesson 5.5 Understand Hadoop Version 2 MapReduce
Lesson 5.6 Use Hadoop Version 2 features Part 1
Lesson 5.6 Use Hadoop Version 2 features Part 2
Lesson 6: Higher Level Tools
Learning objectives
Lesson 6.1 Demonstrate a Pig example
Lesson 6.2 Demonstrate a Hive example
Lesson 6.3 Demonstrate an Oozie example Part 1
Lesson 6.3 Demonstrate an Oozie example Part 2
Lesson 7: Using the Spark Language
Lesson 7.1 Learn Spark language basics
Lesson 7.2 Demonstrate a PySpark command line example
Lesson 8: Getting Data into Hadoop HDFS
Learning objectives
Lesson 8.1 Import data into Hive tables
Lesson 8.2 Use Spark to import data into HDFS
Lesson 8.3 Demonstrate a Flume Example Part 1
Lesson 8.3 Demonstrate a Flume Example Part 2
Lesson 8.4 Demonstrate a Sqoop Example Part 1
Lesson 8.4 Demonstrate a Sqoop Example Part 2
Lesson 9: Using the Zeppelin Web Interface
Learning objectives
Lesson 9.1 Understand Zeppelin features
Lesson 9.2 Create a PySpark example in Zeppelin
Lesson 10: Learning Basic Hadoop Installation and Administration
Learning objectives
Lesson 10.1 Install and configure Hadoop using Ambari
Lesson 10.2 Perform simple administration and monitoring with Ambari
Lesson 10.3 Perform simple administration and monitoring
Lesson 10.4 Utilize additional features of HDFS