Home > Store

Virtualizing Hadoop: How to Install, Deploy, and Optimize Hadoop in a Virtualized Architecture

By George Trujillo, Charles Kim, Steve Jones, Rommel Garcia, Justin Murray
Published Jul 20, 2015 by VMware Press. Part of the VMware Press Technology series.

Book

Sorry, this book is no longer in print.

Not for Sale

About

Description

Sample Content

Updates

More Information

About

Features

Shows how data works and moves in Hadoop clusters, and how to integrate Hadoop into enterprise data architecture
Gives "golden image templates" for deploying Hadoop smoothly, quickly, and consistently
Teaches how to avoid pitfalls and mitigate risks associated with virtualizing Hadoop
By pioneering Big Data and virtualization experts George Trujillo (Hortonworks), Charles Kim (Oracle ACE Director, VMware vExpert), and Steven Jones (VMware)



Description

Copyright 2016
Pages: 480
Edition: 1st

Book
ISBN-10: 0-13-381102-6
ISBN-13: 978-0-13-381102-5

Plan and Implement Hadoop Virtualization for Maximum Performance, Scalability, and Business Agility

Enterprises running Hadoop must absorb rapid changes in big data ecosystems, frameworks, products, and workloads. Virtualized approaches can offer important advantages in speed, flexibility, and elasticity. Now, a world-class team of enterprise virtualization and big data experts guide you through the choices, considerations, and tradeoffs surrounding Hadoop virtualization. The authors help you decide whether to virtualize Hadoop, deploy Hadoop in the cloud, or integrate conventional and virtualized approaches in a blended solution.

First, Virtualizing Hadoop reviews big data and Hadoop from the standpoint of the virtualization specialist. The authors demystify MapReduce, YARN, and HDFS and guide you through each stage of Hadoop data management. Next, they turn the tables, introducing big data experts to modern virtualization concepts and best practices.

Finally, they bring Hadoop and virtualization together, guiding you through the decisions you’ll face in planning, deploying, provisioning, and managing virtualized Hadoop. From security to multitenancy to day-to-day management, you’ll find reliable answers for choosing your best Hadoop strategy and executing it.

Coverage includes the following:

• Reviewing the frameworks, products, distributions, use cases, and roles associated with Hadoop

• Understanding YARN resource management, HDFS storage, and I/O

• Designing data ingestion, movement, and organization for modern enterprise data platforms

• Defining SQL engine strategies to meet strict SLAs

• Considering security, data isolation, and scheduling for multitenant environments

• Deploying Hadoop as a service in the cloud

• Reviewing the essential concepts, capabilities, and terminology of virtualization

• Applying current best practices, guidelines, and key metrics for Hadoop virtualization

• Managing multiple Hadoop frameworks and products as one unified system

• Virtualizing master and worker nodes to maximize availability and performance

• Installing and configuring Linux for a Hadoop environment



Sample Content

Foreword xix

Preface xxi

Part I: Introduction to Hadoop

Chapter 1 Understanding the Big Data World 1

The Data Revolution 2

Traditional Data Systems 4

Semi-Structured and Unstructured Data 5

Causation and Correlation 7

Data Challenges 8

The Modern Data Architecture 17

Organizational Transformations 20

Industry Transformation 21

Summary 22

Chapter 2 Hadoop Fundamental Concepts 23

Types of Data in Hadoop 23

Use Cases 25

What Is Hadoop? 26

Hadoop Distributions 32

Hadoop Frameworks 32

NoSQL Databases 37

What Is NoSQL? 38

A Hadoop Cluster 42

Hadoop Software Processes 45

Hadoop Hardware Profiles 48

Roles in the Hadoop Environment 56

Summary 59

Chapter 3 YARN and HDFS 61

A Hadoop Cluster Is Distributed 61

Hadoop Directory Layouts 65

Hadoop Operating System Users 67

The Hadoop Distributed File System 67

YARN Logging 70

The NameNode 70

The DataNode 71

Block Placement 75

NameNode Configurations and Managing Metadata 77

Rack Awareness 82

Block Management 83

The Balancer 84

Maintaining Data Integrity in the Cluster 84

Quotas and Trash 92

YARN and the YARN Processing Model 93

Running Applications on YARN 101

Resource Schedulers 107

Benchmarking 112

TeraSort Benchmarking Suite 115

Summary 117

Chapter 4 The Modern Data Platform 119

Designing a Hadoop Cluster 119

Enterprise Data Movement 124

Summary 140

Chapter 5 Data Ingestion 141

Extraction, Loading, and Transformation (ELT) 141

Sqoop: Data Movement with SQL Sources 143

Flume: Streaming Data 148

Oozie: Scheduling and Workfl ow 167

Falcon: Data Lifecycle Management 172

Kafka: Real-time Data Streaming 176

Summary 186

Chapter 6 Hadoop SQL Engines 187

Where SQL Was Born 187

SQL in Hadoop 188

Hadoop SQL Engines 190

Selecting the SQL Tool For Hadoop 190

Now Getting Groovy with Hive and Pig 198

Hive 199

HCatalog 213

Pig 215

Summary 221

Chapter 7 Multitenancy in Hadoop 223

Securing the Access 224

Authentication 225

Auditing 230

Authorization 230

Data Protection 232

Isolating the Data 241

Isolating the Process 251

Summary 255

Part II: Introduction to Virtualization

Chapter 8 Virtualization Fundamentals 257

Why Virtualize Hadoop? 258

Introduction to Virtualization 261

Summary 276

References 276

Chapter 9 Best Practices for Virtualizing Hadoop 277

Running Virtualized Hadoop with Purpose and Discipline 277

The Discipline of Purpose Starts with a Clear Target 279

Virtualizing Different Tiers of Hadoop 280

Industry Best Practices 282

Summary 298

Part III: Virtualizing Hadoop

Chapter 10 Virtualizing Hadoop 299

How Are Hadoop Ecosystems Going to Be Managed? 300

Building an Enterprise Hadoop Platform That Is Agile and Flexible 301

Clarification of Terms 302

The Journey from Bare-Metal to Virtualization 303

Why Consider Virtualizing Hadoop? 304

Benefits of Virtualizing Hadoop 305

Virtualized Hadoop Can Run as Fast or Faster Than Native 306

Coordination and Cross-Purpose Specialization Is the Future 309

Barriers Can Be Organizational 310

Virtualization Is Not an All or Nothing Option 310

Rapid Provisioning and Improving Quality of Development and Test Environments 311

Improve High Availability with Virtualization 313

Use Virtualization to Leverage Hadoop Workloads 313

Hadoop in the Cloud 314

Big Data Extensions 314

The Path to Virtualization 315

The Software-Defined Data Center 316

Virtualizing the Network 318

vRealize Suite 320

Summary 321

References 322

Chapter 11 Virtualizing Hadoop Master Servers 323

Virtualizing Servers in a Hadoop Cluster 324

Virtualizing the Environment Around Hadoop 325

Virtualizing the Master Hadoop Servers 325

Virtualizing Without the SAN 330

Summary 331

Chapter 12 Virtualizing the Hadoop Worker Nodes 333

A Brief Introduction to the Worker Nodes in Hadoop 333

Deployment Models for Hadoop Clusters 335

The Combined Model 336

The Separated Model 339

Network Effects of the Data-Compute Separation 341

The Shared-Storage Approach to the Data-Compute Separated Model 343

Local Disks for the Application’s Temporary Data 345

The Shared Storage Architecture Model Using Network-Attached Storage (NAS) 345

Deployment Model Summary 348

Best Practices for Virtualizing Hadoop Workers 349

Disk I/O 349

The Hadoop Virtualization Extensions (HVE) 354

Summary 357

References 358

Resources 358

Chapter 13 Deploying Hadoop as a Service in the Private Cloud 361

The Cloud Context 361

Stakeholders for Hadoop 362

Overview of the Solution Architecture 368

Summary 370

References 371

Chapter 14 Understanding the Installation of Hadoop 373

Map the Right Solutions to the Right Use Case 373

Thoughts About Installing Hadoop 374

Configuring Repositories 376

Installing HDP 2.2 378

Environment Preparation 378

Setting Up the Hadoop Configuration 389

Starting HDFS and YARN 393

Start YARN 396

Verifying MapReduce Functionality 398

Installing and Configuring Hive 400

Installing and Configuring MySQL Database 401

Installing and Configuring Hive and HCatalog 401

Summary 404

Chapter 15 Configuring Linux for Hadoop 405

Supported Linux Platforms 406

Different Deployment Models 406

Linux Golden Templates 407

Building a Linux Enterprise Hadoop Platform 408

Selecting the Linux Distribution 411

Optimal Linux Kernel Parameters and System Settings 411

epoll 411

Disable Swap Space 412

Disable Security During Install 412

IO Scheduler Tuning 414

Check Transparent Huge Pages Configuration 414

Limits.conf 414

Partition Alignment for RDMs 415

File System Considerations 416

Lazy Count Parameter for XFS 418

Mount Options 418

I/O Scheduler 419

Disk Read and Write Options 421

Storage Benchmarking 421

Java Version 422

Set Up NTP 423

Enable Jumbo Frames 424

Additional Network Considerations 425

Summary 427

Appendix A Hadoop Cluster Creation: A Prerequisite Checklist 429

Appendix B Big Data/Hadoop on VMware vSphere Reference Materials 433

Deployment Guides 433

Reference Architectures 434

Customer Case Studies 434

Performance 434

vSphere Big Data Extensions (BDE) 435

Other vSphere Features and Big Data 436

9780133811025 TOC 7/7/2015



Updates

Submit Errata



More Information



InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address