Apache Hadoop YARN: A Brief History and Rationale
In this chapter we provide a historical account of why and how Apache Hadoop YARN came about. YARN’s requirements emerged and evolved from the practical needs of long-existing cluster deployments of Hadoop, both small and large, and we discuss how each of these requirements ultimately shaped YARN.
YARN’s architecture addresses many of these long-standing requirements, based on experience evolving the MapReduce platform. By understanding this historical context, readers can appreciate most of the design decisions that were made with YARN. These design decisions will repeatedly appear in Chapter 4, “Functional Overview of YARN Components,” and Chapter 7, “Apache Hadoop YARN Architecture Guide.”
Introduction
Several different problems need to be tackled when building a shared compute platform. Scalability is the foremost concern, to avoid rewriting software again and again whenever existing demands can no longer be satisfied with the current version. The desire to share physical resources brings up issues of multitenancy, isolation, and security. Users interacting with a Hadoop cluster serving as a long-running service inside an organization will come to depend on its reliable and highly available operation. To continue to manage user workloads in the least disruptive manner, serviceability of the platform is a principal concern for operators and administrators. Abstracting the intricacies of a distributed system and exposing clean but varied application-level paradigms are growing necessities for any compute platform.
Hadoop’s compute layer has seen all of this and much more during its continuous and long progress. It went through multiple evolutionary phases in its architecture. We highlight the “Big Four” of these phases in the reminder of this chapter.
- “Phase 0: The Era of Ad Hoc Clusters” signaled the beginning of Hadoop clusters that were set up in an ad hoc, per-user manner.
- “Phase 1: Hadoop on Demand” was the next step in the evolution in the form of a common system for provisioning and managing private Hadoop MapReduce and HDFS instances on a shared cluster of commodity hardware.
- “Phase 2: Dawn of the Shared Compute Clusters” began when the majority of Hadoop installations moved to a model of a shared MapReduce cluster together with shared HDFS instances.
- “Phase 3: Emergence of YARN”—the main subject of this book—arose to address the demands and shortcomings of the previous architectures.
As the reader follows the journey through these various phases, it will be apparent how the requirements of YARN unfolded over time. As the architecture continued to evolve, existing problems would be solved and new use-cases would emerge, pushing forward further stages of advancements.
We’ll now tour through the various stages of evolution one after another, in chronological order. For each phase, we first describe what the architecture looked like and what its advancements were from its previous generation, and then wind things up with its limitations—setting the stage for the next phase.