New Integration Architecture
The key enabling element in our architecture is that it moves the responsibility for launching processes from the parallel environment to the RM, while maintaining the infrastructure of the environment. This solves the fundamental problem of giving the RM visibility into the parallel processes, and enables the proper monitoring found in a tight integration. Our approach requires that the RM export three basic capabilities:
Ability to list the set of allocated resources (hosts). This allows the parallel environment to start its own daemon infrastructure on the proper set of hosts.
Ability to launch a new process under RM control. The parallel environment calls the RM rather than forking a new process.
Ability for that process to identify its containing job. The launched process must rendezvous with the parallel environment, and needs a means of identifying itself to the environment.
We find these capabilities are ubiquitous amongst distributed resource management systems. TABLE 1 shows the interfaces that export these capabilities for our target systems4, 5, 6,7.
TABLE 1 Resource Manager Interfaces
Resource Manager |
Host List |
Job ID |
Spawn Method |
LSF | LSB_HOSTS | LSB_JOBID | pam |
Sun Grid Engine | PE_HOSTFILE | JOB_ID | qrsh-inherit |
PBS | PBS_NODEFILE | PBS_JOBID | tm_spawn() |
We have defined abstract function call interfaces for each of these capabilities that are called during job startup, as we will describe shortly. The interfaces for a particular RM are collected into a single plug-in library that is dynamically loaded by the parallel environment using the dlopen system call. Thus, the environment is insulated from the implementation details of the RM. The plug-in library is small; it requires fewer than 250 lines of source code for each RM that we implement. The plug-in architecture allows customers and other third parties to implement tight integrations with additional RMs without modifying the source code of Sun CRE. The plug-in interface specification is described in the xrm.3 manual page in the HPC ClusterTools 5 software documentation.
The interaction between Sun CRE and PBS serves to illustrate the specifics of this architecture, and is shown in FIGURE 2. (The interaction with other RM systems is similar).
FIGURE 2 Interaction Between Sun CRE and PBS During Job Startup
The user submits a job with the qsub command of PBS, which contacts the pbsmom daemon in step 1. The pbsmom daemon forks the job script myjob.sh (step 2), which invokes an mprun command (step 3). In step 4, mprun reads the list of allocated hosts, and performs the Sun CRE daemon setup on these hosts as was shown in FIGURE 1 (these steps are elided in FIGURE 2 for clarity). Unlike in FIGURE 1, iod no longer forks a.out. In step 5, mprun invokes a launcher program which calls the PBS tm_spawn function (step 6), which contacts another pbsmom daemon. The pbsmom daemon forks the a.out process in step 7. Lastly, the process determines its job ID, finds the iod that is associated with this job, and establishes a socket connection to that iod. This becomes the query socket over which the process obtains services from the parallel environment. Although our PBS integration calls the launcher from the single mprun process, we could instead make simultaneous calls to the launcher from multiple iod instances, which yields better startup scalability. This is done in our Sun Grid Engine software integration. An RM's ability to handle a distributed launch is a property in our plug-in interface.
Our new architecture has a number of advantages:
Achieves a tight integration with multiple RMs in a uniform way. Differences between RM systems are abstracted in a plug-in library.
Allows the RM to have a parent-child relationship with the processes it manages.
Allows fast parallel startup of jobs.
No modifications were made to the resource management software.
Gives the parallel environment and the RM visibility into a job. Commands of both systems can be used to view and signal jobs.
Complete infrastructure of the parallel environment is available to parallel jobs and tools.