Managing a Cluster Grid
In many cluster grid environments, large numbers of identically configured systems exist, providing opportunities to minimize administration time by taking advantage of diagnostic and automated installation tools.
SunVTS Software
SunVTS software can be used to perform hardware testing on a new node before the node is put into production. It can also be used on a regular basis for routine hardware checkups, or used to investigate a malfunctioning node. Many separate tests are included in the SunVTS application. Each test is a separate process from the SunVTS kernel. Tests are provided for processor, memory, network, communication, storage and peripheral devices.
When SunVTS software is started, the SunVTS kernel automatically probes the system kernel to identify which hardware devices are installed, and displays the testable devices in the SunVTS UI. This provides a quick check of the hardware configuration, and only those tests applicable to that system are displayed. During testing, the hardware tests send the test status and messages to the SunVTS kernel through interprocess communication protocols. The kernel passes the status to the user interface and logs the messages.
Cluster Console Manager
In the compute farm environment, tasks that require simultaneous command line input to multiple clients, such as patch installs and software upgrades, can be performed using the Cluster Console tool set that is provided with Sun HPC ClusterTools software. The Cluster Console Manager (CCM) enables you to issue commands to all nodes in a cluster simultaneously through a graphical user interface.
The CCM offers three modes of operation: cconsole, ctelnet and crlogin. The cconsole interface is of particular use as it provides access to each node's console port through terminal concentrator links. To use this tool, the cluster nodes must be connected to terminal concentrator ports and these node/port connections must be defined in the hpc_config file. Operations performed at the ok prompt such as configuring the boot PROM parameters, booting, and initializing operating system installations, can use this tool.
The ctelnet and crlogin uses telnet and rlogin respectively to log you in to every node in the cluster. Each of these modes creates a command entry window, called the common window, and a separate console window, called a term window, for each node. Each command typed in the common window is echoed in all term windows (but not in the common window). Every term window displays commands that you issue as well as system messages logged by its node.
Solaris Jumpstart and Flash Software
Solaris Jumpstart software should be used to perform installs at least for the compute tier of the cluster grid. If new hosts are added in the compute tier, or reinstalls are necessary, a well-configured Jumpstart environment will vastly reduce the management time for these tasks. The Jumpstart environment allows the administrator to set the Solaris install type according to the characteristics of the Jumpstart client.
The development of post-install scripts can further speed the install by performing the required system configuration tasks automatically following the Solaris install. This is particularly useful in the case of compute tier servers, which usually have relatively simple configurations in large number.
For a compute host, which is to be integrated in an existing Sun Grid Engine environment, the following tasks are required (in addition to the Solaris installation):
Performing simple configuration tasks such as populating files like /etc/hosts or /etc/system, adding the SGE administrator user, and so fourth.
Mounting directories for Sun Grid Engine binaries, user home directories, libraries, and executables.
Execution of setup scripts such as install_execd for a Grid Engine execution host. In this case, it is necessary to register the new host as an admin host with the master node prior to the installation of an exec host.
In addition, it may be useful to perform some testing, in which case the installation of SunVTS, possibly followed by the automated execution of selected hardware testing scripts, can be performed.
Sun Grid Engine Software
The built-in logging and accounting capabilities of Sun Grid Engine allow administrators to keep track of compute jobs that run on the cluster grid. The software keeps a record of every job that has been run by SGE, including details such as start time, duration, CPU, memory and I/O consumption, user statistics, and, in the case of SGEEE, project and department information. The built-in SGE command qacct can be used to display summary information of resource consumption based on such criteria as user, project, resources requested, and time period. For more intricate sort criteria, or third-party accounting or analysis tools, you can access the accounting record directly. By default, it is kept in a file called accounting in the $SGE_ROOT/$SGE_CELL/common directory that is stored in a flat file database (the man page for the accounting command gives the format). A how to on the Grid Engine Project web site discusses how to rotate this file using a utility script provided with the software.