HPC Administration and Usage Tips
This section gives simple recommendations to Sun HPC system users and administrators on various issues that relate to configuration and usage of the Sun HPC software installation. The following is a list that addresses the most frequently encountered issues:
Do not mix different versions of the hostname syntax for the cluster nodes to prevent an HPC installation from successfully completing.
Provide a superuser (root) readable and writeable directory for synchronization.
Change the permissions to 0600 and the ownership to root on the /.rhosts and /etc/sunhpc_rhosts authentication files.
Refresh the resource database.
Try to avoid NFS-type installations.
Do not remove the /tmp/CRE-ctblfile file because it is needed by the CRE software.
Use the -t scale_factor option with the mprun(1) command to increase the timeout period.
Use the output of the hostname command in the hpc_config file. Make sure the same syntax is used in the /etc/hpc_system and the /.rhosts or /etc/sunhpc_rhosts files.
Usually the hpc_config file is saved in this directory, and all of the SYNC files used by the install script are created in this directory. Check the sync directory for correct permissions before starting the installation.
Make sure that the authentication files contain the hostname of the cluster nodes, including the host on which they reside.
The resource database sometimes gets out of sync and needs to be refreshed. This is demonstrated by wrong and unexpected output from the CRE commands. You should stop the CRE daemons, remove the /var/rdb-* files, and restart the CRE daemons.
Use SMP-local or cluster-local installations only. The latter type has generated more clean installations than the NFS type, due to the wide variations of network configurations.
A job spawned by the Sun CRE is closely tied to a the /tmp/CRE-ctblfile file that lives as long as the CRE daemons are running. Most computer sites have scripts that regularly clean up the /tmp directory. There have been instances where long running jobs that take days to complete have failed due to the unexpected disappearance of the /tmp/CRE-ctblfile file.
Jobs that spawn a large number of processes may, on rare occasions, fail with the following message:
mprun: tmrte_proc_spawn: select: Operation timed out: Operation timed out
This is due to the default timeout value used by the Sun CRE to spawn all of the processes of the job.
Configure a large /tmp swap partition because the MPI programs running on a particular node use shared memory files that are mapped to the /tmp area.
TABLE 1 contains two examples of shared memory sizes with respect to the number of processes running on the same SMP:
TABLE 1 Shared Memory Sizes Per Processes
Processes per Job |
Required Shared Memory |
2 | 35 Mbytes |
16 | 85 Mbytes |
Keep MPI network traffic separate from administrative and other network traffic to improve MPI application performance.
The above tips and recommendations frequently reappear on the support forums. See the Frequently Asked Questions at the following site:
http://supportforum.sun.com/clustertools
Read the Sun HPC ClusterTools 4 User's Guide and the Sun HPC ClusterTools 4 Product Notes at the following site:
Use the Sun Cluster Support forum at:
http://supportforum.sun.com/clustertools
There are several forums at this site that users and experts use to discuss issues that pertain to the Sun HPC ClusterTools and the Sun Grid Engine products.
Appendix
This appendix contains a copy of the /etc/sudoers file. It includes the necessary changes to administer the Sun HPC ClusterTools 4 software.
-------------Start of /etc/sudoers file------------------- ... snip ... # Host_Alias HPCHOSTS=<hostname>,<hostname>,... # User_Alias HPCUSERS=<username>,<username>,... # Used for HPC CT 3.1 Cmnd_Alias HPCCMNDS=/opt/SUNWhpc/bin/*,/etc/init.d/sunhpc*,/opt/SUNWhpc/etc/*,/opt/SUNWhpc/etc /isa.hHPCUSERS HPCHOSTS=HPCCMNDS/opt/SUNWhpc/etc/sparc*/*,/opt/SUNWhpc/etc/pfs/sparc*/*,/opt /SUNWhpc/bin/Install_Utilities/* # ... snip ... -------------End of /etc/sudoers file-------------------
Acknowledgements
I would like to thank all of my colleagues from the many HPC-related groups for reviewing the original SUPerG white paper and offering their valuable feedback and suggestions.
References
This section contains the references used in this article.
Sun HPC ClusterTools 4 documentation set at:
The Sudo software at:
The Platform software at:
The Condor project at:
The Sun Gridware software at:
The Sun HPC ClusterTools 4 software support forum at:
The Sun GE software at:
"HPC Best Practices" presented at the SUPerG conference in Paris, France, April 2000
Code examples on the Sun GE software site:
http://supportforum.sun.com/clustertools
http://gridengine.sunsource.net
http://gridengine.sunsource.net/project/gridengine/howto/condorckpt.html
Ordering Sun Documents
The SunDocsSM program provides more than 250 manuals from Sun Microsystems, Inc. If you live in the United States, Canada, Europe, or Japan, you can purchase documentation sets or individual manuals through this program.
Accessing Sun Documentation Online
The docs.sun.com web site enables you to access Sun technical documentation online. You can browse the docs.sun.com archive or search for a specific book title or subject. The URL is http://docs.sun.com/
To reference Sun BluePrints OnLine articles, visit the Sun BluePrints OnLine Web site at: http://www.sun.com/blueprints/online.html