- Introduction to Cluster File Systems
- The NFS
- A Survey of Some Open-Source Parallel File Systems
- Commercially Available Cluster File Systems
- Cluster File System Summary
14.3 A Survey of Some Open-Source Parallel File Systems
This section provides an overview of some of the available parallel file systems. This is not an exhaustive list, just some of the options that I have come across in my research and experience. I cover several open-source parallel file systems, and one commercially available one in the overviews.
A common issue with open-source parallel file systems is that they may require kernel reconfiguration expertise to install and use. Some of the functionality may be available in the form of premodified kernels and modules, precompiled kernel modules for a particular kernel version, or a do-it-yourself kit in the form of patches. If you are not running a production environment, or are open to experimentation and have the necessary kernel hacking skills, several of the file systems covered here might be options for you. Otherwise, a commercially available and supported solution is a better choice.
Many of the possible choices will require you to work with a minimum amount of documentation, other than the source code. This is a normal occurrence in the Linux and open-source world. The projects that we are examining are still works in progress, although some of the functionality is being included in the next release of Linux version 2.6 of the kernel.
You should note that most of the parallel file systems we are examining use TCP/IP as a transport. This means that the parallel file systems may be run over Ethernet, but there are other choices. Recall that the HSIs like Myrinet and Quadrics allow transport of TCP/IP along with the other message formats needed by MPI or OpenMP. This means that very fast transport is available for the parallel file system, provided that your cluster has an HSI in place.
14.3.1 The Parallel Virtual File System (PVFS)
The first cluster file system we will examine is the parallel virtual file system, or PVFS. There are currently two versions of this file system, PVFS1 and PVFS2. The newest version, PVFS2, requires the Linux 2.6 kernel, so for the purposes of this introduction, we will concentrate on PVFS1. Both versions are currently being maintained.
The PVFS1 software is available from http://www.parl.clemson.edu/pvfs and the PVFS2 software is available from http://www.pvfs.org/pvfs2. I chose the PVFS1 software as the first example, because it requires no modifications to the kernel to install, just compilation of the sources. The architecture of PVFS is shown in Figure 14-4.
Figure 14-4 Architecture of the PVFS
Figure 14-4 shows an application talking to PVFS through a dynamically loaded kernel module that is supplied as part of the PVFS software. This module is provided to allow normal interaction between the application, the Linux virtual file system (VFS) layer, and the PVFS daemon, which performs I/O on behalf of the application. Applications may be recompiled to use a PVFS library that communicates directly with the meta-data server and the I/O daemons on the server nodes, thus eliminating the kernel communication (and one data copy).
The server nodes store data in an existing physical file system, such as ext3, to take advantage of existing journaling and recovery mechanisms. The PVFS iod process, or I/O daemon, performs file operations on behalf of the PVFS client system. These operations take place on files striped across the PVFS storage servers in a user-defined manner.
You may download the source for PVFS1 from ftp://ftp.parl.clemson.edu/pub/pvfs and perform the following steps:
# tar xvzf pvfs-1.6.2.tgz # cd pvfs-1.6.2 # ./configure # make # make install
The installation unpacks /usr/local/sbin/iod, /usr/local/sbin/mgr, the application library /usr/local/lib/pvfs.a, and a host of utilities in /usr/local/bin, such as pvfs-ping and pvfs-mkdir. The default installation includes auxilliary information, like the man pages and header files for the development library under /usr/include. The entire operation takes approximately three minutes.
If you wish to use the kernel module to allow access to the file system for existing programs, you may download the pvfs-kernel-1.6.2-linux-2.4.tgz file and build the contents:
# tar xvzf pvfs-kernel-1.6.2-linux-2.4.tgz # cd pvfs-kernel-1.6.2-linux-2.4 # ./configure # make # make install # cp mount.pvfs /sbin/mount.pvfs # mkdir /lib/modules/2.4.18-24.8.0/kernel/fs/pvfs # cp pvfs.o /lib/modules/2.4.18-24.8.0/kernel/fs/pvfs # depmod -a
Let's install the software package on five nodes: one meta-data server and four data servers. For this example, let's install the kernel package on a single node that is to be the PVFS client. For the meta-data server, install and configure the software, then started the meta-data manager daemon, mgr:
# mkdir /pvfs-meta # cd /pvfs-meta # mkmgrconf This script will make the .iodtab and .pvfsdir files in the metadata directory of a PVFS file system. Enter the root directory (metadata directory): /pvfs-meta Enter the user id of directory: root Enter the group id of directory: root Enter the mode of the root directory: 777 Enter the hostname that will run the manager: cs01 Searching for host...success Enter the port number on the host for manager: (Port number 3000 is the default) <CR> Enter the I/O nodes: (can use form node1, node2, ... or nodename{#-#,#,#}) cs02, cs03, cs04, cs05 Searching for hosts...success I/O nodes: cs02 cs03 cs04 cs05 Enter the port number for the iods: (Port number 7000 is the default) <CR> Done! # /usr/local/sbin/mgr
On each of the data servers, configure and install the I/O server daemon, iod:
# cp /tmp/pvfs-1.6.2/system/iod.conf /etc/iod.conf # mkdir /pvfs-data # chmod 700 /pvfs-data # chown nobody.nobody /pvfs-data # /usr/local/sbin/iod
Then test to make sure that all the pieces are functioning. A command that is very useful is the pvfs-ping command, which contacts all the participating systems:
# /usr/local/bin/pvfs-ping -h hpepc1 -f /pvfs-meta mgr (cs01:3000) is responding. iod 0 (192.168.0.102:7000) is responding. iod 1 (192.168.0.103:7000) is responding. iod 2 (192.168.0.104:7000) is responding. iod 3 (192.168.0.105:7000) is responding. pvfs file system /pvfs-meta is fully operational.
Wahoo! We can now go and mount the file system on a client. There are a few minor steps before we can mount the file system on the client. First we need to create the /etc/pvfstab file, which contains mounting information:
cs01:/pvfs-meta /mnt/pvfs pvfs port=3000 0 0
Next, we can actually mount the parallel file system, which involves installing the kernel module and starting the pvfsd to handle communication with "normal" programs and commands:
# insmod pvfs # /usr/local/sbin/pvfsd # mkdir /mnt/pvfs # /sbin/mount.pvfs cs01:/pvfs-meta /mnt/pvfs # /usr/local/bin/pvfs-statfs /mnt/pvfs blksz = 65536, total (bytes) = 388307550208, free (bytes) = 11486298112
We are now ready to start using the file system. Let's load a copy of iozone, which is a handy file system performance tool, available from http://www.iozone.org, and start running tests. Things should work just fine within the limits of your test network and systems. For performance and scaling information, see the PVFS Web site.
The version of PVFS that we are testing, PVFS1, although it is still maintained, is likely to be superseded by the PVFS2 version when the Linux 2.6 kernel becomes generally available. There have been a number of noticeable improvements in the tools for configuring and managing the file system components, along with more documentation. Download and try PVFS1 for your cluster, but keep your eyes on PVFS2.
14.3.2 The Open Global File System (OpenGFS)
The Open Global File System, also known as OpenGFS and OGFS, is another file system that provides direct client access to shared storage resources. The systems that use OpenGFS access the storage devices through a Fibre-Channel SAN (Figure 14-5). Some of the features of OpenGFS are
-
Individual journals for separate systems
-
Pluggable locking services based on LAN connections
-
Management of cluster "participation" to manage journaling and locking actions
-
"Fencing" of storage access to prevent corruption by failed nodes [1]
-
Support for volume manager aggregation of multiple devices, with volume managers like the enterprise volume management system (EVMS; available at http://evms.sourceforge.net/docs.php)
Figure 14-5 OpenGFS architecture
The OpenGFS software requires compilation and patching of the kernel as part of the installation process. This means that only certain versions of the kernel are acceptable for use with OpenGFS. If you choose to use EVMS in conjunction with OpenGFS, additional patching is required. The current release version, as of this writing, is 0.3.0.
Because the level of modification to the kernel requires specific software development and kernel-patching expertise, it is beyond the scope of this discussion. If you are interested in experimenting with OpenGFS, there is plenty of documentation available for obtaining the software, compiling it, patching the kernel, and installing it (go to http://opengfs.sourceforge.net/docs.php). Among the HOW-TO information on this site are documents on using OpenGFS with iSCSI, IEEE-1394 (Firewire), and Fibre-Channel storage devices.
14.3.3 The Lustre File System
The name Lustre is a melding of the words Linux and cluster, with a tweak of the last two letters. Although the vision of the final Lustre file system is far from realized, version 1.0 (actually version 1.04 as of this writing) is now available for download and demonstration purposes. Information on Lustre is available from http://www.lustre.org. Before we actually try Lustre, let's enumerate some of the target design attributes:
-
Object-oriented storage access (hundreds of object storage servers [OSSs])
-
Scalable to large numbers of clients (tens of thousands)
-
Scalable to huge amounts of storage (petabytes)
-
Scalable to large amounts of aggregated I/O (hundreds of gigabytes per second)
-
Manageable and immune to failures
As you can tell from the descriptive items, the word scalable comes up a lot with reference to Lustre. So do the words huge and large. Although the end goal is a parallel file system that scales for use in very large clusters, there is nothing to prevent us from using Lustre for our own, albeit, smaller clusters.
Storage in Lustre is allocated at a higher level than the physical hardware blocks that are associated with disk sectors. File objects may be spread over multiple "object storage servers," with a variety of flexible allocation and access schemes. The internal architecture of Lustre is quite complex, and to start even a high-level description we should examine a diagram. Figure 14-6 shows a high-level collection of Lustre components.
Figure 14-6 Lustre high-level architecture
Lustre clients access meta-data servers (MDSs) and OSSs using a message-passing library called "Portals" from Sandia National Laboratories (see http://www.sandiaportals.org and http://sourceforge.net/projects/sandiaportals.) This library, among other things, provides a network abstraction layer that allows message passing over a variety of physical transports, including TCP, Infiniband, SCI, Myrinet, and Quadrics ELAN3 and ELAN4. The specification includes provisions for RDMA and operating system bypass to improve efficiency.
Lustre clients create files from objects allocated on multiple OSS systems. The file contents are kept on the OSS systems, but the meta-data is stored and maintained by a separate meta-data server. The later versions of Lustre are to allow many MDS systems to exist, to scale the meta-data access, but the current implementation allows for only one MDS (and possibly a fail-over copy, but the documentation is unclear about this).
Lustre OSS systems use the physical file systems of a Linux host to store the file objects. This provides journaling and other recovery mechanisms that would need to be implemented "from scratch" if a raw file system is used. It is entirely possible that manufacturers will create special OSS hardware with the Lustre behavior captured in firmware, thus providing an OSS appliance. This does not appear to have happened yet.
To try Lustre, you can download two packages from the Web site:
kernel-smp-2.4.20-28.9_lustre.1.0.4.i586.rpm lustre-lite-utils-2.4.20-28.9_lustre.1.0.4.i586.rpm
There is also a kernel source package and an all-inclusive package that contains the kernel and the utility sources. The precompiled packages are easy enough for a quick trial. As you can see by the names, the packages are based on a version of the kernel from Red Hat 9.0. The two packages and the HOW-TO at https://wiki.clusterfs.com/lustre/lustrehowto can help us get started.
First, install the two packages. Make sure that you use the install option to the rpm command, not the update option, or you might replace your current kernel. We want to install the software in addition to the current kernel:
# rpm -ivh kernel-smp-2.4.20-28.9_lustre.1.0.4.i586.rpm lustre-lite-utils-2.4.20-28.9_lustre.1.04.i586.rpm
This will place the kernel in /boot, update the bootloader menu, and install the required modules in /lib/modules/2.4.20-28.9_lustre.1.0.4smp. When you reboot your system, you should see a menu item for 2.4.20-28.9_lustre.1.0.4smp. Select it and continue booting the system.
The configuration information for Lustre is kept in a single file that is used by all members of the cluster. This file is in XML format and is produced and operated on by three main configuration utilities in /usr/sbin: lmc, lconf, and lctl. There is a rather large PDF document (422 pages) describing the architecture and setup of Lustre at http://www.lustre.org/docs/lustre.pdf. It is not necessary to understand the whole picture to try out Lustre on your system. There are example scripts that will create a client, OSS, and MDS on the same system, using the loop-back file system.
The demonstration scripts are located in /usr/lib/lustre/examples. There are two possible demonstrations. The one we are using activates local.sh as part of the configuration:
# NAME=local.sh; ./llmount.sh loading module: portals srcdir None devdir libcfs loading module: ksocknal srcdir None devdir knals/socknal loading module: lvfs srcdir None devdir lvfs loading module: obdclass srcdir None devdir obdclass loading module: ptlrpc srcdir None devdir ptlrpc loading module: ost srcdir None devdir ost loading module: fsfilt_ext3 srcdir None devdir lvfs loading module: obdfilter srcdir None devdir obdfilter loading module: mdc srcdir None devdir mdc loading module: osc srcdir None devdir osc loading module: lov srcdir None devdir lov loading module: mds srcdir None devdir mds loading module: llite srcdir None devdir llite NETWORK: NET_localhost_tcp NET_localhost_tcp_UUID tcp cs01 988 OSD: OST_localhost OST_localhost_UUID obdfilter /tmp/ost1-cs01 200000 ext3 no 0 0 MDSDEV: mds1 mds1_UUID /tmp/mds1-cs01 ext3 no recording clients for filesystem: FS_fsname_UUID Recording log mds1 on mds1 OSC: OSC_cs01_OST_localhost_mds1 2a3da_lov_mds1_7fe101b48f OST_localhost_UUID LOV: lov_mds1 2a3da_lov_mds1_7fe101b48f mds1_UUID 0 65536 0 0 [u'OST_localhost_UUID'] mds1 End recording log mds1 on mds1 Recording log mds1-clean on mds1 LOV: lov_mds1 2a3da_lov_mds1_7fe101b48f OSC: OSC_cs011_OST_localhost_mds1 2a3da_lov_mds1_7fe101b48f End recording log mds1-clean on mds1 MDSDEV: mds1 mds1_UUID /tmp/mds1-cs01 ext3 100000 no OSC: OSC_cs01_OST_localhost_MNT_localhost 986a7_lov1_d9a476ab41 OST_localhost_UUID LOV: lov1 986a7_lov1_d9a476ab41 mds1_UUID 0 65536 0 0 [u'OST_localhost_UUID'] mds1 MDC: MDC_cs01_mds1_MNT_localhost 91f4c_MNT_localhost_bb96ccc3fb mds1_UUID MTPT: MNT_localhost MNT_localhost_UUID /mnt/lustre mds1_ UUID lov1_UUID
There are now a whole lot more kernel modules loaded than there were before we started. A quick look with the lsmod command yields
Module Size Used by Not tainted loop 11888 6 (autoclean) llite 385000 1 mds 389876 2 lov 198472 2 osc 137024 2 mdc 110904 1 [llite] obdfilter 180416 1 fsfilt_ext3 25924 2 ost 90236 1 ptlrpc 782716 0 [llite mds lov osc mdc obdfilter ost] obdclass 558688 0 [llite mds lov osc mdc obdfilter fsfilt_ext3 ost ptlrpc] lvfs 27300 1 [mds obdfilter fsfilt_ext3 ptlrpc obdclass] ksocknal 81004 5 portals 144224 1 [llite mds lov osc mdc obdfilter fsfilt_ext3 ost ptlrpc obdclass lvfs ksocknal]
These are the Lustre modules that are loaded in addition to the normal system modules. If you examine some of the Lustre architecture documents, you will recognize some of these module names as being associated with Lustre components. Names like ost and obdclass become familiar after a while. Notice the number of modules that use the portals module.
The output of the mount command shows that we do indeed have a Lustre file system mounted:
local on /mnt/lustre type lustre_lite (rw,osc=lov1,mdc=MDC_cs01_mds1_MNT_localhost)
There are also a lot of files under the /proc/fs/lustre directories that contain state and status information for the file system components. Taking a look at the local.xml configuration file created by the example, we can see lots of Lustre configuration information required to start up the components (I have used bold type for the major matching XML tags in the output):
<?xml version='1.0' encoding='UTF-8'?> <lustre version='2003070801'> <ldlm name='ldlm' uuid='ldlm_UUID'/> <node uuid='localhost_UUID' name='localhost'> <profile_ref uuidref='PROFILE_localhost_UUID'/> <network uuid='NET_localhost_tcp_UUID' nettype='tcp' name='NET_localhost_tcp'> <nid>hpepc1</nid> <clusterid>0</clusterid> <port>988</port> </network> </node> <profile uuid='PROFILE_localhost_UUID' name='PROFILE_localhost'> <ldlm_ref uuidref='ldlm_UUID'/> <network_ref uuidref='NET_localhost_tcp_UUID'/> <mdsdev_ref uuidref='MDD_mds1_localhost_UUID'/> <osd_ref uuidref='SD_OST_localhost_localhost_UUID'/> <mountpoint_ref uuidref='MNT_localhost_UUID'/> </profile> <mds uuid='mds1_UUID' name='mds1'> <active_ref uuidref='MDD_mds1_localhost_UUID'/> <lovconfig_ref uuidref='LVCFG_lov1_UUID'/> <filesystem_ref uuidref='FS_fsname_UUID'/> </mds> <mdsdev uuid='MDD_mds1_localhost_UUID' name='MDD_mds1_localhost'> <fstype>ext3</fstype> <devpath>/tmp/mds1-cs01</devpath> <autoformat>no</autoformat> <devsize>100000</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> <nspath>/mnt/mds_ns</nspath> <mkfsoptions>-I 128</mkfsoptions> <node_ref uuidref='localhost_UUID'/> <target_ref uuidref='mds1_UUID'/> </mdsdev> <lov uuid='lov1_UUID' stripesize='65536' stripecount='0' stripepattern='0' name='lov1'> <mds_ref uuidref='mds1_UUID'/> <obd_ref uuidref='OST_localhost_UUID'/> </lov> <lovconfig uuid='LVCFG_lov1_UUID' name='LVCFG_lov1'> <lov_ref uuidref='lov1_UUID'/> </lovconfig> <ost uuid='OST_localhost_UUID' name='OST_localhost'> <active_ref uuidref='SD_OST_localhost_localhost_UUID'/> </ost> <osd osdtype='obdfilter' uuid='SD_OST_localhost_localhost_UUID' name='OSD_OST_localhost_localhost'> <target_ref uuidref='OST_localhost_UUID'/> <node_ref uuidref='localhost_UUID'/> <fstype>ext3</fstype> <devpath>/tmp/ost1-cs01</devpath> <autoformat>no</autoformat> <devsize>200000</devsize> <journalsize>0</journalsize> <inodesize>0</inodesize> <nspath>/mnt/ost_ns</nspath> </osd> <filesystem uuid='FS_fsname_UUID' name='FS_fsname'> <mds_ref uuidref='mds1_UUID'/> <obd_ref uuidref='lov1_UUID'/> </filesystem> <mountpoint uuid='MNT_localhost_UUID' name='MNT_localhost'> <filesystem_ref uuidref='FS_fsname_UUID'/> <path>/mnt/lustre</path> </mountpoint> </lustre>
I believe we are seeing the future of configuration information, for better or worse. Although XML might make a good, easy-to-parse, universal "minilanguage" for software configuration, it certainly lacks human readability. If this means we get nifty tools to create the configuration, and nothing ever goes wrong to force us to read this stuff directly, I guess I will be happy.
The example script calls lmc to operate on the XML configuration file, followed by lconf to load the modules and activate the components. As you can see from the configuration, an MDS and object storage target (OST) are created along with a logical volume (LOV) that is mounted by the client "node." Although this example all takes place on the local host, it is possible to experiment and actually distribute the components onto other systems in the network.
Cleaning up after ourselves, we perform the following:
# NAME=local.sh;./llmountcleanup.sh MTPT: MNT_localhost MNT_localhost_UUID /mnt/lustre mds1_UUID lov1_UUID MDC: MDC_cs01_mds1_MNT_localhost 21457_MNT_localhost_4aafe7d139 LOV: lov1 d0db0_lov1_aa2c6e8c14 OSC: OSC_cs01_OST_localhost_MNT_localhost d0db0_lov1_aa2c6e8c14 MDSDEV: mds1 mds1_UUID OSD: OST_localhost OST_localhost_UUID NETWORK: NET_localhost_tcp NET_localhost_tcp_UUID tcp cs01 988 killing process 3166 removing stale pidfile: /var/run/acceptor-988.pid unloading module: llite unloading module: mdc unloading module: lov unloading module: osc unloading module: fsfilt_ext3 unloading module: mds unloading module: obdfilter unloading module: ost unloading module: ptlrpc unloading module: obdclass unloading module: lvfs unloading module: ksocknal unloading module: portals lconf DONE
To try the multinode example, you will need to install the Lustre kernel on multiple nodes, reboot, and feed a common XML configuration file to the lconf command on each node. Because this is relatively intrusive, we will not attempt the nonlocal demonstration here. The multinode example is left as an exercise for you, when you feel ambitious.
One of the interesting things about Lustre is the abstraction of the file access. There are what might be called "pseudo" drivers that implement access patterns (striping, RAID, and so on) across the multiple OSS devices. This architecture appears to be substantially different from some of the other parallel file systems that we will encounter in terms of the "back-end" storage and SAN configurations. The I/O requests (object accesses) may be spread across multiple SANs or physically attached devices on the OSS systems to produce the promised level of scaling.
Lustre development continues today, so many of the architectural promises have not yet been realized. The level of support for this project from major corporations like Hewlett-Packard, however, shows a potential for realizing the goals, and the project is on a stepwise approach to the full vision (what we have today is "Lustre Lite"). We need to keep our eyes on Lustre as it matures and extends its features.