Networking 101 for Clusters
Networking is an essential part of clustering. Computers just can't talk to each other on their own; that would be creepy. Parallel networks need a dedicated networking environment, where in comparison, quite a few load balanced, distributed, and even some HA solutions are designed over a WAN environment. To better understand the relationship between the network and the cluster, we need to understand network issues and how they affect cluster communications. Although we won't get into a detailed explanation of TCP/IP and networking, a cursory examination is provided here.
The OSI Networking Model
The International Organization for Standardization (ISO), an international body comprised of national standards bodies from more than 75 countries, created the Open Standards Interconnection (OSI) model in order for different vendors to design networks that would be able to talk to each other. The OSI model, finally standardized in 1977, is basically a reference that serves as a general framework for networking.
It wasn't uncommon to find vendors about 30 years ago who produced computers that didn't have the ability to talk to other vendors. Along with creating operating systems and mainframes that were rather proprietary, the communications of the time were mostly proprietary as well. In 1977, the British Standards Organization proposed to the ISO that an international standard for distributed processing be created. The American National Standards Institute (ANSI) was charged with presenting a framework for networking, and they came up with a Seven Layer model (The Origins of OSI; http://williamstallings.com/Extras/OSI.html). This model is shown in Figure 1.3.
The top layer, Layer Seven, is the Application Layer and is where the end user actually interfaces with the computer itself. Layer Seven encompasses such applications as the web, telnet, SSH, and email.
Layer Six, the Presentation Layer, presents data to the Application Layer. Computers take generic data and turn it into formats such as text, images, sound, and video at this layer. Data translation, compression, and encryption are handled here.
Layer Five, the Session Layer, handles data by providing a means of transport for the Presentation Layer. Examples of applications that utilize the Session Layer include X Window, NFS, AppleTalk, and RPC.
Figure 1.3 OSI Network Layer diagram.
The Transport Layer, Layer Four, provides error detection and control, multiplexing transport connections onto data connections (multiplexing allows the data from different applications to share one data stream), flow control, and transport to the Network Layer.
Layer Three, the Network Layer, is responsible for transmitting data across networks. Two types of packets are used at this level, including data packets and route updates. Routers that work on this level keep data about network addresses, routing tables, and the distance for remote networks.
Layer Two, the Data Link Layer, translates messages from the Network Layer into the Physical Layer. This layer handles error control, reliability, and integrity issues. The Network Layer adds frames to the data and adds a customized header containing the source and destination address. The layer identifies each device on the network as well.
The bottom layer, Layer One, or the Physical Layer, sends and receives information in the form of ones and zeros. The characteristics of this layer include specifications for signal voltages, wire width and length, and signaling.
Many devices operate at different levels. The hub, when cabled, only amplifies or retransmits data through all its ports. In this way, it operates only on Layer One. However, a switch operates on Layers One and Two, and a router operates on Layers One, Two, and Three. The end user's workstation would typically handle Layers Five, Six, and Seven (although the versatility of Linux could allow it to handle much more).
The point of having such a layered approach is so that the different aspects of networking can work with each otheryet remain independent. This allows application developers the freedom to work on one aspect of a layer while expecting the other layers to work as planned. Without these layers, the developer would have to make sure that every aspect of the application included support for each layer. In other words, instead of just coding a simple game, the development team would not only have to code the game itself, but also the picture formats, the TCP/IP stack, and have to develop the router to transmit the information.
Why is learning this so important for clustering? First of all, it aids in troubleshooting. Knowing where the problem lies is the most important step to solving the problem. Every troubleshooting method needs to start somewhere, and by going over the OSI model, you can easily diagnose where the problem lies. By isolating each layer, you can track down problems and resolve them.
Network Topology
Different types of clusters need different types of network topologies depending on the framework involved. A HA network might need more attention to detail regarding security to maintain uptime than a distributed computing environment or a parallel clustering scenario.
Picture the following HA scenario, if you will, as shown in Figure 1.4. A nationwide bank has Points of Presence (POP) in three cities across the United States. Each city has its own cluster with its own database and each is connected directly to the Internet from its own city, but yet each city has to have access to the other's data. How is this best achieved? Topologies have to be clearly thought about in advance. Consider the topology of clusters spread across a WAN, for instance.
In this scenario, three sites are connected over the public Internet, with a firewall for security. This isn't the most effective method of achieving high availability, but it's a common scenario.
Figure 1.4 High availability across the WAN.
One way to go about redesigning this scenario is to remove two satellite cities from the Internet and connect each of these cities through direct frame relay to the internal network, thereby dropping the firewall from two of the satellite offices. Another approach would be to implement a load balanced network. POPs could easily be placed in key locations across the country so that customers could have relatively local access to the financial data. But because the data has to be synchronized between three cities in real time, the bandwidth involved would be tremendous.
A parallel cluster still needs a network design, although one that is much simpler in scope. Before fast ethernet to the desktop and gigabit ethernet for servers was the common standard, hypercubes were the primary means for designing high-performance clusters. A hypercube is a specific method to layout a parallel cluster, usually using regular ethernet. The trick to designing a hypercube was that each computer would have to have a direct network connection to each other node in the cube. This worked well with smaller cubes, although the size of the cube was somewhat limited due to the amount of network cards that could fit in any one computer. Larger clusters required meshed designs with hubs to support each node and the requisite multiple connections. Needless to say, they're quite messy because of all the intermeshed cabling. With the advent of fast ethernet and gigabit ethernet, a simple managed or unmanaged switch will take care of most of the bandwidth problems. Of course, "the faster, the better" is the general motto, so be sure to consider the network when budgeting your cluster.
Services to Consider
Along with the physical development of the cluster and the network topology, deciding which services to enable is the next step toward a finished clustering solution. Although most Linux distributions offer access to all the standard services listed in /etc/services, the system administrator has to determine if those services are applicable to their environment.
In a high-security environment, the system administrator might have no choice but to tighten down these resources. The most common ways of disabling services include restricting access to them through the firewall, by setting up an internal network, or by utilizing hosts:deny and hosts:allow. Those services that are accessible though inet can be commented out in /etc/inetd.conf.
It's no surprise that, to enable web services, you have to keep access to port 80and 443 if you're going to enable web support over Secure Sockets Layer (SSL). In addition, what you also have to keep in mind when designing your cluster is access to the backup devices, whether or not to allow SSH, telnet, or ftp, and so on. In fact, it's a good idea to totally disallow telnet across the environment and replace it with secure shell. Not only does this allow secure logins, but also secure ftp transfers for internal office use. FTP servers should still use regular ftp, of course, but only for dedicated servers.
Keeping Your Services Off the Public Network
After you decide which services to keep on the public network, it's a wise idea to make a nonroutable network strictly for management purposes. This management network serves two purposes: One, it enables redundancy. The administrator has the ability to gain access to the box if the public network goes down or is totally saturated from public use. Secondly, it gives the opportunity for a dedicated backup network. The bandwidth from nightly backups each night has the potential to saturate a public network.
Realistically, you don't want to keep your parallel cluster on a public network. A private network lessens the chance of compromised data and machines. Unless you keep your cluster so that anyone can run jobs on it, the added benefit of a nonroutable network greatly outweighs the potential risks.