Simplifying Cluster (PVM and MPI) Application Programming Using Interface Classes, Part 3: The Pthreads Connection
In part 3 of this series, we take a closer look at a cluster-based text-file analysis utility that has been implemented using both the Message Passing Interface (MPI) system and the Parallel Virtual Machine (PVM) library. The utility is a simple cluster application that sees the cluster as a single computer with multiple processors. The multiple processors allow multiple text files to be processed in parallel, and separate steps of the text analysis to be performed in parallel. This is useful when the number of text files to be analyzed is large or when the text file to be processed is large.
Text file analysis touches a wide range of activities, including everything from parsing and tokenization to machine learning and data mining. The opportunities and demands of text file analysis are increasing. We routinely measure file storage in terms of gigabytes, but it’s not uncommon to have email and web server configurations with terabytes of storage, and very large data stores are currently measured in petabytes (1,000 terabytes, or 1 quadrillion bytes). Data-mining and information-extraction applications often process hundreds of thousands and in some cases millions of text files. Cluster-based text-file analysis is a good choice under these kinds of circumstances.
The simple text-file analyzer that we present is designed to provide interactive real-time analysis of text files. That is, the user presents files to be considered and the results are expected immediately. Not only does the cluster have a lot of work to do; it has to do it in real time and in an interactive fashion. This is in contrast to cluster applications that operate as batch jobs or as background processing. Our text file analysis checks text files for the presence of certain tokens, the absence of certain tokens, and the frequency of certain tokens. It compares text files for similarities based on the percentage of tokens that the files have in common. It compares all tokens found against a lexicon and thesaurus and returns the result of the analysis to the user. For purposes of this article, a token is a piece of text in a file that is demarcated by whitespace. The text file analyzer that we present in this article is a scaled-down version of one that we use at Ctest Laboratories. We provide just enough detail to demonstrate how interface classes can be used to simplify MPI or PVM processing for cluster applications.
How It Works
The utility is given some general search criteria, which will be used to distinguish the files that need to be analyzed from the files that don’t. In this case, we use files that include txt in the filename. This will give us any files that have .txt suffixes, txt. prefixes, .txt followed by other suffixes, or any other filename that includes txt. We assume that txt in the name implies that the file will contain simple text. Although this assumption will not be true in every case, for purposes of this example the .txt search criteria is sufficient. Most of the files found will be text files. This is a kind of preliminary analysis that determines which files have to be analyzed.
Once the files are located, they’re parsed and the tokens are placed into containers. At this point, the application is changed from a single process executing on a single machine to a cluster application consisting of a manager node and N number of worker nodes. The manager node is responsible for distributing the work. As each container is created and filled with tokens, the manager node sends that container to some worker node. The worker node performs the text analysis. The container is sent using either MPI or PVM. In parts 1 and 2 of this series, we discussed the pvm_stream and mpi_stream that are used to provide interface classes for the PVM and MPI routines. The text analysis is CPU-intensive. Once the analysis is complete, each node sends the results back to the manager node, which reports the results to the user. Figure 1 shows a simple flow of control for the text analysis utility.
Figure 1 Flowchart of the text analysis utility.