Analysis
The major processing is done in the worker nodes by the function analyzeTokens. An excerpt of the analyzeToken() function is shown in Listing 5. Each of the nodes (including the manager node) is implemented as a PVM or MPI process. The Single Instruction Multiple Data (SIMD) model is used for the parallel processing. In the SIMD model, each worker node executes the same code (Single Instruction) but on different data (Multiple Data). For the PVM version of the utility, the work nodes were created and assigned work by the calling pvm_spawn function inside a C++ program:
TaskNum = pvm_spawn("analysis",NULL,PvmTaskDefault,NULL,NumTasks,Tid);
For the MPI version of the utility, the work nodes were created and assigned work by calling at the command line:
mpirun -np 16 $HOME/mpi/analysis
Listing 5 Excerpt from the analyzeToken() function.
1 #include <set> 2 #include <algorithm> 3 4 5 multiset<string> Tokens; 6 set<string> SetA; 7 set<string> SetB; 8 set<string> SetC; 9 multiset<string> SetAAnalysis; 10 multiset<string> SetBAnalysis; 11 multiset<string> SetCAnalysis; 12 13 14 15 16 void analyzeTokens(string FileName) 17 { 18 19 20 21 set_intersection(SetA.begin(),SetA.end(),Tokens.begin(), Tokens.end(),inserter(SetAAnalysis,SetAAnalysis.begin())); 22 set_intersection(SetB.begin(),SetB.end(),Tokens.begin(), Tokens.end(),inserter(SetBAnalysis,SetBAnalysis.begin())); 23 set_difference(SetC.begin(),SetC.end(),Tokens.begin(), Tokens.end(),inserter(SetCAnalysis,SetCAnalysis.begin())); 24 copy(SetBAnalysis.begin(),SetBAnalysis.end(), inserter(Result.MonitoredTokens,Result.MonitoredTokens.begin())); 25 copy(SetCAnalysis.begin(),SetCAnalysis.end(), inserter(Result.MissingTokens,Result.MissingTokens.begin())); 26 27 28 }
The text document received from the distributeWork() function is placed into the multiset<string> Tokens. The analysis is done by using the set_intersection and set_difference algorithms. SetA contains positive tokens that we’re looking for in a text file. SetB contain negative tokens that we’re looking for in a text file. SetC contains mandatory tokens that should be in each text file we analyze. The analyzeTokens() function performs set intersection between SetA and Tokens, and SetB and Tokens. The analyzeTokens() function performs a set difference between SetC and Tokens.
The results of the analysis are stored in an analysis object and sent back to the manager node. This type of set processing is CPU-intensive. The size and number of the text files on which we have to run this set processing make a single processor with sequential processing unacceptable. On the other hand, multithreading the initial file location and then using multiple nodes to perform the set processing concurrently provided us with a useful analysis utility. The initial problem with the utility was the cumbersome nature of passing complex objects between MPI or PVM clients. This was solved with interface classes.