- <tt>stopwatch</tt>
- <tt>date</tt>
- <tt>time</tt>
- <tt>clock</tt>
- <tt>gettimeofday</tt>
- Performance Tuning Using GNU <tt>gprof</tt>
- gcc Option Needed for <tt>gprof</tt>
- <tt>kprof</tt>
- Summary
- Web Resources for Profiling
gcc Option Needed for gprof
Before programs can be profiled using gprof, they must be compiled with the -pg gcc option. To get complete information about gprof, you can use the command info gprof or man gprof.
Listing 1.5 shows the benefits that profiling can have on a small program. The sample1 program prints the prime numbers up to 50,000. You can use the output from gprof to increase this program’s performance by changing the program to sample2, shown later in Listing 1.8.
Listing 1.5 sample1.c
1 #include <stdlib.h> 2 #include <stdio.h> 3 4 int prime (int num); 5 6 int main() 7 { 8 int i; 9 int colcnt = 0; 10 for (i=2; i <= 50000; i++) 11 if (prime(i)) { 12 colcnt++; 13 if (colcnt%9 == 0) { 14 printf("%5d\n",i); 15 colcnt = 0; 16 } 17 else 18 printf("%5d ", i); 19 } 20 putchar(’\n’); 21 return 0; 22 } 23 24 int prime (int num) { 25 /* check to see if the number is a prime? */ 26 int i; 27 for (i=2; i < num; i++) 28 if (num %i == 0) 29 return 0; 30 return 1; 31 }
Building the sample1 Program and Using gprof
The sample1.c program needs to be compiled with the option -pg to have profile data generated, as shown in Figure 1.11.
Figure 1.11 Building and running sample1.
When the sample1 program is run, the gmon.out file is created.
To view the profiling data, the gprof utility must be on your system. If your system is rpm-based, the rpm command shows the version of gprof, as shown in Figure 1.12.
Figure 1.12 The version of gprof.
gprof is in the binutils package. For you to use the utility, the package must be installed on your system. One useful gprof option is -b. The -b option eliminates the text output that explains the data output provided by gprof:
# gprof -b ./sample1
The output shown in Listing 1.6 from gprof gives some high-level information like the total running time, which is 103.74 seconds. The main routine running time is 0.07 seconds, and the prime routine running time is 103.67 seconds. The prime routine is called 49,999 times.
Listing 1.6 Output from gprof for sample1
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 99.93 103.67 103.67 49999 2.07 2.07 prime 0.07 103.74 0.07 main Call graph granularity: each sample hit covers 4 byte(s) for 0.01% of 103.74 seconds index % time self children called name <spontaneous> [1] 100.0 0.07 103.67 main [1] 103.67 0.00 49999/49999 prime [2] --------------------------------------------------------------------------- 103.67 0.00 49999/49999 main [1] [2] 99.9 103.67 0.00 49999 prime [2] --------------------------------------------------------------------------- Index by function name [1] main [2] prime
Next we can use the gcov program to look at the actual number of times each line of the program was executed. (See Chapter 2, "Code Coverage," for more about gcov.)
We will build the sample1 program with two additional options—-fprofile-arcs and -ftest-coverage, as shown in Figure 1.13. These options let you look at the program using gcov, as shown in Figure 1.14.
Figure 1.13 Building sample1 with gcov options.
Figure 1.14 Running sample1 and creating gcov output.
Running gcov on the source code produces the file sample1.c.gcov. It shows the actual number of times each line of the program was executed. Listing 1.7 is the output of gcov on sample1.
Listing 1.7 Output from gcov for sample1
-: 0:Source:sample1.c -: 0:Graph:sample1.bbg -: 0:Data:sample1.da -: 1:#include <stdlib.h> -: 2:#include <stdio.h> -: 3: -: 4:int prime (int num); -: 5: -: 6:int main() 1: 7: { 1: 8: int i; 1: 9: int colcnt = 0; 50000: 10: for (i=2; i <= 50000; i++) 49999: 11: if (prime(i)) { 5133: 12: colcnt++; 5133: 13: if (colcnt%9 == 0) { 570: 14: printf("%5d\n",i); 570: 15: colcnt = 0; -: 16: } -: 17: else 4563: 18: printf("%5d ", i); -: 19: } 1: 20: putchar(’\n’); 1: 21: return 0; -: 22: } -: 23: 49999: 24:int prime (int num) { -: 25: /* check to see if the number is a prime? */ 49999: 26: int i; 121337004: 27: for (i=2; i < num; i++) 121331871: 28: if (num %i == 0) 44866: 29: return 0; 5133: 30: return 1; -: 31: } -: 32:
There are 5,133 prime numbers. The expensive operations in the routine prime are the for loop (line 27) and the if statement (line 28). The "hot spots" are the loop and the if test inside the prime routine. This is where we will work to increase the program’s performance. One change that will help this program is to use the sqrt() function, which returns the nonnegative square root function of the number passed in. sample2, shown inListing 1.8, has been changed to use the sqrt function in the newly created function called faster.
Listing 1.8 sample2.c
1 #include <stdlib.h> 2 #include <stdio.h> 3 #include <math.h> 4 5 int prime (int num); 6 int faster (int num); 7 8 int main() 9 { 10 int i; 11 int colcnt = 0; 12 for (i=2; i <= 50000; i++) 13 if (prime(i)) { 14 colcnt++; 15 if (colcnt%9 == 0) { 16 printf("%5d\n",i); 17 colcnt = 0; 18 } 19 else 20 printf("%5d ", i); 21 } 22 putchar(’\n’); 23 return 0; 24 } 25 26 int prime (int num) { 27 /* check to see if the number is a prime? */ 28 int i; 29 for (i=2; i <= faster(num); i++) 30 if (num %i == 0) 31 return 0; 32 return 1; 33 } 34 35 int faster (int num) 36 { 37 return (int) sqrt( (float) num); 38 }
Now you can build the sample2 program (see Figure 1.15) and use gprof to check how long the program will take to run (see Figure 1.16). Also, the gcov output shows the reduced number of times each line needs to be executed. In Listing 1.9, the total running time has been reduced from 103.74 seconds to 2.80 seconds.
Listing 1.9 shows the output of gprof for the sample2 program.
Listing 1.9 Output from gprof for sample2
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls us/call us/call name 52.68 1.48 1.48 1061109 1.39 1.39 faster 46.61 2.78 1.30 49999 26.10 55.60 prime 0.71 2.80 0.02 main Call graph granularity: each sample hit covers 4 byte(s) for 0.36% of 2.80 seconds index % time self children called name <spontaneous> [1] 100.0 0.02 2.78 main [1] 1.30 1.48 49999/49999 prime [2] ------------------------------------------------------------- 1.30 1.48 49999/49999 main [1] [2] 99.3 1.30 1.48 49999 prime [2] 1.48 0.00 1061109/1061109 faster [3] ------------------------------------------------------------- 1.48 0.00 1061109/1061109 prime [2] [3] 52.7 1.48 0.00 1061109 faster [3] ------------------------------------------------------------- Index by function name [3] faster [1] main [2] prime
Figure 1.15 Building and running sample2.
Figure 1.16 Using gprof on sample2.
Now we’ll run gcov on the sample2 program, as shown in Figures 1.17 and 1.18.
Figure 1.17 Building sample2 with gcov and running sample2.
Figure 1.18 Running sample2 and getting gcov output.
Listing 1.10 shows gcov output for the sample2 program.
Listing 1.10 Output of sample2.c.gcov
-: 0:Source:sample2.c -: 0:Graph:sample2.bbg -: 0:Data:sample2.da -: 1:#include <stdlib.h> -: 2:#include <stdio.h> -: 3:#include <math.h> -: 4: -: 5:int prime (int num); -: 6:int faster (int num); -: 7: -: 8:int main() 1: 9:{ 1: 10: int i; 1: 11: int colcnt = 0; 50000: 12: for (i=2; i <= 50000; i++) 49999: 13: if (prime(i)) { 5133: 14: colcnt++; 5133: 15: if (colcnt%9 == 0) { 570: 16:printf("%5d\n",i); 570: 17:colcnt = 0; -: 18: } -: 19: else 4563: 20: printf("%5d ", i); -: 21: } 1: 22: putchar(’\n’); 1: 23: return 0; -: 24: } -: 25: 49999: 26:int prime (int num) { -: 27: /* check to see if the number is a prime? */ 49999: 28: int i; 1061109: 29: for (i=2; i <= faster(num); i++) 1055976: 30: if (num %i == 0) 44866: 31: return 0; 5133: 32: return 1; -: 33: } -: 34: -: 35:int faster (int num) 1061109: 36: { 1061109: 37: return (int) sqrt( (float) num); -: 38: } -: 39:
The for loop in the prime routine has been reduced from 121 million executions to 1 million executions. Therefore, the total time has been reduced from 103.74 seconds to 2.80 seconds.
The tools gprof and gcov helped find the "hot spots" in this sample program. After the "hot spots" were found, the program was changed to increase its overall performance. It is interesting how changing a few lines of code can have a great impact on a program’s performance.
Listing 1.11, sample3.cpp, has three different functions (1, 2, and 3). It shows a more complex use of profiling, with both flat and graphic profiles. We’ll also use kprof, which can use gprof output. It presents the information in list or tree views, which make the information easier to understand when programs are more complicated. Let’s start by building the sample3.cpp program and displaying the flat and graphic profiles and then displaying the data using kprof.
Listing 1.11 sample3.cpp
1 #include <iostream> 2 3 void function1(){ 4 for(int i=0;i<1000000;i++); 5 } 6 7 void function2(){ 8 function1(); 9 for (int i=0;i<2000000;i++); 10 } 11 12 void function3(){ 13 function1(); 14 function2(); 15 for (int i=0;i<3000000;i++); 16 function1(); 17 } 18 19 int main(){ 20 for(int i=0;i<10;i++) 21 function1(); 22 23 for (int i=0;i<5000000;i++); 24 25 for(int i=0;i<10;i++) 26 function2(); 27 for(int i=0; i<13;i++); 28 { 29 function3(); 30 function2(); 31 function1(); 32 } 33 }
Figure 1.19 shows the commands used to build and run the sample3 program. gprof is also run on sample3 to get the profile data from sample3.
Figure 1.19 Building and capturing gprof output for sample3.
We won’t use the -b option on the gprof output on the sample3 program so that we can see all the descriptive information that gprof can display.
The sample3.gprof should look similar to this:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ms/call ms/call name 43.36 4.21 4.21 12 0.35 0.52 function2() 42.84 8.37 4.16 25 0.17 0.17 function1() 8.65 9.21 0.84 main 5.15 9.71 0.50 1 0.50 1.35 function3() 0.00 9.71 0.00 1 0.00 0.00 global constructors keyed to function1() 0.00 9.71 0.00 1 0.00 0.00 __static_initialization_and_destruction_0(int, int)
Field |
Description |
---|---|
% time |
The percentage of the program’s total running time used by this function. |
cumulative seconds |
A running sum of the number of seconds accounted for by this function and those listed above it. |
self seconds |
The number of seconds accounted for by this function alone. This is the major sort for this listing. |
calls |
The number of times this function was invoked if this function is profiled; otherwise, it is blank. |
self ms/call |
The average number of milliseconds spent in this function per call if this function is profiled; otherwise, it is blank. |
total ms/call |
The average number of milliseconds spent in this function and its descendents per call if this function is profiled; otherwise, it is blank. |
name |
The function’s name. This is the minor sort for this listing. The index shows the location of the function in the gprof listing. If the index is in parentheses, it shows where it would appear in the gprof listing if it were to be printed. |
Call graph (explanation follows) granularity: each sample hit covers 4 byte(s) for 0.10% of 9.71 seconds index % time self children called name <spontaneous> [1] 100.0 0.84 8.87 main [1] 3.86 1.83 11/12 function2() [2] 1.83 0.00 11/25 function1() [3] 0.50 0.85 1/1 function3() [4] ----------------------------------- 0.35 0.17 1/12 function3() [4] 3.86 1.83 11/12 main [1] [2] 63.9 4.21 2.00 12 function2() [2] 2.00 0.00 12/25 function1() [3] ----------------------------------- 0.33 0.00 2/25 function3() [4] 1.83 0.00 11/25 main [1] 2.00 0.00 12/25 function2() [2] [3] 42.8 4.16 0.00 25 function1() [3] ----------------------------------- 0.50 0.85 1/1 main [1] [4] 13.9 0.50 0.85 1 function3() [4] 0.35 0.17 1/12 function2() [2] 0.33 0.00 2/25 function1() [3] ----------------------------------- 0.00 0.00 1/1 __do_global_ctors_aux [13] [11] 0.0 0.00 0.00 1 global constructors keyed to function1() [11] 0.00 0.00 1/1 __static_initialization_and_destruction_0(int, int) [12] ----------------------------------- 0.00 0.00 1/1 global constructors keyed to function1() [11] [12] 0.0 0.00 0.00 1 __static_initialization_and_destruction_0(int, int) [12] -----------------------------------
This table describes the program’s call tree. It is sorted by the total amount of time spent in each function and its children.
Each entry in this table consists of several lines. The line with the index number at the left margin lists the current function. The lines above it list the functions that called this function, and the lines below it list the functions this one called.
You see the following:
Field |
Description |
---|---|
index |
A unique number given to each element of the table. Index numbers are sorted numerically. The index number is printed next to every function name so that it is easier to look up the function in the table. |
% time |
The percentage of the total time that was spent in this function and its children. Note that due to different viewpoints, functions excluded by options, and so on, these numbers do not add up to 100%.<\p> |
self<\p> |
The total amount of time spent in this function.<\p> |
children<\p> |
The total amount of time propagated into this function by its children.<\p> |
called<\p> |
The number of times the function was called. If the function called itself recursively, the number includes only nonrecursive calls and is followed by a + and the number of recursive calls.<\p> |
name |
The name of the current function. The index number is printed after it. If the function is a member of a cycle, the cycle number is printed between the function’s name and the index number. |
For the function’s parents, the fields have the following meanings:
Field |
Description |
---|---|
self |
The amount of time that was propagated directly from the function into this parent. |
children<\p> |
The amount of time that was propagated from the function’s children into this parent.<\p> |
called<\p> |
The number of times this parent called the function and the total number of times the function was called. Recursive calls to the function are not included in the number after the /.<\p> |
name |
The parent’s name. The parent’s index number is printed after it. If the parent is a member of a cycle, the cycle number is printed between the name and the index number. |
If the function’s parents cannot be determined, the word <spontaneous> is printed in the name field, and all the other fields are blank.
For the function’s children, the fields have the following meanings:
Field |
Description |
---|---|
self |
The amount of time that was propagated directly from the child into the function. |
children<\p> |
The amount of time that was propagated from the child’s children to the function.<\p> |
called<\p> |
The number of times the function called this child and the total number of times the child was called. Recursive calls by the child are not listed in the number after the /.<\p> |
name |
The child’s name. The child’s index number is printed after it. If the child is a member of a cycle, the cycle number is printed between the name and the index number. |
If the call graph has any cycles (circles), there is an entry for the cycle as a whole. This entry shows who called the cycle (as parents) and the members of the cycle (as children). The + recursive calls entry shows how many function calls were internal to the cycle. The calls entry for each member shows, for that member, how many times it was called from other members of the cycle.
Index by function name [11] global constructors keyed to function1() [3] function1() [4] function3() [12] __static_initialization_and_destruction_0(int, int) [2] function2() [1] main