Measuring and Improving Application Performance with PerfSuite
<<< Previous		Next >>>

Customizing Your Performance Analysis

psrun determines the performance events to be measured by consulting a configuration file you can supply, which is an XML document that describes the measurements to be taken. If you don't supply a configuration file, a default is used (the output shown in Examples 2 and 3 used the default). As an XML document, the configuration file is straightforward to modify and read. For example, if you wanted to obtain the raw events required to calculate the CPI metric discussed earlier, you'd need to ask psrun to measure the total number of graduated instructions and the total number of cycles. These events are predefined in PAPI and are called PAPI_TOT_INS and PAPI_TOT_CYC, respectively. Example 4 shows a PerfSuite XML configuration file that could be used to measure these events. To use this configuration file with psrun, all you need to do is supply the option -c, along with the name of your custom configuration and run as usual.

Example 4. An Example PerfSuite XML Configuration Document

<?xml version="1.0" encoding="UTF-8" ?>
<ps_hwpc_eventlist class="PAPI">
  <!-- ==================================================
       Configuration file to measure graduated instructions
       and total cycles.
       ================================================== -->
  <ps_hwpc_event type="preset" name="PAPI_TOT_INS" />
  <ps_hwpc_event type="preset" name="PAPI_TOT_CYC" />
</ps_hwpc_eventlist>

The measurements described so far have been in aggregate counting mode, where the total count of one or more performance events are measured and reported over the total runtime of your application. PerfSuite provides an additional way of looking at your application's performance. Let's say you are interested in finding out where in your application all the level 2 cache misses occur so that you can focus your optimization work there. In other words, you'd like a profile similar to gprof's time-based profile, but instead have it be based on level 2 cache misses. This can be done rather easily with psrun by specifying a configuration file tailored for profiling rather than aggregate counting. The PerfSuite distribution includes a number of similar alternative configuration files that you can tailor as needed. Here's an example of how you would ask for a profiling experiment rather than the default total count of events:

8% psrun -c /usr/local/share/perfsuite/xml/pshwpc/profile.xml solver

9% psprocess -e solver psrun.24135.xml

In profiling mode, the psprocess tool also needs the name of your executable (solver) to do its work. This is required in order to extract the symbol information in the executable so program address can be mapped to source code lines.

Example 5 shows an example of a profiling run of psrun obtained in this way. Not only is the application (solver) analyzed, but it also lists shared libraries used with the application that consumed CPU time. The combination of overall performance counting and profiling can be a powerful tool for learning about bottlenecks that may exist in your software and can help you to isolate quickly those areas of your application most in need of attention.

Example 5. A source code profile generated by psprocess based on level 2 cache misses (output has been truncated to fit in available space).

PerfSuite Hardware Performance Summary Report


Profile Information
===================================================
Class          : PAPI
Event          : PAPI_L2_TCM (Level 2 cache misses)
Period         : 10000
Samples        : 16132
Domain         : user
Run Time       : 319.72 (seconds)
Min Self %     : (all)

Module Summary
---------------------------------------------------
Samples   Self %  Total %  Module

   16131   99.99%   99.99%  /home/nobody/solver/sol
       1    0.01%  100.00%  /lib/libc-2.2.4.so

File Summary
---------------------------------------------------
Samples   Self %  Total %  File
    5093   31.57%   31.57%  matxvec2d_blk3.f
    5015   31.09%   62.66%  cg3_blk.f
    4162   25.80%   88.46%  pc_jac2d_blk3.f
    1407    8.72%   97.18%  dot_prod2d_blk3.f
     429    2.66%   99.84%  add_exchange2d_blk3.f
      20    0.12%   99.96%  glibc-2.2.4/csu/init.c
       4    0.02%   99.99%  main3.f
       1    0.01%   99.99%  linuxthreads/weaks.c
       1    0.01%  100.00%  cs_jac2d_blk3.f

Function Summary
---------------------------------------------------
Samples   Self %  Total %  Function

    5093   31.57%   31.57%  matxvec2d_blk3
    5015   31.09%   62.66%  cg3_blk
    4162   25.80%   88.46%  pc_jac2d_blk3
    1407    8.72%   97.18%  dot_prod2d_blk3
     429    2.66%   99.84%  add_exchange2d_blk3
      20    0.12%   99.96%  ?
       4    0.02%   99.99%  main3
       1    0.01%   99.99%  __pthread_return_0
       1    0.01%  100.00%  cs_jac2d_blk3

File:Line Summary
---------------------------------------------------
Samples   Self %  Total %  File:Line

    5089   31.55%   31.55%  matxvec2d_blk3.f:19
    4125   25.57%   57.12%  pc_jac2d_blk3.f:20
    2763   17.13%   74.24%  cg3_blk.f:206
    1346    8.34%   82.59%  cg3_blk.f:346
     576    3.57%   86.16%  dot_prod2d_blk3.f:24
     524    3.25%   89.41%  cg3_blk.f:278
     489    3.03%   92.44%  dot_prod2d_blk3.f:23
     332    2.06%   94.50%  dot_prod2d_blk3.f:25
     197    1.22%   95.72%  cg3_blk.f:279
     176    1.09%   96.81%  add_exchange2d_blk3.f:29
      99    0.61%   97.42%  add_exchange2d_blk3.f:50
      71    0.44%   97.86%  add_exchange2d_blk3.f:30
      71    0.44%   98.30%  add_exchange2d_blk3.f:51
      55    0.34%   98.64%  cg3_blk.f:55
      38    0.24%   98.88%  cg3_blk.f:207
      34    0.21%   99.09%  cg3_blk.f:218
      31    0.19%   99.28%  pc_jac2d_blk3.f:27
      24    0.15%   99.43%  cg3_blk.f:139
      20    0.12%   99.55%  init.c:0
       8    0.05%   99.60%  dot_prod2d_blk3.f:22
       5    0.03%   99.63%  add_exchange2d_blk3.f:44
       4    0.02%   99.66%  matxvec2d_blk3.f:17
       4    0.02%   99.68%  cg3_blk.f:140
       3    0.02%   99.70%  cg3_blk.f:347
       3    0.02%   99.72%  cg3_blk.f:268
       3    0.02%   99.74%  cg3_blk.f:280
       3    0.02%   99.76%  pc_jac2d_blk3.f:18
       3    0.02%   99.78%  cg3_blk:/home/nobody/solver/cg3_blk.f:174