Measuring and Improving Application Performance with PerfSuite | ||
---|---|---|
<<< Previous | Next >>> |
psrun determines the performance events to be measured by consulting a configuration file you can supply, which is an XML document that describes the measurements to be taken. If you don't supply a configuration file, a default is used (the output shown in Examples 2 and 3 used the default). As an XML document, the configuration file is straightforward to modify and read. For example, if you wanted to obtain the raw events required to calculate the CPI metric discussed earlier, you'd need to ask psrun to measure the total number of graduated instructions and the total number of cycles. These events are predefined in PAPI and are called PAPI_TOT_INS and PAPI_TOT_CYC, respectively. Example 4 shows a PerfSuite XML configuration file that could be used to measure these events. To use this configuration file with psrun, all you need to do is supply the option -c, along with the name of your custom configuration and run as usual.
Example 4. An Example PerfSuite XML Configuration Document
<?xml version="1.0" encoding="UTF-8" ?> <ps_hwpc_eventlist class="PAPI"> <!-- ================================================== Configuration file to measure graduated instructions and total cycles. ================================================== --> <ps_hwpc_event type="preset" name="PAPI_TOT_INS" /> <ps_hwpc_event type="preset" name="PAPI_TOT_CYC" /> </ps_hwpc_eventlist> |
The measurements described so far have been in aggregate counting mode, where the total count of one or more performance events are measured and reported over the total runtime of your application. PerfSuite provides an additional way of looking at your application's performance. Let's say you are interested in finding out where in your application all the level 2 cache misses occur so that you can focus your optimization work there. In other words, you'd like a profile similar to gprof's time-based profile, but instead have it be based on level 2 cache misses. This can be done rather easily with psrun by specifying a configuration file tailored for profiling rather than aggregate counting. The PerfSuite distribution includes a number of similar alternative configuration files that you can tailor as needed. Here's an example of how you would ask for a profiling experiment rather than the default total count of events:
8% psrun -c /usr/local/share/perfsuite/xml/pshwpc/profile.xml solver 9% psprocess -e solver psrun.24135.xml |
In profiling mode, the psprocess tool also needs the name of your executable (solver) to do its work. This is required in order to extract the symbol information in the executable so program address can be mapped to source code lines.
Example 5 shows an example of a profiling run of psrun obtained in this way. Not only is the application (solver) analyzed, but it also lists shared libraries used with the application that consumed CPU time. The combination of overall performance counting and profiling can be a powerful tool for learning about bottlenecks that may exist in your software and can help you to isolate quickly those areas of your application most in need of attention.
Example 5. A source code profile generated by psprocess based on level 2 cache misses (output has been truncated to fit in available space).
PerfSuite Hardware Performance Summary Report Profile Information =================================================== Class : PAPI Event : PAPI_L2_TCM (Level 2 cache misses) Period : 10000 Samples : 16132 Domain : user Run Time : 319.72 (seconds) Min Self % : (all) Module Summary --------------------------------------------------- Samples Self % Total % Module 16131 99.99% 99.99% /home/nobody/solver/sol 1 0.01% 100.00% /lib/libc-2.2.4.so File Summary --------------------------------------------------- Samples Self % Total % File 5093 31.57% 31.57% matxvec2d_blk3.f 5015 31.09% 62.66% cg3_blk.f 4162 25.80% 88.46% pc_jac2d_blk3.f 1407 8.72% 97.18% dot_prod2d_blk3.f 429 2.66% 99.84% add_exchange2d_blk3.f 20 0.12% 99.96% glibc-2.2.4/csu/init.c 4 0.02% 99.99% main3.f 1 0.01% 99.99% linuxthreads/weaks.c 1 0.01% 100.00% cs_jac2d_blk3.f Function Summary --------------------------------------------------- Samples Self % Total % Function 5093 31.57% 31.57% matxvec2d_blk3 5015 31.09% 62.66% cg3_blk 4162 25.80% 88.46% pc_jac2d_blk3 1407 8.72% 97.18% dot_prod2d_blk3 429 2.66% 99.84% add_exchange2d_blk3 20 0.12% 99.96% ? 4 0.02% 99.99% main3 1 0.01% 99.99% __pthread_return_0 1 0.01% 100.00% cs_jac2d_blk3 File:Line Summary --------------------------------------------------- Samples Self % Total % File:Line 5089 31.55% 31.55% matxvec2d_blk3.f:19 4125 25.57% 57.12% pc_jac2d_blk3.f:20 2763 17.13% 74.24% cg3_blk.f:206 1346 8.34% 82.59% cg3_blk.f:346 576 3.57% 86.16% dot_prod2d_blk3.f:24 524 3.25% 89.41% cg3_blk.f:278 489 3.03% 92.44% dot_prod2d_blk3.f:23 332 2.06% 94.50% dot_prod2d_blk3.f:25 197 1.22% 95.72% cg3_blk.f:279 176 1.09% 96.81% add_exchange2d_blk3.f:29 99 0.61% 97.42% add_exchange2d_blk3.f:50 71 0.44% 97.86% add_exchange2d_blk3.f:30 71 0.44% 98.30% add_exchange2d_blk3.f:51 55 0.34% 98.64% cg3_blk.f:55 38 0.24% 98.88% cg3_blk.f:207 34 0.21% 99.09% cg3_blk.f:218 31 0.19% 99.28% pc_jac2d_blk3.f:27 24 0.15% 99.43% cg3_blk.f:139 20 0.12% 99.55% init.c:0 8 0.05% 99.60% dot_prod2d_blk3.f:22 5 0.03% 99.63% add_exchange2d_blk3.f:44 4 0.02% 99.66% matxvec2d_blk3.f:17 4 0.02% 99.68% cg3_blk.f:140 3 0.02% 99.70% cg3_blk.f:347 3 0.02% 99.72% cg3_blk.f:268 3 0.02% 99.74% cg3_blk.f:280 3 0.02% 99.76% pc_jac2d_blk3.f:18 3 0.02% 99.78% cg3_blk:/home/nobody/solver/cg3_blk.f:174 |
<<< Previous | Home | Next >>> |
PerfSuite Basics | Summary |