Measuring and Improving Application Performance with PerfSuite | ||
---|---|---|
<<< Previous | Next >>> |
The current release of PerfSuite includes the following four tools for accessing and working with performance data:
psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based and MPI applications.
psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements.
psinv: a utility that provides access to information about the characteristics of a machine (e.g., processor type, cache information, available performance counters).
psconfig: a graphical tool for easy creation and management of PerfSuite configuration files.
This section demonstrates the two commands psrun and psprocess. Visit the PerfSuite Web site for more information about and examples of the use of psinv and psconfig.
The easiest way to learn to use the basic PerfSuite tools is try them out on your own programs. Here is a sequence of commands you might enter to run the simple cache example discussed earlier with performance measurement enabled. Also shown are the current contents of the directory after each run with psrun to show that XML documents are created:
1% ls badcache goodcache 2% psrun badcache 3% ls badcache badcache.22865.xml goodcache 4% psrun goodcache 5% ls badcache badcache.22865.xml goodcache goodcache.22932.xml 6% psprocess badcache.22865.xml 7% psprocess goodcache.22932.xml |
Examples 2 and 3 show the output of the psprocess command for the unoptimized and optimized versions of the test program, these listings have been edited slightly to fit in the available space. As you can see, a substantial amount of information is gathered during the course of the measurement and the report includes not only the raw event counts measured using PAPI, but also a series of metrics that can be derived from the counts.
Example 2. psprocess Output from the Cache-Unfriendly Version of the Loop
PerfSuite Hardware Performance Summary Report Version : 1.0 Created : Thu Feb 19 22:43:01 2004 Generator : psprocess 0.2 XML Source : badcache.22865.xml Processor and System Information ==================================================== Node CPUs : 2 Vendor : Intel Family : Pentium Pro (P6) CPU Revision : 6 Clock (MHz) : 997.173 Memory (MB) : 1510.82 Pagesize (KB) : 4 Cache Information ==================================================== Cache levels : 2 -------------------------------- Level 1 Type : instruction Size (KB) : 16 Linesize (B) : 32 Assoc : 4 Type : data Size (KB) : 16 Linesize (B) : 32 Assoc : 4 -------------------------------- Level 2 Type : unified Size (KB) : 256 Linesize (B) : 32 Assoc : 8 Index Description Counter Value =================================================== 1 Conditional branch instructions........ 52663367 2 Branch instructions.................... 52650952 3 Conditional branch ins mispredicted...... 112009 4 Conditional branch instructions taken.. 52610596 5 Branch target address cache misses........ 31020 6 Requests for excl acc to clean cache line.. 1165 7 Requests for cache line invalidation.......... 0 8 Requests for cache line intervention...... 32801 9 Requests for excl acc to shared cache ln.. 26537 10 Floating point multiply instructions.......... 0 11 Floating point divide instructions............ 0 12 Floating point instructions........... 208155552 13 Hardware interrupts....................... 22134 14 Total cycles........................ 21407855039 15 Instructions issued.................. 2010041200 16 Instructions completed................ 624104056 17 Vector/SIMD instructions...................... 0 18 Level 1 data cache accesses........... 678945043 19 Level 1 data cache misses............. 244760094 20 Level 1 instruction cache accesses.. 21332388384 21 Level 1 instruction cache misses.......... 22546 22 Level 1 instruction cache reads..... 21309322857 23 Level 1 load misses................... 244318153 24 Level 1 store misses....................... 9852 25 Level 1 cache misses.................. 243826788 26 Level 2 data cache reads.............. 243745402 27 Level 2 data cache writes................. 10317 28 Level 2 instruction cache accesses........ 24335 29 Level 2 instruction cache reads........... 21362 30 Level 2 cache misses.................. 212665026 31 Cycles stalled on any resource...... 21057880641 32 Instruction TLB misses....................... 64 Statistics =================================================== Counting domain............................... user Multiplexed.................................... yes Graduated floating point ins. per cycle...... 0.010 Vector ins. per cycle........................ 0.000 Floating point ins per graduated ins ........ 0.334 Vector ins per graduated ins ................ 0.000 Floating point ins per L1 data cache access.. 0.307 Graduated ins per cycle...................... 0.029 Issued ins per cycle......................... 0.094 Graduated ins per issued ins................. 0.310 Issued ins per L1 ins cache miss......... 89152.896 Graduated ins per L1 ins cache miss...... 27681.365 Level 1 ins cache miss ratio................. 0.000 Level 1 data cache access per graduated ins.. 1.088 % floating point ins of all graduated ins... 33.353 % cycles stalled on any resource............ 98.365 Level 1 ins cache misses per issued ins...... 0.000 Level 1 cache read miss ratio (instruction).. 0.000 Level 1 cache miss ratio (data).............. 0.361 Level 1 cache miss ratio (instruction)....... 0.000 Bandwidth used to level 1 cache (MB/s)..... 363.437 Bandwidth used to level 2 cache (MB/s)..... 316.988 MFLIPS (cycles).............................. 9.696 MFLIPS (wall clock).......................... 9.530 MVOPS (cycles)............................... 0.000 MVOPS (wall clock)........................... 0.000 MIPS (cycles)............................... 29.071 MIPS (wall clock)........................... 28.572 CPU time (seconds).......................... 21.469 Wall clock time (seconds)................... 21.843 % CPU utilization........................... 98.285 |
Example 3. Part of the psprocess output from the optimized version of the loop. The Processor and System Information and Cache Information sections are the same.
Index Description Counter Value =================================================== 1 Conditional branch instructions........ 49627213 2 Branch instructions.................... 49971420 3 Conditional branch ins mispredicted....... 97630 4 Conditional branch ins taken........... 49089592 5 Branch target address cache misses......... 3816 6 Requests for excl access to clean cache ln. 820 7 Requests for cache line invalidation.......... 0 8 Requests for cache line intervention....... 2796 9 Requests for excl access to shared cache ln. 494 10 Floating point multiply instructions.......... 0 11 Floating point divide instructions............ 0 12 Floating point instructions........... 189564951 13 Hardware interrupts........................ 2577 14 Total cycles......................... 2471179766 15 Instructions issued................... 513936102 16 Instructions completed................ 509580537 17 Vector/SIMD instructions...................... 0 18 Level 1 data cache accesses........... 372965600 19 Level 1 data cache misses.............. 23010188 20 Level 1 instruction cache accesses... 2769671237 21 Level 1 instruction cache misses........... 2369 22 Level 1 instruction cache reads...... 2746595553 23 Level 1 load misses.................... 25980065 24 Level 1 store misses........................ 995 25 Level 1 cache misses................... 25772544 26 Level 2 data cache reads.............. .25617201 27 Level 2 data cache writes................... 935 28 Level 2 instruction cache accesses......... 2405 29 Level 2 instruction cache reads............ 2652 30 Level 2 cache misses................... 25287572 31 Cycles stalled on any resource....... 2199590592 32 Instruction TLB misses........................ 0 Statistics ================================================== Counting domain.............................. user Multiplexed................................... yes Graduated floating point ins per cycle...... 0.077 Vector ins per cycle.........................0.000 Floating point ins per graduated ins........ 0.372 Vector ins per graduated ins................ 0.000 Floating point ins per L1 data cache access. 0.508 Graduated ins per cycle......................0.206 Issued ins per cycle.........................0.208 Graduated ins per issued ins................ 0.992 Issued ins per L1 ins cache miss....... 216942.213 Graduated ins per L1 ins cache miss.... 215103.646 Level 1 ins cache miss ratio................ 0.000 Level 1 data cache access per graduated ins. 0.732 % floating point ins of all graduated ins.. 37.200 % cycles stalled on any resource........... 89.010 Level 1 ins cache misses per issued ins..... 0.000 Level 1 cache read miss ratio (instruction). 0.000 Level 1 cache miss ratio (data)............. 0.062 Level 1 cache miss ratio (instruction)...... 0.000 Bandwidth used to level 1 cache (MB/s).... 332.792 Bandwidth used to level 2 cache (MB/s).... 326.530 MFLIPS (cycles)............................ 76.493 MFLIPS (wall clock)........................ 66.787 MVOPS (cycles).............................. 0.000 MVOPS (wall clock).......................... 0.000 MIPS (cycles)............................. 205.626 MIPS (wall clock)......................... 179.533 CPU time (seconds).......................... 2.478 Wall clock time (seconds)................... 2.838 % CPU utilization.......................... 87.310 |
<<< Previous | Home | Next >>> |
Using Performance Counters to Measure Application Characteristics | Customizing Your Performance Analysis |