Measuring and Improving Application Performance with PerfSuite
<<< Previous		Next >>>

PerfSuite Basics

The current release of PerfSuite includes the following four tools for accessing and working with performance data:

psrun: a utility for hardware performance event counting and profiling of single-threaded, POSIX threads-based and MPI applications.
psprocess: a utility that assists with a number of common tasks related to pre- and post-processing of performance measurements.
psinv: a utility that provides access to information about the characteristics of a machine (e.g., processor type, cache information, available performance counters).
psconfig: a graphical tool for easy creation and management of PerfSuite configuration files.

This section demonstrates the two commands psrun and psprocess. Visit the PerfSuite Web site for more information about and examples of the use of psinv and psconfig.

The easiest way to learn to use the basic PerfSuite tools is try them out on your own programs. Here is a sequence of commands you might enter to run the simple cache example discussed earlier with performance measurement enabled. Also shown are the current contents of the directory after each run with psrun to show that XML documents are created:

1% ls
badcache
goodcache

2% psrun badcache

3% ls
badcache
badcache.22865.xml
goodcache

4% psrun goodcache

5% ls
badcache
badcache.22865.xml
goodcache
goodcache.22932.xml

6% psprocess badcache.22865.xml
7% psprocess goodcache.22932.xml

Examples 2 and 3 show the output of the psprocess command for the unoptimized and optimized versions of the test program, these listings have been edited slightly to fit in the available space. As you can see, a substantial amount of information is gathered during the course of the measurement and the report includes not only the raw event counts measured using PAPI, but also a series of metrics that can be derived from the counts.

Example 2. psprocess Output from the Cache-Unfriendly Version of the Loop

PerfSuite Hardware Performance Summary Report

Version                  : 1.0
Created                  : Thu Feb 19 22:43:01 2004
Generator                : psprocess 0.2
XML Source               : badcache.22865.xml

Processor and System Information
====================================================
Node CPUs                : 2
Vendor                   : Intel
Family                   : Pentium Pro (P6)
CPU Revision             : 6
Clock (MHz)              : 997.173
Memory (MB)              : 1510.82
Pagesize (KB)            : 4

Cache Information
====================================================
Cache levels             : 2
--------------------------------
Level 1
Type                     : instruction
Size (KB)                : 16
Linesize (B)             : 32
Assoc                    : 4
Type                     : data
Size (KB)                : 16
Linesize (B)             : 32
Assoc                    : 4
--------------------------------
Level 2
Type                     : unified
Size (KB)                : 256
Linesize (B)             : 32
Assoc                    : 8

Index Description                     Counter Value
===================================================
 1 Conditional branch instructions........ 52663367
 2 Branch instructions.................... 52650952
 3 Conditional branch ins mispredicted...... 112009
 4 Conditional branch instructions taken.. 52610596
 5 Branch target address cache misses........ 31020
 6 Requests for excl acc to clean cache line.. 1165
 7 Requests for cache line invalidation.......... 0
 8 Requests for cache line intervention...... 32801
 9 Requests for excl acc to shared cache ln.. 26537
10 Floating point multiply instructions.......... 0
11 Floating point divide instructions............ 0
12 Floating point instructions........... 208155552
13 Hardware interrupts....................... 22134
14 Total cycles........................ 21407855039
15 Instructions issued.................. 2010041200
16 Instructions completed................ 624104056
17 Vector/SIMD instructions...................... 0
18 Level 1 data cache accesses........... 678945043
19 Level 1 data cache misses............. 244760094
20 Level 1 instruction cache accesses.. 21332388384
21 Level 1 instruction cache misses.......... 22546
22 Level 1 instruction cache reads..... 21309322857
23 Level 1 load misses................... 244318153
24 Level 1 store misses....................... 9852
25 Level 1 cache misses.................. 243826788
26 Level 2 data cache reads.............. 243745402
27 Level 2 data cache writes................. 10317
28 Level 2 instruction cache accesses........ 24335
29 Level 2 instruction cache reads........... 21362
30 Level 2 cache misses.................. 212665026
31 Cycles stalled on any resource...... 21057880641
32 Instruction TLB misses....................... 64

Statistics
===================================================
Counting domain............................... user
Multiplexed.................................... yes
Graduated floating point ins. per cycle...... 0.010
Vector ins. per cycle........................ 0.000
Floating point ins per graduated ins ........ 0.334
Vector ins per graduated ins ................ 0.000
Floating point ins per L1 data cache access.. 0.307
Graduated ins per cycle...................... 0.029
Issued ins per cycle......................... 0.094
Graduated ins per issued ins................. 0.310
Issued ins per L1 ins cache miss......... 89152.896
Graduated ins per L1 ins cache miss...... 27681.365
Level 1 ins cache miss ratio................. 0.000
Level 1 data cache access per graduated ins.. 1.088
% floating point ins of all graduated ins... 33.353
% cycles stalled on any resource............ 98.365
Level 1 ins cache misses per issued ins...... 0.000
Level 1 cache read miss ratio (instruction).. 0.000
Level 1 cache miss ratio (data).............. 0.361
Level 1 cache miss ratio (instruction)....... 0.000
Bandwidth used to level 1 cache (MB/s)..... 363.437
Bandwidth used to level 2 cache (MB/s)..... 316.988
MFLIPS (cycles).............................. 9.696
MFLIPS (wall clock).......................... 9.530
MVOPS (cycles)............................... 0.000
MVOPS (wall clock)........................... 0.000
MIPS (cycles)............................... 29.071
MIPS (wall clock)........................... 28.572
CPU time (seconds).......................... 21.469
Wall clock time (seconds)................... 21.843
% CPU utilization........................... 98.285

Example 3. Part of the psprocess output from the optimized version of the loop. The Processor and System Information and Cache Information sections are the same.

Index Description                     Counter Value
===================================================
 1 Conditional branch instructions........ 49627213
 2 Branch instructions.................... 49971420
 3 Conditional branch ins mispredicted....... 97630
 4 Conditional branch ins taken........... 49089592
 5 Branch target address cache misses......... 3816
 6 Requests for excl access to clean cache ln.  820
 7 Requests for cache line invalidation.......... 0
 8 Requests for cache line intervention....... 2796
 9 Requests for excl access to shared cache ln. 494
10 Floating point multiply instructions.......... 0
11 Floating point divide instructions............ 0
12 Floating point instructions........... 189564951
13 Hardware interrupts........................ 2577
14 Total cycles......................... 2471179766
15 Instructions issued................... 513936102
16 Instructions completed................ 509580537
17 Vector/SIMD instructions...................... 0
18 Level 1 data cache accesses........... 372965600
19 Level 1 data cache misses.............. 23010188
20 Level 1 instruction cache accesses... 2769671237
21 Level 1 instruction cache misses........... 2369
22 Level 1 instruction cache reads...... 2746595553
23 Level 1 load misses.................... 25980065
24 Level 1 store misses........................ 995
25 Level 1 cache misses................... 25772544
26 Level 2 data cache reads.............. .25617201
27 Level 2 data cache writes................... 935
28 Level 2 instruction cache accesses......... 2405
29 Level 2 instruction cache reads............ 2652
30 Level 2 cache misses................... 25287572
31 Cycles stalled on any resource....... 2199590592
32 Instruction TLB misses........................ 0


Statistics
==================================================
Counting domain.............................. user
Multiplexed................................... yes
Graduated floating point ins per cycle...... 0.077
Vector ins per cycle.........................0.000
Floating point ins per graduated ins........ 0.372
Vector ins per graduated ins................ 0.000
Floating point ins per L1 data cache access. 0.508
Graduated ins per cycle......................0.206
Issued ins per cycle.........................0.208
Graduated ins per issued ins................ 0.992
Issued ins per L1 ins cache miss....... 216942.213
Graduated ins per L1 ins cache miss.... 215103.646
Level 1 ins cache miss ratio................ 0.000
Level 1 data cache access per graduated ins. 0.732
% floating point ins of all graduated ins.. 37.200
% cycles stalled on any resource........... 89.010
Level 1 ins cache misses per issued ins..... 0.000
Level 1 cache read miss ratio (instruction). 0.000
Level 1 cache miss ratio (data)............. 0.062
Level 1 cache miss ratio (instruction)...... 0.000
Bandwidth used to level 1 cache (MB/s).... 332.792
Bandwidth used to level 2 cache (MB/s).... 326.530
MFLIPS (cycles)............................ 76.493
MFLIPS (wall clock)........................ 66.787
MVOPS (cycles).............................. 0.000
MVOPS (wall clock).......................... 0.000
MIPS (cycles)............................. 205.626
MIPS (wall clock)......................... 179.533
CPU time (seconds).......................... 2.478
Wall clock time (seconds)................... 2.838
% CPU utilization.......................... 87.310

<<< Previous	Home	Next >>>
Using Performance Counters to Measure Application Characteristics		Customizing Your Performance Analysis