libpshwpc, the PerfSuite supporting software library for hardware performance event counting, contains a small number of routines that are used to collect hardware performance event data for use within your program or by the PerfSuite graphical, command-line, or web-based tools.
libpshwpc supports both single-threaded programs and programs that use the POSIX threads standard (pthreads) for multithreaded execution. For pthreads programs, each thread will maintain copies of its own performance counter data.
The library is targeted for Linux-Intel/AMD (x86/x86-64/ia64) platforms.
The routines within libpshwpc provide output and functionality that can be useful independently of the graphical tools and may be used solely in that way, too.
The routines currently contained in the library are summarized on
this web page. All PerfSuite libpshwpc
library routines begin with the prefix "ps_hwpc"
(for C) and "PSF_hwpc"
(for Fortran).
Here's an example Fortran matrix-multiply loop that uses libpshwpc for (possibly multiplexed) hardware performance counting. Additions necessary to use libpshwpc are shown in bold:
program mxm
include 'fperfsuite.h'
(... declare and initialize arrays ...)
c Initialize libpshwpc
call PSF_hwpc_init(ierr)
if (ierr.ne.0) then
print*, 'Error initializing libpshwpc!'
stop
endif
c Start performance counting using libpshwpc
call PSF_hwpc_start(ierr)
if (ierr.ne.0) then
print*, 'Error starting performance counting!'
stop
endif
c Do the matrix multiply
do j = 1, n
do i = 1, m
do k = 1, l
c(i,j) = c(i,j) + a(i,k)*b(k,j)
end do
end do
end do
c Stop hardware performance counting and write the
c results to a file named 'perf.PID.xml' (PID will be
c replaced by the process ID of the program)
call PSF_hwpc_stop('perf', ierr)
if (ierr.ne.0) then
print*, 'Error stopping hardware performance counting!'
stop
endif
c Shutdown use of libpshwpc and the underlying libraries
call PSF_hwpc_shutdown(ierr)
if (ierr.ne.0) then
print*, 'Error terminating libpshwpc!'
stop
endif
The output generated from a program that uses the libpshwpc library is an XML document in a standard PerfSuite format. Because it is based on the XML standard for data representation, there are many possibilities for working with this document to obtain insight into the behavior of your application.
The PerfSuite command-line tool psprocess is a convenient utility for post-processing the results; for some examples and suggestions, you can refer to the documentation for psrun.
These instructions are specific to the PerfSuite installation
at NCSA, which is rooted at the directory
/usr/apps/tools/perfsuite. For other installations,
you should substitute the local PerfSuite top-level directory instead.
All C/C++-based applications should include the main PerfSuite
header file <perfsuite.h> and also
the libpshwpc header file <pshwpc.h>.
Fortran-based applications should include <fperfsuite.h>.
No other header files are necessary to use these routines.
When you compile your program, include the flag:
-I/usr/apps/tools/perfsuite/include
When you link your program, include the flags:
-L/usr/apps/tools/perfsuite/lib -lpshwpc
Programs that use POSIX threads should instead link with the threaded version of libpshwpc, as follows:
-L/usr/apps/tools/perfsuite/lib -lpshwpc_r
The libpshwpc shared library automatically links the necessary
low-level hardware performance counter support library (default: PAPI).
You'll still have to add the directory
/usr/apps/tools/perfsuite/lib to your
LD_LIBRARY_PATH environment variable in order
for your program to locate the PerfSuite shared libraries
(or use other linktime options).
If you link statically, you'll have to specify the PerfSuite core, PAPI, and Expat XML parser libraries also, as follows:
-L/usr/apps/tools/perfsuite/lib -L/usr/apps/tools/papi/lib \
-lpshwpc -lperfsuite -lpapi -lexpat
Also note that a static link will remove the requirement to set your LD_LIBRARY_PATH environment variable.
More complete information about PAPI and its installation at NCSA can be found on the PAPI at NCSA web page.
Assuming that you've successfully compiled and linked your program as described above and that you've set your LD_LIBRARY_PATH environment variable if necessary, just run your program as you normally would, possibly setting run-specific environment variables (described next).
Note: you should not run a program linked with libpshwpc with psrun (or any other software that would simultaneously attempt to access the hardware performance counters). Doing so will result in a conflict and a run-time error.
libpshwpc recognizes the following environment variables:
Controls the run-time behavior of the library. If set to "off" or "no" (case is not significant), then no hardware performance counting will take place and all libpshwpc routines will return a success status without actually doing anything.
This variable allows you to provide an optional annotation string
to the resulting XML output file (you may want to use this to
keep track of specific information regarding a particular run).
The value of this variable is copied verbatim as the text
associated with the element <annotation>.
Specifies an event configuration file that is used to determine which hardware performance events will be counted. The file name may be absolute or relative to the current working directory. See Selecting Events to Count for more information about the configuration file.
Specifies the "counting domain" at which measurement will take place. Recognized values (case is not significant) are "user" (default), "kernel", or "all".
Specifies the base prefix to be used for the resulting XML
output document (see the routine ps_hwpc_stop()
below for details).
libpshwpc uses an event configuration file to decide at runtime what performance events should be counted. This file is an XML document with a very simple syntax that can be modified with any text editor.
Here's a sample event configuration file and instructions for creating your own custom configuration file.
PerfSuite provides several default configuration files,
each targeted to a different architecture, that are located in
the directory share/perfsuite/xml/pshwpc (relative to the
PerfSuite top-level installation directory). For Pentium (except
Pentium 4/Xeon) and Itanium machines, these files are named
papi2_p6.xml
and papi2_itanium.xml, respectively. You can
use these default files as a basis for creating your own
desired configuration (just copy them to a private location
and modify appropriately).
There's also a "do-nothing"
configuration file called null.xml that
can be used to obtain general run information without using
performance counters at all. In this case,
the resulting XML output will contain
information about the machine and the date and
wall clock time elapsed between the call to ps_hwpc_init()
and ps_hwpc_stop(). See the C/Fortran API
section for more details on these routines.
You can also use the graphical tool PSConfig to create or modify event configuration files. This tool provides a convenient point-and-click interface along with several other features to make it easy to work with event selection.
You cannot control the event selection programmatically - the only way to specify events other than the default is through the environment variable PS_HWPC_CONFIG and a custom configuration file.
Note that libpshwpc will not accept the PAPI 2 "rate events" (PAPI_FLOPS and PAPI_IPS) because they are not true events, but derived metrics that are provided anyway if you post-process your program with the PerfSuite utility psprocess (described in the documentation for psrun) and your event configuration includes the underlying raw events: PAPI_FP_INS, PAPI_TOT_INS, and PAPI_TOT_CYC.
Processors typically have a limited set of registers for use in hardware performance counting. One technique for counting more performance events simultaneously than would otherwise be possible with the number of available registers is to use multiplexing, which causes the available physical counters to be time-shared over the desired events.
At the end of the measurement, the number of events read during each time-slice is then adjusted according to the total run time over all measurements to provide a statistical estimate of the actual number of events that would have been observed if the register had been devoted to a single event. This is a very convenient method of measuring a large number of events when only a few performance counters are available, especially if it's not convenient to make multiple non-multiplexed runs of the program.
By using PAPI as the default performance counter access method, which implements multiplexing based on John May's MPX software, libpshwpc provides support for multiplexing of the counters through PAPI. This is done automatically for you and is noted in the final output file. Multiplexing will only be enabled if required (i.e., the software detects more requested events than can be counted on the available counters).
The libpshwpc C/Fortran API allows you to insert calls to the library into your application, enabling you to control the collection of hardware event data. The API is intentionally simple but is sufficient for the needs of many people doing performance analysis in practice. More complex needs (e.g. writing tools, profilers, etc) are probably better served by one of the many academic, research, and commercial products that are available.
The API consists of 5 "core" routines that allow you
to control and configure hardware performance measurement.
Additionally, two convenience routines (ps_hwpc_PAPI_write
and ps_hwpc_PAPI_hl_write)
are provided that are intended
for applications that already use PAPI directly.
These routines convert
an existing PAPI event set and associated counter values to
PerfSuite's XML format and write the XML document to a disk
file. The document can then be used by other PerfSuite
command-line, graphical, or Web-based tools.
The following diagram shows the typical sequence of calls
to the routines in libpshwpc.
libpshwpc routines displayed in
a red font may only be called by
a single thread in a program (note: this need not
be the same thread).
libpshwpc routines displayed in a blue
font may be called repeatedly by any thread
(the other libpshwpc
routines should be called exactly once). Dashed lines indicate
the typical path for threads created with pthread_create().

Note that some variations are possible (for example, a thread may
create other threads after having already started performance
counting), but this diagram covers the most common case. The main
things to keep in mind are:
ps_hwpc_shutdown() until all threads have
finished with performance counting (if this is a serious problem,
then another option is to not call ps_hwpc_shutdown()
at all)
| ||||||
| ||||||
| ||||||
| ||||||
| ||||||
| ||||||
One way of using the libpshwpc routines is as follows:
ps_hwpc_init() at the very beginning of
your program. Insert calls to ps_hwpc_stop() and (optionally)
ps_hwpc_shutdown() at the place(s) where your program
normally exits. Put any string you like as the argument to
ps_hwpc_stop() (you can always change it at runtime
via the environment variable PS_HWPC_FILE).
ps_hwpc_start() and
ps_hwpc_suspend()
calls in order to "zoom in" on the region of interest.
Try to keep these calls at the outermost level possible.
Minimize the perturbation
of your program's execution by only concentrating on one
region at a time and minimize the calls you make to
libpshwpc.
The key phrase is: keep it simple. Focus on where the time is spent, and see what's happening in those portions of your application.
Here's a complete POSIX threads program, with the modifications required to use libpshwpc shown in bold. We won't describe the program here, but this program is also used as an example in the documentation for the PerfSuite psrun tool (see the next section), so you can read about it in more depth in that document.
#include <pthread.h>
#include <stdio.h>
#include <perfsuite.h>
#include <pshwpc.h>
pthread_mutex_t reduction_mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_t *tid;
int n, num_threads;
double pi, w;
double
f(double a)
{
return ( 4.0 / (1.0 + a*a) );
}
void *
PIworker(void *arg)
{
int i, myid;
double sum, mypi, x;
/* set individual id to start at 0 */
myid = pthread_self() - tid[0];
if (ps_hwpc_start() != 0) {
fprintf(stderr, "Error starting performance counting!\n");
pthread_exit(NULL);
}
/* integrate function */
sum = 0.0;
for (i=myid+1; i<=n; i+=num_threads) {
x = w * ((double) i - 0.5);
sum += f(x);
}
if (ps_hwpc_stop("PIworker") != 0) {
fprintf(stderr, "Error stopping performance counting!\n");
pthread_exit(NULL);
}
mypi = w*sum;
/* reduce value */
pthread_mutex_lock(&reduction_mutex);
pi += mypi;
pthread_mutex_unlock(&reduction_mutex);
pthread_exit(NULL);
}
int
main(int argc, char **argv)
{
int i;
/* check command line */
if (argc != 3) {
printf("Usage: %s num-intervals num-threads\n", argv[0]);
exit(0);
}
/* get num intervals and num threads from command line */
n = atoi(argv[1]);
num_threads = atoi(argv[2]);
w = 1.0 / (double) n;
pi = 0.0;
tid = (pthread_t *) calloc(num_threads, sizeof(pthread_t));
if (ps_hwpc_init() != 0) {
fprintf(stderr, "Error initializing libpshwpc!\n");
exit(1);
}
if (ps_hwpc_start() != 0) {
fprintf(stderr, "Error starting performance counting!\n");
exit(1);
}
/* create the threads */
for (i=0; i<num_threads; i++) {
if (pthread_create(&tid[i], NULL, PIworker, NULL)) {
fprintf(stderr, "Cannot create thread %d\n", i);
}
}
/* join threads */
for (i=0; i<num_threads; i++) {
pthread_join(tid[i], NULL);
}
printf("computed pi = %.16f\n", pi);
if (ps_hwpc_stop("PImaster") != 0) {
fprintf(stderr, "Error stopping performance counting!\n");
exit(1);
}
if (ps_hwpc_shutdown() != 0) {
fprintf(stderr, "Error terminating libpshwpc!\n");
exit(1);
}
return(0);
}
If you'd rather not modify your application's source code or relink your program, you can instead cause hardware performance counting based on libpshwpc to be enabled automatically for your program by using the PerfSuite command-line utility psrun. This utility is very convenient and simple to use, and arranges for performance counter measurement to be enabled just before your main program begins execution and to be reported when your program terminates (i.e., the entire application is monitored). psrun can be used with an unmodified executable and also supports POSIX threads.