libpspmpi
PerfSuite Profiling Library for MPI

Introduction PSPMPI (libpspmpi) is a library that can be linked to MPI-based programs to measure message passing performance. The original idea was inspired by the FPMPI library from Argonne National Laboratory. The output produced by PSPMPI is an XML file. In addition to measuring message passing information,  PSPMPI is also being developed to have an interface with  the profiling framework of VMI 2.0 .

Following sections describe:

Features


Features already developed: Features under development:

Usage


The following example is somewhat specific to NCSA-based clusters. Compile your MPI program as usual and when linking it with mpich/vmi libraries, link it to libpspmpi before linking it with libmpich etc.

<<  instead of:

ecc  -o test-program test-program.o -L/usr/local/vmi/mpich/lib/gcc -lmpich -lvmi -lpthread -ldl

>> use

ecc  -o test-program test-program.o -L/usr/local/vmi/mpich/lib/gcc -L/usr/apps/tools/PerfSuite/pspmpi/lib/gcc -lpspmpi -lpmpich -lmpich -lvmi -lpthread -ldl

For fortran programs:

efc -o swim swim.o -L/usr/local/vmi/mpich/lib/intel -L/u/ncsa/raghuram/PSPMPI/lib/intel -lpspmpi -lpmpich -lfmpich -lmpich -lvmi -ldl -lpthread
 

The library installed on Titan and Platinum cluster has been compiled using Intel's C++ compiler.
 

Functionality


PSPMPI provides most of the features that FPMPI provides (such as statistics for various MPI calls, summary of communication statistics etc.). In addition to these it provides additional features. One such  feature is of having user-specified 'bins'. The word 'bin' used here pertains to the message sizes. PSPMPI measures execution time, number of calls, synchronization time etc. of MPI routines and classifies these statistics into 'bins' based on the size of the messages transferred by MPI routines. To make this more clear, consider the following example: if an MPI program consists of point-to-point message passing between 3 processes and the messages are of size 10 bytes, 100 bytes and 10000 bytes, then PSPMPI can be instructed by the user to collect the performance data separately for messages that involve each of these different sizes. More details are here .

MPI programs can consist of multiple phases. If the user wishes to profile these phases individually so that he/she can look at the performance of such phases individually, PSPMPI provides a such a capability through an API.

Profiling program phases:

The following is the API:

#include <pspmpi.h>

PS_Err ps_pmpi_profilePhase(const char *name);

This call initializes a new profile with the name 'name' being supplied as a unique identifier. A profile can be in 3 possible states: ACTIVE, SUSPENDED and CLOSED. When a profile is created, it is in the SUSPENDED state by default. All that it means is that profiling for this phase is not yet turned on. In order to activate it, use:

PS_Err ps_pmpi_startProfile(const char *name);

where 'name' refers to the same string that was used in the creation of the profile. After a profile is activated, performance of MPI routines called within that phase will logged into this profile. So, after the program finishes, the output XML file will contain information about message passing for the phase measured by the profile. It will also have an overall profile (corresponding to PS_Default" default profile which is created by PSPMPI itself and which is always ACTIVE). Each profile can have it own bin sizes which are specified using a file. Thus, if the user wants to profile a phase using the identifier "XYZ" using specific bin sizes (rather that the default bin sizes), he/she specifies them in a file named "BINS.XYZ".

Profiling for a profile can be suspended by using:

PS_Err ps_pmpi_suspendProfile(const char *name);

After a profile is SUSPENDED, performance data for MPI routines in that phase will not be logged to that profile. Profiling can be resumed by:

PS_Err ps_pmpi_resumeProfile(const char *name);

A profile can be CLOSED by using:

PS_Err ps_pmpi_closeProfile(const char *name);

Once a profile is closed, it cannot be reactivated. All profiling for this profile is stopped.

Two additional functions are provided to resume,suspend and close all profiles. (except for the ones that have been closed explicitly).

void ps_pmpi_resumeAllProfiles();
void ps_pmpi_suspendAllProfiles();
void ps_pmpi_closeAllProfiles();

The use of profiles will become more clear with the following example:

Following is the skeleton of an MPI program having 3 phases (say):

Initialization
Point-to-point communication
Collective Operations

To profile these phases individually (apart from obtaining a combined profile) the following are used:

ps_pmpi_profilePhase("initialize");        // creates a profile to measure the initialization phase
ps_pmpi_profilePhase("p2p");                //  for point-to-point messaging phase
ps_pmpi_profilePhase("collective");     //  for collective operations phase
     ...

ps_pmpi_startProfile("initialize");    // resume the profile for initialization
...
... Initialization phase
...
ps_pmpi_closeProfile("initialize");        // close the profile for initialization
     ...

ps_pmpi_resumeProfile("p2p");    // you can also use ps_pmpi_startProfile
...
... Point-to-point communication phase
...
ps_pmpi_closeProfile("p2p");
     ...

ps_pmpi_resumeProfile("collective");
...
... Collective communication phase
...
ps_pmpi_closeProfile("collective");
 

The user can also use the suspend feature to temporarily suspend profiling for a profile. (except "PS_Default", which is always active). The output of PSPMPI is an xml file which in this case will contain profile data for 4 profiles: "PS_default" and the 3 user-defined profiles (identified by "initialize", "p2p" and "collective").

NOTE: PSPMPI assumes that profiles are created in a SPMD-style. That is, all processes create profiles in the same program order, by all processes involved in message passing for that profile. Non-conformance may cause undefined results.
 

Profiling comunication with user-defined communicators:
 

Most MPI programs use the MPI_COMM_WORLD communicator for communication. However, MPI programs are often written using user-defined communicators to structure the communication in a systematic way. PSPMPI can profile MPI routines that involve communication within a user-defined intra-communicator .  Suppose the user has created a communicator called "comm_worker" and wants to know the performance of the MPI routines that transfer data within this communicator. To do this,  he /she  creates a PSPMPI profile  that is associated with the  user-defined communicator "worker".  This is done by the following API call:

PS_Err ps_pmpi_profileComm(MPI_Comm comm, const char *name );

where "comm" is the communicator that the user wants to profile (for example, say "comm_worker") and associates an identifier "name " with the communicator. This identifier will appear in the XML file that is generated. Profiles for communicators can be suspended/resumed in a similar manner as profiles for program phases are.
 

File Format


Users specify bins sizes they want. Following is an example of a user-specified BIN file.

 samples$> cat BINS.i
 16 32
 64 128
 500 1024
 1K 4K
 samples$>

The above BIN file creates the following bins:

1. The first bin will store performance data for messages that are 0 to 15 bytes long.

2. The second bin is for message sizes 16 to 31 bytes.
Similarly,
3. 3rd bin: 32 to 63 bytes
4. 4th bin: 64 to 127 bytes
5. 5th bin: 128 to 499 bytes
6. 6th bin: 500 to 1023 bytes
7. 7th bin: 1024 to 4095 bytes
8. 8th bin: 4096 to 2^31 -1 bytes

Thus, PSPMPI allows users to specify bins and gaps. PSPMPI automatically creates bins that occupy the gaps. In addition, shortforms for kilobyte (K:uppercase) and megabyte (M:uppercase) are allowed. The bins specified should be non-overlapping and inreasing in size. Further, each bin has to be specified in a new line (End a line using the 'Enter' key).
 

Output Generated


PSPMPI generates the output in form of an XML file. We provide a converter which converts the XML output into human-readable form:

To transform the XML file into something that is a little easier to read, we have a basic XSL transformation set up.  This creates an HTML document from the XML one.  You can see an example at: http://perfsuite.ncsa.uiuc.edu/PerfExchange/pspmpi.html

To do that same transformation on your own XML output from PSPMPI, run the following command (assume $PSDIR is set to /usr/apps/tools/perfsuite):

$PSDIR/ext/bin/xt profile.xml $PSDIR/xml/xsl/pspmpi.xsl profile.html

If that works, you can look at "profile.html" from a browser.  If you get an error that it can't find "pspmpi.dtd", then remove the line in profile.xml that looks like this:

<!DOCTYPE pspmpi SYSTEM "pspmpi.dtd">

And run xt again.

The DTD for XML files generated by PSPMPI is in the xml subdirectory. A graphical viewer for the output generated is currently under development.
 

Miscellaneous

Environment Variables

The following shell environment variables are recognised by PSPMPI. They are usually specified in the batch script that starts the mpi program (usually pbs script).

PSPMPI_BIN_FILE: name of the BIN file (used for PS_Default profile and all other user-defined profiles that do not have a corresponding bin file ).
PSPMPI_XML_FILE: name of the xml output file
PSPMPI_USE_DEFAULT_BINS: uses the default bin sizes even if the default bin file (BINS.i) or the user defined BIN file exists.

PSPMPI tries to use the user specified bin file first, then the default bin file (BINS.i) in that order. If both of them do not exist, it uses default bin sizes (defined in include/profiler.h)

Feedback

Please send comments, suggestions and bug reports to: perfsuite@ncsa.uiuc.edu
 

Examples


We provide you with the following examples and the results:

Known Issues

Appendix

List of MPI Routines that are profiled/hooked:

MPI_Allgather
MPI_Allgatherv
MPI_Allreduce
MPI_Alltoall
MPI_Alltoallv
MPI_Barrier
MPI_Bcast
MPI_Bsend
MPI_Cancel
MPI_Gather
MPI_Gatherv
MPI_Ibsend
MPI_Irecv
MPI_Irsend
MPI_Isend
MPI_Issend
MPI_Pack
MPI_Recv
MPI_Reduce
MPI_Reduce_scatter
MPI_Rsend
MPI_Scan
MPI_Scatter
MPI_Scatterv
MPI_Send
MPI_Sendrecv
MPI_Sendrecv_replace
MPI_Ssend
MPI_Test   (only hooked)
MPI_Testall (only hooked)
MPI_Testany (only hooked)
MPI_Testsome (only hooked)
MPI_Unpack
MPI_Wait
MPI_Waitall
MPI_Waitany
MPI_Waitsome
 
 
 

PerfSuite
Email: perfsuite (at) ncsa.uiuc.edu
National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign