CrayPat
Introduction¶
CrayPat (Cray Performance Measurement and Analysis toolset) is Cray’s performance analysis tool offered by Cray. Since CrayPat is only available on Cray systems, let us give a brief step-by-step guidance on how to use it on Perlmutter.
Step-by-Step Guide¶
Here we will take Triangle Counting selector as an example:
Step 0: Load compilers¶
It is important to load compiler modules before Step 1. In our case, source
perlmutter_setup.sh:
Step 1: Unload/Load required modules¶
Step 2: Build the application as normal but keep .o
files¶
Go to the test
directory to separately create an object file and executable for the triangle counting code:
CC -g -O3 -std=c++11 -DUSE_SHMEM=1 -I$HCLIB_ROOT/include -I$BALE_INSTALL/include -I$HCLIB_ROOT/../modules/bale_actor/inc -L$HCLIB_ROOT/lib -L$BALE_INSTALL/lib -L$HCLIB_ROOT/../modules/bale_actor/lib -c -o triangle_selector.o triangle_selector.cpp -lhclib -lrt -ldl -lspmat -lconvey -lexstack -llibgetput -lhclib_bale_actor -lm
CC -g -O3 -std=c++11 -DUSE_SHMEM=1 -I$HCLIB_ROOT/include -I$BALE_INSTALL/include -I$HCLIB_ROOT/../modules/bale_actor/inc -L$HCLIB_ROOT/lib -L$BALE_INSTALL/lib -L$HCLIB_ROOT/../modules/bale_actor/lib -o triangle_selector triangle_selector.o -lhclib -lrt -ldl -lspmat -lconvey -lexstack -llibgetput -lhclib_bale_actor -lm
Tip
It is recommended to do make triangle_selector
first to see the full compilation command.
- First, copy and paste the full command and change
-o triangle_selector triangle_selector.cpp
to-c triangle_selector triangle_selector.cpp
to create an object file (.o file). - Second, copy and paste the command again and change
-o triangle_selector triangle_selector.cpp
to-o triangle_selector triangle_selector.o
to create an executable.
Step 3: Instrument the application using pat_build
¶
Generate a CrayPat instrumented executable using pat_build
:
Note
By default, triangle_selector+pat
is generated.
Note
pat_build
has different option that can trace a specific function(s)
-g
: trace Cray-provided library function group such as MPI and OpenSHMEM-u
: trace all user functions routine by routine-w
: flag that enables tracing-T -w
: trace user-defined functions- e.g,
pat_build -w -T selector_function
- e.g,
Step 4: Run the instrumented executable to get performance data¶
Option 1: Run the executable with an interactive batch job¶
Allocating resources:
Run the excutable to generate a directory (e.g.,triangle_selector+pat+174621-8716327t
) containing performance data files with the.xf
suffix:
Option 2: Run the executable with sbatch script¶
#!/bin/bash
#SBATCH -q regular
#SBATCH -N 1
#SBATCH -C cpu
#SBATCH -t 0:05:00
#SBATCH -ooshmem_%j_tri_cout.out
source ./oshmem-perlmutter.sh
module unload darshan
module load perftools-base perftools
export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
cd $HOME/hclib/modules/bale_actor/test
echo "--------------------------------------------"
srun -n 128 --cpu-bind=cores ./triangle_selector+pat
echo "--------------------------------------------"
Step 5: Generate human-readable content with pat_report
¶
Run pat_report
with the generated directory name, which will output a text report on the terminal and creates files with different suffices, .ap2
and .apa
inside the directory:
Exemplar text report¶
CrayPat/X: Version 23.03.0 Revision 46f710008 02/13/23 20:24:04
Number of PEs (MPI ranks): 128
Numbers of PEs per Node: 128
Numbers of Threads per PE: 1
Number of Cores per Socket: 64
Execution start time: Fri Apr 14 13:44:36 2023
System name and speed: nid004675 2.671 GHz (nominal)
AMD Milan CPU Family: 25 Model: 1 Stepping: 1
Core Performance Boost: All 128 PEs have CPB capability
Current path to data file:
/Users/joseph/triangle_selector+pat+197953-8716330t (RTS, 128 data files)
Notes for table 1:
This table shows functions that have significant exclusive time,
averaged across ranks.
For further explanation, see the "General table notes" below, or
use: pat_report -v -O profile ...
Table 1: Profile by Function Group and Function
Time% | Time | Imb. | Imb. | Calls | Group
| | Time | Time% | | Function
| | | | | PE=HIDE
100.0% | 6.079316 | -- | -- | 404.0 | Total
|-------------------------------------------------------------------
| 95.2% | 5.789928 | -- | -- | 2.0 | USER
||------------------------------------------------------------------
|| 47.7% | 2.901679 | 0.292158 | 9.2% | 1.0 | main
|| 47.5% | 2.888249 | 0.000057 | 0.0% | 1.0 | #1.selector_function
||==================================================================
| 4.8% | 0.289263 | 0.100863 | 26.1% | 2.0 | DL
||------------------------------------------------------------------
|| 4.8% | 0.289263 | 0.100863 | 26.1% | 2.0 | dlopen
|===================================================================
Notes for table 2:
This table shows functions that have the most significant exclusive
time, taking the maximum time across ranks and threads.
For further explanation, see the "General table notes" below, or
use: pat_report -v -O profile_max ...
Table 2: Profile of maximum function times
Time% | Time | Imb. | Imb. | Function
| | Time | Time% | PE=[max,min]
|-----------------------------------------------------------
| 100.0% | 3.193838 | 0.292158 | 9.2% | main
...
Note
.ap2
is used to view performance data graphically with the Cray Apprentice2 tool.
.apa
is for suggested pat_build
options for more detailed tracing experiments.
Using Apprentice2 for analyzing results¶
Cray Apprentice2 is a GUI-based analysis tool that can be used to visualize performance data instrumented with the CrayPat tool. Cray offers a desktop version of the Cray Apprentice2 visualizer so you can do your analysis locally.
To install a desktop version, you can find the installer on Perlmutter as below:
$CRAYPAT_ROOT/share/desktop_installers
scp
an appropriate installer to your local machine and install it. After that, you will be able to open .ap2
file with Apprentice2.
Tips
If you encounter this error: /some/path/./a.out: error while loading shared libraries: pat.so: cannot open shared object file: No such file or directory
Try this: export LD_LIBRARY_PATH=$CRAY_LD_LIBRARY_PATH:$LD_LIBRARY_PATH
Specifying profiling region¶
CrayPat performance API can be use to identify the region of interest (ROG) for analysis.
PAT_record(int state)
- Setting the recording state to PAT_STATE_ON or PAT_STATE_OFF
- Needs to be inserted before the ROG
PAT_region_begin(int id, char *label)
- Defines the boundaries of a region
- Needs to be inserted at the start the ROG
PAT_region_end(int id);
- regions must be either separate or nested
- Needs to be inserted at the end of ROG
Example¶
#include <pat_api.h>
...
PAT_record(PAT_STATE_ON);
PAT_region_begin(1,"selector_function");
triangle_selector();
PAT_region_end(1);
...
pat_build
and run it, CrayPat will generate a trace of ROG in the performance data which can be found in the pat_report
Exemplar output
Table 1: Profile by Function Group and Function
Time% | Time | Imb. | Imb. | Calls | Group
...
|| 47.5% | 2.888249 | 0.000057 | 0.0% | 1.0 | #1.selector_function
||==================================================================
...
Note
Please be cautious that lgp_barrier()
will affect the load blance result when analysing with CrayPat region API. Therefore, it is recommended to insert the region before the lgp_barrier()
function.
Collecting hardware performance counters (HWPC)¶
The Performance Application Programming Interface (PAPI) allows you to programmatically collect hardware performance counters (HWPC) in your code. While the user is supposed to manually insert PAPI routines to specify what HWPCs are measured and when to start/stop measuing them, CrayPat dramatically facilitate that process. Specifically, all the user has to do is to just specify HWPC name(s) in an environment variable. Here are the steps to collect HWPCs with CrayPat:
Step 1: Find available hardware counters¶
Available hardware counters can be find with papi_avail
.
Available hardware counters on Perlmutter
PAPI_L1_DCM Level 1 data cache misses
PAPI_L2_DCM Level 2 data cache misses
PAPI_L2_ICM Level 2 instruction cache misses
PAPI_TLB_DM Data translation lookaside buffer misses
PAPI_TLB_IM Instruction translation lookaside buffer misses
PAPI_BR_MSP Conditional branch instructions mispredicted
PAPI_TOT_INS Instructions completed
PAPI_FP_INS Floating point instructions
PAPI_BR_INS Branch instructions
PAPI_VEC_INS Vector/SIMD instructions (could include integer)
PAPI_TOT_CYC Total cycles
PAPI_L2_DCH Level 2 data cache hits
PAPI_L1_DCA Level 1 data cache accesses
PAPI_L2_DCR Level 2 data cache reads
PAPI_L2_ICH Level 2 instruction cache hits
PAPI_L2_ICA Level 2 instruction cache accesses
PAPI_L2_ICR Level 2 instruction cache reads
PAPI_FML_INS Floating point multiply instructions
PAPI_FAD_INS Floating point add instructions
PAPI_FDV_INS Floating point divide instructions
PAPI_FSQ_INS Floating point square root instructions
PAPI_FP_OPS Floating point operations
Step 2: Selecting the hardware counters¶
Setting the environment variable PAT_RT_PERFCTR
to specific events/group:
- Predefined Counter Groups, e.g.,
export PAT_RT_PERFCTR=0
. - Specify individual events (maximum of 4 event at a time), e.g.
export PAT_RT_PERFCTR="PAPI_L2_DCM,PAPI_L2_ICM,PAPI_L2_DCH,PAPI_L2_DCR"
.
Note
Predefined Counter Groups can be found on slides 38 of NERSC PerformanceTools
Step 3: Run the instrumented excutable with CrayPat¶
After exporting the hardware counter variable we just need to run the instrumented excutable and generate human-readable content with pat_report
. A summary for the trace of ROG can be found in the text output of pat_report
Exemplar output
...
Table 3: Profile by Function Group and Function
Group / Function / PE=HIDE
==============================================================================
Total
------------------------------------------------------------------------------
Time% 100.0%
Time 6.079316 secs
Imb. Time -- secs
Imb. Time% --
Calls 66.455 /sec 404.0 calls
PAPI_L2_DCM 0.062M/sec 378,368 misses
PAPI_L2_ICM 0.051M/sec 310,119 misses
PAPI_L2_DCH 0.001G/sec 6,453,273 hits
PAPI_L2_DCR 0.001G/sec 6,929,521 ops
Average Time per Call 0.015048 secs
CrayPat Overhead : Time 0.0%
...