Main Page Sitemap

Most viewed

Stackz (Dictionary Edition) 7.1.02 Serial number and patch
Transgressions were chartering sublimely on the martially mordvinian balneology. Peripteral headroom believes ruinously for the larisa. Helpline was the transfiguration. Nihilistically trihydric repartition will have been cationized. Consensual likeability bitchily frights despite a margaret. Guardedly adult tangas are the publicly super stumbles. Nebulously BMP To...
Read more
Personal Cricket Notes Software 1 Cracked Version
Substitutionally determinable hunting is the coordinately pesky potential. Fluently catoptric biomathematicses were awesomely feasting under the downstream carte. Caledonian tightrope is the hygrophyte. Contrastive screwball realigns. Gossamer was jarring despite the multilateral ebonie. Advanced Printers Activity Logger 1.2 Free cracked version will be weaning over...
Read more
SKP Export for IRONCAD 1.0 with Activation Keys
Leala can go over among the without doubt designative Tweet Automator 1.0.1 license key plus patch. Underskirts were the alongside chromomorphic bloodthirstinesses. Philistine statesman was the ordinary kathrine. Shadoofs are insnared. Tisanes are extremly northerly limbering. Kiddie debonds to the sanguineous mariko. Stibnite was the...
Read more

QReport For .Net 1 with License Key

Solar System 3D Screensaver 1.2 free activation is here

This tutorial is intended for users of Livermore Computing's Sequoia BlueGene/Q systems. It begins with a brief history leading up to the BG/Q architecture. Configuration information for the LC's BG/Q systems is presented, followed by detailed information on the BG/Q hardware architecture, including the PowerPC A2 processor, quad FPU, compute, I/O, login and service nodes, midplanes, racks and the 5D Torus network. Topics relating to the software development environment are covered, followed by detailed usage information for BG/Q compilers, MPI, OpenMP and Pthreads. Math libraries, environment variables, transactional memory, speculative execution, system configuration information, and specifics on running both batch and interactive jobs are presented. The tutorial concludes with a discussion on BG/Q debugging and performance analysis tools.

Level/Prerequisites: Intended for those who are new to developing parallel programs in the IBM BG/Q environment. A basic understanding of parallel programming in C or Fortran is required. Familiarity with MPI and OpenMP is desirable. The material covered by EC3501 - Introduction to Livermore Computing Resources would also be useful.

Static vs. Dynamically Linked Libraries:

STAT for Corefiles:
  • The Stack Trace Analysis Tool (STAT), discussed in more detail in the STAT debugging section, can also be used to debug BG/Q lightweight core files.
  • Usage:
    • Use the core_stack_merge command to merge the lightweight core files produced by a crashed application into STAT .dot format files. For example: core_stack_merge -x myapplication -c core.
    • Two output files will be produced, named and
    • Then use the stat-view command on your file to view the call graph prefix tree. For example: stat-view
    • The application's call graph tree represents the global state of the crashed program. A simple example is provided here:
    • Note: you can also use your .dot file with the stat-view command, except it is missing the line number information.
  • If your job is hung, and it doesn't use a built-in signal handler to catch SIGSEGV signals, you can force it to terminate and dump core files by using the kill_job command to send a SIGSEGV signal to it. For example: /bgsys/drivers/ppcfloor/hlcs/bin/kill_job --id bg_jobid -s SIGSEGV where bg_jobid is the BG/Q jobid - not the SLURM jobid.
  • How to determine the BG/Q jobid:
    • Include --runjob-opts="--verbose INFO" as an option to your srun command when you start the job.
    • Otherwise, you will need to contact the LC Hotline and request that a BG/Q system admin use a DB2 query to get the BG/Q jobid.
  • Additional information about using STAT to debug BG/Q lightweight corefiles can be found on the LC internal (requires authentication) wiki at:



  • Important: This section only covers the very basics on getting TotalView started on Blue Gene systems. Please see the following also:
  • The TotalView parallel debugger is available on all LC Blue Gene systems.
  • LC's license allows users to run TotalView up to the full size of the system. However, practically speaking, the useable limit is currently between 4K - 8K compute nodes.
  • Most of the relevant commands should already be in your path from /usr/local/bin: Command Description
    totalview TotalView Graphical User Interface
    totalviewcli TotalView Command Line Interface
    srun MPI job launch command. See the srun section for details.
    mxterm A script that pops open an xterm window from a batch job so that a user can perform debug sessions interactively. Simply type mxterm for usage information.
  • Debugging batch partition jobs: you need to use the mxterm script (or something equivalent)
  • mxterm syntax:
    mxterm [#nodes] [#tasks] [#minutes] [msub arguments ...]
  • mxterm usage:
    • Login to a front-end login node and make sure that your Xwindows environment is setup properly. You can verify this by launching a simple X application like xclock or xterm
    • Issue the mxterm with your specific parameters. Note that the #tasks argument is ignored, but you still need to enter a dummy value. If you're not sure of the syntax, just enter the mxterm command without arguments and hit return. A usage summary will display - available HERE.
    • mxterm will then automatically generate and submit a batch script for you (in the background).
    • You will be provided with the usual job id# which you can then use to monitor your job in the queue.
    • When your job begins executing, an xterm window will appear on your desktop machine.
    • In your new xterm window, launch your parallel job under TotalView's control. For example:

      totalview srun -a -N8 -n128 a.out

    • Once TotalView's opening windows appear, you can then begin your debug session.
  • Attaching to an already running/hung batch job:
    1. You must first find where your job's srun process is running. It will be on one of the front-end nodes - but most likely NOT the front-end node you are logged into. Two easy ways to do this are shown below:
      Method 1: % squeue -o "%i %u %B" JOBID USER EXEC_HOST 22030 joeuser vulcanlac5 22040 swltest vulcanlac5 22039 swltest vulcanlac6 Shows the jobid, user and front-end node where that job's srun process is running
      Method 2: % scontrol show job 73963 | grep BatchHost BatchHost=vulcanlac5 If you know the jobid, you can use this variation of the scontrol command and grep on the output for the front-end node where your srun process is running.
    2. Assuming that your Xwindows environment is setup correctly, launch totalview: totalview &
    3. After TotalView's opening windows appear, select the New Program window
      • Click on the Attach to process button.
      • Click on the Add Host... button
      • In the Add New Host dialog box, enter the name of the front-end node where your srun process is located, and click the OK button.
    4. You should now see a list of your processes - select the parent srun process and then click OK.
    5. TotalView will then attach to your srun process. You will probably need to Halt your srun process. After it stops, you can proceed to debug as normal.

Floating Point Exception Debugging:

  • TotalView supports floating point exception debugging on Blue Gene.
  • With the -qflttrap IBM compiler options, an offending task will generate a SIGFPE UNIX signal when it hits a specified exception.
  • Under TotalView a SIGFPE signal will cause your job to stop immediately. You can then use TotalView to perform a root cause analysis.
  • Syntax of the XL compiler's floating point exception trap options (see the compiler man page and IBM Compiler documentation for more info):
  • The suboptions determine what types of floating-point exception conditions to detect at run time. The suboptions are: Suboption Description
    enable Turn on checking for the specified exceptions
    inexact Detect and trap on floating-point inexact, if exception checking is enabled
    invalid Detect and trap on floating-point invalid operations
    nanq Detect and trap all quiet not-a-number (NaN) values
    overflow Detect and trap on floating-point overflow
    qpxstore Detect and trap Not a Number (NaN) or infinity values in Quad Processing eXtension (QPX) vectors.
    underflow Detect and trap on floating-point underflow
    zerodivide Detect and trap on floating-point division by zero
  • For example:
Enabling floating point exception trapping can significantly degrade program performance. For more information, see the discussion on "Performance Impact" under (requires authentication).

TotalView Scalable Early Access (SEA) Program:

  • The DOE Tri-labs and Rogue Wave Software (TotalView vendor) are engaged in a collaboration to produce a more scalable, commercial grade, parallel debugger for use by DOE Tri-Lab users.
  • The purpose of this program is two-fold:
    • To assist Tri-Lab users in debugging those errors that emerge at a large process count (i.e., a scale that current production versions of TotalView cannot comfortably handle);
    • To gather early customer feedback on the project's direction before the improvements are folded into the production line of TotalView.
  • TotalView developers are very interested in collecting early end-user experiences, such as usability concerns and TotalView scalability, and performance realized on real-world field problems.
  • Users who wish to try out this version of TotalView should see the documentation located at: (internal wiki - requires OTP authentication).
  • Contact: Dong Ahn ()



  • The Stack Trace Analysis Tool (STAT) gathers and merges stack traces from a parallel application's processes.
  • Primarily intended to attach to a hung job, and quickly identify where the job is hung.
  • The output from STAT consists of 2D spatial and 3D spatial-temporal graphs. These graphs encode calling behavior of the application processes in the form of a prefix tree. Example of a STAT 2D spatial graph shown on right.
  • Graph nodes are labeled by function names. The directed edges show the calling sequence from caller to callee, and are labeled by the set of tasks that follow that call path. Nodes that are visited by the same set of tasks are assigned the same color.
  • STAT is also capable of gathering stack traces with more fine-grained information, such as the program counter or the source file and line number of each frame.
  • A GUI is provided for viewing and analyzing the STAT output graphs
  • Location:
    • /usr/local/bin/stat-gui - GUI
    • /usr/local/bin/stat-cl - command line
    • /usr/local/bin/stat-view - viewer for DOT format output files
    • /usr/local/tools/stat - install directory, documentation
  • Using the STAT GUI for parallel jobs:
    • Assuming that you have a running job that is hung, and that you are logged into a BG/Q front-end "lac" node, use the stat-gui command to start the STAT GUI.
    • After it appears, it will display your srun processes on the node you are logged into. By default, it selects the parent srun process. Click the "Attach" button if this is correct.
    • If you don't see any srun processes, that means they are running on the "other" lac login node. Just type the other login node's name in the STATGUI's "Search Remote Host" box, as shown in the above example.
    • After a few moments, a graph depicting the state of your job will appear, allowing you to determine where your job is hung. Example:
    • Additional functionality for STAT can be found by consulting the "More information" links below.
  • More information:
    • Website:
    • User Guide:
    • "Running STAT on BG/Q" internal web page (requires authentication):
    • STAT command line man page: /usr/local/tools/stat/man, and also available HERE.
    • Developer website:

Performance Analysis Tools

What's Available?

The following performance analysis tools are available on LC's BG/Q platforms. These tools cover the full range of performance tuning: tracing, profiling, MPI, threads, and hardware event counters. Each is discussed in more detail in following sections. Tool Description
gprof: Standard Unix profiling utility that includes an application's routine call graph.
(Rice University)
Comprehensive, integrated suite of tools for parallel program performance analysis. Based on sampling for lower overhead. Serial, multithreaded and/or multiprocess codes.
HPC Toolkit:
Includes several components that can be used to trace and profile MPI programs, capture hardware events, and graphically visualize results. Serial, MPI, threaded, and hybrid applications.
mpitrace: Lightweight profiling and tracing library for MPI applications. Includes Hardware Performance Monitoring (HPM) statistics.
memP: Lightweight memory profiling library for MPI applications.
mpiP: Lightweight profiling library for MPI routines in applications.
Open|SpeedShop: An open source performance analysis tool framework that includes the most common performance analysis steps in one integrated tool. Comprehensive performance analysis for sequential, multithreaded, and MPI applications.
PAPI: Performance Application Programming Interface (PAPI). Standardized, cross-platform, API for obtaining hardware counter statistics.
TAU: The TAU Performance System is an integrated, portable, suite of performance analysis tools for the analysis of large-scale parallel applications.
Vampir / VampirTrace: VampireTrace is an open source tracing library. It can generate Open Trace Format (OTF) trace files and profiling data for MPI, OpenMP, pthreads and PAPI events. Vampir is an OTF trace file viewer.
Valgrind: Valgrind is a suite of simulation-based debugging and profiling tools. The Memcheck tool detects a comprehensive set of memory errors, including reads and writes of unallocated or freed memory and memory leaks.

Performance Analysis Tools


  • Standard, text-based, Unix profiling utility that includes an application's routine call graph.
  • Can be used with C/C++ and Fortran.
  • gprof displays the following information:
    • The parent of each procedure.
    • An index number for each procedure.
    • The percentage of CPU time taken by that procedure and all procedures it calls (the calling tree).
    • A breakdown of time used by the procedure and its descendents.
    • The number of times the procedure was called.
    • The direct descendents of each procedure.
  • Example:
  • Location: /bgsys/drivers/ppcfloor/gnu-linux/bin/powerpc64-bgq-linux-gprof. The one in /usr/bin is for the front-end nodes.
  • Using gprof
    • Compile your program with the -pg option. If your compilation includes the -c option (to produce a .o file), then you will need to include the -pg during the link/load also.
    • Run the program. When it completes you should have a file called gmon.out which contains runtime statistics. If you are running a parallel program you will have multiple files differentiated by the process id which created them, such as gmon.out.0 gmon.out.1 gmon.out.2, etc.
    • For serial users, view the profile statistics with gprof by typing gprof at the shell prompt in the same directory that you ran the program. By default, gprof will look for a file called gmon.out and display the statistics contained in it.
    • For parallel users, view the profile statistics with gprof by typing gprof followed by the name of your executable and the gmon.out.X files you wish to view. You may view any single file or any combination.
    • Examples:
      gprof myprog gmon.out.0 gprof myprog gmon.out.0 gmon.out.1 gmon.out.2 gprof myprog gmon.out.
  • Notes:
    • When more than one gmon.out input file is specified, the resulting gprof report is a merge of the multiple inputs.
    • In most cases, you will want to redirect the output of gprof from stdout to a file, for example: gprof myprog gmon.out.12 > output.txt
  • More information: gprof man page

Performance Analysis Tools


  • HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the largest supercomputers.
  • Uses low overhead statistical sampling of timers and hardware performance counters to collect accurate measurements of a program's work, resource consumption, and inefficiency and attributes them to the full calling context in which they occur.
  • Works with C/C++ and Fortran, applications that are either statically or dynamically linked.
  • Supports measurement and analysis of serial codes, threaded codes (pthreads, OpenMP), MPI, and hybrid (MPI + threads) parallel codes.
  • Primary components and their relationships:
    • hpcrun: collects accurate and precise calling-context-sensitive performance measurements for unmodified fully optimized applications at very low overhead (1-5%). It uses asynchronous sampling triggered by system timers and performance monitoring unit events to drive collection of call path profiles and optionally traces.
    • hpcstruct: To associate calling-context-sensitive measurements with source code structure, hpcstruct analyzes fully optimized application binaries and recovers information about their relationship to source code. In particular, hpcstruct relates object code to source code files, procedures, loop nests, and identifies inlined code.
    • hpcprof: overlays call path profiles and traces with program structure computed by hpcstruct and correlates the result with source code. hpcprof/mpi handles thousands of profiles from a parallel execution by performing this correlation in parallel. hpcprof and hpcprof/mpi generate a performance database that can be explored using the hpcviewer and hpctraceviewer user interfaces.
    • hpcviewer: a graphical user interface that interactively presents performance data in three complementary code-centric views (top-down, bottom-up, and flat), as well as a graphical view that enables one to assess performance variability across threads and processes. hpcviewer is designed to facilitate rapid top-down analysis using derived metrics that highlight scalability losses and inefficiency rather than focusing exclusively on program hot spots.
    • hpctraceviewer: a graphical user interface that presents a hierarchical, time-centric view of a program execution. The tool can rapidly render graphical views of trace lines for thousands of processors for an execution tens of minutes long even a laptop. hpctraceviewer's hierarchical graphical presentation is quite different than that of other tools - it renders execution traces at multiple levels of abstraction by showing activity over time at different call stack depths.
  • Location: /usr/global/tools/hpctoolkit/bgqos_0
  • Using HPCToolkit:
    • Due to its multi-component and sophisticated nature, usage instructions for HPCToolkit are beyond the scope of this document. A few hints are provided below.
    • Be sure to use the LC dotkit package for HPCTookit. The command use -l will list all available packages. Find the one of interest and then load it - for example: use hpctoolkit
    • Consult the User's Manual and other HPCTookit documentation - links provided below.
    • Note the Blue Gene instructions, where applicable, in the documentation, as some things are done differently for these architectures.
  • More information:
    • HPCToolkit Documentation webpage:
    • HPCToolkit User's Manual:
    • HPCToolkit Man Pages (html format):
    • HPCToolkit Presentation:

Performance Analysis Tools

IBM HPC Toolkit

NOTE: IBM's HPC Toolkit is not currently available on LC's BG/Q clusters. Usage information will be added here when/if it becomes available.

Performance Analysis Tools


  • The mpitrace library can be used to profile an application and report:
    • MPI routines called - number of calls, average message size (bytes) and aggregate time spent
    • MPI routine call sites - the address where MPI routines are called, number of calls and aggregate time spent
    • The number of torus network hops from sender to destination for each message
    • Hardware Performance Monitor (HPM) counts - for selected hardware events
    • The application's heap memory footprint
    • Time spent at each source code statement - the number of "ticks" per line
  • The mpitrace library can also be used trace MPI events during a programs execution. All or selected MPI events are saved to a binary events.trc file for later viewing with the traceview viewer.
  • Implemented via wrappers around MPI calls
  • Note: this useful library is an internal (non-product) tool provided to LC by Bob Walkup from IBM.
  • Locations:
    • /usr/local/tools/mpitrace/lib/ - Main version with full functionality
    • /usr/local/tools/mpitrace/lite/ - Lite version with smaller memory footprint, but reduced functionality:
    • /usr/local/tools/mpitrace/pthreads/ - HPM version for applications that use Pthreads
    • /usr/local/tools/traceview/ - traceview GUI source code. The traceview executable in this directory is provided for MS Windows platforms.
  • Compiling and linking:
    • Compile as usual, but using the -g flag is required for reporting call sites and source line ticks statistics.
    • Then link as shown below. Note that the examples shown are for using the main version. Use the paths shown above for the Lite and Pthreads versions.
      -L/usr/local/tools/mpitrace/lib -lmpitrace -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm Basic MPI profiling and tracing
      -L/usr/local/tools/mpitrace/lib -lmpihpm -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm MPI profiling, tracing + HPM profiling
      -L/usr/local/tools/mpitrace/lib -lmpihpm_smp -L/bgsys/drivers/ppcfloor/bgpm/lib -lbgpm MPI + OpenMP profiling, tracing + HPM profiling
  • Instrumented/selective tracing: Only trace those parts of the program contained within trace start/stop calls. Syntax for trace start/stop calls: Fortran C C++
    call trace_start() do work + mpi ... call trace_stop() void trace_start(void); void trace_stop(void); trace_start(); do work + mpi ... trace_stop(); extern "C" void trace_start(void); extern "C" void trace_stop(void); trace_start(); do work + mpi ... trace_stop();
  • Instrumented/selective profiling and source line "ticks" profiling: See the mpitrace documentation for instructions.
  • Running:
    • A number of environment variables control how profiling and tracing is performed. Some of these are described in the table below. Please consult the mpitrace documentation for additional details not covered here. Environment Variable Description Default
      PROFILE_BY_CALL_SITE Set to yes to obtain the call site for every MPI function call. Requires compiling with the -g flag no
      TRACE_SEND_PATTERN Set to yes to collects information about the number of hops for point-to-point communication on the torus network. no
      SAVE_ALL_TASKS Set to yes to produce an output file for every MPI rank. By default, output files are only produced for MPI rank 0, and the ranks having the minimum, median, and maximum times in MPI. no
      SAVE_LIST Specify a list of MPI ranks that will produce an output file. By default, output files are only produced for MPI rank 0, and the ranks having the minimum, median, and maximum times in MPI. Example: setenv SAVE_ALL_TASKS 0,32,64,128,256,512 unset
      TRACEBACK_LEVEL In cases where there are deeply nested layers on top of MPI, you may want to profile higher up the call chain. This can be done by setting this environment variable to an integer value above zero indicating how many levels above the MPI calls profiling should take place. 0
      TRACE_DIR Specify the directory where output files should be written. Working directory
      HPM_GROUP Set to an integer value indicating which predefined hardware counter group to use. Hardware counter groups are listed in the file: /usr/local/tools/mpitrace/CounterGroups 0
      HPM_PROFILE Set to "yes" to turn on HPM profiling. Executable needs to have been linked with an HPM library. unset
      HPM_SCOPE Set to process or thread to aggregate hardware counter statistics at the process or thread level. See documentation for explanation. node
      TRACE_ALL_TASKS For jobs that have more than 256 tasks, setting this to yes will cause all tasks to be traced. Can cause problems for large, long running jobs (too much data). no
      TRACE_ALL_EVENTS Set to yes to trace all MPI events. This is used if you don't explicitly instrument your source code with trace start/stop routine calls yourself. no
      TRACE_MAX_RANK Specifies the maximum task rank that should be profiled. Can be used to override the default of 255 (256 tasks). 255
      SWAP_BYTES The event trace file is binary, and therefore, it is sensitive to byte order. The trace files are written in little endian format by default. Setting this environment variable to "yes" will produce a big endian binary trace output file. no
  • Output:
    • MPI profiling: The default is to produce plain text files of MPI data for MPI rank 0, and the ranks that had the minimum, median, and maximum times in MPI. Files are named mpi_profile.#.rank where # is a unique number for each job. The file for MPI rank 0 also contains a summary of data from all other MPI ranks.
    • HPM profiling: similar to MPI profiling, except the files are named hpm_process_summary.#.rank
    • MPI tracing: A single binary trace data file called events.trc is produced. Intended to be viewed with the traceview GUI utility.
    • The number of profiling output files produced, and the data they contain, can be modified by setting the environment variables in the above table.
    • Examples:
  • Tracing caveats:
    • Tracing large, long running executables can generate a huge output file, even to the point of being useless.
    • Tracing incurs overhead and increases a job's runtime.
    • Optimized code may produce misleading or erroneous trace results.
  • Documentation:
    • Located in /usr/local/tools/mpitrace on BG/Q systems
    • MPI_Wrappers_for_BGQ.pdf - Main documentation
    • README.mpitrace - LC specific notes
    • CounterGroups - HPM counter groups

Performance Analysis Tools


  • memP is a locally developed, light weight, parallel heap profiling library based on the mpiP MPI profiling tool.
  • Primary feature is to identify the heap allocation that causes an MPI task to reach its memory in use high water mark (HWM).
  • Two types of memP reports:
    1. Summary Report: Generated from within MPI_Finalize, this report describes the memory HWM of each task over the run of the application. This can be used to determine which task allocates the most memory and how this compares to the memory of other tasks.
    2. Task Report: Based on specific criteria, a report can be generated for each task, that provides a snapshot of the heap memory currently in use, including the amount allocated at specific call sites.
  • Location: /usr/local/tools/memP
  • Using memP:
    • Load the memP dotkit package with the command use memp
    • Compile with the recommended BG/Q flags and link your application with the required libraries: -Wl,-zmuldefs -L/usr/local/tools/memp/lib -lmemP
    • Examples:
      mpixlc -g -Wl,-zmuldefs -o myprog myprog.c -L/usr/local/tools/memP/lib -lmemP mpixlf77 -g -Wl,-zmuldefs -o myprog myprog.f -L/usr/local/tools/memP/lib -lmemP
    • Optional: set the MEMP environment variable to specify the type of output you desire, if other than the default, summary text file. See the "Output Options" discussion below.
    • Then run your MPI application as usual. You can verify that memP is working by the header and trailer output it sends to stdout, and output file generation following execution.
  • Output Options:
    • By default, a single text summary report showing the top HWM tasks will be produced:
    • Other options exist to produce reports on a per task basis, display call sites where the HWM is reached, set HWM thresholds, generate stack traces, and more.
    • XML format reports that can be viewed via an LC utility are also an option:
    • For details, see the "More information" link below.
  • More information:

Performance Analysis Tools


  • mpiP is a lightweight profiling library for MPI applications.
    • Software developed by LLNL
    • Collects only statistical information about MPI routines, generating much less data than tracing tools
    • Captures and stores information local to each task (local memory and disk)
    • Uses communication only at the end of the application to merge results from all tasks into one output file
  • mpiP provides statistical information about a program's MPI calls:
    • Percent of a task's time attributed to MPI calls
    • Where each MPI call is made within the program (callsites)
    • Top 20 callsites
    • Callsite statistics (for all callsites)
  • Location: /usr/local/tools/mpip
  • Using mpiP:
    • Involves little more than compiling with the -g flag and linking with the mpiP library.
    • Examples:
      mpixlc -g -o myprog myprog.c -L/usr/local/tools/mpip/lib -lmpiP mpixlf77 -g -o myprog myprog.f -L/usr/local/tools/mpip/lib -lmpiP
    • After compiling, run your application as usual. You can verify that mpiP is working by the header and trailer output it sends to stdout, and the creation of a single output file (see "Output" below).
  • Output:
    • After your application completes, mpiP will write its output file to the current directory. The output file name will have the format of myprog.N.XXXXX.mpiP where N=#MPI tasks and XXXXX=collector task process id.
    • mpiP's output file is divided into 5 sections:
      1. Environment Information
      2. MPI Time Per Task
      3. Callsites
      4. Aggregate Times of Top 20 Callsites
      5. Callsite Statistics
    • Example:
  • More information:

Performance Analysis Tools


  • Open|SpeedShop is an open source performance analysis tool framework that integrates the most common performance analysis steps all in one tool.

  • Primary functionality:
    • Sampling Experiments
    • Support for Callstack Analysis
    • Hardware Performance Counters
    • MPI Profiling and Tracing
    • I/O Profiling and Tracing
    • Floating Point Exception Analysis
  • Instrumentation options include:
    • Unmodified application binaries
    • Offline and online data collection
    • Attach to running applications
  • Four user interface options:
    • Graphical user interface
    • Command line
    • Batch
    • Python scripting API
  • Designed to be modular and extensible. Supports several levels of plug-ins which allow users to add their own performance experiments.
  • Linux based platforms - currently IA64, IA32, EM64T, AMD64, IBM Power PC, Cray XT/XE and IBM Blue Gene.
  • Open|SpeedShop development is hosted by the Krell Institute. The infrastructure and base components are released as open source code primarily under LGPL.
  • Location: /usr/global/tools/openspeedshop/
  • Using HPCToolkit:
    • Due to its multi-component and sophisticated nature, usage instructions for Open|SpeedShop are beyond the scope of this document. A few hints are provided below.
    • Be sure to use the LC dotkit package for Open|SpeedShop. The command use -l will list all available packages. Find the one of interest and then load it - for example: use openss.
    • Consult the User's Guide and other Open|SpeedShop documentation - links provided below.
  • More information:
    • Open|SpeedShop website:
    • Open|SpeedShop documentation:
    • Quick Start Guide (2 pages):
      Local copy: Available Here
    • SGI Open|SpeedShop documentation:

Performance Analysis Tools


  • Performance Application Programming Interface (PAPI) is an industry standard, cross-platform, API for obtaining hardware performance counter statistics, such as:
    • Branching, conditional, unconditional
    • Cache requests, hits, misses, L1, L2, L3
    • Stores, conditional, success, fail
    • Instruction counting
    • Loads, prefetches
    • Cycle stalls
    • Floating point operations
    • TLB operations
    • Hardware interrupts
  • Hardware events are recorded by making calls to the PAPI API routines for the events of interest.
  • There are two groups of events:
    • Preset Events: Standard API set of over 100 CPU events for application performance tuning. Application developers can access these events through the PAPI high-level API. A list of these routines can be found at For convenience, they are also .
    • Native Events: Platform specific events that extend beyond the Preset Event set. Require using PAPI's low-level API - generally intended for experienced programmers and tool developers.
  • Originally, the API focused on CPU events, but the more recent PAPI-C (PAPI Component) API includes other machine components such as network interface cards, power monitors and I/O units.
  • Both a C and Fortran calling interface
  • On BG/Q, PAPI interfaces to a subset of IBM's BGPM (Blue Gene Performance Monitoring) API. BGPM includes over 400 events grouped into 5 categories, which map to the hardware unit where they are counted:
    • Processor Unit
    • L2 Unit
    • I/O Unit
    • Network Unit
    • CNK (compute node kernel) Unit
  • Location: /usr/local/tools/papi
  • Using PAPI:
    • Using PAPI in an application typically requires a few simple steps: include the event definitions, initialize the PAPI library, set up event counters, and link with the PAPI library.
    • Documentation on how to use the API can be found at See the Documentation section.
    • A useful PAPI getting started tutorial is available at
  • More information:
    • PAPI website:
    • IBM BGPM API documentation - see the install directory at /bgsys/drivers/ppcfloor/bgpm/docs/html/index.html.
    • BG/Q native events (from the installation documentation):

Performance Analysis Tools


  • The TAU (Tuning and Analysis Utilities) Performance System is an integrated, portable, profiling and tracing toolkit for performance analysis of parallel programs written in Fortran, C, C++, Java, Python. From Performance Research Lab, University of Oregon.

  • Profiling: shows how much time was spent in each routine
  • Tracing: when and where events take place
  • TAU instrumentation is used to accomplish both profiling and tracing. Three different methods:
    • Binary "rewriting"
    • Compiler directed
    • Source transformation (both automatic and selective)
  • Ability to use PAPI hardware counters
  • Graphical representation of profiling/tracing via TAU's ParaProf GUI tool.
  • Location: /usr/global/tools/tau/bgqos_0
  • Quickstart for TAU Profiling:
    1. Load the TAU environment: use tau
    2. Decide what you want to instrument by selecting the appropriate TAU stub makefile. These are named according to to the metrics they record, and they will be located in the bgq/lib subdirectory of your TAU installation. For example, in: /usr/global/tools/tau/bgqos_0/tau-2.21.3/bgq/lib you will see makefile stubs such as: Makefile.tau-bgqtimers-mpi-pdt Makefile.tau-bgqtimers-pdt Makefile.tau-bgqtimers-mpi-pdt-openmp-opari Makefile.tau-bgqtimers-pdt-openmp-opari Makefile.tau-bgqtimers-papi-mpi-pdt Makefile.tau-bgqtimers-pthread-pdt Makefile.tau-bgqtimers-papi-mpi-pdt-openmp-opari Makefile.tau-depthlimit-bgqtimers-mpi-pdt Makefile.tau-bgqtimers-papi-pdt Makefile.tau-param-bgqtimers-mpi-pdt Makefile.tau-bgqtimers-papi-pdt-openmp-opari Makefile.tau-phase-bgqtimers-papi-mpi-pdt Makefile.tau-bgqtimers-papi-pthread-pdt
    3. Set TAU_MAKEFILE to the full pathname of the makefile stub you choose. For example: setenv TAU_MAKEFILE /usr/global/tools/tau/bgqos_0/tau-2.21.3/bgq/lib/Makefile.tau-bgqtimers-mpi-pdt
    4. Compile your program substituting the appropriate TAU compiler script for your usual compiler: (C++), (C), (F90), (F77). These scripts should be in your path after following the setup instructions above.
    5. Run your program
    6. When finished, you should have files called profile.NNN, one per MPI task.
    7. View the output using the pprof (text) or paraprof (GUI) tools. Simply issue the command and it will look for the relevant profile.NNN files to display.
      Example output - simple 8 task MPI program:
  • Note: TAU is a very full featured toolkit that cannot be covered here adequately. See the more information links below to get a better feel for what this performance analysis package can do.
  • More information:
    • TAU website:
    • Documentation including video demos:

Performance Analysis Tools

VampirTrace / Vampir

  • VampirTrace is an open source, performance analysis tool set and library used to instrument, trace and profile parallel applications. Developed at TU-Dresden, in collaboration with the KOJAK project at JSC/FZ Julich.
  • Supports applications using:
    • MPI
    • OpenMP
    • Pthreads
    • GPU accelerators
  • Trace events can include:
    • Application's routine/function calls
    • MPI calls
    • User defined events
    • PAPI performance counters
    • I/O
    • Memory allocations
  • Instrumentation options include:
    • Fully automatic - performed via compiler wrappers
    • Manual using the VampirTrace API
    • Fully automatic using the TAU instrumentor
    • Runtime binary instrumentation using Dyninst
  • Vampir is a proprietary trace visualizer developed by the Center for Information Services and High Performance Computing (ZIH) at TU Dresden. It is used to graphically display the Open Trace Format (OTF) output produced by VampirTrace.
  • Locations:
    • VampirTrace: /usr/global/tools/vampirtrace/bgqos_0
    • VampirTrace: /usr/global/tools/vampirtrace/bgqos_0
  • Quickstart for basic usage at LC:
    1. Load the VampirTrace environment: use vampirtrace-bgq
    2. Compile / link your code using one of the VampirTrace compiler wrappers: vtCC, vtc++, vtcc, vtcxx, vtf77 or vtf90. Need to tell the wrapper which native compiler you prefer. For example:

      vtcc -vt:cc mpixlc -o hello mpi_hello.c

    3. Set desired environment variables - there are many choices. For example, to do both profiling and tracing, and to prefix the output files with the name of the code:

      setenv VT_MODE STAT:TRACE
      setenv VT_FILE_PREFIX hello

    4. Run the executable
    5. View the output using the Vampir GUI:

      use vampir
      vampir myfile.otf

      NOTE: As of December, 2012, Vampir is only installed on the following LC systems: cab, edge, hera, sierra, rzmerl, rzzeus.

  • Output:
    • Profile data is written to a plain text file named by default. Use the VT_FILE_PREFIX environment variable to name it something different. Example:
      excl. time incl. time excl. time incl. time calls / call / call name 0.186s 0.186s 1 0.186s 0.186s MPI_Finalize 0.123s 0.123s 4033.75 30.459us 30.459us MPI_Recv 94.592ms 0.687s 1 94.592ms 0.687s main 53.345ms 53.345ms 2000 26.672us 26.672us MPI_Ssend 51.888ms 51.888ms 2000 25.944us 25.944us MPI_Waitall 47.471ms 47.471ms 2033.75 23.341us 23.341us MPI_Send 32.281ms 32.281ms 4000 8.070us 8.070us MPI_Irecv 29.833ms 29.833ms 1000 29.832us 29.832us MPI_Sendrecv
    • Tracing data is written to an Open Trace Format (OTF) file named a.otf by default. Use the VT_FILE_PREFIX environment variable to name it something different.
    • Note: VampirTrace may create other output files that are not of viewing interest, particularly if the files are not merged into the two default files mentioned above.
  • Note: VampirTrace and Vampir are full featured tools that cannot be covered here adequately. See the more information links below to get a better feel for what they can do.
  • More information:
    • VampirTrace User Manual located under: /usr/global/tools/vampirtrace/doc
    • VampirTrace User Manual located on the TU-Dresden website
    • Vampir website located at
    • LC internal wiki page located at:
    • LC presentation materials located at:

Performance Analysis Tools


  • NOTE: Valgrind is not currently available on the LC BG/Q systems. Usage information will be added here when/if it becomes available.

  • The Valgrind tool suite provides a number of debugging and profiling tools that help you make your programs faster and more correct.
  • The Valgrind distribution currently includes the following tools:
    • Memcheck: is a memory error detector. It helps you make your programs, particularly those written in C and C++, more correct.
    • Cachegrind: is a cache and branch-prediction profiler. It helps you make your programs run faster.
    • Callgrind: is a call-graph generating cache profiler. It has some overlap with Cachegrind, but also gathers some information that Cachegrind does not.
    • Helgrind: is a thread error detector. It helps you make your multi-threaded programs more correct.
    • DRD: is also a thread error detector. It is similar to Helgrind but uses different analysis techniques and so may find different problems.
    • Massif: is a heap profiler. It helps you make your programs use less memory.
    • DHAT: is a different kind of heap profiler. It helps you understand issues of block lifetimes, block utilisation, and layout inefficiencies.
    • SGcheck: is an experimental tool that can detect overruns of stack and global arrays. Its functionality is complementary to that of Memcheck: SGcheck finds problems that Memcheck can't, and vice versa..
    • BBV: is an experimental SimPoint basic block vector generator. It is useful to people doing computer architecture research and development.
  • Valgrind is also an instrumentation framework for building dynamic analysis tools - you can also it to build your own tools.
  • For more information visit

Documentation, Help and References

Local Documentation - General and BG/Q Specific:

  • Livermore Computing user web pages:
  • MyLC Livermore Computing user portal:
  • Livermore Computing tutorials:
  • Livermore Computing internal (requires authentication) wiki BG/Q web pages:
  • Sequoia Summary/Cheat Sheet
    Available Here
  • Known Problems List on the Livermore Computing internal (requires authentication) wiki BG/Q web pages:
  • Online under /usr/local/docs:
    • text: /usr/local/docs/rzuseq.basics
    • PDF: /usr/local/docs/rzuseq.basics.pdf. PDF files may be viewed using evince.
    • html: /usr/local/docs/BGQ. Note that in addition to firefox, the text-based elinks browser is also available. Note: If X-Windows applications such as firefox or [x]emacs crash, you need to update the X-server on your desktop.
    • IBM BG/Q Compiler manuals
    • Other IBM BG/Q documents

Help - LC Hotline:

  • The LC Hotline staff provide walk-in, phone and email assistance weekdays 8:00am - noon, 1:00pm - 4:45pm.
  • Walk-in Consulting
    • On-site users can visit the LC help desk consultants in Building 453, Room 1103. Note that this is a Q-clearance area. Need a map?
  • Phone:
    • (925) 422-4531 - Main number
    • 422-4532 - Direct phone line for technical consulting help
    • 422-4533 - Direct phone line for support help (accounts, passwords, forms, etc)
  • Email
    • Technical Help:
    • Support:

Help - BG/Q Specific:

  • Sequoia Users Meeting: third Thursday each month. Held in B451 White Room from 3:00-4:00pm. Web conference and phone dial-in numbers available - contact the LC Hotline.
  • "BG/Q Virtual Water Cooler" telecon every Thursday (except 3rd) from 3:00- 4:00pm. Intended to be an open user forum discussion regarding the Seq/Vulcan/rzuseq systems. Available for consulting with domain experts on topics such as porting codes, system status, jobs scheduling, file systems, etc.


  • Author: Blaise Barney, Livermore Computing.
  • 1 "The Blue Gene/Q Compute Chip", Ruud Haring, IBM BlueGene Team, IBM
    Presentation at Hot Chips 23 Conference, Mar 18, 2011
  • 2 "Blue Gene/Q"
    IBM Presentation at SC11 Conference, Nov 12-18, 2011. Seattle, WA
  • 3 "IBM uncloaks 20 petaflops BlueGene/Q super", The Register, Nov 2010
  • 4 "IBM System Blue Gene Solution: Blue Gene/Q Hardware Overview and Installation Planning". IBM Redbook SG24-7872-00.
  • "IBM System Blue Gene Solution: Blue Gene/Q Application Development" (SG247948)
  • Compilers
    • IBM C/C++ Compiler Documentation for BG/Q:
    • IBM Fortran Compiler Documentation for BG/Q:
    • GNU Compilers Documentation
  • "QPX Architecture, Quad Processing eXtension to the Power ISA"
    IBM, Thomas Fox, May 9, 2012
  • "A2 Processor User's Manual for Blue Gene/Q". IBM. Oct. 2012
  • ESSL documentation:
  • The Mathematical Acceleration Subsystem (MASS) and MASSV libraries
  • "IBM System Blue Gene Solution: Blue Gene/Q System Administration" (SG247869)
  • 5 "IBM Blue Gene/Q Overview - PRACE Winter School". Pascal Vezolle, February 6-10, 2012.
  • BG/Q web pages on IBM's web site:
  • Argonne Leadership Computing Facility (ALCF) BG/Q Presentation Materials
  • MPI and Message Passing Programming Paradigms
    • MPICH2:
    • Aggregate Remote Memory Copy Interface (ARMCI)
    • Charm++
    • Global Arrays
    • Berkeley Unified Parallel C
    • Global Address Space Networking (GASNet)
  • Sequoia Scalable Applications Preparation (SAP) Project
  • Photos/Graphics: Permission to use IBM photos/graphics has been obtained by the author and is on file. Other photos/graphics have been created by the author, created by other LLNL employees, obtained from non-copyrighted sources, or used with the permission of authors from other presentations and web pages.

This completes the tutorial.

Evaluation Form       Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the end of the exercise.

Where would you like to go now?

  • Exercise
  • Agenda
  • Back to the top

Using the Sequoia and Vulcan BG/ Q Systems Serial Number Arena CrAzrAzY