Blue Gene/Q

From CCI User Wiki
Jump to: navigation, search

System Overview

The Blue Gene/Q is a 5-rack, 5120 node IBM Blue Gene/Q. Each node consists of a 16-core 1.6 GHz A2 processor, with 16 GB of DDR3 memory.

Note: The A2 processor is bi-endian but defaults to big endian. If you are migrating binary data between x86 (little endian) and the Blue Gene, you must consider the difference in endianness.

Access

The primary front-end node is named amos or q (ssh q or ssh amos), and it can be accessed with any authorized CCI account from the landing pads. A second front-end node named q2 is also available (ssh q2).

The main CCI GPFS File System is available on the Blue Gene/Q.

Compiler Environments

BG/Q will be using modules to coordinate different available compilers and environment options. The modules command manages your $PATH and $LD_LIBRARY_PATH environment for you.

The xl and gnu modules setup your environment for building applications that will run on the compute nodes. The BG/Q GNU toolchain is indicated by the "powerpc64-bgq-linux-" target prefix. If you want to build a tool that will only run on the front end node see Building for the Front-End Node.

To begin, you can use

module avail

to list the currently available modules. (Note that you can also write and use your own modules as well, ask if you're interested in getting this working.)

To list the currently loaded modules:

module list

To load the IBM XL compiler set, and mpicc wrapper scripts, you can run:

module load xl

GNU compilers:

module load gnu

You can also use

module unload (module)

or

module clear

to remove all modules.

Compilers

Build Systems

Quirks

Some build systems do not recognize the MPICC environment variable set by the modules. If you receive errors of the form "undefined reference" related to missing MPI, PAMI, or other communication libraries, please ensure your build system or Makefile is using the MPI wrapper specified in the MPICC environment variable.

You may add CC=$MPICC as a quick-fix to point the build system at the MPI wrapper instead of directly at the C compiler.

CMake

/usr/bin/cmake points to version 2.6.

/usr/bin/cmake28 points to version 2.8.

3.0

 module load cmake

Various cmake toolchain files are available through the environment variables CMAKE_GNU_TOOLCHAIN CMAKE_GNU47_TOOLCHAIN and CMAKE_XL_TOOLCHAIN. They can be used with a cmake build by setting the cmake variable:

 -DCMAKE_TOOLCHAIN_FILE=$CMAKE_XYZ_TOOLCHAIN

GNU Autotools

automake, autoreconf, libtool and friends are available in /usr/bin.

imake

(Deprecated) imake, specifically makedepend is available, however, imake has been deprecated since 2005 and dormant since 2009. It is highly recommended to move to either CMake or GNU autotools.

Libraries

IBM's ESSL (accelerated math library) version 5.1 is installed under /bgsys/ibm_essl/prod/opt/ibmmath/.

Running

Slurm

Slurm is used for scheduling.

The job queue can be viewed here: https://secure.cci.rpi.edu

System-wide limits

By default a single user may submit a maximum of 128 jobs to the queue.

Partitions

Other job sizes are not possible due to hardware constraints.
Partition Name Max Runtime (hours) Job Size (Nodes)
debug (default) 1 1, 2, 4 .. 32
small 24 1, 2, 4 .. 32, 64
medium 12 128, 256, 512
large 6 1024, 2048
verylarge 6 3072, 4096

You can use salloc to request a partition to run on. For example,

salloc --nodes 1024 --time 60 --partition large

will request the entire rack as one partition, and drop you into a shell when successful. Please make sure to exit from that shell once you are done. The salloc method is usefuly mainly for testing only and submitting a batch script is prefered. It is not necessary to use salloc to request an allocation before starting a job with srun or sbatch.

Jobs are started on the BG/Q hardware with srun. To run on all 1024 nodes, with one MPI thread per processor, you would launch:

srun --partition large --time 60 --ntasks 16384 /path/to/executable

sbatch is working to run batch jobs as well. An example sbatch script (equivalent to the testjob.sh script from the BG/L) is:

#!/bin/sh
#SBATCH --job-name=TESTING
#SBATCH -t 04:00:00
#SBATCH -D /gpfs/u/<home or barn or scratch>/<project>/<user>
#
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email>

srun -o testing.log ./my-bgq-executable <options>
# additional calls to srun may be placed in the script, they will all use the same partition 

Note that you can launch up to 64 tasks per node. (4 threads per core * 16 cores = 64 threads per node.) Whether this is the most efficient way to run your code is up to you to decide.

It is suggested to begin with 32-nodes and then scale up. Note that 32 BG/Q nodes have more raw power than an entire BG/L rack; the full 1024-node system is twice as fast as the entire 16-rack BG/L.

Time limit requirement

It is necessary in all cases to specify a time limit or runtime estimation when requesting an allocation. This can be done by specifying the -t/--time command-line option. If this option is not present, job submission will fail with the following message: "error: Unable to allocate resources: Missing time limit". The time limit must be within the limits of the selected partition.

Job arrays/Many small jobs

For many small jobs running simultaneously, it is better to submit one large job rather than many small jobs. Doing otherwise leads to resource fragmentation and poor scheduler performance. A group of small jobs that collectively use 256 nodes or more should be submitted together as 1 job. Example:

#!/bin/sh
#SBATCH --job-name=TESTING
#SBATCH -t 04:00:00
#SBATCH -D /gpfs/u/<home or barn or scratch>/<project>/<user>
#
#SBATCH --mail-type=ALL
#SBATCH --mail-user=<email>

srun -N8 -o testing.log ./my-bgq-executable <options> &
srun -N8 -o testing2.log ./my-bgq-executable <options> &
srun -N8 -o testing3.log ./my-bgq-executable <options> &
srun -N8 -o testing4.log ./my-bgq-executable <options> &
wait

The differences are the addition of an ampersand (&) at then end of each srun command and the wait command at the end. This will run all 4 jobs in parallel within the allocation and wait until all 4 are complete. For this example, the batch script should be run as sbatch -N32 <script> to ensure enough nodes are allocated for all the jobs that will run in parallel.

Runjob options

Since users interact with Slurm rather than runjob as documented in the IBM manuals, the availability of some runjob options is not apparent. However, Slurm actually wraps the runjob command and additional runjob-specific options that Slurm does not directly support are still available by using the srun/sbatch option --runjob-opts. This switch accepts a string that can be passed directly to the underlying runjob command. This is useful for changing torus mappings and debugging.

Torus mapping

Torus mappings can be changed by utilizing the --runjob-opts switch on srun/sbatch. Example:

srun --runjob-opts="--mapping TEDCBA"

Memory Organization

Some applications with few processes per node that also run close to the memory limit of each node (16 GB) may "crash" when they approach the limit. This is due to the default memory organization on the compute node. A work-around is to set BG_MAPCOMMONHEAP=1 in the environment when submitting a job. Caution: This causes all heap memory to be allocated in one global memory space. This can lead to memory corruption/crashing if a process does not handle memory correctly and/or touched another process's memory space.

OpenMP "Optimizations"

Setting OMP_WAIT_POLICY=ACTIVE in the environment when compiling an OpenMP application can lead to a significant speedup. However, this also leads to OpenMP threads spinning instead of yielding when they need to wait. This can lead to a performance degredation in non-OpenMP portions of the application. However, this can be reduced by setting BG_SMP_FAST_WAKEUP=YES. If you intend to oversubscribe the hardware threads while using BG_SMP_FAST_WAKEUP=YES you must also set OMP_PROC_BIND=TRUE to prevent deadlocks.

MPI_THREAD_MULTIPLE

Set runjob option PAMID_ASYNC_PROGRESS (and other associated variables) to disable PAMI communication threads. Having these threads active has been observed to reduce performance when running 64 user threads or processes per node. See the BGQ appdev redbook for more details. Alternatively, the XL Legacy compiler and wrappers can be used.

MY_RUNJOB_OPTS="PAMI_MAX_COMMTHREADS=0 PAMID_ASYNC_PROGRESS=0 PAMID_CONTEX_POST=1 BG_THREADMODEL=1"
srun -n <processes> --runjob-opts="--envs $MY_RUNJOB_OPTS"

L1P Prefetcher Tuning

Each of the 16 cores on the A2 processor has a 16 KB data cache. Additional L1P hardware controls prefetching into the L1 cache. Details on the L1P unit are in section 4.8 of the Application Development Redbook.

Code

The following code demonstrates how iteration over a memory-fragmented data structure can be improved using the 'perfect prefetcher' mode of the L1P unit.

#include <cstdlib>
#include <cstdio>
#include <mpi.h>
#include <spi/include/l1p/pprefetch.h>

#define COUNT 4*1000*1000
#define ITERS 8
#define SEED 42
#define LIST_SIZE 10*1024*1024 

void printL1PStatus() {
  L1P_Status_t status;
  L1P_PatternStatus(&status);
  fprintf(stdout,"L1P_Status_t finished %d abandoned %d maximum %d\n",
    (int)status.s.finished,
    (int)status.s.abandoned,
    (int)status.s.maximum);
}

int main(int argc, char** argv) {
  MPI_Init(&argc,&argv);

  srand(SEED);
  fprintf(stdout,"count %d iters %d seed %d list_size %d\n",COUNT,ITERS,SEED,LIST_SIZE);

  double t0 = MPI_Wtime();
  std::set<int> s;
  for (int i=0; i < COUNT; ++i)
    s.insert(rand());
  double t1 = MPI_Wtime();

  L1P_PatternConfigure(LIST_SIZE);
  printL1PStatus();

  double t2 = MPI_Wtime();
  size_t sum=0;
  int cache_max[ITERS];
  for (int i=0; i < ITERS; ++i) {
    //std::set<int> s2;
    L1P_PatternStart(!i);
    for (std::set<int>::iterator it = s.begin();
         it != s.end();
         ++it) {
      //s2.insert(*it);  // uncomment this for *huge* performance decrease
      sum += *it;
    }
    L1P_PatternStop();

    L1P_Status_t status;
    L1P_PatternStatus(&status);
    cache_max[i] = status.s.maximum;
  }
  double t3 = MPI_Wtime();

  printf("construction time %f seconds\n",t1-t0);
  printf("iteration time %f seconds\n",t3-t2);
  printf("L1P_Status_t maximum[0:ITERS-1]: ");
  for (int i=0; i < ITERS; ++i)
    printf("%d ", cache_max[i]);
  printf("\n");
  printL1PStatus();

  printf("sum: %d\n",sum);
  MPI_Finalize();
  return 0;
}

Build

 module load xl
 mpicxx -O3 fragment.cc -lSPI_l1p -o fragment.xl

Run

run.sh

 #!/bin/bash
 module load xl
 srun ./fragment.xl

submit

 sbatch -t 5 -n 1 run.sh

Results

Using L1P perfect prefetching yields a 3.38x decrease in iteration time v.s. default prefetching. With L1P perfect prefetching enabled the iteration time matches a x86_64 gnu/linux machine.

Without prefetching - SCOREC romulus, GCC compiler

 count 4000000 iters 8 seed 42 list_size 10485760
 construction time 4.072692 seconds
 iteration time 3.760265 seconds
 sum: -257633432

Without prefetching - BGQ, XL compiler

 count 4000000 iters 8 seed 42 list_size 10485760
 construction time 16.347557 seconds
 iteration time 12.913008 seconds
 sum: -257633432

With prefetching - BGQ, XL compiler

 count 4000000 iters 8 seed 42 list_size 10485760
 L1P_Status_t finished 1 abandoned 0 maximum 0
 construction time 16.357077 seconds
 iteration time 3.814678 seconds
 L1P_Status_t maximum[0:ITERS-1]: 0 0 0 0 0 0 0 0 
 L1P_Status_t finished 1 abandoned 0 maximum 0
 sum: -257633432

With prefetching - BGQ, GNU compiler

 count 4000000 iters 8 seed 42 list_size 10485760
 L1P_Status_t finished 1 abandoned 0 maximum 0
 construction time 16.203492 seconds
 iteration time 3.796874 seconds
 L1P_Status_t maximum[0:ITERS-1]: 0 0 0 0 0 0 0 0 
 L1P_Status_t finished 1 abandoned 0 maximum 0
 sum: -257633432

Debugging

Totalview

Totalview is available for academic use by loading the totalview module

 module load proprietary/totalview

To run Totalview first obtain an interactive allocation:

 salloc -n <# Processes> -t <Max Wall Time>

Then run your executable within Totalview

 totalview --args srun <Your Executable> <Your Executables Arguments>

Interesting pdf on how TV works:

http://www.alcf.anl.gov/sites/www.alcf.anl.gov/files/L2P_TotalView_0.pdf

GDB

Instructions for running GDB on compute-node jobs are located in the Blue Gene/Q Application Development Redbook on page 105.

Debugging from startup

The following commands are for debugging a 2 process job from startup.

  1. Create an interactive allocation
     salloc -p debug -n 2 -t 60 
  2. Load the gdb module
     module load gdb 
  3. Run the executable within gdbtool
     srun --runjob-opts="--start-tool `which gdbtool` " --ntasks <number of MPI ranks> <executable> <executable arguments> 
    • The following should be output:
       Enter a rank to see its associated I/O node's IP address, or press enter to start the job:
    • Do not hit enter.
  4. Open a new terminal on the BGQ front end node and connect gdb to the rank 0 process
    • Load the gdb module
       module load gdb 
    • Attach GDB to the rank 0 process:
       pgdb 0 
  5. Repeat step 4 for connecting gdb to the rank 1 process by passing '1' to pgdb instead of '0'.
  6. In the terminal where the 'srun' command was executed press enter to start running your application.
  7. The application can now be debugged from the two terminals running gdb.

There are two critical steps for using gdb on BGQ: running the gdb server via gdbtool, and attaching the gdb client via pgdb. A maximum of four ranks can be debugged.

Disconnect GDB

To disconnect GDB from a process enter the following at the gdb prompt:

 disconnect

Debugging Hung Processes

Attaching GDB to a running process

If a job appears to be hanging you can use the following commands to attach gdb to the processes:

  1. Load the gdb module
     module load experimental/gdb 
  2. Attach GDB to a MPI rank:
     pgdb -t <mpi rank> 

Note the same limits apply to running GDB this way, a maximum of four ranks can be debugged simultaneously.

Forcing core files to be written

The following assumes you only have one job currently running on AMOS. First run the following to get the id. Note, this is not the same id that SLURM uses.

 list_jobs -l | awk '/ID/ {print $2}'

Run the following command to send the 'segfault' signal to the job.

 kill_job -s 11 --id <output of the first command>

OR

Using SLURM

 scancel --signal=ABRT slurmJobId

Core Files

Add the following command to the srun command to enable the output of core files.

 --runjob-opts="--envs BG_COREDUMPDISABLED=0 BG_COREDUMPONEXIT=1 "

Core files can be viewed with the coreprocessor tool.

 /bgsys/drivers/ppcfloor/coreprocessor/bin/coreprocessor.pl -b=<executable> -c=/path/to/directory/with/core/files

More info on the coreprocessor tool is in the Administration Redbook

Text based core processing tool

Create a file named 'getStack.sh' with the following contents:

#!/bin/bash -e
expectedArgs=2
if [ $# -ne $expectedArgs ]; then
  echo "Usage: $0 <corefile> <exe>"
  exit 0
fi
corefile=$1
stackfile=stack${corefile##core}
exe=$2
echo input: $corefile
echo output: $stackfile
grep -n STACK $corefile | awk -F : '{print  $1}' > lines
let s=`head -n 1 lines`+2
let f=`tail -n -1 lines`-1
sed -n ${s},${f}p $corefile | awk '{print $2}' | perl -pi -e 's/000000000/0x/g' > core.addy
addr2line -e $exe < core.addy > $stackfile
rm lines
rm core.addy

Make it executable:

chmod +x getStack.sh

To generate a file with the stack trace from a given core file run:

./getStack.sh core.##### /path/to/executable/that/generated/core/files/foo.exe

Building for the Front-End Node

Compilers in /usr/bin are meant for the service/front-end nodes. Simply if it's under /bgsys it's intended for the BG/Q hardware, /bgsys/linux/ is for the I/O nodes, and elsewhere is the service/front-end nodes.

Useful Links