DCS Supercomputer

From CCI User Wiki
Jump to: navigation, search

This page is a guide for the CCI users wishing to utilize the IBM DCS supercomputer.

Users may connect to dcsfen01 and dcsfen02 to build and submit jobs via Slurm.

Note: dcsfen02 is available to run debug jobs as well as acting as a front-end node. It is prioritized over other nodes to limit resource fragmentation and improve system utilization. This has implications for performance and resource availability on the node, particularly GPUs. For example, if a debug job has requested GPUs in exclusive mode they may not be available to run other code.

System information

16 nodes each housing:

  • 2x IBM Power 9 processors clocked at 3.15 GHz. Each processor contains 18 cores with 4 hardware threads (144 logical processors per node).
  • 4x NVIDIA Tesla V100 GPUs with 16 GiB of memory each
  • 512 GB RAM
  • 1.6 TB Samsung NVMe Flash

The first 2 nodes also contain a Nallatech FPGA.

All nodes are connected with EDR Infiniband and connect to the unified File System.

Building software

Many packages for building software are available as modules. However, some tools are available without loading any modules and a subset of those packages can be overridden by modules. Please pay careful attention to which modules you have loaded.

Build systems/tools:

  • ninja 1.7.2
  • cmake 3.13.4 (cmake3)
  • autoconf 2.69
  • automake 1.13.4

Compilers:

  • gcc 4.8.5
  • clang/llvm 3.4.2

Currently the following are available as modules:

  • automake 1.16.1
  • bazel 0.17.2, 0.18.0, 0.18.1, 0.21.0
  • ccache 3.5
  • cmake 3.12.2, 3.12.3
  • gcc 6.4.0, 6.5.0, 7.4.0, 8.1.0, 8.2.0
  • xl/xl_r (xlC and xlf) 13.1.6, 16.1.0
  • MPICH 3.2.1 (mpich module, built with XL compiler)
  • CUDA 9.1, 10.0

CUDA

When mixing CUDA and MPI, please make sure an xl module is loaded and nvcc is called with -ccbin $(CXX) otherwise linking will fail.

CUDA code should be compiled with -arch=sm_70 for the Volta V100 GPUs.

XL MPICH Compiler Wrapper Flags

The default MPICH compiler wrapper flags -O3 -qipa -qhot will perform aggressive optimizations that could alter the semantics of your program. If the compiler applies such an optimization the following warning message will be displayed:

1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information.

Specifying flags on the command line will override these defaults. For example, the following flags will respectively reduce the optimization level, add debug symbols, and block semantic changing optimizations: -O2 -g -qstrict.

More information on the XL compiler options is here:

https://www.ibm.com/support/knowledgecenter/SSXVZZ_16.1.0/com.ibm.xlcpp161.lelinux.doc/compiler_ref/tuspecop.html

Submitting jobs

Jobs are submitted via Slurm to one of the following partitions: debug, dcs.

The debug partition is limited to single node jobs, running up to 30 minutes, and may only use a maximum of 128G of memory.

The dcs partition makes nodes and all their resources available for up to 6 hours.

See the Spectrum MPI section of the SLURM page for an example job script.

Using GPUs

When submitting GPU/CUDA jobs via Slurm, users must specify --gres=gpu:# to specify the number of GPUs desired per node. If GPUs are not requested with a job, they will not be accessible.

Note: The system currently forces node sharing to improve GPU utilization. Please make sure you specify the resources you need as part of your job (memory, GPUs, CPU cores, etc), not just the number of nodes and/or tasks. If an oversubscribed node causes an issue please contact support.

Using GPUs in exclusive mode

Slurm will set each GPU in an allocation to the CUDA "exclusive process" mode when the "cuda-mode-exclusive" feature/constraint is requested, ex: salloc --gres=gpu:2 -C cuda-mode-exclusive. For applications using one process per GPU this mode may be used as a safeguard to ensure that GPUs are not oversubscribed. This mode is also recommended when running MPS. Note: This is not to be confused with exclusive user access to the GPU. Only one user may access a GPU regardless of the mode.

GPU-Direct

Spectrum MPI disables GPU-Direct by default. See the SLURM page for the syntax to enable GPU-Direct ('CUDA aware MPI').

Setting GPU-process Affinity

Use a CUDA runtime API call, such as cudaSetDevice, to set process-to-device affinity.

Using NVMe storage

To use the NVMe storage in a node, request it along with the job specification: --gres=nvme (This can be combined with other requests, such as GPUs.) When the first job step starts, the system will initialize the storage and create the path /mnt/nvme/uid_${SLURM_JOB_UID}/job_${SLURM_JOB_ID}.

The storage is not persistent between allocations. However, it may be used/shared by multiple job steps within an allocation, see Slurm job arrays.

CCI is exploring other configurations in which the NVMe storage can be operated. Please email any suggestions to support.

Profiling

One method for profiling is reading the time base registers (mftb, mftbu). An example of this is found in the FFTW cycle header.

The time base for the Power 9 processor is 512000000.

Documentation

IBM XL

CUDA and GPU programming

See also