This page is a guide for the CCI users wishing to utilize the IBM DCS supercomputer.
Users may connect to
dcsfen02 to build and submit jobs via Slurm.
Note: dcsfen02 is available to run debug jobs as well as acting as a front-end node. It is prioritized over other nodes to limit resource fragmentation and improve system utilization. This has implications for performance and resource availability on the node, particularly GPUs. For example, if a debug job has requested GPUs in exclusive mode they may not be available to run other code.
- 1 System information
- 2 Building software
- 3 Submitting jobs
- 4 Profiling
- 5 Documentation
16 nodes each housing:
- 2x IBM Power 9 processors clocked at 3.15 GHz. Each processor contains 18 cores with 4 hardware threads (144 logical processors per node).
- 4x NVIDIA Tesla V100 GPUs with 16 GiB of memory each
- 512 GB RAM
- 1.6 TB Samsung NVMe Flash
The first 2 nodes also contain a Nallatech FPGA.
All nodes are connected with EDR Infiniband and connect to the unified File System.
Many packages for building software are available as modules. However, some tools are available without loading any modules and a subset of those packages can be overridden by modules. Please pay careful attention to which modules you have loaded.
- ninja 1.7.2
- cmake 3.13.4 (cmake3)
- autoconf 2.69
- automake 1.13.4
- gcc 4.8.5
- clang/llvm 3.4.2
Currently the following are available as modules:
- automake 1.16.1
- bazel 0.17.2, 0.18.0, 0.18.1, 0.21.0
- ccache 3.5
- cmake 3.12.2, 3.12.3
- gcc 6.4.0, 6.5.0, 7.4.0, 8.1.0, 8.2.0
- xl/xl_r (xlC and xlf) 13.1.6, 16.1.0
- MPICH 3.2.1 (mpich module, built with XL compiler)
- CUDA 9.1, 10.0
When mixing CUDA and MPI, please make sure an xl module is loaded and nvcc is called with
-ccbin $(CXX) otherwise linking will fail.
CUDA code should be compiled with
-arch=sm_70 for the Volta V100 GPUs.
XL MPICH Compiler Wrapper Flags
The default MPICH compiler wrapper flags
-O3 -qipa -qhot will perform aggressive optimizations that could alter the semantics of your program. If the compiler applies such an optimization the following warning message will be displayed:
1500-036: (I) The NOSTRICT option (default at OPT(3)) has the potential to alter the semantics of a program. Please refer to documentation on the STRICT/NOSTRICT option for more information.
Specifying flags on the command line will override these defaults. For example, the following flags will respectively reduce the optimization level, add debug symbols, and block semantic changing optimizations:
-O2 -g -qstrict.
More information on the XL compiler options is here:
Jobs are submitted via Slurm to one of the following partitions: debug, dcs.
The debug partition is limited to single node jobs, running up to 30 minutes, and may only use a maximum of 128G of memory.
The dcs partition makes nodes and all their resources available for up to 6 hours.
See the Spectrum MPI section of the SLURM page for an example job script.
When submitting GPU/CUDA jobs via Slurm, users must specify
--gres=gpu:# to specify the number of GPUs desired per node. If GPUs are not requested with a job, they will not be accessible.
Note: The system currently forces node sharing to improve GPU utilization. Please make sure you specify the resources you need as part of your job (memory, GPUs, CPU cores, etc), not just the number of nodes and/or tasks. If an oversubscribed node causes an issue please contact support.
Using GPUs in exclusive mode
Slurm will set each GPU in an allocation to the CUDA "exclusive process" mode when the "cuda-mode-exclusive" feature/constraint is requested, ex:
salloc --gres=gpu:2 -C cuda-mode-exclusive. For applications using one process per GPU this mode may be used as a safeguard to ensure that GPUs are not oversubscribed. This mode is also recommended when running MPS. Note: This is not to be confused with exclusive user access to the GPU. Only one user may access a GPU regardless of the mode.
Spectrum MPI disables GPU-Direct by default. See the SLURM page for the syntax to enable GPU-Direct ('CUDA aware MPI').
Setting GPU-process Affinity
Use a CUDA runtime API call, such as
cudaSetDevice, to set process-to-device affinity.
Using NVMe storage
To use the NVMe storage in a node, request it along with the job specification:
--gres=nvme (This can be combined with other requests, such as GPUs.) When the first job step starts, the system will initialize the storage and create the path
The storage is not persistent between allocations. However, it may be used/shared by multiple job steps within an allocation, see Slurm job arrays.
CCI is exploring other configurations in which the NVMe storage can be operated. Please email any suggestions to support.
One method for profiling is reading the time base registers (mftb, mftbu). An example of this is found in the FFTW cycle header.
The time base for the Power 9 processor is 512000000.