ERP Cluster

From CCI User Wiki
Jump to: navigation, search

Specifications

The cluster consists of 14 nodes connected via Infiniband. Each node has two 24-core 2.3 GHz AMD EPYC 7451 processors and 128GB of system memory. erp13 and erp14 are higher memory nodes with 512GB of memory each.

Performance

An ERP node with two AMD EPYC 7451 can theoretically perform

2 sockets * 24 cores/socket * 8 flops/cycle/core * 2.9 Giga-cycles/second = 1.1 TeraFLOPs double precision

using the 8 flops/cycle/core rate described here and a 2.9Ghz all core boost clock rate.

Each ERP node achieves 200GB/s of memory bandwidth running the stream triad benchmark. This result is from testing on a similar system.

Documentation

The following materials provide cluster users with details on architecture and performance tuning.

Dell EPYC performance study - discusses NUMA effects and socket locality to network interface

AMD EPYC Cluster Tuning

Accessing the System

Note: Not all projects have access to the cluster. Job submissions to Slurm may be rejected even if access to the front-end node is authorized.

Running on the cluster first requires connecting to one of its front end nodes erpfen01. These machines are accessible from the landing pads.

Compiling

The erpfen01 front end node is virtualized and will be setup for compiling for the compute nodes. While that is being configured, please allocate a himem node:

  salloc -p himem -N 1 -t 30

then ssh to the allocated node to build your software.

Modules

On erpfen01:

  module use /gpfs/u/software/erp-spack-install/lmod/linux-centos7-x86_64/Core/

SMT

Currently, SMT is not enabled on cluster nodes. By default Slurm will assign 48 processes to each node to fit the 48 cores available.

Submitting and Managing Jobs

The ERP cluster, unlike other clusters at CCI, uses consumable resources. This schedules CPU cores and memory independently rather than assuming a job will utilize whole nodes. Users should make note of the following when submitting jobs:

  • Jobs should include a per-node memory constraint (--mem) or a per-CPU memory constraint (--mem-per-cpu) that will be sufficient for the highest demand process. (Non-homogeneous jobs may have different memory requirements across the job. The highest requirement should be used as the constraint.)
  • Processes that utilize threads will need CPU count constraints. By default the allocation method is 1 CPU per process or task. Jobs that require multiple CPUs per process or task (for threads, OpenMP, etc) should add a --cpus-per-task constraint to allocate additional CPU cores. Explicit binding may also be necessary depending on the application (see --cpu_bind in the srun man page).
  • Parallel job arrays will require some additional parameters to subdivide resources assigned to a job. See the Job arrays/Many small jobs section for more information.

Partitions

Name Time Limit (hr) Max Nodes
debug 1 14
erp 6 12
himem 4 2

Example job submission scripts

Please see Slurm for more info.

The work distribution, communication patterns, and programming model will guide the selection for a process binding. Users seeking maximum performance or efficiency should review the SLURM and EPYC architecture documentation to determine a suitable process binding.

Exclusive node allocation, MPI (OpenMPI) Only

The following job script and submission command will allocate all the resources on each of <number of nodes> and bind processes to the NUMA domains in order of increasing distance from the domain (4) physically closest to the InfiniBand interface.

Create a file named run.sh with the following contents:

 #!/bin/bash
 module use /gpfs/u/software/erp-spack-install/lmod/linux-centos7-x86_64/Core/
 module load gcc
 module load openmpi
 bindArg="--cpu_bind=verbose,map_ldom=4,4,4,4,4,4,5,5,5,5,5,5,6,6,6,6,6,6,7,7,7,7,7,7,0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3"
 srun --mpi=pmi2 ${bindArg} /path/to/executable

Submission command:

 sbatch -p <partition> -t <minutes> -N <number of nodes> -n <processes> --mincpus=48 ./run.sh