GROMACS – High Performance Computing

Known issue

GROMACS in some cases can get very low performance. We found that in case when GROMACS is getting CPU cores from 2 CPU sockets and it is not using all of them and in addition performance is dramatically low (20 times slower in the example below).

Example server idun-06-16:

# lscpu
CPU(s):                  36
. . .
    Socket(s):           2
. . .
NUMA:                    
  NUMA node(s):          2
  NUMA node0 CPU(s):     0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
  NUMA node1 CPU(s):     1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35

Starting job gromacs-1 on 14 CPU cores:

$ cat md.sh
#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --gres=gpu:1
#SBATCH --account=support
#SBATCH --time=36:00:00
#SBATCH --nodes=1              # 1 compute nodes
#SBATCH --ntasks=1             # Total number of MPI ranks
#SBATCH --ntasks-per-node=1    # 1 tasks and CPU cores per task
#SBATCH --cpus-per-task=14     # 14 cores per MPI rank for GROMACS
#SBATCH --mem=30G              # 30GB memory
#SBATCH --job-name=gromacs-1
#SBATCH --output=test-srun.out
module purge
module load GROMACS/2023.1-foss-2022a-CUDA-11.7.0
module list
srun gmx_mpi mdrun -s md_10ns.tpr --v

Job started on CPU cores (top comand): 9,11,13,15,17,19,21,23,25,27,29,31,33,35 (the same processor). And this is confirmed by md.log file:

. . .
  Hardware topology: Basic
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  1: [   9] [  11] [  13] [  15] [  17] [  19] [  21] [  23] [  25] [  27] [  29] [  31] [  33] [  35]
    CPU limit set by OS: -1   Recommended max number of threads: 14
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA NVIDIA A100-PCIE-40GB, compute cap.: 8.0, ECC: yes, stat: compatible
. . .

File test-srun.out shows prediction to finish in about 10 hours.

Starting job gromacs-2 on 12 CPU cores (copy of the same job on the same server):

$ cat md.sh
#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --gres=gpu:1
#SBATCH --account=support
#SBATCH --time=36:00:00
#SBATCH --nodes=1              # 1 compute nodes
#SBATCH --ntasks=1             # Total number of MPI ranks
#SBATCH --ntasks-per-node=1    # 1 tasks and CPU cores per task
#SBATCH --cpus-per-task=12     # 14 cores per MPI rank for GROMACS
#SBATCH --mem=30G              # 30GB memory
#SBATCH --job-name=gromacs-1
#SBATCH --output=test-srun.out
module purge
module load GROMACS/2023.1-foss-2022a-CUDA-11.7.0
module list
srun gmx_mpi mdrun -s md_10ns.tpr --v

Job started only on CPU cores: 1,3,5,7 other 8 cores on the second processor are idle. md.log file shows all CPU cores visible by GROMACS but it does not use CPU cores from Package 0.

. . .
  Hardware topology: Basic
    Packages, cores, and logical processors:
    [indices refer to OS logical processors]
      Package  1: [   1] [   3] [   5] [   7]
      Package  0: [  20] [  22] [  24] [  26] [  28] [  30] [  32] [  34]
    CPU limit set by OS: -1   Recommended max number of threads: 12
  GPU info:
    Number of GPUs detected: 1
    #0: NVIDIA NVIDIA A100-PCIE-40GB, compute cap.: 8.0, ECC: yes, stat: compatible
. . .

File test-srun.out shows prediction to finish in 10 days.

How to mitigate this issue?

Check job after start: is time prediction reasonable.
Use less CPU cores then one socket has. (ssh to server and use command: top -H -p PROCESS_ID (and the press F and enable "Last Used Cpu" to see CPU cores usage))
Start job first on idle node. (If you see idle node it is possible to add line to a job script #SBATCH --nodelist=SERVER_NAME)