Known issue
GROMACS in some cases can get very low performance. We found that in case when GROMACS is getting CPU cores from 2 CPU sockets and it is not using all of them and in addition performance is dramatically low (20 times slower in the example below).
Example server idun-06-16:
# lscpu
CPU(s): 36
. . .
Socket(s): 2
. . .
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35
Starting job gromacs-1 on 14 CPU cores:
$ cat md.sh
#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --gres=gpu:1
#SBATCH --account=support
#SBATCH --time=36:00:00
#SBATCH --nodes=1 # 1 compute nodes
#SBATCH --ntasks=1 # Total number of MPI ranks
#SBATCH --ntasks-per-node=1 # 1 tasks and CPU cores per task
#SBATCH --cpus-per-task=14 # 14 cores per MPI rank for GROMACS
#SBATCH --mem=30G # 30GB memory
#SBATCH --job-name=gromacs-1
#SBATCH --output=test-srun.out
module purge
module load GROMACS/2023.1-foss-2022a-CUDA-11.7.0
module list
srun gmx_mpi mdrun -s md_10ns.tpr --v
Job started on CPU cores (top comand): 9,11,13,15,17,19,21,23,25,27,29,31,33,35 (the same processor). And this is confirmed by md.log file:
. . .
Hardware topology: Basic
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 1: [ 9] [ 11] [ 13] [ 15] [ 17] [ 19] [ 21] [ 23] [ 25] [ 27] [ 29] [ 31] [ 33] [ 35]
CPU limit set by OS: -1 Recommended max number of threads: 14
GPU info:
Number of GPUs detected: 1
#0: NVIDIA NVIDIA A100-PCIE-40GB, compute cap.: 8.0, ECC: yes, stat: compatible
. . .
File test-srun.out shows prediction to finish in about 10 hours.
Starting job gromacs-2 on 12 CPU cores (copy of the same job on the same server):
$ cat md.sh
#!/bin/sh
#SBATCH --partition=GPUQ
#SBATCH --gres=gpu:1
#SBATCH --account=support
#SBATCH --time=36:00:00
#SBATCH --nodes=1 # 1 compute nodes
#SBATCH --ntasks=1 # Total number of MPI ranks
#SBATCH --ntasks-per-node=1 # 1 tasks and CPU cores per task
#SBATCH --cpus-per-task=12 # 14 cores per MPI rank for GROMACS
#SBATCH --mem=30G # 30GB memory
#SBATCH --job-name=gromacs-1
#SBATCH --output=test-srun.out
module purge
module load GROMACS/2023.1-foss-2022a-CUDA-11.7.0
module list
srun gmx_mpi mdrun -s md_10ns.tpr --v
Job started only on CPU cores: 1,3,5,7 other 8 cores on the second processor are idle. md.log file shows all CPU cores visible by GROMACS but it does not use CPU cores from Package 0.
. . .
Hardware topology: Basic
Packages, cores, and logical processors:
[indices refer to OS logical processors]
Package 1: [ 1] [ 3] [ 5] [ 7]
Package 0: [ 20] [ 22] [ 24] [ 26] [ 28] [ 30] [ 32] [ 34]
CPU limit set by OS: -1 Recommended max number of threads: 12
GPU info:
Number of GPUs detected: 1
#0: NVIDIA NVIDIA A100-PCIE-40GB, compute cap.: 8.0, ECC: yes, stat: compatible
. . .
File test-srun.out shows prediction to finish in 10 days.
How to mitigate this issue?
- Check job after start: is time prediction reasonable.
- Use less CPU cores then one socket has. (ssh to server and use command: top -H -p PROCESS_ID (and the press F and enable "Last Used Cpu" to see CPU cores usage))
- Start job first on idle node. (If you see idle node it is possible to add line to a job script #SBATCH --nodelist=SERVER_NAME)