1. Check resource usage for COMPLETED job with command "sacct".
Example with Job ID 19361471:
sacct -j 19361471 --format="JobID,JobName%30,Start,Elapsed,ReqTRES%45,TRESUsageInMax%110,State"
Output is long. Screenshot:
- Job requested 1 GPU: ReqTRES ( gres/gpu=1 )
But GPU utilization is zero: TRESUsageInMax ( gres/gpumem=0,gres/gpuutil=0 )
- Job requested 300G memory: ReqTRES ( mem=300G )
But max used was: TRESUsageInMax ( mem=86980388K ) about 83G
- CPU utilization for this job is close to max:
Job requested 3 CPU cores: ReqTRES ( cpu=3 )
Job was running for: Elapsed (2-10:00:08)
Job used: TRESUsageInMax (cpu=6-17:14:38)
You can see all collected information about this job with this command:
sacct -j 19361471 --format="ALL"
Some fields are long you need to change length with present sign. Example 150 character length %150. Example:
sacct -j 19361471 --format="ALL%150"
2. Check resource usage for running job.
Example. User hpcuser has running job:
$ squeue -u hpcuser
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
19380572 GPUQ my_job01 hpcuser R 1-14:49:39 1 idun-04-07
Job ID 19380572 requested: 3 CPU cores, 100G memory and 1 GPU.
$ scontrol show job 19380572 | grep ReqTRES
ReqTRES=cpu=3,mem=100G,node=1,billing=3,gres/gpu=1
You can login compute node via ssh and check how job is running:
[hpcuser@idun-login2 ~]$ ssh idun-04-07
[hpcuser@idun-04-07 ~]$ top -u hpcuser
. . .
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2454197 hpcuser 20 0 2396196 2.0g 888252 R 100.0 0.1 1370:12 python3
2756123 hpcuser 20 0 2396200 2.0g 888236 R 100.0 0.1 402:48.31 python3
2795017 hpcuser 20 0 20560 5348 3920 R 6.2 0.0 0:00.02 top
571974 hpcuser 20 0 7696 4016 3216 S 0.0 0.0 0:00.00 slurm_script
576324 hpcuser 20 0 9065364 5.0g 3260 R 0.0 0.3 2:12.09 python3
582463 hpcuser 20 0 2719500 29016 2544 S 0.0 0.0 0:00.20 python3
582517 hpcuser 20 0 30004 18856 2212 S 0.0 0.0 0:00.03 python3
2794958 hpcuser 20 0 48652 7004 4716 S 0.0 0.0 0:00.00 sshd
2794959 hpcuser 20 0 17048 4980 3872 S 0.0 0.0 0:00.00 bash
2 processes are on 100% CPU cores utilisation but they are using only about 4G memory.
We can check what was the peak memory usage for these processes from the start:
[hpcuser@idun-04-07 ~]$ grep VmPeak /proc/2454197/status
VmPeak: 2682100 kB
[hpcuser@idun-04-07 ~]$ grep VmPeak /proc/2756123/status
VmPeak: 2682096 kB
Check GPU utilization with command "nvidia-smi" or "nvtop":
[hpcuser@idun-04-07 ~]$ nvidia-smi
Tue May 7 09:21:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06 Driver Version: 545.23.06 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A100-PCIE-40GB On | 00000000:89:00.0 Off | 0 |
| N/A 26C P0 32W / 250W | 4MiB / 40960MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
GPU is idle.