GPU compute server rosenblatt
The GPU compute server rosenblatt has been available since February 2022 for computations that benefit from graphics cards.It was jointly financed by the professorships of Stochastics, Harmonic Analysis, Numerical Mathematics und Scientific Computing (one GPU each) as well as by faculty funds (base machine).
About the name rosenblatt
Frank Rosenblatt was a pioneer of neural networks:- https://news.cornell.edu/stories/2019/09/professors-perceptron-paved-way-ai-60-years-too-soon
- https://en.wikipedia.org/wiki/Frank_Rosenblatt
Hardware
- CPU: 2 pieces AMD EPYC 7313 16-Core Processor
- 256 GB RAM
- 250 GB SSD system and swap partition
- 3.5 TB SSD data partition
- 4 pieces GPU NVIDIA A40 amp architecture CUDA® processing units
- Ubuntu 22.04
CUDA installation and usage
Under/usr/local/cuda-12.2
there is an installation of the CUDA environment. To use it, a few
environment variables need to be set. This is most easily done with the prepared script by calling:
source /usr/local/cuda-12.2/cudaenv.sh
Subsequently, when querying the version of the nvcc compiler, you get the correct output as 12.2:
rosenblatt:~$ source /usr/local/cuda-12.2/cudaenv.sh
rosenblatt:~$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Jun_13_19:16:58_PDT_2023
Cuda compilation tools, release 12.2, V12.2.91
Build cuda_12.2.r12.2/compiler.32965470_0
and can start compiling your own programs.
The tool nvidia-smi
is available for monitoring the GPU load. A simple call returns, for example:
rosenblatt:~$ nvidia-smi
Fri Jul 7 10:12:51 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03 Driver Version: 535.54.03 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA A40 On | 00000000:01:00.0 Off | 0 |
| 0% 34C P0 80W / 300W | 274MiB / 46068MiB | 100% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
| 0% 23C P8 12W / 300W | 7MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:A1:00.0 Off | 0 |
| 0% 23C P8 12W / 300W | 7MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:C1:00.0 Off | 0 |
| 0% 23C P8 12W / 300W | 7MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 146064 C ./a.out 262MiB |
+---------------------------------------------------------------------------------------+
A repetitive output is more suitable for continuous monitoring. This can be done by means of
nvidia-smi -i 0 -lms 300 --format=csv --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu,memory.used
can be achieved. The parameters
- -i 0 ... show me GPU0, the other 3 GPUs analog with -i 1, -i 2, -i 3
- -lms 300 ... output of all 300 milliseconds
rosenblatt:~$ nvidia-smi -i 0 -lms 300 --format=csv --query-gpu=power.draw,utilization.gpu,fan.speed,temperature.gpu,memory.used
power.draw [W], utilization.gpu [%], fan.speed [%], temperature.gpu, memory.used [MiB]
74.75 W, 0 %, 0 %, 36, 0 MiB
74.69 W, 0 %, 0 %, 36, 0 MiB
74.72 W, 0 %, 0 %, 36, 0 MiB
74.78 W, 0 %, 0 %, 36, 0 MiB
74.71 W, 0 %, 0 %, 36, 0 MiB
74.71 W, 0 %, 0 %, 36, 0 MiB
74.76 W, 0 %, 0 %, 36, 0 MiB
74.73 W, 0 %, 0 %, 36, 0 MiB
74.73 W, 0 %, 0 %, 36, 0 MiB
74.78 W, 0 %, 0 %, 36, 0 MiB
74.78 W, 0 %, 0 %, 36, 0 MiB
74.77 W, 0 %, 0 %, 36, 2 MiB
90.46 W, 5 %, 0 %, 40, 455 MiB
137.43 W, 100 %, 0 %, 43, 455 MiB
184.72 W, 100 %, 0 %, 44, 455 MiB
232.18 W, 100 %, 0 %, 45, 455 MiB
249.01 W, 100 %, 0 %, 46, 455 MiB
250.37 W, 100 %, 0 %, 46, 455 MiB
251.33 W, 100 %, 0 %, 47, 455 MiB
252.26 W, 100 %, 0 %, 47, 455 MiB
252.45 W, 100 %, 0 %, 47, 455 MiB
...
and you can see very nicely how power consumption, GPU, temperature and RAM go up when the
program starts.