Setonix 的 PyTorch 环境的 ROCm 报错排查


Setonix 是南半球最强大的超级计算机,也是我用过的最不稳定的超级计算机。平均一个月炸一次,例如升级固件升炸了,Lustre文件系统炸了等等……

因为是用 AMD Instinct MI250X GPU,所以 PyTorch 后端用的是 ROCm。这个环境有就各种奇奇怪怪的问题。



First, load the module and create a virtual environment:

module load pytorch/2.2.0-rocm5.7.3
python3 -m venv venv

Now in the current directory, there will be a venv directory. Go to venv/bin and execute ls -l , we can see the python3 is linked to the wrong file:

liyumin@setonix-05:~/software/venv/bin> ls -lh
total 36K
-rw-r--r-- 1 liyumin pawsey1001 2.0K  3月 17 12:02 activate
-rw-r--r-- 1 liyumin pawsey1001  935  3月 17 12:02 activate.csh
-rw-r--r-- 1 liyumin pawsey1001 2.2K  3月 17 12:02
-rw-r--r-- 1 liyumin pawsey1001 8.9K  3月 17 12:02 Activate.ps1
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip3
lrwxrwxrwx 1 liyumin pawsey1001    7  3月 17 12:03 python -> python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3 -> /usr/bin/python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3.10 -> python3

Use the commands below to fix this:

rm python3
ln -s /software/setonix/2023.08/containers/modules-long/ python3

Now it is corrected:

total 36K
-rw-r--r-- 1 liyumin pawsey1001 2.0K  3月 17 12:02 activate
-rw-r--r-- 1 liyumin pawsey1001  935  3月 17 12:02 activate.csh
-rw-r--r-- 1 liyumin pawsey1001 2.2K  3月 17 12:02
-rw-r--r-- 1 liyumin pawsey1001 8.9K  3月 17 12:02 Activate.ps1
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip3
lrwxrwxrwx 1 liyumin pawsey1001    7  3月 17 12:03 python -> python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3 -> /software/setonix/2023.08/containers/modules-long/
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3.10 -> python3

The pip in this environment is missing, we have to install it:

source ./activate
curl -o

Then we can install Jupyter lab and other packages:

pip install jupyterlab

If the home partition quota is full, we have to move our .local directory to software partition:

mv ~/.local /software/projects/pawsey1001/`whoami`/.local/
ln -s /software/projects/pawsey1001/`whoami`/.local/ ~/.local

After installing Jupyter, we should add it to PATH, put the below line into ~/.bashrc:

export PATH=$PATH:~/.local/bin

Jupyter is not working now, we still have something to do:

File "/home/liyumin/.local/bin/jupyter", line 5, in <module>
  from jupyter_core.command import main
ModuleNotFoundError: No mudule named 'jupyter_core'

Open ~/.local/bin/jupyter with any file editor, and modify the first line, change the python path to your virtual environment python path, like this:

# -*- coding: utf-8 -*-
import re
import sys
from jupyter_core.command import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])

Now jupyter lab is working.

Here’s my batch script, save it into batch_script:

#!/bin/bash -l
# Allocate slurm resources, edit as necessary
#SBATCH --account=pawsey1001-gpu
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu
#SBATCH --time=12:00:00
#SBATCH --job-name=jupyter_notebook

# Get the hostname
# We'll set up an SSH tunnel to connect to the Juypter notebook server

# Set the port for the SSH tunnel
# This part of the script uses a loop to search for available ports on the node;
# this will allow multiple instances of GUI servers to be run from the same host node
while [ $port -lt 65535 ] ; do
  check=$( ss -tuna | awk '{print $4}' | grep ":$port *" )
  if [ "$check" == "" ] ; then
  : $((++port))
if [ $pfound -eq 0 ] ; then
  echo "No available communication port found to establish the SSH tunnel."
  echo "Try again later. Exiting."

echo "*****************************************************"
echo "Setup - from your laptop do:"
echo "ssh -L ${port}:${host}:${port} $USER@$"
echo "*****"
echo "The launch directory is: $dir"
echo "*****************************************************"
echo ""

#Launch the notebook
module load pytorch/2.2.0-rocm5.7.3
source /home/liyumin/software/venv/bin/activate

srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest jupyter lab \
  --no-browser \
  --port=${port} --ip= \

Remember to change the virtual environment path to your own: /home/liyumin/software/venv/bin/activate

And submit the task: sbatch batch_script

Wait until the task status is running: squeue --me

Then we will have a file like slurm-{task_id}.out

There will be information in this file:

Setup - from your laptop do:
ssh -L 8888:nid002240:8888
The launch directory is: 

[I 2024-04-07 22:55:07.485 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-04-07 22:55:07.488 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-04-07 22:55:07.492 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-04-07 22:55:07.844 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-04-07 22:55:07.942 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-04-07 22:55:07.944 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-04-07 22:55:07.946 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-04-07 22:55:07.954 LabApp] JupyterLab extension loaded from /home/liyumin/.local/lib/python3.10/site-packages/jupyterlab
[I 2024-04-07 22:55:07.954 LabApp] JupyterLab application directory is /software/projects/pawsey1001/liyumin/.local/share/jupyter/lab
[I 2024-04-07 22:55:07.955 LabApp] Extension Manager is 'pypi'.
[I 2024-04-07 22:55:08.001 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-04-07 22:55:08.001 ServerApp] Serving notebooks from local directory: 
[I 2024-04-07 22:55:08.001 ServerApp] Jupyter Server 2.13.0 is running at:
[I 2024-04-07 22:55:08.001 ServerApp] http://nid002240:8888/lab?token=d8a8b968d5b8a8af2cc4f5adb42674d2368bcb9fd657c370
[I 2024-04-07 22:55:08.001 ServerApp]
[I 2024-04-07 22:55:08.001 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-04-07 22:55:08.008 ServerApp] 
    To access the server, open this file in a browser:
    Or copy and paste one of these URLs:

Run that ssh -L command in your local machine and open the link in the browser:

We are all done!

To cancel the task, use scancel {task_id} or File > Shut Down in Jupyter Lab to stop consuming the SUs.



MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/ML
Open /src/include/miopen/kern_db.hpp:180: Internal error while accessing SQLite database: attempt to write a readonly database
Traceback (most recent call last):
RuntimeError: miopenStatusInternalError




sbatch --exclude=nid00[2056,2112,2826,2860,2868,2872,2928,2930,2932,2942,2946,2948,2984,2986,2988,2990,2994,3000] batch_script

N卡正常,A卡莫名其妙 Loss NaN

我在测试某个开源模型,在我加了一个nn.Linear后,模型的 Loss 直接炸了,变成了NaN



尝试减少 lrbatch_size,发现当 batch_size <= 2 的时候得出来的数值是正确的。

百思不解,后来发现他们的代码开启了 AMP。然后我试了下把 AMP 关掉,结果就正常了。

虽然A卡支持 AMP,但是我也不知道为什么会有这种玄学问题。以后用A卡都不敢开 AMP 了。