Setonix 的 PyTorch 环境的 ROCm 报错排查

前言

Setonix 是南半球最强大的超级计算机,也是我用过的最不稳定的超级计算机。平均一个月炸一次,例如升级固件升炸了,Lustre文件系统炸了等等……

因为是用 AMD Instinct MI250X GPU,所以 PyTorch 后端用的是 ROCm。这个环境有就各种奇奇怪怪的问题。

建议炼丹远离A卡。A卡炼丹,毁我青春!

PyTorch环境装不了包

First, load the module and create a virtual environment:

module load pytorch/2.2.0-rocm5.7.3
python3 -m venv venv

Now in the current directory, there will be a venv directory. Go to venv/bin and execute ls -l , we can see the python3 is linked to the wrong file:

liyumin@setonix-05:~/software/venv/bin> ls -lh
total 36K
-rw-r--r-- 1 liyumin pawsey1001 2.0K  3月 17 12:02 activate
-rw-r--r-- 1 liyumin pawsey1001  935  3月 17 12:02 activate.csh
-rw-r--r-- 1 liyumin pawsey1001 2.2K  3月 17 12:02 activate.fish
-rw-r--r-- 1 liyumin pawsey1001 8.9K  3月 17 12:02 Activate.ps1
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip3
lrwxrwxrwx 1 liyumin pawsey1001    7  3月 17 12:03 python -> python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3 -> /usr/bin/python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3.10 -> python3

Use the commands below to fix this:

rm python3
ln -s /software/setonix/2023.08/containers/modules-long/quay.io/pawsey/pytorch/2.2.0-rocm5.7.3/bin/python3 python3

Now it is corrected:

total 36K
-rw-r--r-- 1 liyumin pawsey1001 2.0K  3月 17 12:02 activate
-rw-r--r-- 1 liyumin pawsey1001  935  3月 17 12:02 activate.csh
-rw-r--r-- 1 liyumin pawsey1001 2.2K  3月 17 12:02 activate.fish
-rw-r--r-- 1 liyumin pawsey1001 8.9K  3月 17 12:02 Activate.ps1
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip
-rwxr-xr-x 1 liyumin pawsey1001  305  3月 17 12:08 pip3
lrwxrwxrwx 1 liyumin pawsey1001    7  3月 17 12:03 python -> python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3 -> /software/setonix/2023.08/containers/modules-long/quay.io/pawsey/pytorch/2.2.0-rocm5.7.3/bin/python3
lrwxrwxrwx 1 liyumin pawsey1001  100  3月 17 12:03 python3.10 -> python3

The pip in this environment is missing, we have to install it:

source ./activate
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3 get-pip.py

Then we can install Jupyter lab and other packages:

pip install jupyterlab

If the home partition quota is full, we have to move our .local directory to software partition:

mv ~/.local /software/projects/pawsey1001/`whoami`/.local/
ln -s /software/projects/pawsey1001/`whoami`/.local/ ~/.local

After installing Jupyter, we should add it to PATH, put the below line into ~/.bashrc:

export PATH=$PATH:~/.local/bin

Jupyter is not working now, we still have something to do:

File "/home/liyumin/.local/bin/jupyter", line 5, in <module>
  from jupyter_core.command import main
ModuleNotFoundError: No mudule named 'jupyter_core'

Open ~/.local/bin/jupyter with any file editor, and modify the first line, change the python path to your virtual environment python path, like this:

#!/software/projects/pawsey1001/liyumin/venv/bin/python3
# -*- coding: utf-8 -*-
import re
import sys
from jupyter_core.command import main
if __name__ == '__main__':
    sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
    sys.exit(main())

Now jupyter lab is working.

Here’s my batch script, save it into batch_script:

#!/bin/bash -l
# Allocate slurm resources, edit as necessary
#SBATCH --account=pawsey1001-gpu
#SBATCH --gres=gpu:1
#SBATCH --partition=gpu
#SBATCH --time=12:00:00
#SBATCH --job-name=jupyter_notebook

# Get the hostname
# We'll set up an SSH tunnel to connect to the Juypter notebook server
host=$(hostname)

# Set the port for the SSH tunnel
# This part of the script uses a loop to search for available ports on the node;
# this will allow multiple instances of GUI servers to be run from the same host node
port="8888"
pfound="0"
while [ $port -lt 65535 ] ; do
  check=$( ss -tuna | awk '{print $4}' | grep ":$port *" )
  if [ "$check" == "" ] ; then
    pfound="1"
    break
  fi
  : $((++port))
done
if [ $pfound -eq 0 ] ; then
  echo "No available communication port found to establish the SSH tunnel."
  echo "Try again later. Exiting."
  exit
fi

echo "*****************************************************"
echo "Setup - from your laptop do:"
echo "ssh -L ${port}:${host}:${port} $USER@$PAWSEY_CLUSTER.pawsey.org.au"
echo "*****"
echo "The launch directory is: $dir"
echo "*****************************************************"
echo ""

#Launch the notebook
export OMP_NUM_THREADS=1
module load pytorch/2.2.0-rocm5.7.3
source /home/liyumin/software/venv/bin/activate

srun -N 1 -n 1 -c 8 --gres=gpu:1 --gpus-per-task=1 --gpu-bind=closest jupyter lab \
  --no-browser \
  --port=${port} --ip=0.0.0.0 \
  --notebook-dir=${dir}

Remember to change the virtual environment path to your own: /home/liyumin/software/venv/bin/activate

And submit the task: sbatch batch_script

Wait until the task status is running: squeue --me

Then we will have a file like slurm-{task_id}.out

There will be information in this file:

*****************************************************
Setup - from your laptop do:
ssh -L 8888:nid002240:8888 liyumin@setonix.pawsey.org.au
*****
The launch directory is: 
*****************************************************

[I 2024-04-07 22:55:07.485 ServerApp] jupyter_lsp | extension was successfully linked.
[I 2024-04-07 22:55:07.488 ServerApp] jupyter_server_terminals | extension was successfully linked.
[I 2024-04-07 22:55:07.492 ServerApp] jupyterlab | extension was successfully linked.
[I 2024-04-07 22:55:07.844 ServerApp] notebook_shim | extension was successfully linked.
[I 2024-04-07 22:55:07.942 ServerApp] notebook_shim | extension was successfully loaded.
[I 2024-04-07 22:55:07.944 ServerApp] jupyter_lsp | extension was successfully loaded.
[I 2024-04-07 22:55:07.946 ServerApp] jupyter_server_terminals | extension was successfully loaded.
[I 2024-04-07 22:55:07.954 LabApp] JupyterLab extension loaded from /home/liyumin/.local/lib/python3.10/site-packages/jupyterlab
[I 2024-04-07 22:55:07.954 LabApp] JupyterLab application directory is /software/projects/pawsey1001/liyumin/.local/share/jupyter/lab
[I 2024-04-07 22:55:07.955 LabApp] Extension Manager is 'pypi'.
[I 2024-04-07 22:55:08.001 ServerApp] jupyterlab | extension was successfully loaded.
[I 2024-04-07 22:55:08.001 ServerApp] Serving notebooks from local directory: 
[I 2024-04-07 22:55:08.001 ServerApp] Jupyter Server 2.13.0 is running at:
[I 2024-04-07 22:55:08.001 ServerApp] http://nid002240:8888/lab?token=d8a8b968d5b8a8af2cc4f5adb42674d2368bcb9fd657c370
[I 2024-04-07 22:55:08.001 ServerApp]     http://127.0.0.1:8888/lab?token=d8a8b968d5b8a8af2cc4f5adb42674d2368bcb9fd657c370
[I 2024-04-07 22:55:08.001 ServerApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 2024-04-07 22:55:08.008 ServerApp] 
    
    To access the server, open this file in a browser:
        file:///home/liyumin/.local/share/jupyter/runtime/jpserver-67643-open.html
    Or copy and paste one of these URLs:
        http://nid002240:8888/lab?token=d8a8b968d5b8a8af2cc4f5adb42674d2368bcb9fd657c370
        http://127.0.0.1:8888/lab?token=d8a8b968d5b8a8af2cc4f5adb42674d2368bcb9fd657c370

Run that ssh -L command in your local machine and open the link in the browser: http://127.0.0.1:8888/lab?token=xxxxxx

We are all done!

To cancel the task, use scancel {task_id} or File > Shut Down in Jupyter Lab to stop consuming the SUs.

你问我为什么是英文?因为这是其实是我发给别人的邮件内容,写的时候是英文写的。懒得翻译了。

一跑模型就报错

MIOpen Error: /long_pathname_so_that_rpms_can_package_the_debug_info/src/extlibs/ML
Open /src/include/miopen/kern_db.hpp:180: Internal error while accessing SQLite database: attempt to write a readonly database
Traceback (most recent call last):
  ...
RuntimeError: miopenStatusInternalError

网上搜不到解决办法。提工单也不回复,无语了,只能自己解决。

随便试了下,发现换一台机器有概率可以解决。经排查,只有部分机器有这个问题,所以我把它归咎于环境问题。

我们在执行sbatch命令的时候,可以通过--exclude参数把有问题的节点去掉:

sbatch --exclude=nid00[2056,2112,2826,2860,2868,2872,2928,2930,2932,2942,2946,2948,2984,2986,2988,2990,2994,3000] batch_script

N卡正常,A卡莫名其妙 Loss NaN

我在测试某个开源模型,在我加了一个nn.Linear后,模型的 Loss 直接炸了,变成了NaN

同样的代码在N卡下运行正常。

经过仔细排查,手动设置这一层的权重,并打印出数值进行手动计算比较,发现这玩意儿居然能算错数。也就是说数据经过了这个nn.Linear之后,得到的数值和理论值不一致。

尝试减少 lrbatch_size,发现当 batch_size <= 2 的时候得出来的数值是正确的。

百思不解,后来发现他们的代码开启了 AMP。然后我试了下把 AMP 关掉,结果就正常了。

虽然A卡支持 AMP,但是我也不知道为什么会有这种玄学问题。以后用A卡都不敢开 AMP 了。