Robotics 学习笔记

发表于： 2024-11-09 分类于：学习

Problem Framework

Markov Decision Process (MDP)

Discrete time step, can be continuous space of action and state
We don’t know the exact outcome of the action
Once the action is performed, we know exactly what happened
The agent’s state is known (fully observed) – observation and the state is the same here

Formally defined as a 4-tuples (S, A, T, R):

State Space
Action Space
Transition Function
Reward Function

Partially Observable Markov Decision Process (POMDP)

Almost the same as MDP, except: the effect of the action are not known exactly before the action is performed (non-deterministic action effects)

Azure B系列虚拟机软中断很高的问题排查

发表于： 2024-11-07 分类于：折腾

前言

我在 Azure 有一台虚拟机，是 B1s 系列的。最近发现变得很卡，使用top命令排查发现sy和si都很高：

%Cpu(s): 52.1 us, 21.9 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi, 26.0 si,  0.0 st

排查过程

鉴于sy排查会比较复杂，所以先从简单的si开始。使用命令watch -n1 -d cat /proc/interrupts查看软中断情况：

Azure Linux 服务器自动重启问题

发表于： 2024-09-15 分类于：折腾

问题描述

某日登录服务器发现我用screen挂起来的任务没了。看了下uptime发现系统启动时间不对，遂登录 Azure 查看虚拟机的 Activity log，发现：

Install OS update patches on virtual machine | Succeeded | 23 hours ago

这条日志的时间刚好和服务器重启时间对上了。

神经网络训练一开始准确率很高然后逐渐下降的问题排查

发表于： 2024-08-18 分类于：学习

现象

神经网络训练，一开始准确率很高，然后逐渐下降。如下所示：

Epoch 	 Time 	 Train Loss 	 Train ACC 	 Val Loss 	 Val ACC 	 Test Loss 	 Test ACC 	 LR
1	 197.8234 	 0.0053 	 0.8645 	 0.0412 	 0.1443 	 0.0412 	 0.1443 	 0.0100
2	 108.6638 	 0.0084 	 0.7311 	 0.0272 	 0.1443 	 0.0272 	 0.1443 	 0.0100
3	 108.4892 	 0.0095 	 0.6777 	 0.0267 	 0.1443 	 0.0267 	 0.1443 	 0.0100
4	 108.8819 	 0.0087 	 0.7102 	 0.0269 	 0.1443 	 0.0269 	 0.1443 	 0.0100
5	 108.8337 	 0.0065 	 0.7712 	 0.0504 	 0.1443 	 0.0504 	 0.1443 	 0.0100
6	 109.4179 	 0.0061 	 0.8071 	 0.0624 	 0.1443 	 0.0624 	 0.1443 	 0.0100
7	 109.2300 	 0.0057 	 0.8349 	 0.0762 	 0.1443 	 0.0762 	 0.1443 	 0.0075
8	 109.2820 	 0.0101 	 0.6432 	 0.0245 	 0.1443 	 0.0245 	 0.1443 	 0.0075

具体现象是 Train ACC 一开始特别高，但 Val ACC 很低。随着 epoch 增加， Train ACC 开始下降，Val ACC 几乎不变。

Setonix 的 PyTorch 环境的 ROCm 报错排查

发表于： 2024-07-12 更新于： 2024-07-12 分类于：学习

前言

Setonix 是南半球最强大的超级计算机，也是我用过的最不稳定的超级计算机。平均一个月炸一次，例如升级固件升炸了，Lustre文件系统炸了等等……

因为是用 AMD Instinct MI250X GPU，所以 PyTorch 后端用的是 ROCm。这个环境有就各种奇奇怪怪的问题。