type
Post
date
Jun 17, 2025
summary
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV
category
实践技巧
tags
PD 分布式训练
password
URL
Property
Jun 20, 2025 01:48 AM
报错记录
Traceback (most recent call last):
File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2455, in test
integrated(table)
File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2276, in integrated
raise e # 重新抛出异常,让父进程能捕获到错误
File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2270, in integrated
synthesizer.fit() # 训练模型
File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/sdgx/synthesizer.py", line 327, in fit
self.model.fit(metadata, processed_dataloader, **(model_fit_kwargs or {}))
File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 284, in fit
return self._fit_multi_gpu(metadata, dataloader, epochs, *args, **kwargs)
File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 328, in _fit_multi_gpu
mp.spawn(
File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
while not context.join():
File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV这意味着在多进程(多卡)训练时,子进程 1 因为 SIGSEGV(段错误)崩溃,导致整个 mp.spawn 失败。
SIGSEGV 错误常见于 PyTorch 的 CUDA 相关操作
问题排查
1、传入子进程的对象中,是否包含不可序列化的对象
在子进程函数内部最开始的地方进行日志打印,查看进程启动、分配的 GPU 等情况;如果日志正常打印,说明子进程成功启动,并非是传入的对象有问题;检查通过
2025-06-17 07:53:11.333 | INFO | __mp_main__:_train_worker:339 - Worker 0 started, pid=62587
2025-06-17 07:53:13.600 | INFO | __mp_main__:_train_worker:339 - Worker 1 started, pid=628142、内存问题:检查显存和共享内存
- 检查共享内存:
df -h /dev/shm查看其大小和使用情况。检查通过 - 如果空间很小(例如只有几 GB),可以考虑扩容。编辑
/etc/fstab文件或使用mount命令重新挂载。对于数据处理量大的任务,建议分配几十 GB。临时扩容到 64G:sudo mount -o remount,size=64G /dev/shm
Filesystem Size Used Avail Use% Mounted on
tmpfs 32G 1.2M 32G 1% /dev/shm- 检查显存:nvidia-smi 查看显存使用情况,检查通过
- 如果剩余可用显存太小则可能是资源竞争导致的;代码中指定使用[0,2]显卡
| GPU Name Persistence-M | ... | Memory-Usage | GPU-Util Compute M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | ... | 393MiB / 24564MiB | 0% Default |
| 1 NVIDIA GeForce RTX 4090 Off | ... | 21238MiB / 24564MiB | 0% Default |
| 2 NVIDIA GeForce RTX 4090 Off | ... | 393MiB / 24564MiB | 0% Default |
| 3 NVIDIA GeForce RTX 4090 Off | ... | 20442MiB / 24564MiB | 0% Default |
...
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
|=========================================================================================|
| 0 N/A N/A 5619 C ...lama-box/llama-box-rpc-server 384MiB |
| 1 N/A N/A 5621 C ...lama-box/llama-box-rpc-server 384MiB |
| 1 N/A N/A 30802 C Model: ROOT-qwen1.5-0.5b-0 20828MiB |
| 2 N/A N/A 5620 C ...lama-box/llama-box-rpc-server 384MiB |
| 3 N/A N/A 5618 C ...lama-box/llama-box-rpc-server 384MiB |
| 3 N/A N/A 26237 C ...ipx/venvs/gpustack/bin/python 20044MiB |3、单卡测试:检查代码在单卡条件下能否正常运行;测试通过
4、NCCL 和多 GPU 通信问题
- 检查 GPU 通信:运行
nvidia-smi topo -m命令,查看多块 GPU 之间的连接性,如果看到PXB或PIX,则说明连接性没问题;检查通过
GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB PHB 0-31 0 N/A
GPU1 PHB X PHB PHB 0-31 0 N/A
GPU2 PHB PHB X PHB 0-31 0 N/A
GPU3 PHB PHB PHB X 0-31 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks- NCCL 日志检查:设置 NCCL 环境变量,输出详细日志;未输出有效信息
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1- NCCL 兼容性检查:
- 重新安装支持最新 CUDA 驱动(12.4)的 torch 和 NCCL;
当前搭配为:CUDA 12.8 + NCCL 2.21.5 + RTX 4090 + torch 2.5.1
之前可以运行的搭配:CUDA 12.1 + NCCL 2.21.5 + T4 + torch 2.5.1
怀疑是 CUDA 版本和 NCCL 版本不兼容的问题;
pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124backend='gloo' ,原来为“nccl”;成功定位问题,使用 gloo 成功运行程序,确定为 NCCL 兼容性问题; dist.init_process_group(
backend='gloo',
init_method='env://',
world_size=len(self.gpu_ids),
rank=rank
)解决问题
虽然使用 gloo 模式能正常运行代码,但是 gloo 的性能相比 nccl 来说低很多,所以注重效率还是需要使用 nccl 模式;
最终方式:
- 调整 CUDA 和 NCCL 驱动的版本,目前 CUDA 版本为 12.8,版本太新,可以尝试降低 CUDA 版本到 12.4 或 12.1;
- 因为 NCCL 通常是和 torch 一起安装的,所以如果要调整 NCCL 版本实际是调整 torch 的版本,PyTorch 官方网站有 CUDA 12.1 和 12.4 对应的安装命令(传送门);
- 也可以尝试用 PyTorch 2.2.x + CUDA 11.8(最稳定的组合)。