type

Post

date

Jun 17, 2025

summary

torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV

category

实践技巧

tags

PD

分布式训练

password

URL

Property

Jun 20, 2025 01:48 AM

报错记录

Traceback (most recent call last):
  File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2455, in test
    integrated(table)
  File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2276, in integrated
    raise e  # 重新抛出异常,让父进程能捕获到错误
  File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 2270, in integrated
    synthesizer.fit()                                 # 训练模型
  File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/sdgx/synthesizer.py", line 327, in fit
    self.model.fit(metadata, processed_dataloader, **(model_fit_kwargs or {}))
  File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 284, in fit
    return self._fit_multi_gpu(metadata, dataloader, epochs, *args, **kwargs)
  File "/nfsdata/DataSynthesis_/src/synthetic_data.py", line 328, in _fit_multi_gpu
    mp.spawn(
  File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 328, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 284, in start_processes
    while not context.join():
  File "/root/miniconda3/envs/sh_data_synthesis/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 184, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGSEGV

这意味着在多进程(多卡)训练时,子进程 1 因为 SIGSEGV(段错误)崩溃,导致整个 mp.spawn 失败。

SIGSEGV 错误常见于 PyTorch 的 CUDA 相关操作

问题排查

1、传入子进程的对象中,是否包含不可序列化的对象

在子进程函数内部最开始的地方进行日志打印,查看进程启动、分配的 GPU 等情况;如果日志正常打印,说明子进程成功启动,并非是传入的对象有问题;检查通过

2025-06-17 07:53:11.333 | INFO     | __mp_main__:_train_worker:339 - Worker 0 started, pid=62587
2025-06-17 07:53:13.600 | INFO     | __mp_main__:_train_worker:339 - Worker 1 started, pid=62814

2、内存问题:检查显存和共享内存

  • 检查共享内存:df -h /dev/shm 查看其大小和使用情况。检查通过
    • 如果空间很小(例如只有几 GB),可以考虑扩容。编辑 /etc/fstab 文件或使用 mount 命令重新挂载。对于数据处理量大的任务,建议分配几十 GB。临时扩容到 64G: sudo mount -o remount,size=64G /dev/shm
    • Filesystem      Size  Used Avail Use% Mounted on
      tmpfs            32G  1.2M   32G   1% /dev/shm
  • 检查显存:nvidia-smi 查看显存使用情况,检查通过
    • 如果剩余可用显存太小则可能是资源竞争导致的;代码中指定使用[0,2]显卡
    • | GPU  Name                 Persistence-M | ... |           Memory-Usage | GPU-Util  Compute M. |
      |=========================================+========================+======================|
      |   0  NVIDIA GeForce RTX 4090        Off | ... |     393MiB /  24564MiB |      0%      Default |
      |   1  NVIDIA GeForce RTX 4090        Off | ... |   21238MiB /  24564MiB |      0%      Default |
      |   2  NVIDIA GeForce RTX 4090        Off | ... |     393MiB /  24564MiB |      0%      Default |
      |   3  NVIDIA GeForce RTX 4090        Off | ... |   20442MiB /  24564MiB |      0%      Default |
      ...
      | Processes:                                                                              |
      |  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
      |=========================================================================================|
      |    0   N/A  N/A            5619      C   ...lama-box/llama-box-rpc-server        384MiB |
      |    1   N/A  N/A            5621      C   ...lama-box/llama-box-rpc-server        384MiB |
      |    1   N/A  N/A           30802      C   Model: ROOT-qwen1.5-0.5b-0            20828MiB |
      |    2   N/A  N/A            5620      C   ...lama-box/llama-box-rpc-server        384MiB |
      |    3   N/A  N/A            5618      C   ...lama-box/llama-box-rpc-server        384MiB |
      |    3   N/A  N/A           26237      C   ...ipx/venvs/gpustack/bin/python      20044MiB |

3、单卡测试:检查代码在单卡条件下能否正常运行;测试通过

4、NCCL 和多 GPU 通信问题

  • 检查 GPU 通信:运行 nvidia-smi topo -m 命令,查看多块 GPU 之间的连接性,如果看到 PXB 或 PIX,则说明连接性没问题;检查通过
    •         GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
      GPU0     X      PHB     PHB     PHB     0-31    0               N/A
      GPU1    PHB      X      PHB     PHB     0-31    0               N/A
      GPU2    PHB     PHB      X      PHB     0-31    0               N/A
      GPU3    PHB     PHB     PHB      X      0-31    0               N/A
      
      Legend:
      
        X    = Self
        SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
        NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
        PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
        PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
        PIX  = Connection traversing at most a single PCIe bridge
        NV#  = Connection traversing a bonded set of # NVLinks
  • NCCL 日志检查:设置 NCCL 环境变量,输出详细日志;未输出有效信息
    • export NCCL_DEBUG=INFO
      export NCCL_P2P_DISABLE=1
      export NCCL_IB_DISABLE=1
  • NCCL 兼容性检查:
    • 当前搭配为:CUDA 12.8 + NCCL 2.21.5 + RTX 4090 + torch 2.5.1

      之前可以运行的搭配:CUDA 12.1 + NCCL 2.21.5 + T4 + torch 2.5.1

      怀疑是 CUDA 版本和 NCCL 版本不兼容的问题;

    • 重新安装支持最新 CUDA 驱动(12.4)的 torch 和 NCCL;
      • pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
    • 在初始化进程组时设置 backend='gloo'原来为“nccl”成功定位问题,使用 gloo 成功运行程序,确定为 NCCL 兼容性问题;
      •         dist.init_process_group(
                    backend='gloo',
                    init_method='env://',
                    world_size=len(self.gpu_ids),
                    rank=rank
                )

解决问题

虽然使用 gloo 模式能正常运行代码,但是 gloo 的性能相比 nccl 来说低很多,所以注重效率还是需要使用 nccl 模式;

最终方式:

  • 调整 CUDA 和 NCCL 驱动的版本,目前 CUDA 版本为 12.8,版本太新,可以尝试降低 CUDA 版本到 12.4 或 12.1;
  • 因为 NCCL 通常是和 torch 一起安装的,所以如果要调整 NCCL 版本实际是调整 torch 的版本,PyTorch 官方网站有 CUDA 12.1 和 12.4 对应的安装命令(传送门);
  • 也可以尝试用 PyTorch 2.2.x + CUDA 11.8(最稳定的组合)。