std::bodun::blog
PhD student at University of Texas at Austin 🤘. Doing systems for ML.
马上订阅 std::bodun::blog RSS 更新: https://www.bodunhu.com/blog/index.xml
Set up Slurm across Multiple Machines
To install Slurm, we need to have admin access to the machine. This post explains how I got Slurm running in multiple Linux servers. All servers are running on Ubuntu 18.04 LTS.
Setup Munge
First, we need to make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. We need to create two users: slurm and munge across all servers.
z
Then, we install Munge for authentication:
$ apt install munge libmunge2 libmunge-dev
To test if munge is installed successfully:
$ munge -n | unmunge | grep STATUS
STATUS: Success (0)
Next, we create a munge authentication key on one of the servers:
$ /usr/sbin/create-munge-key
After we generate munge authentication key, we copy the key /etc/munge/munge.key on that server to all other servers (overwrite the /etc/munge/munge.key on all other servers).
We need to setup the rights for munge accordingly on every server:
$ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ chmod 0755 /run/munge/
Then, we enable and start the munge service with (remember to not use sudo when running munge):
$ systemctl enable munge
$ systemctl start munge
You can then test whether munge works...
剩余内容已隐藏