std::bodun::blog

PhD student at University of Texas at Austin 🤘. Doing systems for ML.

马上订阅 std::bodun::blog RSS 更新: https://www.bodunhu.com/blog/index.xml

Set up Slurm across Multiple Machines

2021年11月16日 08:00

To install Slurm, we need to have admin access to the machine. This post explains how I got Slurm running in multiple Linux servers. All servers are running on Ubuntu 18.04 LTS.

Setup Munge

First, we need to make sure the clocks, users and groups (UIDs and GIDs) are synchronized across the cluster. We need to create two users: slurm and munge across all servers. z

Then, we install Munge for authentication:

$ apt install munge libmunge2 libmunge-dev

To test if munge is installed successfully:

$ munge -n | unmunge | grep STATUS
STATUS: Success (0)

Next, we create a munge authentication key on one of the servers:

$ /usr/sbin/create-munge-key

After we generate munge authentication key, we copy the key /etc/munge/munge.key on that server to all other servers (overwrite the /etc/munge/munge.key on all other servers).

We need to setup the rights for munge accordingly on every server:

$ chown -R munge: /etc/munge/ /var/log/munge/ /var/lib/munge/ /run/munge/
$ chmod 0700 /etc/munge/ /var/log/munge/ /var/lib/munge/
$ chmod 0755 /run/munge/

Then, we enable and start the munge service with (remember to not use sudo when running munge):

$ systemctl enable munge
$ systemctl start munge

You can then test whether munge works...

剩余内容已隐藏

查看完整文章以阅读更多