Scheduling jobs on HPC clusters for modern ML nerds
In Soviet Russia, job puts YOU in queue
March 9, 2018 — April 28, 2023
Doing stuff on classic HPC clusters.
Slurm
, torque
, PlatformLSF
all implement a similar API providing concurrency guarantees specified by the famous Byzantine committee-designed greasy totem pole priority system. Empirical observation: the IT department for any given cluster often seems reluctant to document which one they are using. Typically a campus cluster will come with some gruff example commands that worked for that guy that time, but not much more. Usually that guy that time was running a molecular simulation package written in some language I have never heard of, or alternatively one I wish to forget that I have heard of. Presumably this is often a combination of the understandable desire not to write documentation for all the bizarre idiosyncratic use cases, and a kind of availability-through-obscurity demand-management. They are typically less eager to allocate GPUs, slightly confused by all this modern neural network stuff, and downright flabbergasted by containers. I’ve lived through this transition, from classic compute to GPU everything; modern sysadmins know all about GPUs.
To investigate: Apparently there is a modern programmatic API to some of the classic schedulers called DRMAA, (Distributed Resource Management Application API), which allows fairly generic job definition and which works on my local cluster, although they have not documented how.
Anyway, here are some methods for getting stuff done that work well for my use-cases, which tend towards statistical inference and neural nets etc.
1 submitit
My current go-to option for python. I use this so much that I made a submitit notebook. Go there.
See also hydra ML.
What follows are some other options I do not frequently use.
2 The here-document trick
#!/usr/bin/env sh
#SBATCH -N 10
#SBATCH -n 8
#SBATCH -o %x-%j.out
module load julia/1.6.1 ## I have to load julia before calling julia
julia << EOF
using SomePackage
do_julia_stuff
EOF
Did you see what happened? We invoked our preferred programming language from the job submission shell job.
3 Request a multicore job from the scheduler and manage that like a mini cluster in python
Dask.distributed works well on a multi-machine job on the cluster apparently, and will even spawn the Slurm job.
Easily distributing a parallel IPython Notebook on a cluster:
Have you ever asked yourself: “Do I want to spend 2 days adjusting this analysis to run on the cluster and wait 2 days for the jobs to finish or do I just run it locally with no extra work and just wait a week.”
ipython-cluster-helper automates that.
“Quickly and easily parallelize Python functions using IPython on a cluster, supporting multiple schedulers. Optimizes IPython defaults to handle larger clusters and simultaneous processes.” […]
ipython-cluster-helper creates a throwaway parallel IPython profile, launches a cluster and returns a view. On program exit it shuts the cluster down and deletes the throwaway profile.
Works on Platform LSF, Sun Grid Engine, Torque, SLURM. Strictly python.
4 test-tube
seems to be discontinued.
An alternative option for many use cases is test-tube, a “Python library to easily log experiments and parallelize hyperparameter search for neural networks”. AFAICT there is nothing neural-network specific in this and it will happily schedule a whole bunch of useful types of task, generating the necessary scripts and keeping track of what is going on. This function is not obvious from the front page description of this software library, but see test-tube/SlurmCluster.md. (Thanks for pointing me to this, Chris Jackett.)
5 Misc python
See also DRMAA Python, which is a Python wrapper around the DRMAA API.
Other ones I looked at: Andrea Zonca wrote a script that allows spawning jobs on a cluster from a Jupyter notebook. After several iterations and improvements it is now called batchspawner.
snakemake
supports make
-like build workflows for clusters. Seem general and powerful but complicated.
6 Hadoop on the cluster
hanythingondemand provides a set of scripts to easily set up an ad-hoc Hadoop cluster through PBS jobs.
7 Misc julia
In Julia there is a rather fancy system JuliaParallel/ClusterManagers.jl which supports many major HPC job managers automatically.
There is also a bare-bones cth/QsubCmds.jl: Run Julia external (shell) commands on a HPC cluster.
8 Luxury parallelism with pipelines and coordination
More modern tools facilitate very sophisticated workflows with execution graphs and pipelines and such. One that was briefly pitched to us that I did not ultimately use: nextflow
Nextflow enables scalable and reproducible scientific workflows using software containers. It allows the adaptation of pipelines written in the most common scripting languages.
Its fluent DSL simplifies the implementation and the deployment of complex parallel and reactive workflows on clouds and clusters.
Nextflow supports Docker and Singularity containers technology.
This, along with the integration of the GitHub code sharing platform, allows you to write self-contained pipelines, manage versions and to rapidly reproduce any former configuration.
It provides out of the box executors for SGE, LSF, SLURM, PBS and HTCondor batch schedulers and for Kubernetes, Amazon AWS and Google Cloud platforms.
I think that we are actually going to be given D2iQ Kaptain: End-to-End Machine Learning Platform instead? TBC.