submitit

A less horrible way to parallelize on HPC clusters

March 9, 2018 — November 20, 2024

computers are awful
computers are awful together
concurrency hell
distributed
premature optimization
Figure 1

My current go-to option for Python HPC jobs. If anyone has ever told you to run some code by creating a bash script full of esoteric cruft like #SBATCH -N 10, then you have experienced how user-hostile classic computer clusters are.

submitit makes all that nonsense go away and keeps the process of executing code on the cluster much more like classic parallel code execution on Python, which is to say, annoying but tolerable. Which is better than the wading-through-punctuation-swamp vibe of submitting jobs by vanilla SLURM or TORQUE.

1 Examples

It looks like this in basic form:

import submitit

def add(a, b):
    return a + b

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=1, slurm_partition="dev",
    tasks_per_node=4  # number of cores
)
job = executor.submit(add, 5, 7)  # will compute add(5, 7)
print(job.job_id)  # ID of your job

output = job.result()  # waits for completion and returns output
assert output == 12  # 5 + 7 = 12... your addition was computed in the cluster

The docs could be better. Usage is by example::

Here is a more advanced pattern I use when running a bunch of experiments via submitit:

import bz2
import cloudpickle

job_name = "my_cool_job"

def expensive_calc(a):
    return a + 1

# executor is the submission interface (logs are dumped in the folder)
executor = submitit.AutoExecutor(folder="log_test")
# set timeout in min, and partition for running the job
executor.update_parameters(
    timeout_min=1,
    slurm_partition="dev", #misc slurm args
    tasks_per_node=4,  # number of cores
    mem=8,  # memory in GB I think
)

#submit 10 jobs:
jobs = [executor.submit(expensive_calc, num) for num in range(10)]

# but actually maybe we want to come back to those jobs later, so let’s save them to disk
with bz2.open(job_name + ".job.pkl.bz2", "wb") as f:
    cloudpickle.dump(jobs, f)

# We can quit this python session and do something else now
# Resume after quitting
with bz2.open(job_name + ".job.pkl.bz2", "rb") as f:
    jobs = cloudpickle.load(f)

# wait for all jobs to finish
[job.wait() for job in jobs]

res_list = []
# # Alternatively append these results to a previous run
# with bz2.open(job_name + ".pkl.bz2", "rb") as f:
#     res_list.extend(pickle.load(f))

# Examine job outputs for failures etc
fail_ids = [job.job_id for job in jobs if job.state not in ('DONE', 'COMPLETED')]
res_list.extend([job.result() for job in jobs if job.job_id not in fail_ids])
failures = [job for job in jobs if job.job_id in fail_ids]

if failures:
    print("failures")
    print("===")
    for job in failures:
        print(job.state, job.stderr())

# Save results to disk
with bz2.open(job_name + ".result.pkl.bz2", "wb") as f:
    cloudpickle.dump(res_list, f)

Cool feature: the spawning script only needs to survive as long as it takes to put jobs on the queue, and then it can die. Later on we can reload those jobs from disk as if it all happened in the same session.

2 Gotchas

2.1 DebugExecutor

When developing a function using submitit, it might seem convenient to use DebugExecutor sometimes, to work out why things are crashing. DebugExecutor is is a very different execution environment: Code runs executes sequentially in-process so any failure of complexity that arises from concurrency will be pretty different in DebugExecutor.

If we don’t want to debug concurrency problems, but just the function itself, it might be easier to just execute the function in the normal way. executor=DebugExecutor();executor.submit(func, *args, **kwargs) does not get us much new that we do not already get from running func(*args, **kwargs) directly except a drop-in replacement for submitit.AutoExecutor.

What DebugExecutor does get us one small extra thing that I do not want. It always invokes pdb, or ipdb, the interactive Python debuggers, upon errors. As such DebugExecutor cannot be run from VS Code, because VS Code does not support those interactive debuggers. I dislike this feature strongly; even though I love debuggers, I find it best to invoke them myself, thank you very much.

2.2 Weird argument names

submitit believes that the slurm_mem argument requisitions memory in GB, but slurm interprets it as MB per default (see --mem). There is an alternate argument, mem_gb, which allocates GB of memory.

3 hydra

Pro tip: Submitit integrates with hydra.