Configuring machine learning experiments

October 20, 2021 — February 16, 2025

computers are awful
faster pussycat
how do science
premature optimization
provenance

A dual problem to experiment tracking is experiment configuring; How can I nicely define experiments and the parameters that make them go?

Figure 1

1 Hydra

If you are working in Python, this does more or less everything. See my Hydra page. Too heavy for some uses.

2 Gin

gin-config configures default parameters in a useful way for ML experiments. It is made by Googlers, as opposed to Hydra, which is made by Facebookers. It is more limited than Hydra, but lighter.

3 Spock

Also looks nifty. However, code is untouched since 2023.

spock is a framework that helps users easily define, manage, and use complex parameter configurations within Python applications. It lets you focus on the code you need to write instead of re-implementing boilerplate code such as creating ArgParsers, reading configuration files, handling dependencies, implementing type validation, maintaining traceability, etc.

spock configurations are normal Python classes that are decorated with @spock. It supports inheritance, dynamic class dependencies, loading/saving configurations from/to multiple markdown formats, automatically generating CLI arguments, and hierarchical configuration by composition.

4 💥 Why You Should Use Spock 💥

  • Simple, organised parameter definitions (i.e. a single line)
  • Type checked (static-esque) & frozen parameters (i.e. fail early during long ML training runs)
  • Complex parameter dependencies made simple (i.e. @spock class with a parameter that is an Enum of other @spock classes)
  • Fully serialisable parameter state(s) (i.e. exactly reproduce prior runtime parameter configurations)
  • Automatic type checked CLI generation without ArgParser boilerplate (i.e. click and/or typer for free!)
  • Easily maintain parity between CLIs and Python APIs (i.e. single line changes between CLI and Python API definitions)
  • Unified hyper-parameter definitions and interface (i.e. don’t write different definitions for Ax or Optuna)
  • Resolver that supports value definitions from reference to other defined variables, environmental variables, dynamic template re-injection, and encryption of sensitive values

5 Key features

  • Simple Declaration: Type checked parameters are defined within a @spock decorated class. Supports required/optional and automatic defaults.
  • Easily Managed Parameter Groups: Each class automatically generates its own object within a single namespace.
  • Parameter Inheritance: Classes support inheritance (with lazy evaluation of inheritance/dependencies) allowing for complex configurations derived from a common base set of parameters.
  • Complex Types: Nested Lists/Tuples, List/Tuples of Enum of @spock classes, List of repeated @spock classes
  • Multiple Configuration File Types: Configurations are specified from YAML, TOML, or JSON files.
  • Hierarchical Configuration: Compose from multiple configuration files via simple include statements.
  • Command-Line Overrides: Quickly experiment by overriding a value with automatically generated command line arguments.
  • Immutable: All classes are frozen preventing any misuse or accidental overwrites (to the extent they can be in Python).
  • Tractability and Reproducibility: Save runtime parameter configuration to YAML, TOML, or JSON with a single chained command (with extra runtime info such as Git info, Python version, machine FQDN, etc). The saved markdown file can be used as the configuration input to reproduce prior runtime configurations.
  • Hyper-Parameter Tuner Addon: Provides a unified interface for defining hyper-parameters (via @spockTuner decorator) that supports various tuning/algorithm backends (currently: Optuna, Ax)

6 argbind

Looks simple

7 Pyrallis

eladrich/pyrallis seems nice, and I am exploring it as an alternative to Hydra. It seems a little less opinionated, which is relaxing while leaving the user with fewer choices about how to map the configuration to the code.

The major trick is using the recent (v3.7) Python feature dataclasses as a first-class citizen. Looks elegant but not very maintained.

8 ml-metadata

ML Metadata (MLMD) is a library for recording and retrieving metadata associated with ML developer and data scientist workflows. MLMD is an integral part of TensorFlow Extended (TFX), but designed so that it can be used independently.

Every run of a production ML pipeline generates metadata containing information about the various pipeline components, their executions (e.g. training runs), and resulting artifacts (e.g. trained models). In the event of unexpected pipeline behaviour or errors, this metadata can be leveraged to analyse the lineage of pipeline components and debug issues. Think of this metadata as the equivalent of logging in software development.

MLMD helps you understand and analyse all the interconnected parts of your ML pipeline instead of analysing them in isolation and can help you answer questions about your ML pipeline such as:

  • Which dataset did the model train on?
  • What were the hyperparameters used to train the model?
  • Which pipeline run created the model?
  • Which training run led to this model?

See MLMD guide.

9 DrWatson.jl

As mentioned under experiment tracking, DrWatson automatically attaches code versions to simulations and does some other work to keep simulations tracked and reproducible. Special feature: Works with Julia, which is my other major language.

10 Configuration.jl

Another Julia entrant.

11 Allennlp Param

Allennlp’s Param system is a kind of introductory trainer-wheels configuration system, but not recommended in practice. It comes with a lot of baggage — installing it will bring in many fragile and fussy dependencies for language parsing. Once I used this for a while I realised all the reasons I would want a better system, which is provided by Hydra.

12 DIY

Why use an external library for this? I could, of course, roll my own. I have done that quite a few times. It is a surprisingly large amount of work, remarkably easy to get wrong, and there are perfectly good tools to do it already.