Python Pickling for Parallel Computing and Large Numerical Arrays

August 23, 2016 — February 18, 2025

computers are awful

concurrency hell

distributed

number crunching

premature optimization

workflow

Suspiciously similar content

Pickling is the process of converting Python objects into byte streams, allowing them to be stored or transmitted. This is particularly useful for saving complex data structures like machine learning models or game states. In the context of parallel computing, pickling is how we transfer objects between processes.

1 Basic Python Pickle

1.1 Using the `pickle` Module

Here’s a simple example of how to serialize and deserialize a Python object using pickle:

import pickle

# Example object
data = {'name': 'John', 'age': 30}

# Serialize the object
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)

# Deserialize the object
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(loaded_data)  # Output: {'name': 'John', 'age': 30}

1.2 Performance Considerations

Pickling can be slow for large objects. Using the highest protocol or cPickle (renamed to _pickle in Python 3) can improve performance. Here’s how to use the latest protocol:

import pickle

# Serialize using the highest protocol
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

2 Handling Large Numerical Arrays

2.1 Efficient Pickling of Numpy Arrays

Libraries like joblib optimize the pickling of numpy arrays by storing them in separate files, reducing memory usage and improving speed. Here’s how I use joblib for efficient array storage:

import numpy as np
from joblib import dump, load

# Create a large numpy array
large_array = np.random.rand(1000, 1000)

# Serialize the array using joblib
dump(large_array, 'large_array.joblib')

# Deserialize the array
loaded_array = load('large_array.joblib')

Alternatively, we can use numpy.save() and numpy.load() for efficient array storage and retrieval:

import numpy as np

# Serialize the array
np.save('large_array.npy', large_array)

# Deserialize the array
loaded_array = np.load('large_array.npy')

This is not usually amazing, merely convenient. For serious data storage, think about using numerical data files formats.

3 Advanced Pickling Libraries

3.1 Cloudpickle

Cloudpickle extends Python’s pickling capabilities to serialize interactively defined functions, which is useful in distributed computing environments. This is most useful for me in submitit so I use it all the time. It also works on Windows, making it versatile for different operating systems.

3.2 Dill

dill can also pickle almost any Python object. TBH I cannot tell the relative feature matrix is between this and cloudpickle. They look the same? I think cloudpickle might be more active? It is particularly useful in parallel computing frameworks like Dask and ipyparallel.

Using dill to pickle anything

4 Handling Large Numerical Arrays

4.1 Efficient Pickling of Numpy Arrays