Python Pickling for Parallel Computing and Large Numerical Arrays

August 23, 2016 — February 18, 2025

computers are awful
concurrency hell
distributed
number crunching
premature optimization
workflow

Pickling is the process of converting Python objects into byte streams, allowing them to be stored or transmitted. This is particularly useful for saving complex data structures like machine learning models or game states. In the context of parallel computing, pickling is how we transfer objects between processes.

Figure 1

1 Basic Python Pickle

1.1 Using the pickle Module

Here’s a simple example of how to serialize and deserialize a Python object using pickle:

import pickle

# Example object
data = {'name': 'John', 'age': 30}

# Serialize the object
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f)

# Deserialize the object
with open('data.pkl', 'rb') as f:
    loaded_data = pickle.load(f)

print(loaded_data)  # Output: {'name': 'John', 'age': 30}

1.2 Performance Considerations

Pickling can be slow for large objects. Using the highest protocol or cPickle (renamed to _pickle in Python 3) can improve performance. Here’s how to use the latest protocol:

import pickle

# Serialize using the highest protocol
with open('data.pkl', 'wb') as f:
    pickle.dump(data, f, protocol=pickle.HIGHEST_PROTOCOL)

2 Handling Large Numerical Arrays

2.1 Efficient Pickling of Numpy Arrays

Libraries like joblib optimize the pickling of numpy arrays by storing them in separate files, reducing memory usage and improving speed. Here’s how I use joblib for efficient array storage:

import numpy as np
from joblib import dump, load

# Create a large numpy array
large_array = np.random.rand(1000, 1000)

# Serialize the array using joblib
dump(large_array, 'large_array.joblib')

# Deserialize the array
loaded_array = load('large_array.joblib')

Alternatively, we can use numpy.save() and numpy.load() for efficient array storage and retrieval:

import numpy as np

# Serialize the array
np.save('large_array.npy', large_array)

# Deserialize the array
loaded_array = np.load('large_array.npy')

This is not usually amazing, merely convenient. For serious data storage, think about using numerical data files formats.

3 Advanced Pickling Libraries

3.1 Cloudpickle

Cloudpickle extends Python’s pickling capabilities to serialize interactively defined functions, which is useful in distributed computing environments. This is most useful for me in submitit so I use it all the time. It also works on Windows, making it versatile for different operating systems.

3.2 Dill

dill can also pickle almost any Python object. TBH I cannot tell the relative feature matrix is between this and cloudpickle. They look the same? I think cloudpickle might be more active? It is particularly useful in parallel computing frameworks like Dask and ipyparallel.

4 Handling Large Numerical Arrays

4.1 Efficient Pickling of Numpy Arrays

Libraries like joblib optimize the pickling of numpy arrays by storing them in separate files, reducing memory usage and improving speed. Here’s how I use joblib for efficient array storage:

import numpy as np
from joblib import dump, load

# Create a large numpy array
large_array = np.random.rand(1000, 1000)

# Serialize the array using joblib
dump(large_array, 'large_array.joblib')

# Deserialize the array
loaded_array = load('large_array.joblib')

Alternatively, we can use numpy.save() and numpy.load() for efficient array storage and retrieval:

import numpy as np

# Serialize the array
np.save('large_array.npy', large_array)

# Deserialize the array
loaded_array = np.load('large_array.npy')

This is not usually amazing, merely convenient. For serious data storage, think about using numerical data files formats.

5 Compress pickles, though

When working with large datasets in parallel computing, compression can significantly reduce I/O overhead and network transfer times:

Basic mode, e can use a standard library compression module like bz2:

import bz2
import pickle

with bz2.open('data.pkl.bz2', 'wb') as f:
    pickle.dump(data, f, protocol=5)

Joblib accepts a compression parameter:

from joblib import dump, load

# Compress numpy arrays with zlib
dump(large_array, 'array.joblib.z', compress=('zlib', 3))

# High-compression LZMA (slow but compact)
dump(large_array, 'array.joblib.xz', compress=('lzma', 9))
import cloudpickle
import lzma

# Create an object with non-trivial serialization requirements
data = {
    'model': lambda x: x**2 + 3*x + 5,  # Cloudpickle can serialize lambdas
    'config': {'batch_size': 256, 'learning_rate': 0.001}
}

# Serialize with LZMA compression (level 6 for balance)
with lzma.open('model_data.xz', 'wb',
               preset=6, format=lzma.FORMAT_XZ) as f:
    cloudpickle.dump(data, f)

# Deserialize later
with lzma.open('model_data.xz', 'rb') as f:
    loaded_data = cloudpickle.load(f)

print(loaded_data['model'](2))  # Output: 2² + 3*2 + 5 = 15

I got perplexity.ai to summarise recommendations about which protocol to use:

Algorithm Speed Ratio Best Use Case
LZ4 ★★★★★ ★★☆ Fast inter-process sharing
gzip ★★★☆☆ ★★★☆ General purpose
lzma ★★☆☆☆ ★★★★★ Long-term storage