Python Pickling for Parallel Computing and Large Numerical Arrays
August 23, 2016 — February 18, 2025
Suspiciously similar content
Pickling is the process of converting Python objects into byte streams, allowing them to be stored or transmitted. This is particularly useful for saving complex data structures like machine learning models or game states. In the context of parallel computing, pickling is how we transfer objects between processes.
1 Basic Python Pickle
1.1 Using the pickle
Module
Here’s a simple example of how to serialize and deserialize a Python object using pickle
:
1.2 Performance Considerations
Pickling can be slow for large objects. Using the highest protocol or cPickle
(renamed to _pickle
in Python 3) can improve performance. Here’s how to use the latest protocol:
2 Handling Large Numerical Arrays
2.1 Efficient Pickling of Numpy Arrays
Libraries like joblib
optimize the pickling of numpy arrays by storing them in separate files, reducing memory usage and improving speed. Here’s how I use joblib
for efficient array storage:
import numpy as np
from joblib import dump, load
# Create a large numpy array
large_array = np.random.rand(1000, 1000)
# Serialize the array using joblib
dump(large_array, 'large_array.joblib')
# Deserialize the array
loaded_array = load('large_array.joblib')
Alternatively, we can use numpy.save()
and numpy.load()
for efficient array storage and retrieval:
import numpy as np
# Serialize the array
np.save('large_array.npy', large_array)
# Deserialize the array
loaded_array = np.load('large_array.npy')
This is not usually amazing, merely convenient. For serious data storage, think about using numerical data files formats.
3 Advanced Pickling Libraries
3.1 Cloudpickle
Cloudpickle extends Python’s pickling capabilities to serialize interactively defined functions, which is useful in distributed computing environments. This is most useful for me in submitit so I use it all the time. It also works on Windows, making it versatile for different operating systems.
3.2 Dill
dill can also pickle almost any Python object. TBH I cannot tell the relative feature matrix is between this and cloudpickle. They look the same? I think cloudpickle might be more active? It is particularly useful in parallel computing frameworks like Dask and ipyparallel.
4 Handling Large Numerical Arrays
4.1 Efficient Pickling of Numpy Arrays
Libraries like joblib
optimize the pickling of numpy arrays by storing them in separate files, reducing memory usage and improving speed. Here’s how I use joblib
for efficient array storage:
import numpy as np
from joblib import dump, load
# Create a large numpy array
large_array = np.random.rand(1000, 1000)
# Serialize the array using joblib
dump(large_array, 'large_array.joblib')
# Deserialize the array
loaded_array = load('large_array.joblib')
Alternatively, we can use numpy.save()
and numpy.load()
for efficient array storage and retrieval:
import numpy as np
# Serialize the array
np.save('large_array.npy', large_array)
# Deserialize the array
loaded_array = np.load('large_array.npy')
This is not usually amazing, merely convenient. For serious data storage, think about using numerical data files formats.
5 Compress pickles, though
When working with large datasets in parallel computing, compression can significantly reduce I/O overhead and network transfer times:
Basic mode, e can use a standard library compression module like bz2
:
Joblib accepts a compression parameter:
from joblib import dump, load
# Compress numpy arrays with zlib
dump(large_array, 'array.joblib.z', compress=('zlib', 3))
# High-compression LZMA (slow but compact)
dump(large_array, 'array.joblib.xz', compress=('lzma', 9))
import cloudpickle
import lzma
# Create an object with non-trivial serialization requirements
data = {
'model': lambda x: x**2 + 3*x + 5, # Cloudpickle can serialize lambdas
'config': {'batch_size': 256, 'learning_rate': 0.001}
}
# Serialize with LZMA compression (level 6 for balance)
with lzma.open('model_data.xz', 'wb',
preset=6, format=lzma.FORMAT_XZ) as f:
cloudpickle.dump(data, f)
# Deserialize later
with lzma.open('model_data.xz', 'rb') as f:
loaded_data = cloudpickle.load(f)
print(loaded_data['model'](2)) # Output: 2² + 3*2 + 5 = 15
I got perplexity.ai to summarise recommendations about which protocol to use:
Algorithm | Speed | Ratio | Best Use Case |
---|---|---|---|
LZ4 | ★★★★★ | ★★☆ | Fast inter-process sharing |
gzip | ★★★☆☆ | ★★★☆ | General purpose |
lzma | ★★☆☆☆ | ★★★★★ | Long-term storage |