Running Python tasks in parallel
There are many options to run Python tasks in parallel. This provides a brief description for some of the options.
-
threading and multiprocessing are part of the standard library.
threading
is used for thread-based parallelism, whilemultiprocessing
is used for process-based parallelism. If your tasks involve Python objects that lock the Global Interpreter Lock (GIL), thenthreading
does not provide much parallelism, and you should opt formultiprocessing
. On the other hand, if your tasks involve NumPy objects that release the GIL, thenthreading
is a better choice. In general,threading
has less overheads compared tomultiprocessing
. -
concurrent.futures is also part of the standard library. It provides a high-level interface for asynchronous parallelism (using
Future
objects). You can easily choose between a “thread” pool or a “process” pool by using ThreadPoolExecutor or ProcessPoolExecutor. If you are running user-level codes, I believeconcurrent.futures
is usually the best option. But if you are writing more low-level codes, I believe you still want to choose betweenthreading
ormultiprocessing
. Note that you can still use threads frommultiprocessing
by using multiprocessing.pool.ThreadPool. -
dask is a powerful library that helps with the common pains dealing with large data and parallel computing (e.g. delayed, lazy loading, array chunking, distributed computing). There are two kinds of schedulers: the single-machine scheduler (default) and the more advanced dask.distributed. The advanced
dask.distributed
provides asynchronous parallelism similar toconcurrent.futures
and can be used on a cluster. But if you are just running codes on a single machine, the default scheduler should suffice (it requires zero setup). You can set the scheduler to “threads”, “processes” or “single-threaded”. The Best Practices pages (1, 2, 3) contain some very useful examples. -
joblib is a lightweight library that also provides lazy evaluation and parallel computing. For process-based parallelism, it uses the alternative serialization library
cloudpickle
instead ofpickle
, which allows you to serialize more things.
I want to mention that Keras and Tensorflow also have built-in parallelism. So, if you are dealing with ML stuff, you should be able to use what is offered by Keras/Tensorflow (e.g. keras.Model.fit). However, I think the documentation is kind of difficult to go through.
Simple code snippets:
def fun(x):
return x * x
# No parallelism
result = map(fun, range(10))
# Using concurrent.futures
import concurrent.futures
with concurrent.futures.ProcessPoolExecutor(max_workers=4) as executor:
result = executor.map(fun, range(10))
# Using multiprocessing
import multiprocessing
with multiprocessing.Pool(processes=4) as pool:
result = pool.imap(fun, range(10))
# Using threads from multiprocessing
with multiprocessing.pool.ThreadPool(processes=4) as pool:
result = pool.imap(fun, range(10))