The Schwimmbad¶
schwimmbad
provides a uniform interface to parallel processing pools
and enables switching easily between local development (e.g., serial processing
or with multiprocessing
) and deployment on a cluster or supercomputer
(via, e.g., MPI or JobLib).
Getting started¶
The utilities provided by schwimmbad
require that your tasks or data can be
“chunked” and your code can be “mapped” onto the chunked tasks. For example, the
very common use-case of executing the same function or code on every datum in
a large data set:
data = list(range(10000))
def do_the_processing(x):
# Do something with each value in the data set!
# For example, here we just square each value
return x**2
values = []
for x in data:
values.append(do_the_processing(x))
In the example above, instead of looping over each item in data
, we could
have equally well used the Python built-in map()
function:
values = list(map(do_the_processing, data))
This applies the function (passed as the first argument) to each element in the
iterable (second argument). In Python 3, map()
returns a generator object,
so we call list
on this to get out the values.
The above map()
example executes the function in serial over each
element in data
. That is, it just goes one by one through the data
object, executes the function, returns, and carries on, all on the same
processor core. If we can write our code in this style (using map()
), we
can easily swap in the Pool
classes provided by schwimmbad
to allow us
to switch between various parallel processing frameworks. The easiest to
understand is the SerialPool
, as this is just a
class-based wrapper of the built-in (serial) map()
function. The same
example above can be run with SerialPool
by doing:
from schwimmbad import SerialPool
pool = SerialPool()
values = list(pool.map(do_the_processing, data))
The only difference here is we call the .map()
method of the instantiated
Pool
class.
If we wanted to switch to using multiprocessing, using, for example, the Python
built-in multiprocessing
package to utilize multiple cores on the same
processor, we can just swap out the pool definition to instead use the
MultiPool
class:
from schwimmbad import MultiPool
with MultiPool() as pool:
values = list(pool.map(do_the_processing, data))
Note that the core processing line (values = ...
) looks identical to the
above, but we’ve only changed the definition of the pool
object. In detail,
we’re now using a context manager (using a
Python with
statement) to handle creating and closing the multiprocessing
pool. With the exception of the SerialPool
, all
other pool classes need to be explicity closed after processing. We could have
also written:
pool = MultiPool()
values = list(pool.map(do_the_processing, data))
pool.close()
See the examples listed below for demonstrations of using the
MPIPool
and JoblibPool
.