Skip to content

Multiprocessing support for data sources

As discussed in today's debugging session with @samuel.gaist and @amohammadi, using a DataLoader object in a multiprocessing context is hard:

  1. Typically, the underlying DataSource's fileobj's are opened by the time the process is forked
  2. Deep copying the object (which goes through pickling and unpickling it) does not properly reset underlying fileobj pointers, which makes multiple processes access the same underlying OS-level file handler, causing unwanted behaviour.

To sort this out, we discussed 2 possible additions to this package:

  1. DataLoader should have a reset() method that resets all underlying DataSource opened files, so that they can be correctly copied across multiple processes (e.g. in the event of a fork()). It should be relatively easy to do a reset() operation across all inputs of a user algorithm, to ensure all data sources are properly reset before an eventual user-guided fork().
  2. The underlying DataSource should have its pickle/unpickle behaviour patched (via overwriting the __setstate__ slot of DataSource, see reference below), so that unpickling a data source (e.g. indirectly via a data loader deep copy), will call self.reset() after its state is unpickled. This would allow a DataLoader object to be sent over current mechanisms for inter-process communication (e.g. MPI or multiprocessing.Queue), transparently.

References: