Multiprocessing support for data sources
As discussed in today's debugging session with @samuel.gaist and @amohammadi, using a DataLoader
object in a multiprocessing context is hard:
- Typically, the underlying
DataSource
'sfileobj
's are opened by the time the process is forked - Deep copying the object (which goes through pickling and unpickling it) does not properly reset underlying
fileobj
pointers, which makes multiple processes access the same underlying OS-level file handler, causing unwanted behaviour.
To sort this out, we discussed 2 possible additions to this package:
-
DataLoader
should have areset()
method that resets all underlyingDataSource
opened files, so that they can be correctly copied across multiple processes (e.g. in the event of afork()
). It should be relatively easy to do areset()
operation across all inputs of a user algorithm, to ensure all data sources are properly reset before an eventual user-guidedfork()
. - The underlying
DataSource
should have its pickle/unpickle behaviour patched (via overwriting the__setstate__
slot ofDataSource
, see reference below), so that unpickling a data source (e.g. indirectly via a data loader deep copy), will callself.reset()
after its state is unpickled. This would allow aDataLoader
object to be sent over current mechanisms for inter-process communication (e.g. MPI ormultiprocessing.Queue
), transparently.
References:
- Python fileobj handling: https://stackoverflow.com/questions/1834556/does-a-file-object-automatically-close-when-its-reference-count-hits-zero
- Pickle user guide (see in particular
__getstate__
and__setstate__
on how to overwrite the pickle/unpickle actions): https://docs.python.org/3/library/pickle.html#object.__getstate__ - On sharing (opened) file pointers in a POSIX system after a
fork()
is issued: https://stackoverflow.com/questions/33899548/file-pointers-after-returning-from-a-forked-child-process