Multiprocessing support for data sources
As discussed in today's debugging session with @samuel.gaist and @amohammadi, using a DataLoader object in a multiprocessing context is hard:
- Typically, the underlying
DataSource'sfileobj's are opened by the time the process is forked - Deep copying the object (which goes through pickling and unpickling it) does not properly reset underlying
fileobjpointers, which makes multiple processes access the same underlying OS-level file handler, causing unwanted behaviour.
To sort this out, we discussed 2 possible additions to this package:
-
DataLoadershould have areset()method that resets all underlyingDataSourceopened files, so that they can be correctly copied across multiple processes (e.g. in the event of afork()). It should be relatively easy to do areset()operation across all inputs of a user algorithm, to ensure all data sources are properly reset before an eventual user-guidedfork(). - The underlying
DataSourceshould have its pickle/unpickle behaviour patched (via overwriting the__setstate__slot ofDataSource, see reference below), so that unpickling a data source (e.g. indirectly via a data loader deep copy), will callself.reset()after its state is unpickled. This would allow aDataLoaderobject to be sent over current mechanisms for inter-process communication (e.g. MPI ormultiprocessing.Queue), transparently.
References:
- Python fileobj handling: https://stackoverflow.com/questions/1834556/does-a-file-object-automatically-close-when-its-reference-count-hits-zero
- Pickle user guide (see in particular
__getstate__and__setstate__on how to overwrite the pickle/unpickle actions): https://docs.python.org/3/library/pickle.html#object.__getstate__ - On sharing (opened) file pointers in a POSIX system after a
fork()is issued: https://stackoverflow.com/questions/33899548/file-pointers-after-returning-from-a-forked-child-process