Skip to content

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
    • Help
    • Support
    • Submit feedback
    • Contribute to GitLab
  • Sign in
beat.backend.python
beat.backend.python
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 5
    • Issues 5
    • List
    • Boards
    • Labels
    • Milestones
  • Merge Requests 0
    • Merge Requests 0
  • CI / CD
    • CI / CD
    • Pipelines
    • Jobs
    • Schedules
  • Analytics
    • Analytics
    • CI / CD
    • Repository
    • Value Stream
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
  • beat
  • beat.backend.pythonbeat.backend.python
  • Issues
  • #32

Closed
Open
Opened May 13, 2020 by André Anjos@andre.anjos💬
  • Report abuse
  • New issue
Report abuse New issue

Multiprocessing support for data sources

As discussed in today's debugging session with @samuel.gaist and @amohammadi, using a DataLoader object in a multiprocessing context is hard:

  1. Typically, the underlying DataSource's fileobj's are opened by the time the process is forked
  2. Deep copying the object (which goes through pickling and unpickling it) does not properly reset underlying fileobj pointers, which makes multiple processes access the same underlying OS-level file handler, causing unwanted behaviour.

To sort this out, we discussed 2 possible additions to this package:

  1. DataLoader should have a reset() method that resets all underlying DataSource opened files, so that they can be correctly copied across multiple processes (e.g. in the event of a fork()). It should be relatively easy to do a reset() operation across all inputs of a user algorithm, to ensure all data sources are properly reset before an eventual user-guided fork().
  2. The underlying DataSource should have its pickle/unpickle behaviour patched (via overwriting the __setstate__ slot of DataSource, see reference below), so that unpickling a data source (e.g. indirectly via a data loader deep copy), will call self.reset() after its state is unpickled. This would allow a DataLoader object to be sent over current mechanisms for inter-process communication (e.g. MPI or multiprocessing.Queue), transparently.

References:

  • Python fileobj handling: https://stackoverflow.com/questions/1834556/does-a-file-object-automatically-close-when-its-reference-count-hits-zero
  • Pickle user guide (see in particular __getstate__ and __setstate__ on how to overwrite the pickle/unpickle actions): https://docs.python.org/3/library/pickle.html#object.__getstate__
  • On sharing (opened) file pointers in a POSIX system after a fork() is issued: https://stackoverflow.com/questions/33899548/file-pointers-after-returning-from-a-forked-child-process
Assignee
Assign to
None
Milestone
None
Assign milestone
Time tracking
None
Due date
None
0
Labels
None
Assign labels
  • View project labels
Reference: beat/beat.backend.python#32