Skip to content
GitLab
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • bob.pipelines bob.pipelines
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 5
    • Issues 5
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • bobbob
  • bob.pipelinesbob.pipelines
  • Merge requests
  • !5

Make scikit operations daskable

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Tiago de Freitas Pereira requested to merge dask-mixin into master Mar 09, 2020
  • Overview 69
  • Commits 12
  • Pipelines 17
  • Changes 11

Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.

Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1-) Formalize the problem; 2-) Present all possible use cases and 3-) Try to propose directions with dask operations.

Problem Statements:

  1. Can we make stateless (only estimator.transform) and statefull (methods estimator.fit/estimator.transform enabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin?
  2. Can we make WHOLE pipelines (a stack of these estimators daskable) daskable enabled with some magic Mixin?

Formalization of the problem:

Boundaries

  1. Here I will consider ONLY cases where estimators are stacked objects (a.k.a pipeline), because this is like real life looks like. Hence, pipeline can be defined as: pipeline=[estimator_[1], estimator_[2],....estimator_[n]]
  2. whole pipelines can be either fittable/transformable (statefull) or transformable (stateless) (look at https://gitlab.idiap.ch/bob/bob.pipelines/blob/master/bob/pipelines/test/test_processor.py#L230).
  3. pipeline.transform HAS to be called as a result of dask.bag.map so we can enjoy parallelization

Cardinality of the operations

Case A: pipeline.transform

  1. pipeline.transform is a 1:1 operation. Hence, 2 or more estimator_n.transform can be dasked in one shot with: dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform). Easy. We already enjoy that in the vanilla-pipeline

Case B: pipeline.fit

Here we have 2 situations:

  1. estimator_[n].fit followed by estimator_[n].transform is an 1:N operation. Once a estimator_[n].fit is done, we need to be able to take all the samples used in this operation and map them again into a estimator_[n].transform so we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we could dask.delayed(estimator_[n].fit)([sample_set]) and dask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform). This basically would break the scikit API :-(

  2. estimator_[n].transform followed by estimator_[n+1].fit is an N:1 operation. We need to be able to concatenate samples from estimator_[n].transform to pass them to the followed estimator_[n+1].fit. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.

Well, that's it. I hope I provided enough details for discussion

ping @andre.anjos @amohammadi

thanks

Edited Mar 14, 2020 by Tiago de Freitas Pereira
Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: dask-mixin