Skip to content

GitLab

Explore

Sign in

Primary navigation

Project

bob.pipelines
- Activity
- Members
- Labels
- Environments
- Terraform modules
- Incidents

Snippets Groups Projects

!5

Make scikit operations daskable

Review changes
Download
Patches
Plain diff

Merged Make scikit operations daskable

dask-mixin into master

Overview 69
Commits 12
Pipelines 17
Changes 11

Merged Tiago de Freitas Pereira requested to merge dask-mixin into master 5 years ago

Overview 69
Commits 12
Pipelines 17
Changes 11

Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.

Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1-) Formalize the problem; 2-) Present all possible use cases and 3-) Try to propose directions with dask operations.

Problem Statements:

Can we make stateless (only estimator.transform) and statefull (methods estimator.fit/estimator.transform enabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin?
Can we make WHOLE pipelines (a stack of these estimators daskable) daskable enabled with some magic Mixin?

Formalization of the problem:

Boundaries

Here I will consider ONLY cases where estimators are stacked objects (a.k.a pipeline), because this is like real life looks like. Hence, pipeline can be defined as: pipeline=[estimator_[1], estimator_[2],....estimator_[n]]
whole pipelines can be either fittable/transformable (statefull) or transformable (stateless) (look at https://gitlab.idiap.ch/bob/bob.pipelines/blob/master/bob/pipelines/test/test_processor.py#L230).
pipeline.transform HAS to be called as a result of dask.bag.map so we can enjoy parallelization

Cardinality of the operations

Case A: `pipeline.transform`

pipeline.transform is a 1:1 operation. Hence, 2 or more estimator_n.transform can be dasked in one shot with: dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform). Easy. We already enjoy that in the vanilla-pipeline

Case B: `pipeline.fit`

Here we have 2 situations:

estimator_[n].fit followed by estimator_[n].transform is an 1:N operation. Once a estimator_[n].fit is done, we need to be able to take all the samples used in this operation and map them again into a estimator_[n].transform so we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we could dask.delayed(estimator_[n].fit)([sample_set]) and dask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform). This basically would break the scikit API :-(
estimator_[n].transform followed by estimator_[n+1].fit is an N:1 operation. We need to be able to concatenate samples from estimator_[n].transform to pass them to the followed estimator_[n+1].fit. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.

Well, that's it. I hope I provided enough details for discussion

ping @andre.anjos @amohammadi

thanks

Edited 5 years ago by Tiago de Freitas Pereira

Merge request reports

Activity

Filter activity

Approvals
Assignees & reviewers
Comments (from bots)
Comments (from users)
Commits & branches
Edits
Labels
Lock status
Mentions
Merge request status
Tracking

Please register or sign in to reply

0 Assignees

0 Reviewers

Request review from

Loading

Labels

0

None

0

None

Select labels

Manage project labels

Milestone

None

None

None

Time tracking

No estimate or time spent

0

0 Participants

Loading