Skip to content

Make scikit operations daskable

Tiago de Freitas Pereira requested to merge dask-mixin into master

Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.

Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1-) Formalize the problem; 2-) Present all possible use cases and 3-) Try to propose directions with dask operations.

Problem Statements:

  1. Can we make stateless (only estimator.transform) and statefull (methods estimator.fit/estimator.transform enabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin?
  2. Can we make WHOLE pipelines (a stack of these estimators daskable) daskable enabled with some magic Mixin?

Formalization of the problem:

Boundaries

  1. Here I will consider ONLY cases where estimators are stacked objects (a.k.a pipeline), because this is like real life looks like. Hence, pipeline can be defined as: pipeline=[estimator_[1], estimator_[2],....estimator_[n]]
  2. whole pipelines can be either fittable/transformable (statefull) or transformable (stateless) (look at https://gitlab.idiap.ch/bob/bob.pipelines/blob/master/bob/pipelines/test/test_processor.py#L230).
  3. pipeline.transform HAS to be called as a result of dask.bag.map so we can enjoy parallelization

Cardinality of the operations

Case A: pipeline.transform

  1. pipeline.transform is a 1:1 operation. Hence, 2 or more estimator_n.transform can be dasked in one shot with: dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform). Easy. We already enjoy that in the vanilla-pipeline

Case B: pipeline.fit

Here we have 2 situations:

  1. estimator_[n].fit followed by estimator_[n].transform is an 1:N operation. Once a estimator_[n].fit is done, we need to be able to take all the samples used in this operation and map them again into a estimator_[n].transform so we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we could dask.delayed(estimator_[n].fit)([sample_set]) and dask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform). This basically would break the scikit API :-(

  2. estimator_[n].transform followed by estimator_[n+1].fit is an N:1 operation. We need to be able to concatenate samples from estimator_[n].transform to pass them to the followed estimator_[n+1].fit. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.

Well, that's it. I hope I provided enough details for discussion

ping @andre.anjos @amohammadi

thanks

Edited by Tiago de Freitas Pereira

Merge request reports