Make scikit operations daskable
Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.
Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1-) Formalize the problem; 2-) Present all possible use cases and 3-) Try to propose directions with dask operations.
Problem Statements:
- Can we make stateless (only
estimator.transform
) and statefull (methodsestimator.fit
/estimator.transform
enabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin? - Can we make WHOLE pipelines (a stack of these estimators daskable) daskable enabled with some magic Mixin?
Formalization of the problem:
Boundaries
- Here I will consider ONLY cases where estimators are stacked objects (a.k.a pipeline), because this is like real life looks like. Hence, pipeline can be defined as:
pipeline=[estimator_[1], estimator_[2],....estimator_[n]]
- whole pipelines can be either fittable/transformable (statefull) or transformable (stateless) (look at https://gitlab.idiap.ch/bob/bob.pipelines/blob/master/bob/pipelines/test/test_processor.py#L230).
-
pipeline.transform
HAS to be called as a result ofdask.bag.map
so we can enjoy parallelization
Cardinality of the operations
pipeline.transform
Case A: -
pipeline.transform
is a 1:1 operation. Hence, 2 or moreestimator_n.transform
can be dasked in one shot with:dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform)
. Easy. We already enjoy that in the vanilla-pipeline
pipeline.fit
Case B: Here we have 2 situations:
-
estimator_[n].fit
followed byestimator_[n].transform
is an 1:N operation. Once aestimator_[n].fit
is done, we need to be able to take all the samples used in this operation andmap
them again into aestimator_[n].transform
so we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we coulddask.delayed(estimator_[n].fit)([sample_set])
anddask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform)
. This basically would break the scikitAPI
:-( -
estimator_[n].transform
followed byestimator_[n+1].fit
is an N:1 operation. We need to be able to concatenate samples fromestimator_[n].transform
to pass them to the followedestimator_[n+1].fit
. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.
Well, that's it. I hope I provided enough details for discussion
ping @andre.anjos @amohammadi
thanks