Make scikit operations daskable
Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.
Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1-) Formalize the problem; 2-) Present all possible use cases and 3-) Try to propose directions with dask operations.
- Can we make stateless (only
estimator.transform) and statefull (methods
estimator.transformenabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin?
- Can we make WHOLE pipelines (a stack of these estimators daskable) daskable enabled with some magic Mixin?
Formalization of the problem:
- Here I will consider ONLY cases where estimators are stacked objects (a.k.a pipeline), because this is like real life looks like. Hence, pipeline can be defined as:
- whole pipelines can be either fittable/transformable (statefull) or transformable (stateless) (look at https://gitlab.idiap.ch/bob/bob.pipelines/blob/master/bob/pipelines/test/test_processor.py#L230).
pipeline.transformHAS to be called as a result of
dask.bag.mapso we can enjoy parallelization
Cardinality of the operations
pipeline.transformis a 1:1 operation. Hence, 2 or more
estimator_n.transformcan be dasked in one shot with:
dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform). Easy. We already enjoy that in the vanilla-pipeline
Here we have 2 situations:
estimator_[n].transformis an 1:N operation. Once a
estimator_[n].fitis done, we need to be able to take all the samples used in this operation and
mapthem again into a
estimator_[n].transformso we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we could
dask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform). This basically would break the scikit
estimator_[n+1].fitis an N:1 operation. We need to be able to concatenate samples from
estimator_[n].transformto pass them to the followed
estimator_[n+1].fit. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.
Well, that's it. I hope I provided enough details for discussion