Hi guys, this is result of our discussion today. I just "copy/paste" the print from @amohammadi in this MR so we can iterate over it.
Well, I don't know if we can solve this with ONLY Mixin's. I will try to: 1-) Formalize the problem; 2-) Present all possible use cases and 3-) Try to propose directions with dask operations.
estimator.transform
) and statefull (methods estimator.fit
/estimator.transform
enabled) scikit estimators automatically daskable by just wrapping them with some magic Mixin?pipeline=[estimator_[1], estimator_[2],....estimator_[n]]
pipeline.transform
HAS to be called as a result of dask.bag.map
so we can enjoy parallelizationpipeline.transform
pipeline.transform
is a 1:1 operation. Hence, 2 or more estimator_n.transform
can be dasked in one shot with: dask.bag.from_sequence([sample_set]).map_partitions(pipeline.transform)
. Easy. We already enjoy that in the vanilla-pipelinepipeline.fit
Here we have 2 situations:
estimator_[n].fit
followed by estimator_[n].transform
is an 1:N operation. Once a estimator_[n].fit
is done, we need to be able to take all the samples used in this operation and map
them again into a estimator_[n].transform
so we can enjoy parallellization. The only way I see this working is by making samples as input of Mixin class (in the init). In this we could dask.delayed(estimator_[n].fit)([sample_set])
and dask.bag.from_sequence([sample_set]).map_partitions(estimator_[n].transform)
. This basically would break the scikit API
:-(
estimator_[n].transform
followed by estimator_[n+1].fit
is an N:1 operation. We need to be able to concatenate samples from estimator_[n].transform
to pass them to the followed estimator_[n+1].fit
. I don't see how this can work ONLY WITH MIXIN'S. We need to have some higher level entitty (possibly an extension of the scikit.pipelines that i wrote in the last MR) to orchestrate this.
Well, that's it. I hope I provided enough details for discussion
ping @andre.anjos @amohammadi
thanks