Vanilla-biometrics: defining partitions size or number of partitions
I have some issues with the way data is partitioned in vanilla-biometrics with Dask.
Actual behavior:
- "automatic": takes
max(len(background_model_samples), len(reference_samples), len(probes_samples))
, then computes a partition size according to that and the number of worker available. - user-set partition size (
-s
option): the size of partitions is fixed by the user.
My issue:
I have a big training set and a small enrollment set.
Using the automatic way, the number of partitions is defined by the size of background_model_samples
(let's say 3000 elements, giving a partition size of 300, with 100 workers). But when processing my small set of reference_samples
(10 elements), the whole set fits in one partition (of size 300) and thus is computed on one worker by Dask. And the enrollment step takes time and is done one reference at a time.
Setting the partition size manually (with -s
) is no good either as I would set it to 1 to split my 10 enrollment tasks as much as possible and this will create 3000 tasks when training on the background_model_samples
(too many tasks for dask, lots of transfer time).
A solution:
Split the data not in terms of partition size but in a number of partitions. ToDaskBag
supports setting npartition
instead of partition_size
, and could easily be used that way. The number of partitions could be the number of available workers.
And in that case, 3000 background_model_samples
will be split into 100 partitions (because of 100 available workers) and the reference_samples
will be split into 10 partitions of size 1.
Am I missing a reason why it was not done like so @tiago.pereira ?
Another solution would be to allow the user to set the number of partitions manually (similar to the setting of partition size).