Skip to content
Snippets Groups Projects
Commit 95ee08d1 authored by Yannick DAYER's avatar Yannick DAYER
Browse files

documentation on choosing partitions count or size

parent cbb1fa4a
Branches
Tags
1 merge request!290PipelineSimple partitioning fixes
Pipeline #60903 failed
......@@ -211,6 +211,7 @@ def execute_pipeline_simple(
"Splitting data according to the number of available workers."
)
n_jobs = dask_client.cluster.sge_job_spec["default"]["max_jobs"]
logger.debug(f"{n_jobs} partitions will be created.")
pipeline = dask_pipeline_simple(pipeline, npartitions=n_jobs)
logger.info(f"Running the PipelineSimple for group {group}")
......
......@@ -129,8 +129,8 @@ It is possible to do it via configuration file
@click.option(
"--dask-partition-size",
"-s",
help="If using Dask, this option defines the max size of each dask.bag.partition."
"Use this option if the current heuristic that sets this value doesn't suit your experiment."
help="If using Dask, this option defines the max size of each dask.bag.partition. "
"Use this option if the current heuristic that sets this value doesn't suit your experiment. "
"(https://docs.dask.org/en/latest/bag-api.html?highlight=partition_size#dask.bag.from_sequence).",
default=None,
type=click.INT,
......@@ -150,8 +150,8 @@ It is possible to do it via configuration file
@click.option(
"--dask-n-workers",
"-w",
help="If using Dask, this option defines the number of workers to start your experiment."
"Dask automatically scales up/down the number of workers due to the current load of tasks to be solved."
help="If using Dask, this option defines the number of workers to start your experiment. "
"Dask automatically scales up/down the number of workers due to the current load of tasks to be solved. "
"Use this option if the current amount of workers set to start an experiment doesn't suit you.",
default=None,
type=click.INT,
......@@ -241,6 +241,30 @@ def pipeline_simple(
:py:func:`bob.bio.base.pipelines.execute_pipeline_simple`
instead.
Using Dask
----------
Vanilla-biometrics is intended to work with Dask to split the load of work between
processes on a machine or workers on a distributed grid system. By default, the
local machine is used in single-threaded mode. However, by specifying the
`--dask-client` option, you specify a Dask Client.
When using multiple workers, a few things have to be considered:
- The number of partitions in the data.
- The number of workers to process the data.
Ideally, (and this is the default behavior) you want to split all the data between
many available workers, and all the workers work at the same time on all the data.
But the number of workers may be limited, or one partition of data may be filling
the memory of one worker. Moreover, having many small tasks (by splitting the data
into many partitions) is not recommended as the scheduler will then spend more time
organizing and communicating with the workers.
To solve speed or memory issues, options are available to split the data
differently (`--dask-n-partitions` or `--dask-partition-size`). If you encounter
memory issues on a worker, try augmenting the number of partitions, and if your
scheduler is not keeping up, try reducing that number.
"""
if no_dask:
dask_client = None
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment