Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
bob.bio.base
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
bob
bob.bio.base
Commits
95ee08d1
Commit
95ee08d1
authored
3 years ago
by
Yannick DAYER
Browse files
Options
Downloads
Patches
Plain Diff
documentation on choosing partitions count or size
parent
cbb1fa4a
Branches
Branches containing commit
Tags
Tags containing commit
1 merge request
!290
PipelineSimple partitioning fixes
Pipeline
#60903
failed
3 years ago
Stage: build
Changes
2
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
bob/bio/base/pipelines/entry_points.py
+1
-0
1 addition, 0 deletions
bob/bio/base/pipelines/entry_points.py
bob/bio/base/script/pipeline_simple.py
+28
-4
28 additions, 4 deletions
bob/bio/base/script/pipeline_simple.py
with
29 additions
and
4 deletions
bob/bio/base/pipelines/entry_points.py
+
1
−
0
View file @
95ee08d1
...
...
@@ -211,6 +211,7 @@ def execute_pipeline_simple(
"
Splitting data according to the number of available workers.
"
)
n_jobs
=
dask_client
.
cluster
.
sge_job_spec
[
"
default
"
][
"
max_jobs
"
]
logger
.
debug
(
f
"
{
n_jobs
}
partitions will be created.
"
)
pipeline
=
dask_pipeline_simple
(
pipeline
,
npartitions
=
n_jobs
)
logger
.
info
(
f
"
Running the PipelineSimple for group
{
group
}
"
)
...
...
This diff is collapsed.
Click to expand it.
bob/bio/base/script/pipeline_simple.py
+
28
−
4
View file @
95ee08d1
...
...
@@ -129,8 +129,8 @@ It is possible to do it via configuration file
@click.option
(
"
--dask-partition-size
"
,
"
-s
"
,
help
=
"
If using Dask, this option defines the max size of each dask.bag.partition.
"
"
Use this option if the current heuristic that sets this value doesn
'
t suit your experiment.
"
help
=
"
If using Dask, this option defines the max size of each dask.bag.partition.
"
"
Use this option if the current heuristic that sets this value doesn
'
t suit your experiment.
"
"
(https://docs.dask.org/en/latest/bag-api.html?highlight=partition_size#dask.bag.from_sequence).
"
,
default
=
None
,
type
=
click
.
INT
,
...
...
@@ -150,8 +150,8 @@ It is possible to do it via configuration file
@click.option
(
"
--dask-n-workers
"
,
"
-w
"
,
help
=
"
If using Dask, this option defines the number of workers to start your experiment.
"
"
Dask automatically scales up/down the number of workers due to the current load of tasks to be solved.
"
help
=
"
If using Dask, this option defines the number of workers to start your experiment.
"
"
Dask automatically scales up/down the number of workers due to the current load of tasks to be solved.
"
"
Use this option if the current amount of workers set to start an experiment doesn
'
t suit you.
"
,
default
=
None
,
type
=
click
.
INT
,
...
...
@@ -241,6 +241,30 @@ def pipeline_simple(
:py:func:`bob.bio.base.pipelines.execute_pipeline_simple`
instead.
Using Dask
----------
Vanilla-biometrics is intended to work with Dask to split the load of work between
processes on a machine or workers on a distributed grid system. By default, the
local machine is used in single-threaded mode. However, by specifying the
`--dask-client` option, you specify a Dask Client.
When using multiple workers, a few things have to be considered:
- The number of partitions in the data.
- The number of workers to process the data.
Ideally, (and this is the default behavior) you want to split all the data between
many available workers, and all the workers work at the same time on all the data.
But the number of workers may be limited, or one partition of data may be filling
the memory of one worker. Moreover, having many small tasks (by splitting the data
into many partitions) is not recommended as the scheduler will then spend more time
organizing and communicating with the workers.
To solve speed or memory issues, options are available to split the data
differently (`--dask-n-partitions` or `--dask-partition-size`). If you encounter
memory issues on a worker, try augmenting the number of partitions, and if your
scheduler is not keeping up, try reducing that number.
"""
if
no_dask
:
dask_client
=
None
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment