Skip to content

GitLab

Explore

Sign in

Primary navigation

Project

bob.pipelines
- Activity
- Members
- Labels
- Environments
- Terraform modules
- Incidents

Snippets Groups Projects

!75

Add worker Time To Live limitation

Review changes
Download
Patches
Plain diff

Merged Add worker Time To Live limitation

dead-workers into master

Overview 2
Commits 2
Pipelines 2
Changes 1

Merged Laurent COLBOIS requested to merge dead-workers into master 3 years ago

Overview 2
Commits 2
Pipelines 2
Changes 1

Hello, I have regularly been annoyed by Dask runs that hang indefinitely because of some workers being disconnected from the scheduler. In this case, the scheduler actually assumes the worker must still be doing its job so it doesn't reassign the task, leading to a completely blocked run that needs to be interrupted by hand. This typically happens on very heavy experiments e.g. on IJBC, FRGC.

From what I understand this can be handled using the worker_ttl parameter of the scheduler, which puts a limit on how long a worker can be unseen by the scheduler before being killed and reassigning its task. It is None by default, I have been working for a while on a local branch where I set the default to 60s, it helped quite a lot.

I am proposing to merge this change, however I wanted to know what you think of it. My main concern is that it might be hiding some underlying issue (why do the workers actually disconnect ?), so I am not 100% sure it's a good change to make.

ping @tiago.pereira @amohammadi

Merge request reports

Activity

Filter activity

Approvals
Assignees & reviewers
Comments (from bots)
Comments (from users)
Commits & branches
Edits
Labels
Lock status
Mentions
Merge request status
Tracking

Please register or sign in to reply

0 Assignees

0 Reviewers

Request review from

Loading

Labels

0

None

0

None

Select labels

Manage project labels

Milestone

None

None

None

Time tracking

No estimate or time spent

0

0 Participants

Loading