Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
  • bob.pipelines bob.pipelines
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
  • Issues 5
    • Issues 5
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 0
    • Merge requests 0
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • bobbob
  • bob.pipelinesbob.pipelines
  • Merge requests
  • !75

Add worker Time To Live limitation

  • Review changes

  • Download
  • Email patches
  • Plain diff
Merged Laurent COLBOIS requested to merge dead-workers into master Sep 06, 2021
  • Overview 2
  • Commits 2
  • Pipelines 2
  • Changes 1

Hello, I have regularly been annoyed by Dask runs that hang indefinitely because of some workers being disconnected from the scheduler. In this case, the scheduler actually assumes the worker must still be doing its job so it doesn't reassign the task, leading to a completely blocked run that needs to be interrupted by hand. This typically happens on very heavy experiments e.g. on IJBC, FRGC.

From what I understand this can be handled using the worker_ttl parameter of the scheduler, which puts a limit on how long a worker can be unseen by the scheduler before being killed and reassigning its task. It is None by default, I have been working for a while on a local branch where I set the default to 60s, it helped quite a lot.

I am proposing to merge this change, however I wanted to know what you think of it. My main concern is that it might be hiding some underlying issue (why do the workers actually disconnect ?), so I am not 100% sure it's a good change to make.

ping @tiago.pereira @amohammadi

Assignee
Assign to
Reviewers
Request review from
Time tracking
Source branch: dead-workers