Hello, I have regularly been annoyed by Dask runs that hang indefinitely because of some workers being disconnected from the scheduler. In this case, the scheduler actually assumes the worker must still be doing its job so it doesn't reassign the task, leading to a completely blocked run that needs to be interrupted by hand. This typically happens on very heavy experiments e.g. on IJBC, FRGC.
From what I understand this can be handled using the
worker_ttl parameter of the scheduler, which puts a limit on how long a worker can be unseen by the scheduler before being killed and reassigning its task. It is
None by default, I have been working for a while on a local branch where I set the default to 60s, it helped quite a lot.
I am proposing to merge this change, however I wanted to know what you think of it. My main concern is that it might be hiding some underlying issue (why do the workers actually disconnect ?), so I am not 100% sure it's a good change to make.