Memory error during serialization of large objects
This is an issue that I'm facing for a while.
Now we are running our pipelines in large scale experiments (several thousands of images), the list of SampleSets that we are generating during
pipeline.transform are getting BIG (>1GB) and this is raising some MemoryError Exceptions during serialization (even when we have enough memory).
This is very annoying, basically, I can't work with large datasets.
I managed to generate a very simple example describing this issue here: https://github.com/dask/distributed/issues/3806
I know we can change the serializer
dask-distributed uses (https://distributed.dask.org/en/latest/serialization.html#use), but I'm not sure if this is the real problem.
However, I would like to propose a workaround that will slow down a bit the execution of experiments, but, at least, the code will not crash. I would like to change the serialization behavior of DelayedSamples to this.
class DelayedSample(_ReprMixin): def __init__(self, load, parent=None, **kwargs): self.load = load if parent is not None: _copy_attributes(self, parent.__dict__) _copy_attributes(self, kwargs) self._data = None @property def data(self): """Loads the data from the disk file.""" if self._data is None: self._data = self.load() return self._data def __getstate__(self): self._data = None d = dict(self.__dict__) return d