Memory error during serialization of large objects
This is an issue that I'm facing for a while.
Now we are running our pipelines in large scale experiments (several thousands of images), the list of SampleSets that we are generating during pipeline.transform
are getting BIG (>1GB) and this is raising some MemoryError Exceptions during serialization (even when we have enough memory).
This is very annoying, basically, I can't work with large datasets.
I managed to generate a very simple example describing this issue here: https://github.com/dask/distributed/issues/3806
I know we can change the serializer dask-distributed
uses (https://distributed.dask.org/en/latest/serialization.html#use), but I'm not sure if this is the real problem.
However, I would like to propose a workaround that will slow down a bit the execution of experiments, but, at least, the code will not crash. I would like to change the serialization behavior of DelayedSamples to this.
class DelayedSample(_ReprMixin):
def __init__(self, load, parent=None, **kwargs):
self.load = load
if parent is not None:
_copy_attributes(self, parent.__dict__)
_copy_attributes(self, kwargs)
self._data = None
@property
def data(self):
"""Loads the data from the disk file."""
if self._data is None:
self._data = self.load()
return self._data
def __getstate__(self):
self._data = None
d = dict(self.__dict__)
return d
What do you think? ping @andre.anjos @amohammadi
ping @ydayer
thanks