Skip to content

Do not cache data in DelayedSample

This is important as loading DelayedSamples and stacking them in SampleBatch will lead to the data being kept in the memory twice. For example, see:

import bob.pipelines as mario
import numpy as np
from functools import partial

a = np.zeros((1000, 1000))

def load(i):
    # normally we load an array from disk
    return a[i]

samples = [mario.DelayedSample(partial(load, i=i)) for i in range(len(a))]
samples[:2]
# [DelayedSample(load=functools.partial(<function load at 0x7fb1c90250d0>, i=0)),
# DelayedSample(load=functools.partial(<function load at 0x7fb1c90250d0>, i=1))]

a2 = np.array(mario.SampleBatch(samples))
np.shares_memory(a, a2)
# False

so you can see that SampleBatch always leads to a copy of data and caching data in delayed samples always leads to doulbe memory usage.

Edited by Amir MOHAMMADI