Load score more efficiently for negatives and positives

I will give this a try on Monday.

But, as I mentioned in #19 (closed), this is not the minimal way to load scores, it still required to keep both real_id and claimed_id of all score pairs in memory at the same time.

OK, I have done some profiling of the three alternatives of loading and splitting score files. I have used https://pypi.python.org/pypi/memory_profiler to do the memory and time profiling.

I have a larger score file with eight million scores (219 MB raw file):

$ wc -l scores-dev 
8012444 scores-dev

I have written a short script to load the score file:

import bob.measure
negatives, positives = bob.measure.load.split_four_column(score_file)

and I have run the script with all three branches: master, minimal_load and 19-load_scores-extremely-memory-hungry, using:

$ bin/mprof run -T 1 ./bin/python load_scores.py
$ bin/mprof plot -no {master,minimal,mine}.pdf

The resulting plots are attached.master.pdf mine.pdf minimal.pdf

As you can see, the minimal_load branch and the master branch need approximately the same amount of memory (3GB vs. 3.5GB), while mine tales 300MB. Also note the time differences (x-axes): master 180 sec., minimal_load: 140 sec, mine: 80 sec. Note that all experiments are run on a local disk, i.e., to avoid network latencies in loading the score file.

@amohammadi Do you now agree that my version works better? Can we close this PR?

Status changed to closed

ok @mguenther go ahead and create a pull for your implementation please. But, as I said before, I make use of these methods elsewhere so I don't want their API (their call signature and their return value format) to be changed.

mentioned in merge request bob.bio.base!250 (merged)

Load score more efficiently for negatives and positives

Merge request reports

Activity