Load score more efficiently for negatives and positives
Fixes #19 (closed)
@mguenther I think this is better than what you are trying to do. What do you think?
Merge request reports
Activity
I will give this a try on Monday.
But, as I mentioned in #19 (closed), this is not the minimal way to load scores, it still required to keep both
real_id
andclaimed_id
of all score pairs in memory at the same time.OK, I have done some profiling of the three alternatives of loading and splitting score files. I have used https://pypi.python.org/pypi/memory_profiler to do the memory and time profiling.
I have a larger score file with eight million scores (219 MB raw file):
$ wc -l scores-dev 8012444 scores-dev
I have written a short script to load the score file:
import bob.measure negatives, positives = bob.measure.load.split_four_column(score_file)
and I have run the script with all three branches:
master
, minimal_load and 19-load_scores-extremely-memory-hungry, using:$ bin/mprof run -T 1 ./bin/python load_scores.py $ bin/mprof plot -no {master,minimal,mine}.pdf
The resulting plots are attached.master.pdfmine.pdfminimal.pdf
As you can see, the
minimal_load
branch and themaster
branch need approximately the same amount of memory (3GB vs. 3.5GB), while mine tales 300MB. Also note the time differences (x-axes):master
180 sec.,minimal_load
: 140 sec, mine: 80 sec. Note that all experiments are run on a local disk, i.e., to avoid network latencies in loading the score file.@amohammadi Do you now agree that my version works better? Can we close this PR?
ok @mguenther go ahead and create a pull for your implementation please. But, as I said before, I make use of these methods elsewhere so I don't want their API (their call signature and their return value format) to be changed.
mentioned in merge request bob.bio.base!250 (merged)