General storage and external data mechanism for Bob
I think we need some kind of storage mechanism for Bob. Our external data usually boils down to 3 kinds of data:
- Data that changes with the package and must be versioned with the package.
- Data that does not change and can be automatically downloaded. Possibly this data is very large and does not make sense to have multiple versions of it in each conda environment.
- Data that cannot be automatically downloaded.
1 is like our .sql3 files for dbs which we keep the latest version on our servers and the versioned version goes into the conda package. There could be a script to get the latest version from our servers similar to bob_dbmanage.
2 is data like DNN models (e.g. bob.ip.tensorflow_extractor!6 (merged)) which do not change and can be automatically downloaded. These DNN models are huge and ideally we don't want to have them installed in each conda environment separately. This is where we need to have one folder to store these kind of data.
Here's the procedure I am suggesting (for bob.ip.tensorflow_extractor for example):
- Look into
rc["bob.ip.tensorflow_extractor.drgan_modelpath"]
to see where it is. - If it's not there, look into
~/.bob/bob.ip.tensorflow_extractor/drgan_modelpath
for the model. - If it's not there, download it and put it there.
The general storage location (~/.bob
) can of course be customized using bob config
. Some thing like bob config set storage ~/user/bob
Data for the case of 3 (Data that cannot be automatically downloaded.) should be documented that it needs to be downloaded manually and users are expected to expose its location using the bob config
command.