Integration of MLflow to keep models, parameters and metrics together with this repository
Here are some thoughts about a potential integration of MLflow Model registration to Gitlab via GitLab MLflow client compatibility. Gitlab in this case acts a MLflow data repository.
Model organisation
MLFlow organises data related to a model "type" into a "Model" object. We should create one such object every time a new use-case is presented. So far, I can see these use-cases:
- All models used to generate documentation results - for these, we should have 1 MLflow "Model" object per result contained in the documentation.
- All models used for testing purpose - 1 MLflow "Model" per actual model used in the tests
I'm explicitly not include models from papers here, since I think they would better live in their own paper-oriented repository.
Model naming convention
To avoid all mess, we should have a system for organising the model names. Something like "doc-..." for documentation models and "test-..." for test models sound OK to me.
Model versions
Each MLflow "Model" may have many versions. We should keep updating each stored model with major changes in software, or changes in parameters (i.e. standard configuration files shipped with this package) used to train said model. To mind, it would be good to store the following parameters to keep track of model information, with each model version:
- The gitlab commit ID (possibly using an URL), with the precise version of the code that generated the model
- The date the model was generated at (I suspect this may come for free)
- The command-line used to generate the model, which shall include the model configuration file and database used (please note that the use of the gitlab commit ID fixes the versions of said configuration files)
- Relevant performance metrics (table
log_text()?
and figures -log_figure()?
) as those produced by the evaluation script - We should also log the "test" set results as with something like
log_metrics()
MLflow.client
Interconnection with - The uploading of models to the registry should appear as an option to the evaluation script. This option can be, for example, the (partial) path of a GitLab project where to upload the model and associated data (e.g. "biosignal/software/mednet"). The user must also provide the name of the MLflow "Model" (e.g. "doc-usage-table" or "test-model-analysis") where to store the new model version. If these options are set, then we load potential GitLab credentials from
~/.python-gitlab.cfg
, find the project ID by querying the GitLab API if necessary, and uploading the model with all the information above. - It should be possible to load a model from a GitLab repository in all places a model may be input in our command-line applications. Examples of this are the prediction and evaluation scripts, but also all scripts in the saliency map submodule.
- Once the functions for loading a model from a GitLab repository are in place, we should be able to load models required during testing directly from GitLab. Some sort of model caching could be put in place to avoid excessive re-downloading.