Generalisation: Enable validation -> test operating-mode evaluation
We currently only evaluate solutions on a single set (potentially the test set?).
It would be nice if this library allowed one to evaluate 2 subsets for the same model: a validation and a test set. The library should then be able to decide which combination of threshold + (sub) system are important to define the Pareto front estimate on the validation set and apply those to a separate test set.
A separate CLI tool could be used to print out the combinations of threshold + (sub) system that actually define the NDS at the estimated Pareto front.
We propose to modify the "Scores" representation to replace the list of lists of scores by a dictionary mapping a string to a list of scores in such way the library can now report threshold and a meaningful user-provided string identifying the sub-system of interest.