Generalisation: Enable validation -> test operating-mode evaluation

We currently only evaluate solutions on a single set (potentially the test set?).

It would be nice if this library allowed one to evaluate 2 subsets for the same model: a validation and a test set. The library should then be able to decide which combination of threshold + (sub) system are important to define the Pareto front estimate on the validation set and apply those to a separate test set.

A separate CLI tool could be used to print out the combinations of threshold + (sub) system that actually define the NDS at the estimated Pareto front.

We propose to modify the "Scores" representation to replace the list of lists of scores by a dictionary mapping a string to a list of scores in such way the library can now report threshold and a meaningful user-provided string identifying the sub-system of interest.

Edited Jul 22, 2025 by André Anjos