Picking a "single" threshold during evaluation is hard

Ideally, we should have a flexible threshold selection mechanism:

If the user provides a floating point number, we apply this to all splits
If the user provides a split name, we calculate the threshold a priori on that set, then apply it to all the other sets
If the user provides no input concerning thresholds, then the strategy should be this:
- If there is a split named train, then the threshold is calculated on this set, and always applied a posteriori
- If there is a split named validation, then the threshold is calculated on this set, and always applied a posteriori
- For all other splits, we use the validation split threshold that was calculated. If no validation split is present, we default to half-way between min(labels) and max(labels).