The current score normalization pipeline had an issue while executing experiments with dev/eval groups. Once the normalization statistics were computed on the dev set, its check-point contains a name that is indistinguishable from the one of the evaluation set. Consequently, the evaluation set pipeline tries to use the statistics from the dev set, which breaks the code.
This bug was introduced while the score normalization pipeline was refactored. It would be good in the future if we have a good dummy dataset to test all this stuff, even if this cost more hours on the CI.