RFW dataset: overlapping and mis-labelling between training and testing sets

Based on the datasets we received from Wang et al., when we use z-samples or t-samples as shown below https://gitlab.idiap.ch/bob/bob.bio.face/-/blob/master/bob/bio/face/database/rfw.py#L242, 2 problems occurred during the experiments.

There are 44 subjects classified as Caucasian in the training set, but as Indian in the testing set. (e.g. m.0c96fs, m.08y5xt, etc.)
When we choose to obtain 2500 z-samples from each race as the cohort, we detect more than 6000 pairs of subjects (one from training and one from testing) that have very high similarity scores (-0.5~-0.1). After manually check some of them, those samples should belong to same person, i.e. not imposter scores. So the overlapping exists between training and testing sets, which is not supposed to be.

This bug report works as a record of problems. I'm not sure if those problems only happen to us because of different versions of datasets. We could discuss it in a later stage, e.g. use other BUPT datasets like BUPT-Balanced as training set since Wang et al. stated there is no overlap between BUPT-Balanced and RFW, face detection might be necessary since no landmark is given for this dataset.