Database interface
Implemented a simple filelist database interface for the VanillaBiometrics based on CSVFiles.
The CSVDatasetDevEval
needs to have the following format:
my_dataset/
my_dataset/my_protocol/
my_dataset/my_protocol/train.csv
my_dataset/my_protocol/train.csv/dev_enroll.csv
my_dataset/my_protocol/train.csv/dev_probe.csv
my_dataset/my_protocol/train.csv/eval_enroll.csv
my_dataset/my_protocol/train.csv/eval_probe.csv
...
where each CSV file needs to have the following format:
PATH,SUBJECT
path_1,subject_1
path_2,subject_2
path_i,subject_j
This formart allows the usage of metadata by following the pattern below:
PATH,SUBJECT,METADATA_1,METADATA_2,METADATA_k
path_1,subject_1,A,B,C
path_2,subject_2,A,B,1
path_i,subject_j,2,3,4
We can imagine other implementations of this.
For instance, CSVDatasetCrossValidation
that given a csv file, it splits "on-the-fly" several data for enrolling, probing and training.
Or CSVDatasetWithEyesAnnotation
, that handles annotations for Face Rec pipelines.
I still need to implement a mechanism that takes zip
files as input to CSVDatasetDevEval
.
That way we can ship databases as simple zip files
ping @ydayer @amohammadi
I'll merge this tomorrow.
I need this to support the efforts on bob.bio.vein
.
Merge request reports
Activity
- The one I'm proposing is a "real" CSV file (comma-separated and with a header)
- This format will allow us to ship Metadata within the CSV file (which is not the case with the current one)
Edited by Tiago de Freitas Pereira- Another motivation to move away from this format is the three column format that should be defined in
for_models.lst
, In almost all casesclient_id=model_id
. I can only remember one case where this is not true and in this case (IJB-C) the dataset interface doesn't use our dataset interface
This can be an oportunity to address these issues with the interface as we are already adressing other issues
- Another motivation to move away from this format is the three column format that should be defined in
@tiago.pereira I think with a few minor modifications (while keeping backward compatibility) you can use our old format https://www.idiap.ch/software/bob/docs/bob/docs/master/bob/bob.bio.base/doc/filelist-guide.html
We shouldn't keep backward compatibility IMO.
As a last resort, people can wrap their file-based interface with the
DatabaseConnector
(https://gitlab.idiap.ch/bob/bob.bio.base/-/blob/database-interface/bob/bio/base/pipelines/vanilla_biometrics/legacy.py#L39)If you create everything new, you should call it new too. Don't call it Bob. Otherwise, how can we use the next release of Bob in our current experiments? And while you are at changing everything without backward compatibility, why don't you re-write all code in Lua or what you have now? See how many people will use your code then.
I totally agree with Amir and Pavel, you should think about backward compatibility. Otherwise, the new interface will break all experiments that are currently running with the old-style file lists. There are two solution to that:
- Implement both behaviors into the current interface, and leave the old one as the default.
- Implement the new interface in a different class, and leave both next to each other.
added 2 commits
I've just implemented another FileList dataset interface based on CSV files. The
CSVDatasetCrossValidation
, as the name says, handles cross validation. It basically takes one CSV file as input and it gives you aDataset
object compatible withVanillaBiometrics
with thetraining
anddev
set split by subjectyou're removing the old interface here, how can they load the old
.lst
files?I totally agree with Amir and Pavel, you should think about backward compatibility. Otherwise, the new interface will break all experiments that are currently running with the old-style file lists. There are two solution to that:
- Implement both behaviors into the current interface, and leave the old one as the default.
- Implement the new interface in a different class, and leave both next to each other.
Thanks for your comments
As I pointed out in the above items, this new dataset interface better addresses our new needs and gives us some more flexibility to extend it if necessary (e.g
CSVToSampleLoader
that handles some specific annotations that are available in a CSV file).If the FileList interface that we currently have on master is so vital as you are claiming, I can put it back to this branch. However, we should encourage the usage of this one that is cleaner IMHO.
We are doing our best to keep backward compatibility by:
- Making sure that we have the necessary adaptors that convert
Preprocessors
,Extractors
, andAlgorithms
toTransformers
- Creating adaptors to our current bob.dbs.*
- Making tests to make sure that everything works together, with/without checkpoints and with dask as well
- Making our C++ Python bindings pickable so we can flawlessly use them under Dask.
In this work, we are also considering to deprecate some stuff and this BREAKS backward compatibility (https://gitlab.idiap.ch/search?group_id=373&project_id=&repository_ref=&scope=issues&search=DEPRECATION). We shouldn't keep backward compatibility with stuff we don't use.
It's a waste of energy. My energy.
Again, if the FileList interface that we currently have on master is so vital as you are claiming, I can put it back to this branch.
Edited by Tiago de Freitas PereiraIf you create everything new, you should call it new too. Don't call it Bob. Otherwise, how can we use the next release of Bob in our current experiments? And while you are at changing everything without backward compatibility, why don't you re-write all code in Lua or what you have now? See how many people will use your code then.
I will not respond to mockery Have some respect for the work of other people
Take it easy. I didn't mean to mock your work, I was just making my point in a more colorful language. Is this a cemetery where everyone needs to be dressed in black and whisper politely? I don't understand why you are taking it so seriously.
But yes, the FileList interfaces is very widely used and it should be available in a foreseeable future. We can slowly start migrating to a new interface, sure, but as life shows, it may actually take years.
@tiago.pereira I understand that most of the interfaces used at Idiap are not based on the filelist database, because you know how to design a proper database. People outside might not have that deep knowledge and rather rely on simple things. While they might switch to the new database interface that you have implemented (thank you for your effort!), it might take a while for them to do so.
When I am implementing interfaces and protocols for new databases, I almost always use the filelist database, just because of its simplicity. Only when the databases get larger or already come with a protocol, I think about implementing a proper interface.
You are right that most cases have
client_id = model_id
, but not all of them do. Additionally to the IJB-datasets that you mentioned, also the FRGC and the GBU datasets rely onclient_id != model_id
(but they do not use the filelist database). I had lately a case where I also required this for the filelist database, since I had defined several models per client. Thus, this feature is not as useless as you think. If you really want to implement theclient_id = model_id
idea, you can add functionality that allows to have thefor_models
file containing only two columns. However, make sure that the three-column way still works.@tiago.pereira I understand that most of the interfaces used at Idiap are not based on the filelist database, because you know how to design a proper database. People outside might not have that deep knowledge and rather rely on simple things. While they might switch to the new database interface that you have implemented (thank you for your effort!), it might take a while for them to do so.
When I am implementing interfaces and protocols for new databases, I almost always use the filelist database, just because of its simplicity. Only when the databases get larger or already come with a protocol, I think about implementing a proper interface.
You are right that most cases have
client_id = model_id
, but not all of them do. Additionally to the IJB-datasets that you mentioned, also the FRGC and the GBU datasets rely onclient_id != model_id
(but they do not use the filelist database). I had lately a case where I also required this for the filelist database, since I had defined several models per client. Thus, this feature is not as useless as you think. If you really want to implement theclient_id = model_id
idea, you can add functionality that allows to have thefor_models
file containing only two columns. However, make sure that the three-column way still works.Thanks for the feedback @mguenther I'm glad to hear from you more often :-)
The current database interface is back on for backward compatibility.
People outside might not have that deep knowledge and rather rely on simple things.
Yes, I do agree and we observe this quite often. That's why, for instance, I'm pushing the
CSVDatasetCrossValidation
. In this interface, you have to provide only one two-column CSV file (or more columns if you want to ship some metadata). How hard is that?If you want something more sophisticated, you can for
CSVDatasetDevEval
where you can play with, dev, and eval sets with a minimal two-column file (again, metadata are allowed; so you can extend this file to more columns). Worth noting thatclient_id <> model_id
is allowed with this format. It's just a matter to add more columns in your CSV files ([dev-eval]_enroll.csv
)If you still need something more sophisticated, you can extend yourself this FileList based interface to fit your needs.
Thanks again for the feedback
Cheers
changed milestone to %Bob 9.0.0
mentioned in commit 5487e057