Database interface

marked as a Work In Progress

changed the description

wait why is this different from the filelist format that we already have?

The one I'm proposing is a "real" CSV file (comma-separated and with a header)
This format will allow us to ship Metadata within the CSV file (which is not the case with the current one)

Still, I don't understand why you're moving away from the old columns and folder structure. If you want to add metadata, you can just add more columns.

Another motivation to move away from this format is the three column format that should be defined in for_models.lst, In almost all cases client_id=model_id. I can only remember one case where this is not true and in this case (IJB-C) the dataset interface doesn't use our dataset interface

This can be an oportunity to address these issues with the interface as we are already adressing other issues

And I see everyone doing client_id=model_id unnecessarily, just because is mandatory

You may make the model_id optional in the new implementation.

added 1 commit

34bd50fb - Improved test cases

Compare with previous version

@tiago.pereira I think with a few minor modifications (while keeping backward compatibility) you can use our old format https://www.idiap.ch/software/bob/docs/bob/docs/master/bob/bob.bio.base/doc/filelist-guide.html

We shouldn't keep backward compatibility IMO.

As a last resort, people can wrap their file-based interface with the DatabaseConnector (https://gitlab.idiap.ch/bob/bob.bio.base/-/blob/database-interface/bob/bio/base/pipelines/vanilla_biometrics/legacy.py#L39)

you're removing the old interface here, how can they load the old .lst files?

If you create everything new, you should call it new too. Don't call it Bob. Otherwise, how can we use the next release of Bob in our current experiments? And while you are at changing everything without backward compatibility, why don't you re-write all code in Lua or what you have now? See how many people will use your code then.

I totally agree with Amir and Pavel, you should think about backward compatibility. Otherwise, the new interface will break all experiments that are currently running with the old-style file lists. There are two solution to that:

Implement both behaviors into the current interface, and leave the old one as the default.
Implement the new interface in a different class, and leave both next to each other.

added 2 commits

4de7cc3b - Implemented CrossValidation Filelist dataset
9f6da96e - Created new test protocol

Compare with previous version

I've just implemented another FileList dataset interface based on CSV files. The CSVDatasetCrossValidation, as the name says, handles cross validation. It basically takes one CSV file as input and it gives you a Dataset object compatible with VanillaBiometrics with the training and dev set split by subject

you're removing the old interface here, how can they load the old .lst files?

I totally agree with Amir and Pavel, you should think about backward compatibility. Otherwise, the new interface will break all experiments that are currently running with the old-style file lists. There are two solution to that:

Implement both behaviors into the current interface, and leave the old one as the default.

Implement the new interface in a different class, and leave both next to each other.

Thanks for your comments

As I pointed out in the above items, this new dataset interface better addresses our new needs and gives us some more flexibility to extend it if necessary (e.g CSVToSampleLoader that handles some specific annotations that are available in a CSV file).

If the FileList interface that we currently have on master is so vital as you are claiming, I can put it back to this branch. However, we should encourage the usage of this one that is cleaner IMHO.

We are doing our best to keep backward compatibility by:

Making sure that we have the necessary adaptors that convert Preprocessors, Extractors, and Algorithms to Transformers
Creating adaptors to our current bob.dbs.*
Making tests to make sure that everything works together, with/without checkpoints and with dask as well
Making our C++ Python bindings pickable so we can flawlessly use them under Dask.

In this work, we are also considering to deprecate some stuff and this BREAKS backward compatibility (https://gitlab.idiap.ch/search?group_id=373&project_id=&repository_ref=&scope=issues&search=DEPRECATION). We shouldn't keep backward compatibility with stuff we don't use.

It's a waste of energy. My energy.

Again, if the FileList interface that we currently have on master is so vital as you are claiming, I can put it back to this branch.

If you create everything new, you should call it new too. Don't call it Bob. Otherwise, how can we use the next release of Bob in our current experiments? And while you are at changing everything without backward compatibility, why don't you re-write all code in Lua or what you have now? See how many people will use your code then.

I will not respond to mockery Have some respect for the work of other people

Take it easy. I didn't mean to mock your work, I was just making my point in a more colorful language. Is this a cemetery where everyone needs to be dressed in black and whisper politely? I don't understand why you are taking it so seriously.

But yes, the FileList interfaces is very widely used and it should be available in a foreseeable future. We can slowly start migrating to a new interface, sure, but as life shows, it may actually take years.

added 1 commit

77951f3b - Moved back the Current FileList Interface

Compare with previous version

I just put it back the current FileList interface. All tests are passing

unmarked as a Work In Progress

@tiago.pereira I understand that most of the interfaces used at Idiap are not based on the filelist database, because you know how to design a proper database. People outside might not have that deep knowledge and rather rely on simple things. While they might switch to the new database interface that you have implemented (thank you for your effort!), it might take a while for them to do so.

When I am implementing interfaces and protocols for new databases, I almost always use the filelist database, just because of its simplicity. Only when the databases get larger or already come with a protocol, I think about implementing a proper interface.

You are right that most cases have client_id = model_id, but not all of them do. Additionally to the IJB-datasets that you mentioned, also the FRGC and the GBU datasets rely on client_id != model_id (but they do not use the filelist database). I had lately a case where I also required this for the filelist database, since I had defined several models per client. Thus, this feature is not as useless as you think. If you really want to implement the client_id = model_id idea, you can add functionality that allows to have the for_models file containing only two columns. However, make sure that the three-column way still works.

@tiago.pereira I understand that most of the interfaces used at Idiap are not based on the filelist database, because you know how to design a proper database. People outside might not have that deep knowledge and rather rely on simple things. While they might switch to the new database interface that you have implemented (thank you for your effort!), it might take a while for them to do so.

When I am implementing interfaces and protocols for new databases, I almost always use the filelist database, just because of its simplicity. Only when the databases get larger or already come with a protocol, I think about implementing a proper interface.

You are right that most cases have client_id = model_id, but not all of them do. Additionally to the IJB-datasets that you mentioned, also the FRGC and the GBU datasets rely on client_id != model_id (but they do not use the filelist database). I had lately a case where I also required this for the filelist database, since I had defined several models per client. Thus, this feature is not as useless as you think. If you really want to implement the client_id = model_id idea, you can add functionality that allows to have the for_models file containing only two columns. However, make sure that the three-column way still works.

Thanks for the feedback @mguenther I'm glad to hear from you more often :-)

The current database interface is back on for backward compatibility.

People outside might not have that deep knowledge and rather rely on simple things.

Yes, I do agree and we observe this quite often. That's why, for instance, I'm pushing the CSVDatasetCrossValidation. In this interface, you have to provide only one two-column CSV file (or more columns if you want to ship some metadata). How hard is that?

If you want something more sophisticated, you can for CSVDatasetDevEval where you can play with, dev, and eval sets with a minimal two-column file (again, metadata are allowed; so you can extend this file to more columns). Worth noting that client_id <> model_id is allowed with this format. It's just a matter to add more columns in your CSV files ([dev-eval]_enroll.csv)

If you still need something more sophisticated, you can extend yourself this FileList based interface to fit your needs.

Thanks again for the feedback

Cheers

changed milestone to %Bob 9.0.0

merged

mentioned in commit 5487e057

Database interface

Merge request reports

Activity