Skip to content
Snippets Groups Projects

Database interface

Merged Tiago de Freitas Pereira requested to merge database-interface into dask-pipelines
2 unresolved threads

Implemented a simple filelist database interface for the VanillaBiometrics based on CSVFiles.

The CSVDatasetDevEval needs to have the following format:

       my_dataset/
       my_dataset/my_protocol/
       my_dataset/my_protocol/train.csv
       my_dataset/my_protocol/train.csv/dev_enroll.csv
       my_dataset/my_protocol/train.csv/dev_probe.csv
       my_dataset/my_protocol/train.csv/eval_enroll.csv
       my_dataset/my_protocol/train.csv/eval_probe.csv
       ...

where each CSV file needs to have the following format:

       PATH,SUBJECT
       path_1,subject_1
       path_2,subject_2
       path_i,subject_j

This formart allows the usage of metadata by following the pattern below:

       PATH,SUBJECT,METADATA_1,METADATA_2,METADATA_k
       path_1,subject_1,A,B,C
       path_2,subject_2,A,B,1
       path_i,subject_j,2,3,4

We can imagine other implementations of this. For instance, CSVDatasetCrossValidation that given a csv file, it splits "on-the-fly" several data for enrolling, probing and training. Or CSVDatasetWithEyesAnnotation, that handles annotations for Face Rec pipelines.

I still need to implement a mechanism that takes zip files as input to CSVDatasetDevEval. That way we can ship databases as simple zip files

ping @ydayer @amohammadi

I'll merge this tomorrow. I need this to support the efforts on bob.bio.vein.

Edited by Tiago de Freitas Pereira

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Tiago de Freitas Pereira marked as a Work In Progress

    marked as a Work In Progress

  • Tiago de Freitas Pereira changed the description

    changed the description

  • added 1 commit

    Compare with previous version

  • added 2 commits

    • 4de7cc3b - Implemented CrossValidation Filelist dataset
    • 9f6da96e - Created new test protocol

    Compare with previous version

  • I've just implemented another FileList dataset interface based on CSV files. The CSVDatasetCrossValidation, as the name says, handles cross validation. It basically takes one CSV file as input and it gives you a Dataset object compatible with VanillaBiometrics with the training and dev set split by subject

  • you're removing the old interface here, how can they load the old .lst files?

    I totally agree with Amir and Pavel, you should think about backward compatibility. Otherwise, the new interface will break all experiments that are currently running with the old-style file lists. There are two solution to that:

    1. Implement both behaviors into the current interface, and leave the old one as the default.
    2. Implement the new interface in a different class, and leave both next to each other.

    Thanks for your comments

    As I pointed out in the above items, this new dataset interface better addresses our new needs and gives us some more flexibility to extend it if necessary (e.g CSVToSampleLoader that handles some specific annotations that are available in a CSV file).

    If the FileList interface that we currently have on master is so vital as you are claiming, I can put it back to this branch. However, we should encourage the usage of this one that is cleaner IMHO.

    We are doing our best to keep backward compatibility by:

    • Making sure that we have the necessary adaptors that convert Preprocessors, Extractors, and Algorithms to Transformers
    • Creating adaptors to our current bob.dbs.*
    • Making tests to make sure that everything works together, with/without checkpoints and with dask as well
    • Making our C++ Python bindings pickable so we can flawlessly use them under Dask.

    In this work, we are also considering to deprecate some stuff and this BREAKS backward compatibility (https://gitlab.idiap.ch/search?group_id=373&project_id=&repository_ref=&scope=issues&search=DEPRECATION). We shouldn't keep backward compatibility with stuff we don't use.

    It's a waste of energy. My energy.

    Again, if the FileList interface that we currently have on master is so vital as you are claiming, I can put it back to this branch.

    Edited by Tiago de Freitas Pereira
  • If you create everything new, you should call it new too. Don't call it Bob. Otherwise, how can we use the next release of Bob in our current experiments? And while you are at changing everything without backward compatibility, why don't you re-write all code in Lua or what you have now? See how many people will use your code then.

    I will not respond to mockery Have some respect for the work of other people

  • Take it easy. I didn't mean to mock your work, I was just making my point in a more colorful language. Is this a cemetery where everyone needs to be dressed in black and whisper politely? I don't understand why you are taking it so seriously.

    But yes, the FileList interfaces is very widely used and it should be available in a foreseeable future. We can slowly start migrating to a new interface, sure, but as life shows, it may actually take years.

  • added 1 commit

    • 77951f3b - Moved back the Current FileList Interface

    Compare with previous version

  • I just put it back the current FileList interface. All tests are passing

  • Tiago de Freitas Pereira unmarked as a Work In Progress

    unmarked as a Work In Progress

  • @tiago.pereira I understand that most of the interfaces used at Idiap are not based on the filelist database, because you know how to design a proper database. People outside might not have that deep knowledge and rather rely on simple things. While they might switch to the new database interface that you have implemented (thank you for your effort!), it might take a while for them to do so.

    When I am implementing interfaces and protocols for new databases, I almost always use the filelist database, just because of its simplicity. Only when the databases get larger or already come with a protocol, I think about implementing a proper interface.

    You are right that most cases have client_id = model_id, but not all of them do. Additionally to the IJB-datasets that you mentioned, also the FRGC and the GBU datasets rely on client_id != model_id (but they do not use the filelist database). I had lately a case where I also required this for the filelist database, since I had defined several models per client. Thus, this feature is not as useless as you think. If you really want to implement the client_id = model_id idea, you can add functionality that allows to have the for_models file containing only two columns. However, make sure that the three-column way still works.

  • @tiago.pereira I understand that most of the interfaces used at Idiap are not based on the filelist database, because you know how to design a proper database. People outside might not have that deep knowledge and rather rely on simple things. While they might switch to the new database interface that you have implemented (thank you for your effort!), it might take a while for them to do so.

    When I am implementing interfaces and protocols for new databases, I almost always use the filelist database, just because of its simplicity. Only when the databases get larger or already come with a protocol, I think about implementing a proper interface.

    You are right that most cases have client_id = model_id, but not all of them do. Additionally to the IJB-datasets that you mentioned, also the FRGC and the GBU datasets rely on client_id != model_id (but they do not use the filelist database). I had lately a case where I also required this for the filelist database, since I had defined several models per client. Thus, this feature is not as useless as you think. If you really want to implement the client_id = model_id idea, you can add functionality that allows to have the for_models file containing only two columns. However, make sure that the three-column way still works.

    Thanks for the feedback @mguenther I'm glad to hear from you more often :-)

    The current database interface is back on for backward compatibility.

    People outside might not have that deep knowledge and rather rely on simple things.

    Yes, I do agree and we observe this quite often. That's why, for instance, I'm pushing the CSVDatasetCrossValidation. In this interface, you have to provide only one two-column CSV file (or more columns if you want to ship some metadata). How hard is that?

    If you want something more sophisticated, you can for CSVDatasetDevEval where you can play with, dev, and eval sets with a minimal two-column file (again, metadata are allowed; so you can extend this file to more columns). Worth noting that client_id <> model_id is allowed with this format. It's just a matter to add more columns in your CSV files ([dev-eval]_enroll.csv)

    If you still need something more sophisticated, you can extend yourself this FileList based interface to fit your needs.

    Thanks again for the feedback

    Cheers

  • changed milestone to %Bob 9.0.0

  • mentioned in commit 5487e057

Please register or sign in to reply
Loading