Using Bob as a library: Don't force HDF5 serialization
There are many, many places in bob.bio.base
& the associated ecosystem where it is assumed the user wants to serialize information to an HDF5 file (for example, bob.bio.base.PCA
's train_projector()
always writes to an HDF5 file). This is an issue when using Bob tools in different use-cases & environments, as there's no guarantee that a user wants to write to an HDF5 file. Sometimes the user can't write to files, such as in BEAT, which is the specific use-case that concerns me.
(Disk) serialization should at least be opt-in, and the data that was previously saved to disk by default should be returned by the function instead. For the above PCA example, this would change train_projector()
to return the variances by default, and optionally write them to disk. Changes like this is the bare minimum needed to use these Bob tools in BEAT.
Honestly, though, serialization endpoints (disk, network, whatever) in general should be separated from individual Bob tools. A preprocessor/extractor/algorithm/whatever should have a method for general serialization as well as a method for rehydrating the instance using this data (this is already present in many places, but is just hard-coded to write to an HDF5 file). Some bob.serialization
package could handle writing this data to disks/caches/networks/whatever.
What does everyone think?