Some databases may contain more than one sample in one file (like videos and audios with two channles); while I understand that this is handled in bob.bio.video for video files, it is not clear how this can be handled for audio files with two channels in them.
If the preprocessor was calling the load method of the File (BioFile) class, we could use logical paths for File.path instead of the actual path and handle this in the load method. For example,
File.path would be origpath_A or origpath_B depending on the channel and then the load method would return only channel A or B depending on the logical path that was requested.
Designs
Child items ...
Show closed items
Linked items 0
Link issues together to show that they're related.
Learn more.
The File.load function (which AFAIR only exists in the bob.db.verification.utils.File class, not in bob.bio.base.database.File) is not used to load the original data -- at least it was not the intention of this function. It rather exists for compatibility purposes, as the File in the old xbob.db databases contained such a function. In this implementation, it was supposed to read and write any kind of data -- usually using HDF5.
However, if the channels are stored in the same physical file, you could implement a specific audio preprocessor that returns the one or the other channel based on the preprocessor configuration.
In fact, if you have a protocol that requires both types of channels, you should have two File instances for the same physical file, one for each channel. Then, you can use the File.make_path function to add the channel name to each file name, i.e., when a unique filename is required, and read both channels (and possibly let the preprocessor decide, which data to use) when reading the original data. With that I mean:
Add an add_channel parameter to the File.make_path function:
and call this function using add_channel=False in the Database.original_file_name function. This will allow to read the same physical file for two different File objects. Now, to tell the preprocessor, which channel to use, you can return the channel as an annotation of the file (inside Database.annotation), such as:
and let the preprocessor decide, what to do with this information (usually, load only this channel).
I have implemented a similar solution for image databases, where two ore more people in the same physical image were annotated. There, I used the client_id instead of the channel in order to create a unique filename in the File.make_path function.
Anyways, this was only my solution to this specific problem. If you find a better solution (which might include revitalizing the File.load function -- I would rather call it File.read to be consistent with the usage of the terms load and read throughout bob.bio) I would be happy to review a pull request.
I don't think this is a good idea. Loading of data should be done by the preprocessor, as only it knows how it wants to load the data. Your solution might work for any preprocessor that is able to process numpy arrays, but what about Bob-external preprocessors that use a different way of loading data, e.g.:
https://github.com/bioidiap/bob.bio.csu/blob/master/bob/bio/csu/preprocessor/LDAIR.py#L144
or even more tricky, when the preprocessor does not understand Python at all (such as in extensions using external/commercial software)?
I would prefer to go the way of annotations. However, I don't exactly know your problem, so it is difficult for me to come up with a solution. Maybe you can be more precise in what you want to achieve, and why it does not work with the current implementation...
We have been through similar discussions in the past. I think the real issue here is that we're talking about "files" while we should be discussing in terms of "samples". Sometimes, these samples are available on disk and can be cleanly loaded. Sometimes they simply can't.
I understand this is why Amir is suggesting loading data from a bob.db interface should be handled by the interface itself and not by a specific framework. In specific cases, these frameworks (like bob.bio.base) cannot handle this efficiently unless a lot of information is provided by the interface. As an example, imagine what you would have to do to properly handle loading data from a SQL database.
Let me gather the use-cases from the discussions above:
Samples in a database may not be stored as 1 sample per file (e.g. samples stored in a SQL database or multiple samples stored in a single file)
Samples in a database may have to be loaded differently depending on the sample (databases with a mixture of PNG and JPEG images for instance)
Data may be fed into an external binary that can only handle files (maybe pipes as well)
Here is one solution that may work for all the cases above:
a. The database designer defines the concept of a Sample (N.B.: for as long as 1 file = 1 sample, the Sample can as well as be our existing File, what is convenient)
b. The database designer defines a load() method that can load the said sample with all its weird specifications, from an original_directory.
c. The database designer defines a make_path() method that can return a unique path for a given Sample in the database interface.
Then, we do the following (in pseudo-code, details omitted):
# Specific Preprocessor call implementation that can handle numpy arraysdef__call__(self,sample):data=sample.load(original_directory)#don't know where sample comes from# do your thing, setup ``result``returnresult
The temporary result may then be saved into a file using the Sample's make_path(), which should return a unique path for each sample in the database.
For case c. above, we can do like this, on the specific preprocessor implementation (which is anyways the only entity that should know how to run the preprocessor properly):
# Specific Preprocessor call implementation that can only handle filesdef__call__(self,sample):use_path=sample.make_path(original_directory)ifnotos.path.exists(use_path):#database is weird, save to temp fileuse_path=tempfile.NamedTemporaryFile()#auto-deletes after file is closeddata=sample.load()#don't know where sample comes from# save ``data`` on ``use_path`` handle# do your thing, setup ``result`` from ``use_path``returnresult
If you wish, you may abstract this on a base class "FilePreprocessor".
Sorry, I just remembered that the old repo on GitHub is outdated now, so I re-post my relpy here...
So far, I have used the File instance as what you would call a Sample. In most cases, I had 1 file = 1 sample, but for video databases, this is clearly not the case. There, we had one sample per video, which in case of the YouTube dataset is split over several image files.
Although I kind-of like the approach that you are proposing, so far I tried to keep the interface to the tools (preprocessor, extractor, algorithm) to be as simple as possible, so that it is easy to implement them independent of any (specific) Bob database. Instead, all parameters to functions of the tools depend only on simple data types -- such as strings for file names. In your approach you violate that concept by introducing a complex data structure as parameter to a tool (a preprocessor). This might be difficult to understand for people trying to implement a new preprocessor. That's why I am thinking of different approaches in order to keep the interface as simple as possible -- while possibly making the backend more complicated.
In fact, the actual place to implement a custom solution to read the data is the read_original_data function, and not inside the __call__ function. The former function is particularly designed to handle cases like that, where you have more difficult structures. So far, this function receives an image filename and load the original data using bob.io.base.load. As this function has a proper default implementation in bob.bio.base.Preprocessor:
https://github.com/bioidiap/bob.bio.base/blob/master/bob/bio/base/preprocessor/Preprocessor.py#L87
there I could see a modification of the default parameter to make sense. In there you might want to use an instance of a File to load the data. But, please do not change the interface of functions that everyone needs to implement.
So, at the end, this modification would be similar to what @amohammadi proposed. However, I wouldn't want the modification in the tools.preprocessor function (since this is global to every preprocessor), but in the Preprocessor base class, so that custom solutions are still possible.
The temporary solution that I proposed got pushed to master during our re-factoring process. The solution works however, please feel free to open a new merge-request to move the code into the Preprocessor base class and we can discuss this further in that merge-request.
Amir, could you please point me to the files that were actually updated during the refactoring? I cannot see, where your modifications were actually implemented... I have checked several files but I cannot see any modification...
I'm not sure why people keep associating two ideas which are not necessarily related: the File.load() method does not have to return a numpy.ndarray. It can return any complex structure needed. If that does not match a particular preprocessor input which is required, the load method can still be overloaded inside the bob.bio.db interface.
I think, you still didn't get it, although I have now repeated it three times:
Only the preprocessor knows, what the preprocessor needs. Modifying the bob.bio.db interface would change the database, so you would make modifications for all preprocessors (that can use these database).
Jeez... I think a tried to reply to your worries with past e-mail/issue exchanges. I understand you can do things in the current way. What we're trying to move forward though is to make it flexible enough to handle other things it can't do at the moment. Here is another go:
In this example, an image of the database may be served mirrored or not depending on the protocol. If you move the readout of the data up to the preprocessor and the flipping of the image there, then all frameworks using this database would have to copy the preprocessor implementation. To avoid this copying we have to move this flipping somewhere. The most obvious place so far is the database package itself. This fits very well with the programming as well, putting the responsibility to ensure proper data readout on the person writing the database API. Loading tests should also be implemented, which ensures the database programmer can effectively read the raw data.
We have discussed in the past about other examples as well: databases in which there is more than one sample in a single file, for example. Some of these databases can be used in two different frameworks: bob.bio.base and another framework that will be created for speech analysis.
Now let me get back to your points:
Only the preprocessor knows what it needs: this is true, of course. Note, however, that a given preprocessor in a given framework (say bob.bio.base) is most probably never going to be re-written because of database API changes. Most database packages provide a perfect-fitFile.load() implementation and, when this is not the case, you can still patch it up on the bob.bio.db interface so as to make it conform to the preprocessor. This is OK: the preprocessor may have picky requirements and therefore requires special code. I don't foresee though this will be the general case. Normally, it should all fit together as of today. However, if it doesn't fit like in the examples above (image mods required by the protocol or more than one sample per file), you still have a way out: patch close to the preprocessor to make one fit into the other. A bob.db like that can be used freely for any purposes and adapted as required to fit to a particular task.
"Modifying the bob.bio.db interface would change the database, so you would make modifications for all preprocessors (that can use these database).": The only reason you would make a modification on the database high-level interface inside bob.bio.db would be to make the value provided by a bob.db.xyz.File.load() which would normally not fit to the preprocessor, fit it nicely. In this way, I don't understand your argument.
If a database is normally providing "sensible" information of the File.load() method, then this should be what the preprocessors should be processing by default. Examples:
For face, vein, fingerprint recognition: images (numpy.ndarray)
For speaker recognition: audio signal(s) + sampling rate
For video processing: 4D numpy.ndarray?
etc
I hope this throws a little bit more of light from what is going on in my brain. Let me know otherwise.
I agree that the normal case should be to load the default data type for the given database. This can be done inside the File.load() method.
OK, you are saying that we should be able to provide a custom File.load method for loading data in a specific way required by the preprocessor. My question is, how to do that in a generic way, i.e., without needing to replicate that for each of the bob.db.xyz.File classes for all databases. Hence, I was thinking about a way to solve this issue generically. But now I have understood that -- most probably -- there is a strong correlation between the database and the preprocessor.
As far as I understand, we are currently stuck in a discussion, what should depend on what. You want the database depend on the preprocessor, and I want it the other way round. However, there is a way that would get rid of the dependency altogether. Let the user that designs the experiment decide, which IO (s)he wants to have by making the IO function a parameter to the preprocessor, and provide default implementations for that, e.g.:
Then, we can provide a different default for bob.bio.spear, which returns the sample rate and the data (or you can modify the File.load function to return those). Now, when the user defines an experiment, (s)he can provide a different implementation for the IO inside the configuration file without needing to change either of the database or the preprocessor, e.g., to extract the 11th frame of a video:
We only need to make sure that in all preprocessors, we hand down the read_function to the base class. The parameters for the read_function are discussable, but once decided, need to be well documented.
What do you think about this solution? This leaves us the freedom to:
use the File.load function, independent of the preprocessor
define a custom read_function in a configuration file that combines database and preprocessor, without modifying any of those
overload the read_original_data function if we want something completely different (for example, the Filename class), independent of the database
By the way, the current implementation would not be very efficient for bob.bio.video. There, I have a FrameSelector that selects only some frames of the video. There, I can make use of the fact that I don't have a fixed data structure (like a 4D numpy.ndarray). For example, the bob.db.youtube database provides frames as images. Hence, I can read only those frames that I am interested in, reducing the image IO substantially.
Note that -- in order to do that -- I have a custom implementation of read_original_data for videos. Hence, the described approach would work without issues.
You can do all with a proper File and Database abstraction and it will be as efficient as you want.
To avoid code replication and improve DRY, you do inheritance.
As of today, all Files handled by bob.bio.db should be of the class bob.bio.db.File similar for the Database type. You can imagine another class that inherits from this one and extends to handle, as in your example, video frames more efficiently, for a specific purpose or to fit a preprocessor:
classVideoFrameFile(bob.bio.db.File):def__init__(self,frames,...):self.frames=frames...defload(self,directory):# implementation of super-efficient video file readingclassVideoFrameDatabase(bob.bio.db.Database):def__init__(self,frames,...):self.frames=framesdefobjects(...):return[VideoFrameFile(self.frames,k)forkinself.__db.objects(...)]classMyDatabase(VideoFrameDatabase):def__init__(self,...):super(self,MyDatabase).__init__(frames=[10,20,30])...# I hope you get the idea
In this design, the "selector" is relieved from a lot of burden, which is to know about all possibilities to read databases. The database designer or bob.bio.db database adaptor designer carries all the burden to efficiently read the original data. IOW, The "selector" job is built into the database interface.
With the current work you did on the config_file branch of bob.bio.base (thanks by the way!), the user retains all configuration possibilities from the configuration script by conveniently configuring the database on the spot (or not, if great defaults are provided). No more fiddling with the file selector or whatever selector, that remains simple and to the point - which is to select the correct files and implement the loops inside the framework - not to know how to read the raw data from every existing database and protocol variations.
I don't think that any solution that removes the read_original_data function of the preprocessor will be flexible enough. This function was designed and implemented exactly for the purpose of flexibility.
Let's give another example (one that I already have implemented in https://gitlab.idiap.ch/bob/xfacereclib-extension-facevacs):
Someone wants to implement an algorithm, that does not rely on Python at all. He has no idea on what the algorithms look like, but he only knows a command line interface (or in our case, a C++ interface) that handles everything based on filenames. The only thing that this guy (me in this case) needs to implement is a proper wrapper for the library, including writing a custom read_original_data function: https://gitlab.idiap.ch/bob/xfacereclib-extension-facevacs/blob/master/xfacereclib/extension/FaceVACS/FaceVACS.py#L47, and is ready to run on all existing image databases, and using the current video extension, also on all currently implemented video databases.
Once again I repeat: ONLY THE PREPROCESSOR KNOWS, WHAT THE PREPROCESSOR NEEDS! Yes, there might be a possibility to implement specific Files and Databases for each case, but my problem is that I don't want to touch the database just to implement a new preprocessor.
Anyways, if you guys think that you can handle everything with your database implementation, go ahead, and I am out of this discussion. But please, then you handle all the requests that ask "how can I use my algorithm on your database?".
since now you know images which will be loaded by all databases are 3 dim numpy.arrays (even though databases load the files, they should return a specific format so that they can be used in the bob.bio framework), you can save it to a temporary file and feed it to your custom pre-processor. Think of it like the beat platform, the databases always output numpy arrays and it's up to you on how to deal with them.
instead of giving the loaded data to the pre-processor, we give the BioFile object to the pre-processor and the pre-processor decides to use the load method or the filename. But this will make some pre-processors which use the filename directly incompatible with some databases who output different things for load and filename.
I have thought about your first proposal already before, and I have rejected that idea as it would basically double the space, and triple the I/O time (read, write, read again) for the preprocessors.
The second proposal might be feasible (and I think that was basically the first thing that I proposed to solve our issue, see my third post here: https://groups.google.com/forum/#!topic/bob-devel/ql-u7NMj4-M), but this was rejected by Andre -- though I still don't quite understand why.
I have proposed a different solution already, i.e., by having an external load function that takes a BioFile as input and returns data in a specific format. This function should be passed to the preprocessor's constructor, and used either by the read_original_data function, or even directly inside the preprocess function: https://gitlab.idiap.ch/bob/bob.bio.base/blob/master/bob/bio/base/tools/preprocessor.py#L72, replacing the current if - else statement.
By default, this function might be just BioFile.load, or BioFile.make_path, but other variants might be possible. Anyways, although this would solve all our issues, I understand that it might be a bit more difficult to use. I can try to push some code to the new branch, in order to make it more clear, and then we can decide if we want to follow this path.
In bob.bio.spear, we could solve our issue in two ways: either we provide a custom load_function, or have the according BioFile.load functions returning both the sample rate and the data.
Now, the user has a simple way of combining different (non-standard) preprocessors with a non-standard database by, e.g., writing a custom load function and set this function as a parameter to the preprocessor. This can be done either in the constructor (given that the derived class Preprocessor hands over the function to the base class), or by overwriting the read_original_data function in the configuration file:
However, this should be used only rarely, as most of the preprocessors or databases use default functionality. Also, the user must be careful by knowing, which database and preprocessor combination he uses, as not all combinations make sense. Anyways, there is enough flexibility to combine things that were not combinable before.
BTW: In the preprocessor base implementation, I had to use an instance of bob.db.base.File instead of bob.bio.db.Fileas the latter was raising exceptions during the test. I have yet to figure out, why this is the case.
Hi @mguenther and @amohammadi , thanks for the commits.
The solution from Manuel looks cleaner and it fits the needs of everyone.
For bob.bio.spear, I think it is cleaner to have this implementation on bob.bio.db in a super class.
Since spear has a particular input for all the preprocessors, we can create something like the code bellow.
class AudioBioFile(BioFile): def __init__(self, f): super(AudioBioFile, self).__init__(client_id=f.client_id, path=f.path, file_id=f.id) self.__f = f def load(self, directory=None, extension='.wav'): rate, audio = scipy.io.wavfile.read(self.make_path(directory, extension)) # We consider there is only 1 channel in the audio file => data[0] data= numpy.cast['float'](audio) return rate, data
I agree. I think, Amir will create a branch in bob.nightlies, so that we can test all the packages together and make sure that they'll work in combination.
Now I see, why the tests failed, when I used a bob.bio.db.BioFile. The reason is that the bob.bio.db.BioFileSet does not derive from bob.bio.db.BioFile. I will change that and push the modifications to bob.bio.db. As they don't affect the current working set, I assume that I can push them to the master branch...
I pushed some modifications into the current branch, which take care of some requests of Amir here: !33 (closed)
When the new branch in bob.nightlies is established, we can start implementing the modifications for the other bob.bio packages as well. I can take the lead on that, if required.