filelist-based db for bio and pad exper. Added docs

156cdbf8 · Pavel KORSHUNOV · Amir MOHAMMADI · fbc33ead · 156cdbf8 · 156cdbf8
Commit 156cdbf8 authored 7 years ago by Pavel KORSHUNOV Committed by Amir MOHAMMADI 7 years ago
--- a/bob/pad/base/database/PadBioFileDB.py
+++ b/bob/pad/base/database/PadBioFileDB.py
+"""
+Implementation of high-level interfaces for FileList-based databases that can be 
+used by both verification and PAD experiments.
+"""
+
+from bob.pad.base.database import PadFile
+from bob.pad.base.database import FileListPadDatabase
+
+from bob.bio.base.database import BioDatabase
+from bob.bio.base.database.file import BioFile
+
+import bob.io.base
+
+import numpy
+import scipy
+
+
+class HighPadFile(PadFile):
+    """
+    A simple base class that defines basic properties of File object for the use in PAD experiments.
+    Replace this class for the specific database.
+    """
+
+    def __init__(self, client_id, path, attack_type=None, file_id=None):
+        """**Constructor Documentation**
+
+        Initialize the Voice File object that can read WAV files.
+
+        Parameters:
+
+        For client_id, path, attack_type, and file_id, please refer
+        to :py:class:`bob.pad.base.database.PadFile` constructor
+
+        """
+
+        super(HighPadFile, self).__init__(client_id, path, attack_type, file_id)
+
+    def load(self, directory=None, extension='.wav'):
+        path = self.make_path(directory, extension)
+        # read audio
+        if extension == '.wav':
+            rate, audio = scipy.io.wavfile.read(path)
+            # We consider there is only 1 channel in the audio file => data[0]
+            return rate, numpy.cast['float'](audio)
+        elif extension == '.avi':
+            return bob.io.base.load(path)
+
+
+class HighPadDatabase(FileListPadDatabase):
+    def __init__(self,
+                 original_directory="[DB_DATA_DIRECTORY]",
+                 original_extension=".wav",
+                 db_name='',
+                 **kwargs):
+        # call base class constructor
+        from pkg_resources import resource_filename
+        folder = resource_filename(__name__, '../lists/' + db_name)
+        super(HighPadDatabase, self).__init__(folder, db_name, pad_file_class=HighPadFile,
+                                              original_directory=original_directory,
+                                              original_extension=original_extension,
+                                              **kwargs)
+
+
+class HighBioFile(BioFile):
+    def __init__(self, f):
+        """
+        Initializes this File object with an File equivalent from the underlying SQl-based interface for
+        database. Replace this class for the specific database.
+        """
+        super(HighBioFile, self).__init__(client_id=f.client_id, path=f.path, file_id=f.id)
+
+        self.__f = f
+
+    def load(self, directory=None, extension='.wav'):
+        path = self.make_path(directory, extension)
+        if extension == '.wav':
+            rate, audio = scipy.io.wavfile.read(path)
+            # We consider there is only 1 channel in the audio file => data[0]
+            return rate, numpy.cast['float'](audio)
+        elif extension == '.avi':
+            return bob.io.base.load(path)
+
+
+class HighBioDatabase(BioDatabase):
+    """
+    Implements verification API for querying High database.
+    """
+
+    def __init__(self,
+                 original_directory="[DB_DATA_DIRECTORY]",
+                 original_extension=".wav",
+                 db_name='',
+                 **kwargs):
+        # call base class constructors to open a session to the database
+        super(HighBioDatabase, self).__init__(name=db_name,
+                                              original_directory=original_directory,
+                                              original_extension=original_extension, **kwargs)
+
+        self.__db = HighPadDatabase(db_name=db_name,
+                                    original_directory=original_directory,
+                                    original_extension=original_extension,
+                                    **kwargs)
+
+        self.low_level_group_names = ('train', 'dev', 'eval')
+        self.high_level_group_names = ('world', 'dev', 'eval')
+
+    def model_ids_with_protocol(self, groups=None, protocol=None, **kwargs):
+        groups = self.convert_names_to_lowlevel(groups, self.low_level_group_names, self.high_level_group_names)
+
+        return [client.id for client in self.__db.clients(groups=groups, **kwargs)]
+
+    def objects(self, protocol=None, purposes=None, model_ids=None, groups=None, **kwargs):
+        """
+        Maps objects method of PAD databases into objects method of Verification database
+
+        :param protocol: To distinguish two vulnerability scenarios, protocol name should have either
+        '-licit' or '-spoof' appended to it. For instance, if DB has protocol 'general', the named passed to this method
+        should be 'general-licit', if we want to run verification experiments on bona fide data only, but it should be
+         'general-spoof', if we want to run it for spoof scenario (the probes are attacks).
+
+        :param purposes: This parameter is passed by the ``bob.bio.base`` verification experiment
+
+        :param model_ids: This parameter is passed by the ``bob.bio.base`` verification experiment
+
+        :param groups: We map the groups from ('world', 'dev', 'eval') used in verification experiments to
+        ('train', 'dev', 'eval')
+
+        :param kwargs: The rest of the parameters valid for a given database
+
+        :return: Set of BioFiles that verification experiments expect.
+
+        """
+        # convert group names from the conventional names in verification experiments to the internal database names
+        if groups is None:  # all groups are assumed
+            groups = self.high_level_group_names
+        matched_groups = self.convert_names_to_lowlevel(groups, self.low_level_group_names, self.high_level_group_names)
+
+        # this conversion of the protocol with appended '-licit' or '-spoof' is a hack for verification experiments.
+        # To adapt spoofing databases to the verification experiments, we need to be able to split a given protocol
+        # into two parts: when data for licit (only real/genuine data is used) and data for spoof
+        # (attacks are used instead of real data) is used in the experiment.
+        # Hence, we use this trick with appending '-licit' or '-spoof' to the
+        # protocol name, so we can distinguish these two scenarios.
+        # By default, if nothing is appended, we assume licit protocol.
+        # The distinction between licit and spoof is expressed via purposes parameters, but
+        # the difference is in the terminology only.
+
+        # lets check if we have an appendix to the protocol name
+        appendix = None
+        if protocol:
+            appendix = protocol.split('-')[-1]
+
+        # if protocol was empty or there was no correct appendix, we just assume the 'licit' option
+        if not (appendix == 'licit' or appendix == 'spoof'):
+            appendix = 'licit'
+        else:
+            # put back everything except the appendix into the protocol
+            protocol = '-'.join(protocol.split('-')[:-1])
+
+        # if protocol was empty, we set it to the None
+        if not protocol:
+            protocol = None
+
+        correct_purposes = purposes
+        # licit protocol is for real access data only
+        if appendix == 'licit':
+            # by default we assume all real data, since this database has no enroll data
+            if purposes is None:
+                correct_purposes = ('real',)
+
+        # spoof protocol uses real data for enrollment and spoofed data for probe
+        # so, probe set is the same as attack set
+        if appendix == 'spoof':
+            # we return attack data only, since this database does not have explicit enroll data
+            if purposes is None:
+                correct_purposes = ('attack',)
+            # otherwise replace 'probe' with 'attack'
+            elif isinstance(purposes, (tuple, list)):
+                correct_purposes = []
+                for purpose in purposes:
+                    if purpose == 'probe':
+                        correct_purposes += ['attack']
+                    else:
+                        correct_purposes += [purpose]
+            elif purposes == 'probe':
+                correct_purposes = ('attack',)
+
+        # now, query the underline PAD database
+        objects = self.__db.objects(protocol=protocol, groups=matched_groups, purposes=correct_purposes, **kwargs)
+
+        # make sure to return BioFile representation of a file, not the database one
+        return [HighBioFile(f) for f in objects]
+
+    def annotations(self, file):
+        pass
--- a/bob/pad/base/database/__init__.py
+++ b/bob/pad/base/database/__init__.py
@@ -2,7 +2,7 @@ from .file import PadFile
 from .database import PadDatabase
 from .filelist.query import FileListPadDatabase
 from .filelist.models import Client
-from . import filelist
+from .PadBioFileDB import HighBioDatabase, HighPadDatabase

 # gets sphinx autodoc done right - don't remove it
 def __appropriate__(*args):
@@ -25,5 +25,7 @@ __appropriate__(
    PadDatabase,
    FileListPadDatabase,
    Client,
+    HighBioDatabase,
+    HighPadDatabase
 )
 __all__ = [_ for _ in dir() if not _.startswith('_')]
--- a/doc/experiments.rst
+++ b/doc/experiments.rst
@@ -6,9 +6,9 @@
 .. _bob.pad.base.experiments:


-=================================================
-Running Presentation Attack Detection Experiments
-=================================================
+===================================================
+ Running Presentation Attack Detection Experiments
+===================================================

 Now, you are almost ready to run presentation attack detection (PAD) experiment.


--- a/doc/extra-intersphinx.txt
+++ b/doc/extra-intersphinx.txt
 python
 numpy
-bob.bio.spear
 gridtk
+bob.io.base
 bob.db.base
 bob.db.avspoof
 bob.bio.base
 bob.pad.voice
-bob.db.voicepa
\ No newline at end of file
+bob.db.voicepa
+bob.bio.spear
+bob.bio.face
+bob.pad.face
\ No newline at end of file
--- a/doc/filedb_guide.rst
+++ b/doc/filedb_guide.rst
+.. vim: set fileencoding=utf-8 :
+.. @author: Manuel Guenther <manuel.guenther@idiap.ch>
+.. author: Pavel Korshunov <pavel.korshunov@idiap.ch>
+.. date: Wed Apr 27 14:58:21 CEST 2016
+
+====================================
+ User's Guide for PAD File List API
+====================================
+
+The low-level Database Interface
+--------------------------------
+
+The :py:class:`bob.pad.base.database.FileListPadDatabase` complies with the standard PAD database as described in :ref:`bob.pad.base`.
+All functions defined in that interface are properly instantiated, as soon as the user provides the required file lists.
+
+Creating File Lists
+-------------------
+
+The initial step for using this package is to provide file lists specifying the ``'train'`` (training), ``'dev'`` (development) and ``'eval'`` (evaluation) sets to be used by the PAD algorithm.
+The summarized complete structure of the list base directory (here denoted as ``basedir``) containing all the files should be like this::
+
+  basedir -- train -- for_real.lst
+         |       |-- for_attack.lst
+         |
+         |-- dev -- for_real.lst
+         |      |-- for_attack.lst
+         |
+         |-- eval -- for_real.lst
+                 |-- for_attack.lst
+
+
+The file lists should contain the following information for PAD experiments to run properly:
+
+* ``filename``: The name of the data file, **relative** to the common root of all data files, and **without** file name extension.
+* ``client_id``: The name or ID of the subject the biometric traces of which are contained in the data file.
+  These names are handled as :py:class:`str` objects, so ``001`` is different from ``1``.
+* ``attack_type``: This is not contained in `for_real.lst` files, only in `for_attack.lst` files.
+  The type of attack (:py:class:`str` object).
+
+
+The following list files need to be created:
+
+- **For real**:
+
+  * *real file*, with default name ``for_real.lst``, in the default sub-directories ``train``, ``dev`` and ``eval``, respectively.
+    It is a 2-column file with format:
+
+    .. code-block:: text
+
+      filename client_id
+
+  * *attack file*, with default name ``for_attack.lst``, in the default sub-directories ``train``, ``dev`` and ``eval``, respectively.
+    It is a 3-column file with format:
+
+    .. code-block:: text
+
+      filename client_id attack_type
+
+
+.. note:: If the database does not provide an evaluation set, the ``eval`` files can be omitted.
+
+
+Protocols and File Lists
+------------------------
+
+When you instantiate a database, you have to specify the base directory that contains the file lists.
+If you have only a single protocol, you could specify the full path to the file lists described above as follows:
+
+.. code-block:: python
+
+  >>> db = bob.pad.base.database.FileListPadDatabase('basedir/protocol')
+
+Next, you should query the data, WITHOUT specifying any protocol:
+
+.. code-block:: python
+
+  >>> db.objects()
+
+Alternatively, if you have more protocols, you could do the following:
+
+.. code-block:: python
+
+  >>> db = bob.pad.base.database.FileListPadDatabase('basedir')
+  >>> db.objects(protocol='protocol')
+
+When a protocol is specified, it is appended to the base directory that contains the file lists.
+This allows to use several protocols that are stored in the same base directory, without the need to instantiate a new database.
+For instance, given two protocols 'P1' and 'P2' (with filelists contained in 'basedir/P1' and 'basedir/P2', respectively), the following would work:
+
+.. code-block:: python
+
+  >>> db = bob.pad.base.database.FileListPadDatabase('basedir')
+  >>> db.objects(protocol='P1') # Get the objects for the protocol P1
+  >>> db.objects(protocol='P2') # Get the objects for the protocol P2
+
+
+The high-level Database Interface
+---------------------------------
+
+the low-level FileList database interface is extended, so that filelist databases can be used to run both types:
+vulnerability analysis experiments using :ref:`bob.bio.base <bob.bio.base>` verification framework
+and PAD experiments using ``bob.pad.base`` framework.
+
+For instance, provided the lists of files for database ``example_db`` in the correct format are located
+inside ``lists`` directory (i.e., inside ``lists/example_db``), the PAD and verification versions of the
+database can be created as following:
+
+.. code-block:: python
+
+  >>> from bob.pad.base.database import HighBioDatabase, HighPadDatabase
+  >>> pad_db = HighPadDatabase(db_name='example_db')
+  >>> bio_db = HighBioDatabase(db_name='example_db')
+
+
+
--- a/doc/high_level_db_interface_guide.rst
+++ b/doc/high_level_db_interface_guide.rst
@@ -3,9 +3,9 @@
 .. @date:   May 2017


-=============================================
-High Level Database Interface How-To Guide
-=============================================
+============================================
+ High Level Database Interface How-To Guide
+============================================

 The *high level database interface* (HLDI) is needed to run biometric experiments using non-filelist databases (e.g. if one wants to use SQL-based database package).

@@ -36,7 +36,7 @@ the ``bob.pad.face`` framework are: ``ReplayPadFile`` and
 ``ReplayPadDatabase``.

 Implementation of the ``*File`` class
---------------------------------------------------
+-------------------------------------

 First of all, the ``*File`` class must inherit from the **base file
 class** of the corresponding biometric framework. An example:
@@ -106,7 +106,7 @@ type of input. With this, we are done configuring the high level
 implementation of the ``*File`` class.

 Implementation of the ``*Database`` class
---------------------------------------------------
+-----------------------------------------

 The second unit to be implemented in HLDI is the ``*Database`` class.
 First of all the ``*Database`` class must inherit from the **base

--- a/doc/implementation.rst
+++ b/doc/implementation.rst
@@ -3,9 +3,9 @@
 .. author: Pavel Korshunov <pavel.korshunov@idiap.ch>
 .. date: Wed Apr 27 14:58:21 CEST 2016

-======================
-Implementation Details
-======================
+========================
+ Implementation Details
+========================

 The ``bob.pad`` set of modules are specifically designed to be as flexible as possible while trying to keep things simple.
 Therefore, python is used to implement tools such as preprocessors, feature extractors and the PAD algorithms.

--- a/doc/implemented.rst
+++ b/doc/implemented.rst
 .. _bob.pad.base.implemented:

-=================================
-Tools implemented in bob.pad.base
-=================================
+===================================
+ Tools implemented in bob.pad.base
+===================================

 Please not that some parts of the code in this package are dependent on and reused from :ref:`bob.bio.base <bob.bio.base>` package.


--- a/doc/index.rst
+++ b/doc/index.rst
@@ -5,9 +5,9 @@

 .. _bob.pad.base:

-=================================================
-Running Presentation Attack Detection Experiments
-=================================================
+===================================================
+ Running Presentation Attack Detection Experiments
+===================================================

 The ``bob.pad`` packages provide open source tools to run comparable and reproducible presentation attack detection (PAD) experiments.
 To design such experiment, one has to choose:
@@ -28,20 +28,21 @@ But it is also possible to use your own database, preprocessor, feature extracto
    The ``bob.pad.*`` packages are derived from the `bob.bio.* <http://pypi.python.org/pypi/bob.bio.base>`__, packages that are designed for biometric recognition experiments.

 This package :py:mod:`bob.pad.base` includes the basic definition of a PAD experiment, as well as a generic script, which can execute the full experiment in a single command line.
-Changing the employed tolls such as the database, protocol, preprocessor, feature extractor or a PAD algorithm is as simple as changing a command line parameter.
+Changing the employed tools such as the database, protocol, preprocessor, feature extractor or a PAD algorithm is as simple as changing a command line parameter.

 The implementation of (most of) the tools is separated into other packages in the ``bob.pad`` namespace.
 All these packages can be easily combined.
 Here is a growing list of derived packages:

 * `bob.pad.voice <http://pypi.python.org/pypi/bob.pad.voice>`__ Tools to run presentation attack detection experiments for speech, including several Cepstral-based features and LBP-based feature extraction, GMM-based and logistic regression based algorithms, as well as plot and score fusion scripts.
+* `bob.pad.face <http://pypi.python.org/pypi/bob.pad.face>`__ Tools to run presentation attack detection experiments for face, including face-related feature extraction, GMM, SVM, and logistic regression based algorithms, as well as plotting scripts.

 If you are interested, please continue reading:


-===========
-Users Guide
-===========
+=============
+ Users Guide
+=============

 .. toctree::
    :maxdepth: 2
@@ -50,10 +51,12 @@ Users Guide
    experiments
    implementation
    high_level_db_interface_guide
+    filedb_guide

-================
-Reference Manual
-================
+
+==================
+ Reference Manual
+==================

 .. toctree::
    :maxdepth: 2
@@ -62,9 +65,9 @@ Reference Manual
    py_api


-==================
-Indices and tables
-==================
+====================
+ Indices and tables
+====================

 * :ref:`genindex`
 * :ref:`modindex`