databases.rst 17 KB
Newer Older
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
.. vim: set fileencoding=utf-8 :

.. Copyright (c) 2016 Idiap Research Institute, http://www.idiap.ch/          ..
.. Contact: beat.support@idiap.ch                                             ..
..                                                                            ..
.. This file is part of the beat.core module of the BEAT platform.            ..
..                                                                            ..
.. Commercial License Usage                                                   ..
.. Licensees holding valid commercial BEAT licenses may use this file in      ..
.. accordance with the terms contained in a written agreement between you     ..
.. and Idiap. For further information contact tto@idiap.ch                    ..
..                                                                            ..
.. Alternatively, this file may be used under the terms of the GNU Affero     ..
.. Public License version 3 as published by the Free Software and appearing   ..
.. in the file LICENSE.AGPL included in the packaging of this file.           ..
.. The BEAT platform is distributed in the hope that it will be useful, but   ..
.. WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY ..
.. or FITNESS FOR A PARTICULAR PURPOSE.                                       ..
..                                                                            ..
.. You should have received a copy of the GNU Affero Public License along     ..
.. with the BEAT platform. If not, see http://www.gnu.org/licenses/.          ..


24 25 26
===========
 Databases
===========
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
27 28 29 30 31 32 33 34 35 36 37 38

A database is a collection of data files, one for each output of the database.
This data are inputs to the BEAT toolchains. Therefore, it is important to
define evaluation protocols, which describe how a specific system must use the
raw data of a given database.

For instance, a recognition system will typically use a subset of the data to
train a recognition `model`, while another subset of data will be used to
evaluate the performance of this model.


Structure of a database
39
=======================
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
40 41 42 43 44 45 46 47 48

A database has the following structure on disk::

    database_name/
        output1_name.data
        output2_name.data
        ...
        outputN_name.data

49
For a given database, BEAT will typically stores information
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
50 51 52
about the root folder containing this raw data as well as a description of
it.

53 54
.. _beat-system-databases-protocols:

Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
55 56

Evaluation protocols
57
====================
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
58

59
A BEAT evaluation protocol consists of several ``datasets``, each dataset
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
60 61 62 63 64 65 66 67 68
having several ``outputs`` with well-defined data formats. In practice,
each dataset will typically be used for a different purpose.

For instance, in the case of a simple face recognition protocol, the
database may be split into three datasets: one for training, one for enrolling
client-specific model, and one for testing these models.
The training dataset may have two outputs: grayscale images as two-dimensional
array of type `uint8` and client id as `uint64` integers.

69
BEAT is data-driven, which means that all the outputs of a given
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
70 71 72 73 74
dataset are synchronized. The way the data is generated by each template
is defined in a piece of code called the ``database view``. It is important
that a database view has a deterministic behavior for reproducibility
purposes.

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136
Databases in BEAT such as other building blocks are consisting of two main components, a JSON declaration and a source code (``database view`` that is written in Python). We will describe each component in the following.


.. _beat-system-databases-protocols-json:

JSON declaration
----------------

Each database has a JSON_ declaration. This file has the information about the protocols, datasets included in each protocol, the ``database view`` used by each dataset, and much more. Here is an example of the JSON_ declaration file for `atnt` database that only has one protocol named "idiap". This protocol is used for a simple face recognition system and has three datasets, "train, "templates", and "probes".

.. code-block:: javascript

	{
	    "description": "The AT&T Database of Faces",
	    "protocols": [
	        {
	            "name": "idiap",
	            "sets": [
	                {
	                    "name": "train",
	                    "outputs": {
	                        "client_id": "system/uint64/1",
	                        "file_id": "system/uint64/1",
	                        "image": "system/array_2d_uint8/1"
	                    },
	                    "parameters": {},
	                    "template": "train",
	                    "view": "Train"
	                },
	                {
	                    "name": "templates",
	                    "outputs": {
	                        "client_id": "system/uint64/1",
	                        "file_id": "system/uint64/1",
	                        "image": "system/array_2d_uint8/1",
	                        "template_id": "system/uint64/1"
	                    },
	                    "parameters": {},
	                    "template": "templates",
	                    "view": "Templates"
	                },
	                {
	                    "name": "probes",
	                    "outputs": {
	                        "client_id": "system/uint64/1",
	                        "file_id": "system/uint64/1",
	                        "image": "system/array_2d_uint8/1",
	                        "probe_id": "system/uint64/1",
	                        "template_ids": "system/array_1d_uint64/1"
	                    },
	                    "parameters": {},
	                    "template": "probes",
	                    "view": "Probes"
	                }
	            ],
	            "template": "simple_face_recognition"
	        }
	    ],
	    "root_folder": "/path_to_db_folder/att_faces"
	}

The JSON_ file for a database has three main field.
137

138 139 140 141
* **description:** A short description of the database.
* **protocols:** a list of protocols defined for the database.
* **root_folder:** path to the directory where the data is stored.

142
The "protocols" field is where the datasets for each protocol is defined. In the example above only one protocol is defined. Implementing a new protocol means adding a new entry to the list of protocols. Each protocol has three main component:
143 144

* **name:** The name of the protocol which is "idiap" in this case.
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
145
* **sets:** The datasets which are included in this protocol. In this case the "idiap" protocol consists of three datasets: "train", "templates", and "probes".
146
* **template:** A template describes the number of sets and the set-template used for each set. Different protocols can use the same template which means they can be used in any application that accepts such structure. However each set may use different ``database view`` that makes the protocols different eventually.
147 148 149 150 151

Each set in the list of "sets" in the above example is a dataset that is used for a particular purpose. For example in case of simple face recognition, dataset "train" is used for training a model, "templates" is used for making templates for each identity and "probes" is used to measure the performance of the system. Each set has the following components:

* **name:** The name of the set.
* **outputs:** The outputs provided by the set. Each output has a name and a specific data format which should be taken into consideration when using the data.
152
* **parameters:** Extra parameters which can be given to the ``index()`` method of a ``database view`` and can be used to further specify the data fed to the system. For example two datasets can use the same ``database view``, but a parameter (e.g. "group": "train") can be given to the system and therefor only the data that are in that group will be available in the output of the database.
153
* **template:** Template defines the number of output and their names.
154 155 156 157 158 159 160
* **view:** The ``database view`` that is used to provide this data samples to the system. More information about the implementation of ``database view`` is given in :ref:`beat-system-databases-protocols-view`.

.. _beat-system-databases-protocols-view:


Database View
-------------
161 162

A ``database view`` is a piece of code that defines how the raw data should be fed
163
to the system based on defined protocols. Each database view is a class that
164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
inherits from ``beat.backend.python.database.View`` and two methods is implemented
in them: ``index()`` and ``get()``. Each database block in an experiment is assigned to
a database view.


The ``index()`` method is only used when the system is indexing the raw data. This means
that the system makes a list of available raw data objects. Here is an example
of an ``index()`` method:

.. code-block:: python

    def index(self, root_folder, parameters):

    	"""
    	This function a list of (named) tuples describing the data provided by the view.
	    The ordering of values inside the tuples is free, but it is expected
	    that the list is ordered in a consistent manner (ie. all train images of
	    person A, then all train images of person B, ...).

	    For instance, assuming a view providing that kind of data:

	    ----------- ----------- ----------- ----------- ----------- -----------
	    |  image  | |  image  | |  image  | |  image  | |  image  | |  image  |
	    ----------- ----------- ----------- ----------- ----------- -----------
	    ----------- ----------- ----------- ----------- ----------- -----------
	    | file_id | | file_id | | file_id | | file_id | | file_id | | file_id |
	    ----------- ----------- ----------- ----------- ----------- -----------
	    ----------------------------------- -----------------------------------
	    |             client_id           | |             client_id           |
	    ----------------------------------- -----------------------------------

	    a list like the following should be generated:

	    [
	        (client_id=1, file_id=1, image=filename1),
	        (client_id=1, file_id=2, image=filename2),
	        (client_id=1, file_id=3, image=filename3),
	        (client_id=2, file_id=4, image=filename4),
	        (client_id=2, file_id=5, image=filename5),
	        (client_id=2, file_id=6, image=filename6),
	        ...
	    ]



	    DO NOT store images, sound files or data loadable from a file in the list!
	    Store the path of the file to load instead.
	    """
        Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])

        # Open the database and load the objects to provide via the outputs
        db = bob.db.atnt.Database()
        objs = sorted(db.objects(groups='world', purposes=None),
                      key=lambda x: (x.client_id, x.id))

        return [Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs]

221 222 223 224 225 226 227 228 229 230
The database views available in the BEAT platform are using `bob`_ database packages
that have well defined protocols and datasets (e.g. train/dev/test). For more information see `database interfaces`_. Some examples:

   * https://pypi.python.org/pypi/bob.db.atvskeystroke
   * https://pypi.python.org/pypi/bob.db.gbu
   * https://pypi.python.org/pypi/bob.db.mobio


However defining new database views are not limited to using such
packages.
231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259


The ``get()`` method is used every time a block is fetching raw data from the database.
The dataformat for the outputs of database is defined in this method. for example:

.. code-block:: python

    def get(self, output, index):

        """
        This function returns the data at the provided index for the output in the list
        of (named) tuples defined in index() method. The full index is available as ``“self.objs”``
        """

        obj = self.objs[index]

        if output == 'client_id':
            return {
                'value': np.uint64(obj.client_id)
            }

        elif output == 'file_id':
            return {
                'value': np.uint64(obj.file_id)
            }

        elif output == 'image':
            return {
                'value': bob.io.base.load(obj.image)
260 261 262
            }

If you want to know more about the underlying source code of these two methods, you can refer to `here <https://gitlab.idiap.ch/beat/beat.backend.python/blob/master/beat/backend/python/database.py>`_
263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320



In the following we present an example of a database view that is used by a subset of ``atnt`` database:

.. code-block:: python

	class Train(View):
	    """Outputs:
	        - image: "system/array_2d_uint8/1"
	        - file_id: "system/uint64/1"
	        - client_id: "system/uint64/1"

	    One "file_id" is associated with a given "image".
	    Several "image" are associated with a given "client_id".

	    --------------- --------------- --------------- --------------- --------------- ---------------
	    |    image    | |    image    | |    image    | |    image    | |    image    | |    image    |
	    --------------- --------------- --------------- --------------- --------------- ---------------
	    --------------- --------------- --------------- --------------- --------------- ---------------
	    |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   |
	    --------------- --------------- --------------- --------------- --------------- ---------------
	    ----------------------------------------------- -----------------------------------------------
	    |                   client_id                 | |                   client_id                 |
	    ----------------------------------------------- -----------------------------------------------
	    """

	    def index(self, root_folder, parameters):
	        Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])

	        # Open the database and load the objects to provide via the outputs
	        db = bob.db.atnt.Database()
	        objs = sorted(db.objects(groups='world', purposes=None),
	                      key=lambda x: (x.client_id, x.id))

	        return [Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs]

	    def get(self, output, index):
	        obj = self.objs[index]

	        if output == 'client_id':
	            return {
	                'value': np.uint64(obj.client_id)
	            }

	        elif output == 'file_id':
	            return {
	                'value': np.uint64(obj.file_id)
	            }

	        elif output == 'image':
	            return {
	                'value': bob.io.base.load(obj.image)
	            }



.. note::
Zohreh MOSTAANI's avatar
Zohreh MOSTAANI committed
321 322

	Each view comes with a documentation describing the way the different outputs are synchronized together.
323 324 325


In the example above if there are 10000 images in the dataset, there will be 10000 entries in list
326
returned form the ``index`` method. The BEAT platform will use this information to efficiently split
327 328 329 330
the jobs on several machines during the experiment. It is expected that the list
is ordered in a logical order (here: entries are grouped by ``client_id``).


331
For each entry in the dataset (represented as a named tuple), all the necessary data is
332
provided by ``index()``. For performance reasons, it is expected that we don’t need to instantiate ``bob.db.atnt.Database()`` anymore in the ``get()`` method. The user can put any information in the index method, except for the names that are reserved by python named tuple such as `class`. If the user wants to use such names they should add it to a dictionary before defining the index method.
333 334

.. code-block:: python
Zohreh MOSTAANI's avatar
Zohreh MOSTAANI committed
335

336
    def __init__(self):
337
	    super().__init__()
338
	    self.output_member_map = {'class': 'cls'}
339

340
Some information from the database can be stored directly in the ``index()``
341
(in the given example: ``client_id`` and ``file_id``). For others, that require
342
opening a file, only the filename should be defined in the ``index()`` and the file
343 344 345 346 347 348
should be processed later in ``get()``

Once the ``database view`` is written, the user must index the database with the command-line tool:

.. code-block:: sh

349 350 351 352 353 354 355 356 357
	beat database index mydatabase/1

The user can index the content of a protocol::

	$ beat database index mydatabase/1/protocolname

Or the content of a set in each protocol::

	$ beat database index mydatabase/1/protocolname/setname
358 359


Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380
Database set templates
----------------------

In practice, different databases used for the same purpose may have the exact
same datasets with the exact same outputs (and attached data formats). In this
case, it is interesting to abstract the definition of the database sets from
a given database. BEAT defines ``database set templates`` for this purpose.

For instance, the simple face recognition evaluation protocol described above,
which consists of three datasets and few inputs may be abstracted in a
database set template. This template defines both the datasets, their outputs
as well as their corresponding data formats. Next, if several databases
implements such a protocol, they may rely on the same `database set template`.
Similarly, evaluation protocols testing different conditions (such as
enrolling on clean and testing on clean data vs. enrolling on clean and
testing on noisy data) may rely on the same database set template.

In practice, this reduces the amount of work to integrate new databases and/or
new evaluation protocols into the platform. Besides, at the experiment level,
this allows to re-use a toolchain on a different database, with almost no
configuration changes from the user.
381

382
.. include:: links.rst