Commit a4efc976 authored by Zohreh MOSTAANI's avatar Zohreh MOSTAANI
Browse files

[general][doc] added making new database views to the database

parent 33ae2350
Pipeline #24579 passed with stages
in 5 minutes and 47 seconds
......@@ -21,9 +21,9 @@
.. with the BEAT platform. If not, see http://www.gnu.org/licenses/. ..
==========
Databases
==========
===========
Databases
===========
A database is a collection of data files, one for each output of the database.
This data are inputs to the BEAT toolchains. Therefore, it is important to
......@@ -46,7 +46,7 @@ A database has the following structure on disk::
...
outputN_name.data
For a given database, the BEAT platform will typically stores information
For a given database, the BEAT system will typically stores information
about the root folder containing this raw data as well as a description of
it.
......@@ -64,13 +64,197 @@ client-specific model, and one for testing these models.
The training dataset may have two outputs: grayscale images as two-dimensional
array of type `uint8` and client id as `uint64` integers.
The BEAT platform is data-driven, which means that all the outputs of a given
The BEAT system is data-driven, which means that all the outputs of a given
dataset are synchronized. The way the data is generated by each template
is defined in a piece of code called the ``database view``. It is important
that a database view has a deterministic behavior for reproducibility
purposes.
Creating new database views in beat
-----------------------------------
A ``database view`` is a piece of code that defines how the raw data should be fed
to the system based on defined protocols. Each database view is a class that
inherits from ``beat.backend.python.database.View`` and two methods is implemented
in them: ``index()`` and ``get()``. Each database block in an experiment is assigned to
a database view.
The ``index()`` method is only used when the system is indexing the raw data. This means
that the system makes a list of available raw data objects. Here is an example
of an ``index()`` method:
.. code-block:: python
def index(self, root_folder, parameters):
"""
This function a list of (named) tuples describing the data provided by the view.
The ordering of values inside the tuples is free, but it is expected
that the list is ordered in a consistent manner (ie. all train images of
person A, then all train images of person B, ...).
For instance, assuming a view providing that kind of data:
----------- ----------- ----------- ----------- ----------- -----------
| image | | image | | image | | image | | image | | image |
----------- ----------- ----------- ----------- ----------- -----------
----------- ----------- ----------- ----------- ----------- -----------
| file_id | | file_id | | file_id | | file_id | | file_id | | file_id |
----------- ----------- ----------- ----------- ----------- -----------
----------------------------------- -----------------------------------
| client_id | | client_id |
----------------------------------- -----------------------------------
a list like the following should be generated:
[
(client_id=1, file_id=1, image=filename1),
(client_id=1, file_id=2, image=filename2),
(client_id=1, file_id=3, image=filename3),
(client_id=2, file_id=4, image=filename4),
(client_id=2, file_id=5, image=filename5),
(client_id=2, file_id=6, image=filename6),
...
]
DO NOT store images, sound files or data loadable from a file in the list!
Store the path of the file to load instead.
"""
Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])
# Open the database and load the objects to provide via the outputs
db = bob.db.atnt.Database()
objs = sorted(db.objects(groups='world', purposes=None),
key=lambda x: (x.client_id, x.id))
return [Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs]
The database views that are available in the BEAT platform is using `bob`_ database packages
that have well defined protocols. However defining new database views are not limitted to using such
packages.
The ``get()`` method is used every time a block is fetching raw data from the database.
The dataformat for the outputs of database is defined in this method. for example:
.. code-block:: python
def get(self, output, index):
"""
This function returns the data at the provided index for the output in the list
of (named) tuples defined in index() method. The full index is available as ``“self.objs”``
"""
obj = self.objs[index]
if output == 'client_id':
return {
'value': np.uint64(obj.client_id)
}
elif output == 'file_id':
return {
'value': np.uint64(obj.file_id)
}
elif output == 'image':
return {
'value': bob.io.base.load(obj.image)
}
More information about the implementation of these two methods can be found `here <https://gitlab.idiap.ch/beat/beat.backend.python/blob/master/beat/backend/python/database.py>`_
In the following we present an example of a database view that is used by a subset of ``atnt`` database:
.. code-block:: python
class Train(View):
"""Outputs:
- image: "system/array_2d_uint8/1"
- file_id: "system/uint64/1"
- client_id: "system/uint64/1"
One "file_id" is associated with a given "image".
Several "image" are associated with a given "client_id".
--------------- --------------- --------------- --------------- --------------- ---------------
| image | | image | | image | | image | | image | | image |
--------------- --------------- --------------- --------------- --------------- ---------------
--------------- --------------- --------------- --------------- --------------- ---------------
| file_id | | file_id | | file_id | | file_id | | file_id | | file_id |
--------------- --------------- --------------- --------------- --------------- ---------------
----------------------------------------------- -----------------------------------------------
| client_id | | client_id |
----------------------------------------------- -----------------------------------------------
"""
def index(self, root_folder, parameters):
Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])
# Open the database and load the objects to provide via the outputs
db = bob.db.atnt.Database()
objs = sorted(db.objects(groups='world', purposes=None),
key=lambda x: (x.client_id, x.id))
return [Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs]
def get(self, output, index):
obj = self.objs[index]
if output == 'client_id':
return {
'value': np.uint64(obj.client_id)
}
elif output == 'file_id':
return {
'value': np.uint64(obj.file_id)
}
elif output == 'image':
return {
'value': bob.io.base.load(obj.image)
}
.. note::
Each view comes with a documentation describing the way the different outputs are synchronised together.
In the example above if there are 10000 images in the dataset, there will be 10000 entries in list
returned form the ``index`` method. The BEAT platform will use this information to efficiently split
the jobs on several machines during the experiment. It is expected that the list
is ordered in a logical order (here: entries are grouped by ``client_id``).
For each entry in the dataset (represented as a named tuple), all the necessary data is
provided by ``index()``. For performance reasons, it is expected that we don’t need to instantiate ``bob.db.atnt.Database()`` anymore in the ``get()`` method. The user can put any information in the index method, except for the namse that are reserved by python named tuple such as c`class`. If the user wants to use such names they should add it to a dictionary before defining the index method.
.. code-block:: python
super(All, self)
self.output_member_map = {'class': 'cls'}
Some information from the database can be stored directly in the ``index()``
(in the given example: ``client_id`` and ``file_id``). For others, that require
opening a file, only the filename should be defined in the ``index()`` and the file
should be processed later in ``get()``
Once the ``database view`` is written, the user must index the database with the command-line tool:
.. code-block:: sh
beat db index mydatabase/1/myview
Database set templates
----------------------
......@@ -92,3 +276,5 @@ In practice, this reduces the amount of work to integrate new databases and/or
new evaluation protocols into the platform. Besides, at the experiment level,
this allows to re-use a toolchain on a different database, with almost no
configuration changes from the user.
.. include:: links.rst
\ No newline at end of file
......@@ -17,3 +17,4 @@
.. _restructuredtext: http://docutils.sourceforge.net/rst.html
.. _conda: https://conda.io/
.. _beat editor: https://www.idiap.ch/software/beat/docs/beat/docs/new/beat.editor/doc/index.html
.. _bob: https://www.idiap.ch/software/bob/docs/bob/docs/stable/bob/doc/index.html
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment