Merge branch 'fix_doc' into '1.5.x'

Documentation for making a new database views in beat See merge request !15

Merge branch 'fix_doc' into '1.5.x'
67abe074 · André Anjos · aed70111 · ebe32652 · 67abe074
Commit 67abe074 authored 6 years ago by André Anjos
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -40,6 +40,163 @@ for scientific reports.
 This package defines a backend to execute algorithms written in the Python
 programming language.

+Creating new database views in beat
+===================================
+
+To implement a view, one needs to write a class that inherits from
+``beat.backend.python.database.View``, and implement two methods: ``index()`` and ``get()``.
+
+Here are the `documentation <https://gitlab.idiap.ch/beat/beat.backend.python/blob/master/beat/backend/python/database.py>`_ of those methods:
+
+The ``index()`` function:
+
+.. code-block:: python
+
+    def index(self, root_folder, parameters):
+        """Returns a list of (named) tuples describing the data provided by the view.
+
+        The ordering of values inside the tuples is free, but it is expected
+        that the list is ordered in a consistent manner (ie. all train images of
+        person A, then all train images of person B, ...).
+
+        For instance, assuming a view providing that kind of data:
+
+        ----------- ----------- ----------- ----------- ----------- -----------
+        |  image  | |  image  | |  image  | |  image  | |  image  | |  image  |
+        ----------- ----------- ----------- ----------- ----------- -----------
+        ----------- ----------- ----------- ----------- ----------- -----------
+        | file_id | | file_id | | file_id | | file_id | | file_id | | file_id |
+        ----------- ----------- ----------- ----------- ----------- -----------
+        ----------------------------------- -----------------------------------
+        |             client_id           | |             client_id           |
+        ----------------------------------- -----------------------------------
+
+        a list like the following should be generated:
+
+        [
+            (client_id=1, file_id=1, image=filename1),
+            (client_id=1, file_id=2, image=filename2),
+            (client_id=1, file_id=3, image=filename3),
+            (client_id=2, file_id=4, image=filename4),
+            (client_id=2, file_id=5, image=filename5),
+            (client_id=2, file_id=6, image=filename6),
+            ...
+        ]
+
+        DO NOT store images, sound files or data loadable from a file in the list!
+        Store the path of the file to load instead.
+        """
+
+The ``get()`` function:
+
+.. code-block:: python
+
+    def get(self, output, index):
+        """Returns the data of the provided output at the provided index in the list
+        of (named) tuples describing the data provided by the view (accessible at
+        self.objs)"""
+
+
+
+So if we take as an example the ``atnt/5 database``, the view named ``“Train”`` is implemented like this way
+(note that each view comes with a documentation describing the way the different outputs are synchronised together):
+
+.. code-block:: python
+
+	class Train(View):
+	    """Outputs:
+	        - image: "{{ system_user.username }}/array_2d_uint8/1"
+	        - file_id: "{{ system_user.username }}/uint64/1"
+	        - client_id: "{{ system_user.username }}/uint64/1"
+
+	    One "file_id" is associated with a given "image".
+	    Several "image" are associated with a given "client_id".
+
+	    --------------- --------------- --------------- --------------- --------------- ---------------
+	    |    image    | |    image    | |    image    | |    image    | |    image    | |    image    |
+	    --------------- --------------- --------------- --------------- --------------- ---------------
+	    --------------- --------------- --------------- --------------- --------------- ---------------
+	    |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   | |   file_id   |
+	    --------------- --------------- --------------- --------------- --------------- ---------------
+	    ----------------------------------------------- -----------------------------------------------
+	    |                   client_id                 | |                   client_id                 |
+	    ----------------------------------------------- -----------------------------------------------
+	    """
+
+	    def index(self, root_folder, parameters):
+	        Entry = namedtuple('Entry', ['client_id', 'file_id', 'image'])
+
+	        # Open the database and load the objects to provide via the outputs
+	        db = bob.db.atnt.Database()
+	        objs = sorted(db.objects(groups='world', purposes=None),
+	                      key=lambda x: (x.client_id, x.id))
+
+	        return [ Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs ]
+
+
+	    def get(self, output, index):
+	        obj = self.objs[index]
+
+	        if output == 'client_id':
+	            return {
+	                'value': np.uint64(obj.client_id)
+	            }
+
+	        elif output == 'file_id':
+	            return {
+	                'value': np.uint64(obj.file_id)
+	            }
+
+	        elif output == 'image':
+	            return {
+	                'value': bob.io.base.load(obj.image)
+	            }
+
+
+Note that:
+
+1) This view exactly matches the example from the documentation of the View class. In particular, ``index()``
+returns a list looking like:
+
+.. code-block:: python
+
+        [
+            (client_id=1, file_id=1, image=filename1),
+            (client_id=1, file_id=2, image=filename2),
+            (client_id=1, file_id=3, image=filename3),
+            (client_id=2, file_id=4, image=filename4),
+            (client_id=2, file_id=5, image=filename5),
+            (client_id=2, file_id=6, image=filename6),
+            ...
+            (client_id=100, file_id=10000, image=filename10000),
+        ]
+
+If there are 10000 images in the dataset, there will be 10000 entries in that list. The platform will use this
+information to efficiently split the jobs on several machines during the experiment. It is expected that the list
+is ordered in a logical order (here: entries are grouped by ``client_id``).
+
+2) For each entry in the dataset (represented as a named tuple), all the necessary data is provided by ``index()``.
+For performance reasons, it is expected that we don’t need to instantiate ``bob.db.atnt.Database()`` anymore in the ``get()`` method.
+
+3) You’re free to put any info in the index, with the names you want for the field (here for simplicity, we have one field in the tuple
+per output of the view, with the same name). The platform doesn’t care.
+
+4) Some data from the database can be stored directly in the ``index`` (here: ``client_id`` and ``file_id``). For others, that require
+opening a file, put the filename in the ``index`` and process the file later in ``get()``
+
+5) The implementation of ``get()`` is straightforward: the full index is available as ``“self.objs”``, just return the data
+corresponding to the provided output at the given index.
+
+
+As for the JSON file describing the database, the format hasn’t changed. For an example of the usage of the parameters defined in the 
+JSON file and given to ``index()``, you can look at ``mnist/4``.
+
+Once the view is written, you must index the database with the command-line tool, something like this:
+
+.. code-block:: sh
+
+	./bin/beat —prefix=… db index mydatabase/1/myview
+
 .. toctree::

   api