diff --git a/doc/index.rst b/doc/index.rst index 0a9f4ea723981a5fa929fdc9d82cf3a9ccb4ec99..59d71aeff0c8d40c27afbf77dd070c6ce091d760 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -26,179 +26,22 @@ ========================= Python Backend for BEAT ========================= - -The BEAT platform is a web-based system for certifying results for -software-based data-driven workflows that can be sub-divided functionally (into -processing blocks). The platform takes all burden of hosting data and software -away from users by providing a capable computing farm that handles both aspects -graciously. Data is kept sequestered inside the platform. The user provides the +In order to run an experiment in BEAT, the user provides the description of data formats, algorithms, data flows (also known as toolchains) -and experimental details (parameters), which are mashed inside the platform to -produce beautiful results, easily exportable into computer graphics or tables -for scientific reports. - -This package defines a backend to execute algorithms written in the Python -programming language. - -Creating new database views in beat -=================================== - -To implement a view, one needs to write a class that inherits from -``beat.backend.python.database.View``, and implement two methods: ``index()`` and ``get()``. - -Here are the `documentation `_ of those methods: - -The ``index()`` function: - -.. code-block:: python - - def index(self, root_folder, parameters): - """Returns a list of (named) tuples describing the data provided by the view. - - The ordering of values inside the tuples is free, but it is expected - that the list is ordered in a consistent manner (ie. all train images of - person A, then all train images of person B, ...). - - For instance, assuming a view providing that kind of data: - - ----------- ----------- ----------- ----------- ----------- ----------- - | image | | image | | image | | image | | image | | image | - ----------- ----------- ----------- ----------- ----------- ----------- - ----------- ----------- ----------- ----------- ----------- ----------- - | file_id | | file_id | | file_id | | file_id | | file_id | | file_id | - ----------- ----------- ----------- ----------- ----------- ----------- - ----------------------------------- ----------------------------------- - | client_id | | client_id | - ----------------------------------- ----------------------------------- - - a list like the following should be generated: - - [ - (client_id=1, file_id=1, image=filename1), - (client_id=1, file_id=2, image=filename2), - (client_id=1, file_id=3, image=filename3), - (client_id=2, file_id=4, image=filename4), - (client_id=2, file_id=5, image=filename5), - (client_id=2, file_id=6, image=filename6), - ... - ] - - DO NOT store images, sound files or data loadable from a file in the list! - Store the path of the file to load instead. - """ - -The ``get()`` function: - -.. code-block:: python - - def get(self, output, index): - """Returns the data of the provided output at the provided index in the list - of (named) tuples describing the data provided by the view (accessible at - self.objs)""" - - - -So if we take as an example the ``atnt/5 database``, the view named ``“Train”`` is implemented like this way -(note that each view comes with a documentation describing the way the different outputs are synchronised together): +and experimental details (parameters) and BEAT schedules and runs the provided recipe to produce displayable result. The algorithms can be written in Python or C++. -.. code-block:: python - class Train(View): - """Outputs: - - image: "{{ system_user.username }}/array_2d_uint8/1" - - file_id: "{{ system_user.username }}/uint64/1" - - client_id: "{{ system_user.username }}/uint64/1" - - One "file_id" is associated with a given "image". - Several "image" are associated with a given "client_id". - - --------------- --------------- --------------- --------------- --------------- --------------- - | image | | image | | image | | image | | image | | image | - --------------- --------------- --------------- --------------- --------------- --------------- - --------------- --------------- --------------- --------------- --------------- --------------- - | file_id | | file_id | | file_id | | file_id | | file_id | | file_id | - --------------- --------------- --------------- --------------- --------------- --------------- - ----------------------------------------------- ----------------------------------------------- - | client_id | | client_id | - ----------------------------------------------- ----------------------------------------------- - """ - - def index(self, root_folder, parameters): - Entry = namedtuple('Entry', ['client_id', 'file_id', 'image']) - - # Open the database and load the objects to provide via the outputs - db = bob.db.atnt.Database() - objs = sorted(db.objects(groups='world', purposes=None), - key=lambda x: (x.client_id, x.id)) - - return [ Entry(x.client_id, x.id, x.make_path(root_folder, '.pgm')) for x in objs ] - - - def get(self, output, index): - obj = self.objs[index] - - if output == 'client_id': - return { - 'value': np.uint64(obj.client_id) - } - - elif output == 'file_id': - return { - 'value': np.uint64(obj.file_id) - } - - elif output == 'image': - return { - 'value': bob.io.base.load(obj.image) - } - - -Note that: - -1) This view exactly matches the example from the documentation of the View class. In particular, ``index()`` -returns a list looking like: - -.. code-block:: python - - [ - (client_id=1, file_id=1, image=filename1), - (client_id=1, file_id=2, image=filename2), - (client_id=1, file_id=3, image=filename3), - (client_id=2, file_id=4, image=filename4), - (client_id=2, file_id=5, image=filename5), - (client_id=2, file_id=6, image=filename6), - ... - (client_id=100, file_id=10000, image=filename10000), - ] - -If there are 10000 images in the dataset, there will be 10000 entries in that list. The platform will use this -information to efficiently split the jobs on several machines during the experiment. It is expected that the list -is ordered in a logical order (here: entries are grouped by ``client_id``). - -2) For each entry in the dataset (represented as a named tuple), all the necessary data is provided by ``index()``. -For performance reasons, it is expected that we don’t need to instantiate ``bob.db.atnt.Database()`` anymore in the ``get()`` method. - -3) You’re free to put any info in the index, with the names you want for the field (here for simplicity, we have one field in the tuple -per output of the view, with the same name). The platform doesn’t care. - -4) Some data from the database can be stored directly in the ``index`` (here: ``client_id`` and ``file_id``). For others, that require -opening a file, put the filename in the ``index`` and process the file later in ``get()`` - -5) The implementation of ``get()`` is straightforward: the full index is available as ``“self.objs”``, just return the data -corresponding to the provided output at the given index. +This package defines a backend to execute algorithms written in the Python +programming language. -As for the JSON file describing the database, the format hasn’t changed. For an example of the usage of the parameters defined in the -JSON file and given to ``index()``, you can look at ``mnist/4``. -Once the view is written, you must index the database with the command-line tool, something like this: -.. code-block:: sh - ./bin/beat —prefix=… db index mydatabase/1/myview .. toctree:: + object_representation api diff --git a/doc/object_representation.rst b/doc/object_representation.rst new file mode 100644 index 0000000000000000000000000000000000000000..78f42f8ad535666282fbb9278e8f63645e14389d --- /dev/null +++ b/doc/object_representation.rst @@ -0,0 +1,91 @@ +.. _python-object-representation: + +======================= + Object Representation +======================= + +As it is mentioned in `Algorithms `_ section, data is available via our +backend API to the user algorithms. For example, in Python, the |project| +platform uses NumPy data types to pass data to and from algorithms. For +example, when the algorithm reads data for which the format is defined like: + +.. code-block:: javascript + + { + "value": "float64" + } + + +The field ``value`` of an instance named ``object`` of this format is +accessible as ``object.value`` and will have the type ``numpy.float64``. If the +format would be, instead: + +.. code-block:: javascript + + { + "value": [0, 0, "float64"] + } + + +It would be accessed in the same way (i.e., via ``object.value``), except that +the type would be ``numpy.ndarray`` and ``object.value.dtype`` would be +``numpy.float64``. Naturally, objects which are instances of a format like +this: + +.. code-block:: javascript + + { + "x": "int32", + "y": "int32" + } + + +Could be accessed like ``object.x``, for the ``x`` value and ``object.y``, for +the ``y`` value. The type of ``object.x`` and ``object.y`` would be +``numpy.int32``. + +Conversely, if you *write* output data in an algorithm, the type of the output +objects are checked for compatibility with respect to the value declared on the +format. For example, this would be a valid use of the format above, in Python: + +.. code-block:: python + + import numpy + + class Algorithm: + + def process(self, inputs, dataloaders, outputs): + + # read data + + # prepares object to be written + myobj = {"x": numpy.int32(4), "y": numpy.int32(6)} + + # write it + outputs["point"].write(myobj) #OK! + + +If you try to write into an object that is supposed to be of type ``int32``, a +``float64`` object, an exception will be raised. For example: + + +.. code-block:: python + + import numpy + + class Algorithm: + + def process(self, inputs, dataloaders outputs): + + # read data + + # prepares object to be written + myobj = {"x": numpy.int32(4), "y": numpy.float64(3.14)} + + # write it + outputs["point"].write(myobj) #Error: cannot downcast! + + +The bottomline is: **all type casting in the platform must be explicit**. It +will not automatically downcast or upcast objects for you as to avoid +unexpected precision loss leading to errors. \ No newline at end of file