introduction.rst 10.3 KB
Newer Older
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
.. vim: set fileencoding=utf-8 :

.. Copyright (c) 2016 Idiap Research Institute, http://www.idiap.ch/          ..
.. Contact: beat.support@idiap.ch                                             ..
..                                                                            ..
.. This file is part of the beat.core module of the BEAT platform.            ..
..                                                                            ..
.. Commercial License Usage                                                   ..
.. Licensees holding valid commercial BEAT licenses may use this file in      ..
.. accordance with the terms contained in a written agreement between you     ..
.. and Idiap. For further information contact tto@idiap.ch                    ..
..                                                                            ..
.. Alternatively, this file may be used under the terms of the GNU Affero     ..
.. Public License version 3 as published by the Free Software and appearing   ..
.. in the file LICENSE.AGPL included in the packaging of this file.           ..
.. The BEAT platform is distributed in the hope that it will be useful, but   ..
.. WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY ..
.. or FITNESS FOR A PARTICULAR PURPOSE.                                       ..
..                                                                            ..
.. You should have received a copy of the GNU Affero Public License along     ..
.. with the BEAT platform. If not, see http://www.gnu.org/licenses/.          ..


24
.. _beat-introduction:
Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209

=============
Introduction
=============

The BEAT platform is a web-based system for certifying results for
software-based data-driven workflows that can be sub-divided functionally (into
processing blocks). The platform takes all burden of hosting data and software
away from users by providing a capable computing farm that handles both aspects
graciously. Data is kept sequestered inside the platform. The user provides the
description of data formats, algorithms, data flows (also known as toolchains)
and experimental details (parameters), which are mashed inside the platform to
produce beautiful results, easily exportable into computer graphics or tables
for scientific reports.

It is intended as a fundamental building-block in `Reproducible Research`_,
allowing academic and industrial parties to prescribe system behavior and have
it reproducible through software, hardware and staff generations. Here are some
known applications:

* Challenges and competitions on defined data, protocols and workflow
  components;
* Study group exercises and exams;
* Support to publication submission;
* System and algorithm performance optimization;
* Reproduction of experiments through communities;
* Support for industry-academy relationship.

This package, in particular, defines a set of core components useful for the
whole platform: the building blocks used by all other packages in the BEAT
software suite. These are:

* **Data formats**: the specification of data which is transmitted between
  blocks of a toolchain;
* **Libraries**: routines (source-code or binaries) that can be incorporated
  into other libraries or user code on algorithms;
* **Algorithms**: the program (source-code or binaries) that defines the user
  algorithm to be run within the blocks of a toolchain;
* **Databases** and **Datasets**: means to read raw-data from a disk and feed
  into a toolchain, respecting a certain usage protocol;
* **Toolchain**: the definition of the data flow in an experiment;
* **Experiment**: the reunion of algorithms, datasets, a toolchain and
  parameters that allow the platform to schedule and run the prescribed recipe
  to produce displayable results.


.. _beat-core-introduction-example:

A Simple Example
----------------

The next figure shows a representation of a very simple toolchain, composed of
only a few color-coded components:

* To the left, the reader can identify two datasets, named ``set`` and ``set2``
  respectively. They emit data (of, at this point, an unspecified type) into
  the following processing blocks;
* Following the datasets, two processing blocks named ``echo1`` and ``echo2``
  receive the input from the dataset and emit data into a third block, named
  ``echo3``;
* The final component receives the inputs emitted from ``echo3`` and it is
  called ``analysis``. Because this block has no output, it is considered a
  final block, from which the BEAT platform expects to collect experiment
  results (that, at this point, are also unspecified).

.. Simple toolchain representation for the BEAT platform
.. graphviz:: img/toolchain-triangle.dot

The toolchain only defines the very basic data flow and connections that must
be respected by experiments. It does not define what is the type of data that
is produced or consumed by any of the existing blocks, the algorithms or
databases and protocols to use. From the toolchain description, it is possible
to devise a possible execution order, by taking into consideration the imposed
data flow. In this simple example, the datasets called ``set`` and ``set2``
may yield data in parallel, allowing the execution of blocks ``echo1`` and
``echo2``. Block ``echo3`` must come next, before the ``analysis`` block, which
comes by last.

In typical problems that can be implemented in the BEAT platform, datasets are
composed of multiple instances of raw data. For example, these could be images
for an object recognition problem, speech sequences for a speech recognition
task or model data for biometric recognition tasks. Computing blocks must
process these data by looping on these atomic data samples. The color-coding in
the figure indicates this extra data-flow information: for each dataset in the
drawing, it indicates how blocks loop on their atomic data. For the proposed,
toolchain, we can observe that blocks ``echo1``, ``echo3`` and ``analysis``
loop over the "raw" data samples from ``set``, while ``echo2`` loop over the
samples from ``set2``.

The next figure shows a complete experimental setup for the above toolchain.
The input blocks use a given database, called ``simple/1`` (the name is
``simple`` and the version is ``1``), using one of its protocols called
``protocol``. Each block is set to a specific data set inside the
database/protocol combination. Both datasets on this database/protocol yield
objects of type ``beat/integer/1`` (a format called ``integer`` from user
``beat``, version ``1``), which are consumed by algorithms running on the next
blocks. The block ``echo1`` uses the algorithm ``user/integers_echo/1`` (an
algorithm called ``integers_echo`` from user ``user``, version ``1``) and
also yields ``beat/integer/1`` objects. The same is valid for the algorithm
running on block ``echo2``.

The algorithm for block ``echo3`` cannot possibly be the same - it must deal
with 2 inputs, generated by blocks looping on different raw data. We'll be more
detailed about conceptual differences while writing algorithms which are not
synchronized with all of their inputs next. For this introduction, it suffices
you understand the organization of algorithms in an experiment is constrained
by its neighboring block requirements as well as the input and output
data flows determined for a given block.

Block ``echo3`` yields elements to the algorithm on the ``analysis`` block,
called ``user/integers_echo_analyzer/1``, which produces a single result named
``out_data``, which is of type ``int32`` (that is, a signed integer with 32
bits). Algorithms that do not communicate with other algorithms are typically
called ``analyzers``. They are set-up on the end of experiments so as to
produce quantifiable results you can use to measure the performance of your
experimental setup.

.. Simple experiment representation for the BEAT platform
.. graphviz:: img/experiment-triangle.dot


.. _beat-core-introduction-design:

Design
------

The next figure shows an UML representation of main BEAT components, showing
some of their interaction and interdependence. Experiments use algorithms, data
sets and a toolchain in order to define a complete runnable setup. Data sets
are grouped into protocols which are, in turn, grouped into databases.
Algorithms use data formats to defined input and output patterns. Most objects
are subject to versioning, possess a name and belong to a specific user. By
contracting those markers, it is possible to define unique identifiers for all
objects in the platform. In the example above, you can identify some examples.

.. High-level component interaction in the BEAT platform core
.. graphviz::

   digraph hierarchy {
     graph [fontname="helvetica", compound=true, splines=polyline]
     node [fontname="helvetica", shape=record, style=filled, fillcolor=gray95]
     edge [fontname="helvetica"]

     subgraph "algorithm_cluster" {
       1[label = "{Dataformat|...|+user\n+name\n+version}"]
       2[label = "{Algorithm|...|+user\n+name\n+version\n+code\n+language}"]
       6[label = "{Library|...|+user\n+name\n+version\n+code\n+language}"]
     }
     subgraph "database_cluster" {
       graph [label=datasets]
       3[label = "{Database|...|+name\n+version}"]
       4[label = "{Protocol|...|+template}"]
       5[label = "Set"]
     }
     subgraph "experiment_cluster" {
       graph [label=experiments]
       7[label = "{Toolchain|+execution_order()|+user\n+name\n+version}"]
       8[label = "{Experiment|...|+user\n+label}"]
     }

     1->1 [label = "0..*", arrowhead=empty]
     2->1 [label = "1..*", arrowhead=empty]
     2->6 [label = "0..*", arrowhead=empty]
     6->6 [label = "0..*", arrowhead=empty]
     4->3 [label = "1..*", arrowhead=odiamond]
     5->4 [label = "1..*", arrowhead=odiamond]
     5->1 [label = "1..*", arrowhead=empty]
     8->7 [label = "1..1", arrowhead=empty]
     8->2 [label = "1..*", arrowhead=empty]
     8->5 [label = "1..*", arrowhead=empty]

   }


The BEAT platform provides a graphical user interface so that you can program
data formats, algorithms, toolchains and define experiments rather intuitively.
This package provides the core building blocks of the BEAT platform. For expert
users, we provide a command-line interface to the platform, allowing such
users to create, modify and dispose of such objects using their own private
editors. For developers and programmers, the rest of this guide details each of
those building blocks, their relationships and how to use such a command-line
interface to interact with the platform efficiently.


.. include:: links.rst