diff --git a/doc/backend_api.rst b/doc/backend_api.rst index 060a0b198056f4f6c12f1051b9f8d6059ae3fa3d..7ac828bb3c674a51b2cea949be2e1267f6cccdaa 100644 --- a/doc/backend_api.rst +++ b/doc/backend_api.rst @@ -267,170 +267,59 @@ In the remainder of this section, we describe the various commands, which are supported by this communication protocol. -Command: has-more-data (hmd) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Command: information (ifo) +~~~~~~~~~~~~~~~~~~~~~~~~~~ (User Process -> Infrastructure) -This command asks the infrastructure, whether there is more input data in a -given input or not. The format of this command is: +This command asks the infrastructure to return information about the remote +data sources queried. The format of this command is: .. code-block:: text - "hmd channel [name]" - -where ``name`` refers to the input name and is optional. - -The BEAT infrastructure will answer by writing the following into the input -pipe. - -.. code-block:: text - - "boolean" - -where `boolean` is a boolean indicating whether there is more data to process -or not. The values may be ``tru``, for True and ``fal``, for False. - - -Command: is-dataunit-done (idd) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -(User Process -> Infrastructure) - - -This command asks the infrastructure, whether the current chunk of data is -going to change or not, next time the current data index is increased. The -format of this command is: - -.. code-block:: text - - "idd channel name" + "ifo name\n" The infrastructure will answer by writing the following into the input pipe. .. code-block:: text - "boolean" - -where `boolean` is a boolean indicating whether there is more data to process -or not. The values may be ``tru``, for True and ``fal``, for False. - - -Command: next (nxt) -~~~~~~~~~~~~~~~~~~~ - -(User Process -> Infrastructure) - - -This command asks the infrastructure to provide the next data on a given -channel and/or input. If no input ``name`` is provided, then the infrastructure -will return the data on all inputs of the given ``channel``. The format of this -command is: - -.. code-block:: text - - "nxt channel [name]" - -where ``name`` refers to the input name and is optional. - -If an input ``name`` is provided, the infrastructure will answer by writing the -following into the input pipe: - -.. code-block:: text - - "dat N name1 <bin1> .. nameN-1 <binN-1>" - -where ``N`` refers to the number of data chunks in the reply, ``name`` is the -name of a given channel, and ``<bin>`` corresponds to the binary representation -of the data contents. The binary data format uses the same format for disk -storage used by the infrastructure so as to optimize I/O performance. Our -reference backend implementation at `beat.backend.python`_ contains -implementation details about the binary data format. As a backend developer, -you must ensure your backend is fully capable of **correctly** interpreting the -contents of the binary stream given the data formats associated with each input -and output of the algorithm. - - -Command: write (wrt) -~~~~~~~~~~~~~~~~~~~~ - -(User Process -> Infrastructure) - - -This command asks the infrastructure to write data on a given output. The -format of this command is: - -.. code-block:: text - - "wrt name <bin>" - -The ``name`` identifies the output on which to write the data. The message -`<bin>` is the raw data to write on the output, pre-encoded in the same binary -data format as the one used for the input. The contents of the binary stream -sent to the infrastructure will be checked again before being written to disk. -In case of problems, a system error will be issued and the processing will -stop. It is the task of the backend developer to insure conformity in order to -avoid such errors. - -The infrastructure acknowledges with: - -.. code-block:: text - - "ack" + "X" + "Start0" + "End0" + ... + "StartX-1" + "EndX-1" -In case all is OK or with the keyword ``err``, if there was an error -interpreting the binary data back. In such cases, the UP is expected to stop -processing and issue a ``done`` command, waiting to be gracefully stopped by -the infrastructure. If the UP does not issue the ``done`` command next, the UP -is forcibly terminated with a operating system level signal. +where `X` is the length of the data source and the `StartX` and `EndX` are the +start and end indexes available through that data source. -Command: is-data-missing (idm) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Command: get data (get) +~~~~~~~~~~~~~~~~~~~~~~~ (User Process -> Infrastructure) -This command asks the infrastructure whether there is missing data on a given -output, by looking at the synchronization information. The format of this -command is: - -.. code-block:: text - - "idm name" - -The infrastructure will answer by writing the following into the input pipe. +This command asks the infrastructure to return the data at the given index. +The format of this command is: .. code-block:: text - "boolean" - -where `boolean` is a boolean indicating whether there is more data to process -or not. The values may be ``tru``, for True and ``fal``, for False. - - -Command: output-is-connected (oic) -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -(User Process -> Infrastructure) - - -This command asks the infrastructure, whether a given output is connected or -not. The format of this command is: - -.. code-block:: text + "get X" - "ict name\n" +where X is the index of the data in the data source. The infrastructure will answer by writing the following into the input pipe. .. code-block:: text - "boolean" + "StartX" + "EndX" + "data" -where `boolean` is a boolean indicating whether there is more data to process -or not. The values may be ``tru``, for True and ``fal``, for False. +where `StartX` and `EndX` and the start and end indexes in the data sources and +`data` is the packed data that the data sources provides. Command: done (don) diff --git a/doc/conf.py b/doc/conf.py index eb6826d563b4a07b7db15e87f37d5c3a6d03f688..18252a3c0e3f68f114bbaca3029413c7f111dddf 100644 --- a/doc/conf.py +++ b/doc/conf.py @@ -32,8 +32,15 @@ # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. # # # ################################################################################### + + +import time import os import pkg_resources +import sphinx_rtd_theme + +# For inter-documentation mapping: +from bob.extension.utils import link_documentation, load_requirements # -- General configuration ----------------------------------------------------- @@ -54,7 +61,6 @@ extensions = [ "sphinx.ext.napoleon", "sphinx.ext.viewcode", "sphinx.ext.mathjax", - #'matplotlib.sphinxext.plot_directive' ] # Be picky about warnings @@ -104,7 +110,6 @@ master_doc = "index" # General information about the project. project = u"beat.core" -import time copyright = u"%s, Idiap Research Institute" % time.strftime("%Y") @@ -164,7 +169,6 @@ owner = [u"Idiap Research Institute"] # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. -import sphinx_rtd_theme html_theme = "sphinx_rtd_theme" @@ -258,16 +262,13 @@ autoclass_content = "class" autodoc_member_order = "bysource" autodoc_default_flags = ["members", "undoc-members", "show-inheritance"] -if not "BOB_DOCUMENTATION_SERVER" in os.environ: +if "BOB_DOCUMENTATION_SERVER" not in os.environ: # notice we need to overwrite this for BEAT projects - defaults from Bob are # not OK os.environ[ "BOB_DOCUMENTATION_SERVER" ] = "https://www.idiap.ch/software/beat/docs/beat/%(name)s/%(version)s/|https://www.idiap.ch/software/beat/docs/beat/%(name)s/master/" -# For inter-documentation mapping: -from bob.extension.utils import link_documentation, load_requirements - sphinx_requirements = "extra-intersphinx.txt" if os.path.exists(sphinx_requirements): intersphinx_mapping = link_documentation( diff --git a/doc/index.rst b/doc/index.rst index b4ef6423c3ac2016cfadb99daf9541d9a890bc46..5749322965edbfa079eba928ab74408a0e5d7a02 100644 --- a/doc/index.rst +++ b/doc/index.rst @@ -46,6 +46,7 @@ This package provides the core components of BEAT ecosystem. These core componen backend_api develop api + zmq_architecture Indices and tables diff --git a/doc/links.rst b/doc/links.rst index 477b357a0e50356b920444fa39378f129cc470cd..f3d8af7f7251fad5dc469391ef1aebd34fc34e92 100644 --- a/doc/links.rst +++ b/doc/links.rst @@ -43,6 +43,8 @@ .. _python 2.7: http://www.python.org .. _zero message queue: http://zeromq.org .. _zmq: http://zeromq.org +.. _ZeroMQ book: http://shop.oreilly.com/product/0636920026136.do +.. _Majordomo Protocol: https://rfc.zeromq.org/spec:18/MDP/ .. _language bindings: http://zeromq.org/bindings:_start .. _python bindings: http://zeromq.org/bindings:python .. _markdown: http://daringfireball.net/projects/markdown/ diff --git a/doc/zmq_architecture.rst b/doc/zmq_architecture.rst new file mode 100644 index 0000000000000000000000000000000000000000..7595b1cbd5b2ce80d597e95569fde570b0f4bef7 --- /dev/null +++ b/doc/zmq_architecture.rst @@ -0,0 +1,129 @@ +.. vim: set fileencoding=utf-8 : + +.. Copyright (c) 2019 Idiap Research Institute, http://www.idiap.ch/ .. +.. Contact: beat.support@idiap.ch .. +.. .. +.. This file is part of the beat.backend.python module of the BEAT platform. .. +.. .. +.. Redistribution and use in source and binary forms, with or without +.. modification, are permitted provided that the following conditions are met: + +.. 1. Redistributions of source code must retain the above copyright notice, this +.. list of conditions and the following disclaimer. + +.. 2. Redistributions in binary form must reproduce the above copyright notice, +.. this list of conditions and the following disclaimer in the documentation +.. and/or other materials provided with the distribution. + +.. 3. Neither the name of the copyright holder nor the names of its contributors +.. may be used to endorse or promote products derived from this software without +.. specific prior written permission. + +.. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND +.. ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED +.. WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE +.. DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE +.. FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL +.. DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR +.. SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER +.. CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +.. OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +.. OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + +.. _zmq_architecture: + +================================== +ZMQ Architecture for task handling +================================== + +Introduction +------------ + +The ZMQ architecture implemented in beat.core is based on the `Majordomo +Protocol`_ as described in the `ZeroMQ book`_. + +There are however some subtle differences: + +- We have one client: the scheduler +- We have unique workers +- We currently don't have "system commands" + +Due to these differences and their implementation, the protocol has been +renamed: "BEAT Computation Protocol" or BCP for short. + +The system is based on these three components: + +- The client +- The broker +- The worker(s) + +In BEAT, the client will be the scheduler which will send the tasks to the +broker which will be responsible for forwarding them to the appropriate worker +requested by the scheduler. + +Once the task has been completed, the worker will send back a message to the +scheduler through the broker. + +The whole messaging system is asynchronous except when starting an actual task. +The worker will send back a confirmation as soon as the runner was properly +started. + +Why this design ? +----------------- + +The original design was a bit simpler: + +- One scheduler +- Many workers + +The scheduler was responsible for both task scheduling and worker communication +handling. One issue that arose from time to time was that with very low volume +of network activity, the connection between one or more workers and the +scheduler would get cut and nobody would notice. The result was that new tasks +would be sent but silently dropped and thus experiment would stay in a running +state while not doing anything. And if canceled, the state would stay in +canceling as again the command would be silently dropped. + +Thus the rational behind choosing this new design was to avoid these connection +loss and therefore platform paralysis. + +Now, the broker and the workers implement a bidirectional heartbeat. This has a +twofold benefit: + +- The heartbeat itself should generate enough network activity to avoid the + connection to be cut. +- If a worker goes missing, it will be detected by the broker that will act as + configured to. + +BCP Schema +---------- + +The figure below shows how the system is working. + +:: + + -------------- + | | + | client | + | | + -------------- + | + -------------- + | | + | broker | + | | + -------------- + | | | | + | | | | + /----------------------/ | | \-----------------------\ + | | | | + | /-------/ \-------\ | + | | | | + --------------- --------------- --------------- --------------- + | | | | | | | | + | worker1 | | worker2 | | worker3 | | worker4 | + | | | | | | | | + --------------- --------------- --------------- --------------- + +.. include:: links.rst