generate.rst 5.6 KB
Newer Older
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194
.. vim: set fileencoding=utf-8 :


.. _gridtk.generate:

=====================================
 Script Generation for Grid Searches
=====================================

The thing with scientific discovery is that, sometimes, you need to do a lot
of experiments before a reasonable conclusion. These experiments require minor
variations in their configuration and submission, possibly to an SGE-enabled
facility for processing.

This guide explains how to use the script ``jgen``, which helps you in
generating multiple experiment configurations for your grid searches. The
system supposes that a single experiment is defined in a single file while
multiple experiments can be run by somehow executing sequences of these
individual configuration files.

The script ``jgen`` takes, in its simplistic form, 3 parameters that explain:

* The "combinations" of variables that one needs to scan for a search in a
  YAML_ file
* A Jinja2_ template file that explains the setup of each experiment
* An output template that explains how to mix the parameters in your YAML_ file
  with the template and generate a bunch of experiment configurations to run

Let's decrypt each of these inputs.


YAML Input
----------

The YAML_ input file describes all possible combinations of parameters you want
to scan. All root keys that represent lists will be combined in all possible
ways to produce, each combination, a "configuration set".

A configuration set corresponds to settings for **all** variables in the input
template that needs replacing. For example, if your template mentions the
variables ``name`` and ``version``, then each configuration set should yield
values for both ``name`` and ``version``.

For example:

.. code-block:: yaml

   name: [john, lisa]
   version: [v1, v2]


This should yield to the following configuration sets:

.. code-block:: python

   [
     {'name': 'john', 'version': 'v1'},
     {'name': 'john', 'version': 'v2'},
     {'name': 'lisa', 'version': 'v1'},
     {'name': 'lisa', 'version': 'v2'},
   ]


Each key in the input file should correspond to either an object or a YAML
list. If the object is a list, then we'll iterate over it for every possible
combination of elements in the lists. If the element in question is not a list,
then it is considered unique and repeated for each generated configuration set.
Example

.. code-block:: yaml

   name: [john, lisa]
   version: [v1, v2]
   text: >
      hello,
      world!

Should yield to the following configuration sets:

.. code-block:: python

   [
     {'name': 'john', 'version': 'v1', 'text': 'hello, world!'},
     {'name': 'john', 'version': 'v2', 'text': 'hello, world!'},
     {'name': 'lisa', 'version': 'v1', 'text': 'hello, world!'},
     {'name': 'lisa', 'version': 'v2', 'text': 'hello, world!'},
   ]

Keys starting with one `_` (underscore) are treated as "unique" objects as
well. Example:

.. code-block:: yaml

   name: [john, lisa]
   version: [v1, v2]
   _unique: [i1, i2]

Should yield to the following configuration sets:

.. code-block:: python

   [
     {'name': 'john', 'version': 'v1', '_unique': ['i1', 'i2']},
     {'name': 'john', 'version': 'v2', '_unique': ['i1', 'i2']},
     {'name': 'lisa', 'version': 'v1', '_unique': ['i1', 'i2']},
     {'name': 'lisa', 'version': 'v2', '_unique': ['i1', 'i2']},
   ]


Jinja2 Template
---------------

This corresponds to a file that will have variables replaced for each of the
configuration sets generated by your YAML_ file. For example, if your template
is a python file that uses the variables this way:

.. code-block:: text

   #/usr/bin/env python

   print('My name is {{ name }}')
   print('This is {{ version }}')


Then, ``jgen`` will generate 4 output files each with combinations of ``name``
and ``version`` as explained above.


Output filename template
------------------------

This is the same as the Jinja2_ template, in the sense it has the same build
rules, but it is just a string, describing the path in which the extrapolated
configurations, when combined with the template, will be saved. It may be
something like this, considering our example above:

.. code-block:: text

   output-dir/{{ name }}-{{ version }}.py


With all those inputs, the ``jgen`` command will look like this:

.. code-block:: sh

   $ jgen variables.yaml template.py 'output-dir/{{ name }}-{{ version }}.py'


Generating Aggregations
-----------------------

When you generate as many files you need to run, it is sometimes practical to
also generate an "aggregation" script, that makes running all configurations
easy. For example, one could think of a bash script that runs all of the above
generated python scripts. We call those "aggregations". When aggregating, you
iterate over a specific variable called ``cfgset``, which contains the
dictionaries for each configuration set extrapolation. For example, an
aggregation would look like this:

.. code-block:: sh

   #/usr/bin/env bash

   {% for k in cfgset %}
   python output-dir/{{ k.name }}-{{ k.version }}.py
   {% endfor %}


Which would then generate:

.. code-block:: sh

   #/usr/bin/env bash

   python output-dir/john-v1.py
   python output-dir/john-v2.py
   python output-dir/lisa-v1.py
   python output-dir/lisa-v2.py


With this generated bash script, you could run all configuration sets from a
single command line.

The final command line for ``jgen``, including the generation of specific
configuration files and the aggregation would look like the following:

.. code-block:: sh

   $ jgen variables.yaml template.py 'output-dir/{{ name }}-{{ version }}.py' run.sh 'output-dir/run.sh'


.. Place your references here:
.. _yaml: https://en.wikipedia.org/wiki/YAML
.. _jinja2: http://jinja.pocoo.org/docs/