Commit 8f89c4b9 authored by André Anjos's avatar André Anjos 💬

Merge branch 'generator' into 'master'

Add generator app for grid searches

See merge request !8
parents 4b48dc44 e6201f9d
Pipeline #11283 passed with stages
in 7 minutes and 23 seconds
.. vim: set fileencoding=utf-8 :
.. _gridtk.generate:
=====================================
Script Generation for Grid Searches
=====================================
The thing with scientific discovery is that, sometimes, you need to do a lot
of experiments before a reasonable conclusion. These experiments require minor
variations in their configuration and submission, possibly to an SGE-enabled
facility for processing.
This guide explains how to use the script ``jgen``, which helps you in
generating multiple experiment configurations for your grid searches. The
system supposes that a single experiment is defined in a single file while
multiple experiments can be run by somehow executing sequences of these
individual configuration files.
The script ``jgen`` takes, in its simplistic form, 3 parameters that explain:
* The "combinations" of variables that one needs to scan for a search in a
YAML_ file
* A Jinja2_ template file that explains the setup of each experiment
* An output template that explains how to mix the parameters in your YAML_ file
with the template and generate a bunch of experiment configurations to run
Let's decrypt each of these inputs.
YAML Input
----------
The YAML_ input file describes all possible combinations of parameters you want
to scan. All root keys that represent lists will be combined in all possible
ways to produce, each combination, a "configuration set".
A configuration set corresponds to settings for **all** variables in the input
template that needs replacing. For example, if your template mentions the
variables ``name`` and ``version``, then each configuration set should yield
values for both ``name`` and ``version``.
For example:
.. code-block:: yaml
name: [john, lisa]
version: [v1, v2]
This should yield to the following configuration sets:
.. code-block:: python
[
{'name': 'john', 'version': 'v1'},
{'name': 'john', 'version': 'v2'},
{'name': 'lisa', 'version': 'v1'},
{'name': 'lisa', 'version': 'v2'},
]
Each key in the input file should correspond to either an object or a YAML
list. If the object is a list, then we'll iterate over it for every possible
combination of elements in the lists. If the element in question is not a list,
then it is considered unique and repeated for each generated configuration set.
Example
.. code-block:: yaml
name: [john, lisa]
version: [v1, v2]
text: >
hello,
world!
Should yield to the following configuration sets:
.. code-block:: python
[
{'name': 'john', 'version': 'v1', 'text': 'hello, world!'},
{'name': 'john', 'version': 'v2', 'text': 'hello, world!'},
{'name': 'lisa', 'version': 'v1', 'text': 'hello, world!'},
{'name': 'lisa', 'version': 'v2', 'text': 'hello, world!'},
]
Keys starting with one `_` (underscore) are treated as "unique" objects as
well. Example:
.. code-block:: yaml
name: [john, lisa]
version: [v1, v2]
_unique: [i1, i2]
Should yield to the following configuration sets:
.. code-block:: python
[
{'name': 'john', 'version': 'v1', '_unique': ['i1', 'i2']},
{'name': 'john', 'version': 'v2', '_unique': ['i1', 'i2']},
{'name': 'lisa', 'version': 'v1', '_unique': ['i1', 'i2']},
{'name': 'lisa', 'version': 'v2', '_unique': ['i1', 'i2']},
]
Jinja2 Template
---------------
This corresponds to a file that will have variables replaced for each of the
configuration sets generated by your YAML_ file. For example, if your template
is a python file that uses the variables this way:
.. code-block:: text
#/usr/bin/env python
print('My name is {{ name }}')
print('This is {{ version }}')
Then, ``jgen`` will generate 4 output files each with combinations of ``name``
and ``version`` as explained above.
Output filename template
------------------------
This is the same as the Jinja2_ template, in the sense it has the same build
rules, but it is just a string, describing the path in which the extrapolated
configurations, when combined with the template, will be saved. It may be
something like this, considering our example above:
.. code-block:: text
output-dir/{{ name }}-{{ version }}.py
With all those inputs, the ``jgen`` command will look like this:
.. code-block:: sh
$ jgen variables.yaml template.py 'output-dir/{{ name }}-{{ version }}.py'
Generating Aggregations
-----------------------
When you generate as many files you need to run, it is sometimes practical to
also generate an "aggregation" script, that makes running all configurations
easy. For example, one could think of a bash script that runs all of the above
generated python scripts. We call those "aggregations". When aggregating, you
iterate over a specific variable called ``cfgset``, which contains the
dictionaries for each configuration set extrapolation. For example, an
aggregation would look like this:
.. code-block:: sh
#/usr/bin/env bash
{% for k in cfgset %}
python output-dir/{{ k.name }}-{{ k.version }}.py
{% endfor %}
Which would then generate:
.. code-block:: sh
#/usr/bin/env bash
python output-dir/john-v1.py
python output-dir/john-v2.py
python output-dir/lisa-v1.py
python output-dir/lisa-v2.py
With this generated bash script, you could run all configuration sets from a
single command line.
The final command line for ``jgen``, including the generation of specific
configuration files and the aggregation would look like the following:
.. code-block:: sh
$ jgen variables.yaml template.py 'output-dir/{{ name }}-{{ version }}.py' run.sh 'output-dir/run.sh'
.. Place your references here:
.. _yaml: https://en.wikipedia.org/wiki/YAML
.. _jinja2: http://jinja.pocoo.org/docs/
.. vim: set fileencoding=utf-8 :
.. author: Manuel Günther <manuel.guenther@idiap.ch>
.. date: Fri Aug 30 14:31:49 CEST 2013
.. _gridtk:
......@@ -25,6 +23,8 @@ Contents:
manual
program
generate
Indices and tables
==================
......
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
'''Utilities for generating configurations for running experiments in batch'''
import collections
import itertools
import yaml
import jinja2
def _ordered_load(stream, Loader=yaml.Loader,
object_pairs_hook=collections.OrderedDict):
'''Loads the contents of the YAML stream into :py:class:`collection.OrderedDict`'s
See: https://stackoverflow.com/questions/5121931/in-python-how-can-you-load-yaml-mappings-as-ordereddicts
'''
class OrderedLoader(Loader): pass
def construct_mapping(loader, node):
loader.flatten_mapping(node)
return object_pairs_hook(loader.construct_pairs(node))
OrderedLoader.add_constructor(yaml.resolver.BaseResolver.DEFAULT_MAPPING_TAG,
construct_mapping)
return yaml.load(stream, OrderedLoader)
def expand(data):
'''Generates configuration sets based on the YAML input contents
For an introduction to the YAML mark-up, just search the net. Here is one of
its references: https://en.wikipedia.org/wiki/YAML
A configuration set corresponds to settings for **all** variables in the
input template that needs replacing. For example, if your template mentions
the variables ``name`` and ``version``, then each configuration set should
yield values for both ``name`` and ``version``.
For example:
.. code-block:: yaml
name: [john, lisa]
version: [v1, v2]
This should yield to the following configuration sets:
.. code-block:: python
[
{'name': 'john', 'version': 'v1'},
{'name': 'john', 'version': 'v2'},
{'name': 'lisa', 'version': 'v1'},
{'name': 'lisa', 'version': 'v2'},
]
Each key in the input file should correspond to either an object or a YAML
array. If the object is a list, then we'll iterate over it for every possible
combination of elements in the lists. If the element in question is not a
list, then it is considered unique and repeated for each yielded
configuration set. Example
.. code-block:: yaml
name: [john, lisa]
version: [v1, v2]
text: >
hello,
world!
Should yield to the following configuration sets:
.. code-block:: python
[
{'name': 'john', 'version': 'v1', 'text': 'hello, world!'},
{'name': 'john', 'version': 'v2', 'text': 'hello, world!'},
{'name': 'lisa', 'version': 'v1', 'text': 'hello, world!'},
{'name': 'lisa', 'version': 'v2', 'text': 'hello, world!'},
]
Keys starting with one `_` (underscore) are treated as "unique" objects as
well. Example:
.. code-block:: yaml
name: [john, lisa]
version: [v1, v2]
_unique: [i1, i2]
Should yield to the following configuration sets:
.. code-block:: python
[
{'name': 'john', 'version': 'v1', '_unique': ['i1', 'i2']},
{'name': 'john', 'version': 'v2', '_unique': ['i1', 'i2']},
{'name': 'lisa', 'version': 'v1', '_unique': ['i1', 'i2']},
{'name': 'lisa', 'version': 'v2', '_unique': ['i1', 'i2']},
]
Parameters:
data (str): YAML data to be parsed
Yields:
dict: A dictionary of key-value pairs for building the templates
'''
data = _ordered_load(data, yaml.SafeLoader)
# separates "unique" objects from the ones we have to iterate
# pre-assemble return dictionary
iterables = collections.OrderedDict()
unique = collections.OrderedDict()
for key, value in data.items():
if isinstance(value, list) and not key.startswith('_'):
iterables[key] = value
else:
unique[key] = value
# generates all possible combinations of iterables
for values in itertools.product(*iterables.values()):
retval = collections.OrderedDict(unique)
keys = list(iterables.keys())
retval.update(dict(zip(keys, values)))
yield retval
def generate(variables, template):
'''Yields a resolved "template" for each config set and dumps on output
This function will extrapolate the ``template`` file using the contents of
``variables`` and will output individual (extrapolated, expanded) files in
the output directory ``output``.
Parameters:
variables (str): A string stream containing the variables to parse, in YAML
format as explained on :py:func:`expand`.
template (str): A string stream containing the template to extrapolate
Yields:
str: A generated template you can save
'''
env = jinja2.Environment()
for c in expand(variables):
yield env.from_string(template).render(c)
def aggregate(variables, template):
'''Generates a resolved "template" for **all** config sets and returns
This function will extrapolate the ``template`` file using the contents of
``variables`` and will output a single (extrapolated, expanded) file.
Parameters:
variables (str): A string stream containing the variables to parse, in YAML
format as explained on :py:func:`expand`.
template (str): A string stream containing the template to extrapolate
Returns:
str: A generated template you can save
'''
env = jinja2.Environment()
d = {'cfgset': list(expand(variables))}
return jinja2.Environment().from_string(template).render(d)
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# Andre Anjos <andre.anjos@idiap.ch>
# Wed 27 Jul 2011 14:36:06 CEST
"""Executes a given command within the context of a shell script that has its
enviroment set like Idiap's 'SETSHELL grid' does."""
......@@ -25,7 +23,8 @@ def main():
# act as before
if len(sys.argv) < 2:
print(__doc__)
print("usage: %s <command> [arg [arg ...]]" % os.path.basename(sys.argv[0]))
print("usage: %s <command> [arg [arg ...]]" % \
os.path.basename(sys.argv[0]))
return 1
replace('grid', sys.argv[1:])
......
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
"""Script generator for grid jobs
This script can generate multiple output files based on a template and a set of
variables explained in a YAML file. It can also, optionally, generate a single
aggregated file for all possible configuration sets in the YAML file. It can be
used to:
1. Generate a set of runnable experiment configurations from a single
template
2. Generate a single script to launch all runnable experiments
"""
__epilog__ = """\
To generate a configuration for running experiments and an aggregation script,
do the following:
$ %(prog)s vars.yaml config.py 'out/cfg-{{ name }}-.py' run.sh out/run.sh
In this example, the user dumps all output in a directory called "out". The
name of each output file uses variable expansion from the file "vars.yaml" to
create a new file for each configuration set defined inside. In this example,
we assume it defines at least variable "name" within with multiple values for
each configuration set. The file "run.sh" represents a template for the
aggregation and the extrapolated template will be saved at 'out/run.sh'. For
more information about how to structure these files, read the GridTK manual.
To only generate the configurations and not the aggregation, omit the last
two parameters:
$ %(prog)s vars.yaml config.py 'out/cfg-{{ name }}-.py'
"""
import os
import sys
import argparse
import logging
from .. import generator
from .. import tools
def _setup_logger(verbosity):
if verbosity > 3: verbosity = 3
# set up the verbosity level of the logging system
log_level = {
0: logging.ERROR,
1: logging.WARNING,
2: logging.INFO,
3: logging.DEBUG
}[verbosity]
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter("%(asctime)s %(levelname)s %(name)s: %(message)s"))
logger = logging.getLogger('gridtk')
logger.addHandler(handler)
logger.setLevel(log_level)
return logger
def main(command_line_options = None):
from ..config import __version__
basename = os.path.basename(sys.argv[0])
epilog = __epilog__ % dict(prog=basename)
formatter = argparse.ArgumentDefaultsHelpFormatter
parser = argparse.ArgumentParser(description=__doc__, epilog=epilog,
formatter_class=formatter)
parser.add_argument('variables', type=str, help="Text file containing the variables in YAML format")
parser.add_argument('gentmpl', type=str, help="Text file containing the template for generating multiple outputs, one for each configuration set")
parser.add_argument('genout', type=str, help="Template for generating the output filenames")
parser.add_argument('aggtmpl', type=str, nargs='?', help="Text file containing the template for generating one single output out of all configuration sets")
parser.add_argument('aggout', type=str, nargs='?', help="Name of the output aggregation file")
parser.add_argument('-v', '--verbose', action = 'count', default = 0,
help = "Increase the verbosity level from 0 (only error messages) to 1 (warnings), 2 (log messages), 3 (debug information) by adding the --verbose option as often as desired (e.g. '-vvv' for debug).")
parser.add_argument('-V', '--version', action='version',
version='GridTk version %s' % __version__)
# parse
if command_line_options:
args = parser.parse_args(command_line_options[1:])
args.wrapper_script = command_line_options[0]
else:
args = parser.parse_args()
args.wrapper_script = sys.argv[0]
# setup logging first
logger = _setup_logger(args.verbose)
# check
if args.aggtmpl and not args.aggout:
logger.error('Missing aggregate output name')
sys.exit(1)
# do all configurations and store
with open(args.variables, 'rt') as f:
args.variables = f.read()
with open(args.gentmpl, 'rt') as f:
args.gentmpl = f.read()
gdata = generator.generate(args.variables, args.gentmpl)
gname = generator.generate(args.variables, args.genout)
for fname, data in zip(gname, gdata):
dirname = os.path.dirname(fname)
if dirname: tools.makedirs_safe(dirname)
with open(fname, 'wt') as f: f.write(data)
# if user passed aggregator, do it as well
if args.aggtmpl and args.aggout:
with open(args.aggtmpl, 'rt') as f:
args.aggtmpl = f.read()
data = generator.aggregate(args.variables, args.aggtmpl)
dirname = os.path.dirname(args.aggout)
if dirname: tools.makedirs_safe(dirname)
with open(args.aggout, 'wt') as f: f.write(data)
return 0
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# Andre Anjos <andre.anjos@idiap.ch>
# Wed 24 Aug 2011 16:13:31 CEST
from __future__ import print_function
......
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
'''Test for the grid-search generator'''
import os
import shutil
import tempfile
import nose.tools
from ..generator import expand, generate, aggregate
from ..script import jgen
def test_simple():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]'
result = list(expand(data))
expected = [
{'name': 'john', 'version': 'v1'},
{'name': 'john', 'version': 'v2'},
{'name': 'lisa', 'version': 'v1'},
{'name': 'lisa', 'version': 'v2'},
]
nose.tools.eq_(result, expected)
def test_unique():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]\n' \
'text: >\n' \
' hello,\n' \
' world!'
result = list(expand(data))
expected = [
{'name': 'john', 'version': 'v1', 'text': 'hello, world!'},
{'name': 'john', 'version': 'v2', 'text': 'hello, world!'},
{'name': 'lisa', 'version': 'v1', 'text': 'hello, world!'},
{'name': 'lisa', 'version': 'v2', 'text': 'hello, world!'},
]
nose.tools.eq_(result, expected)
def test_ignore():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]\n' \
'_unique: [i1, i2]'
result = list(expand(data))
expected = [
{'name': 'john', 'version': 'v1', '_unique': ['i1', 'i2']},
{'name': 'john', 'version': 'v2', '_unique': ['i1', 'i2']},
{'name': 'lisa', 'version': 'v1', '_unique': ['i1', 'i2']},
{'name': 'lisa', 'version': 'v2', '_unique': ['i1', 'i2']},
]
nose.tools.eq_(result, expected)
def test_generation():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]'
template = '{{ name }} - {{ version }}'
expected = [
'john - v1',
'john - v2',
'lisa - v1',
'lisa - v2',
]
result = list(generate(data, template))
nose.tools.eq_(result, expected)
def test_aggregation():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]'
template = '{% for k in cfgset %}{{ k.name }} - {{ k.version }}\n{% endfor %}'
expected = '\n'.join([
'john - v1',
'john - v2',
'lisa - v1',
'lisa - v2\n',
])
result = aggregate(data, template)
nose.tools.eq_(result, expected)
def test_cmdline_generation():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]'
template = '{{ name }}-{{ version }}'
expected = [
'john-v1',
'john-v2',
'lisa-v1',
'lisa-v2',
]
tmpdir = tempfile.mkdtemp()
try:
variables = os.path.join(tmpdir, 'variables.yaml')
with open(variables, 'wt') as f: f.write(data)
gentmpl = os.path.join(tmpdir, 'gentmpl.txt')
with open(gentmpl, 'wt') as f: f.write(template)
genout = os.path.join(tmpdir, 'out', '{{ name }}-{{ version }}.txt')
nose.tools.eq_(jgen.main(['-vv', variables, gentmpl, genout]), 0)
# check all files are there and correspond to the expected output
outdir = os.path.dirname(genout)
for k in expected:
ofile = os.path.join(outdir, k + '.txt')
assert os.path.exists(ofile)
with open(ofile, 'rt') as f: contents = f.read()
nose.tools.eq_(contents, k)
finally:
shutil.rmtree(tmpdir)
def test_cmdline_aggregation():
data = \
'name: [john, lisa]\n' \
'version: [v1, v2]'
template = '{{ name }}-{{ version }}'
aggtmpl = '{% for k in cfgset %}{{ k.name }}-{{ k.version }}\n{% endfor %}'
gen_expected = [
'john-v1',
'john-v2',
'lisa-v1',
'lisa-v2',
]
agg_expected = '\n'.join([
'john-v1',
'john-v2',
'lisa-v1',
'lisa-v2\n',
])
tmpdir = tempfile.mkdtemp()
try:
variables = os.path.join(tmpdir, 'variables.yaml')
with open(variables, 'wt') as f: f.write(data)
gentmpl = os.path.join(tmpdir, 'gentmpl.txt')
with open(gentmpl, 'wt') as f: f.write(template)
genout = os.path.join(tmpdir, 'out', '{{ name }}-{{ version }}.txt')
aggtmpl_file = os.path.join(tmpdir, 'agg.txt')
with open(aggtmpl_file, 'wt') as f: f.write(aggtmpl)
aggout = os.path.join(tmpdir, 'out', 'agg.txt')
nose.tools.eq_(jgen.main(['-vv', variables, gentmpl, genout, aggtmpl_file,
aggout]), 0)
# check all files are there and correspond to the expected output
outdir = os.path.dirname(genout)
for k in gen_expected:
ofile = os.path.join(outdir, k + '.txt')
assert os.path.exists(ofile)
with open(ofile, 'rt') as f: contents = f.read()
nose.tools.eq_(contents, k)
assert os.path.exists(aggout)
with open(aggout, 'rt') as f: contents = f.read()