Commit 785e457b authored by Manuel Günther's avatar Manuel Günther

Corrected and adapted the documentation.

parent ca3f45c5
......@@ -4,7 +4,6 @@
*.swp
build
dist
MANIFEST
.installed.cfg
develop-eggs
eggs
......@@ -13,4 +12,6 @@ sphinx
.project
.pydevproject
.settings
submitted.sql3
logs
......@@ -9,7 +9,30 @@ Currently, it is set up to work with the SGE grid at Idiap, but it is also possi
Since version 1.0 there is also a local submission system introduced.
Instead of sending jobs to the SGE grid, it executes them in parallel processes on the local machine, using a simple scheduling system.
.. warning::
The new version of gridtk was completely rewritten and is no longer compatible with older versions of gridtk.
In particular, the database type has changed.
If you still have old ``submitted.db``, ``success.db`` or ``failure.db`` databases, please use an older version of gridtk to handle them.
.. warning::
Though tested thoroughly, this version might still be unstable and the reported statuses of the grid jobs might be incorrect.
If you are in doubt that the status is correct, please double-check with other grid utilities (like ``bin/grid qmon``).
In case you found any problem, please report it using the `bug reporting system <http://github.com/idiap/gridtk/issues>`.
.. note::
In the current version, gridtk is compatible with python3.
Anyways, due to limitations of the working environment, the grid functionality is not tested with python 3.
However, with python 2.7 everything should work out fine.
This package uses the Buildout system to install it.
Please call::
$ python bootstrap.py
$ bin/buildout
$ bin/sphinx-build docs sphinx
$ firefox sphinx/index.html
to create and open the documentation including even more information than given in this README below.
Submitting jobs to the SGE grid
+++++++++++++++++++++++++++++++
......
......@@ -66,9 +66,9 @@ copyright = u'%s, Idiap Research Institute' % time.strftime('%Y')
# built documents.
#
# The short X.Y version.
version = '0.2'
version = '1.0'
# The full version, including alpha/beta/rc tags.
release = '0.2'
release = '1.0.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
......@@ -138,7 +138,7 @@ html_theme = 'default'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# html_static_path = ['_static']
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
......
.. GridTk documentation master file, created by
sphinx-quickstart on Thu Aug 25 15:51:27 2011.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. vim: set fileencoding=utf-8 :
.. author: Manuel Günther <manuel.guenther@idiap.ch>
.. date: Fri Aug 30 14:31:49 CEST 2013
Welcome to GridTk's documentation!
==================================
The GridTK serves as a tool to submit jobs and keep track of their dependencies and their current statusus.
These jobs can either be submitted to an SGE grid, or to be run in parallel on the local machine.
There are two main ways to interact with the GridTK.
The easiest way is surely to use the command line interface, for details please read the :ref:`command_line` section.
It is also possible to use the GridTK in a program, the developer interface is described in the :ref:`developer` section.
Contents:
.. toctree::
......
=================
SGE Job Manager
=================
.. vim: set fileencoding=utf-8 :
.. author: Manuel Günther <manuel.guenther@idiap.ch>
.. date: Fri Aug 30 14:31:49 CEST 2013
The Job Manager is python wrapper around SGE utilities like ``qsub``, ``qstat``
and ``qdel``. It interacts with these tools to submit and manage grid jobs
making up a complete workflow ecosystem.
.. _command_line:
Everytime you interact with the Job Manager, a local database file (normally
named ``submitted.db``) is read or written so it preserves its state during
decoupled calls. The database contains all informations about jobs that is
required for the Job Manager to:
==========================
The Command Line Interface
==========================
* submit jobs (includes wrapped python jobs or Torch5spro specific jobs)
The command line interface requires the package to be installed properly.
Fortunately, this is easy to be done, using the Buildout tools in the main directory of GridTK:
.. code-block:: sh
$ python boostrap.py
$ ./bin/buildout
These two commands will download all required dependencies and create a ``bin`` directory containing all the command line utilities that we will need in this section.
To verify the installation, you can call out nose tests:
.. code-block:: sh
$ ./bin/nosetests -v
The Job Manager
===============
The most important utility is the Job Manager ``bin/jman``.
This Job Manager can be used to:
* submit jobs
* probe for submitted jobs
* query SGE for submitted jobs
* identify problems with submitted jobs
* cleanup logs from submitted jobs
* easily re-submit jobs if problems occur
* support for parametric (array) jobs
Many of these features are also achieveable using the stock SGE utilities, the
Job Manager only makes it dead simple.
The Job Manager has a common set of parameters, which will be explained in the next section.
Additionally, several commands can be issued, each of which has its own set of options.
These commands will be explained afterwards.
Submitting a job
----------------
Basic Job Manager Parameters
----------------------------
There are two versions of Job Managers: One that submits jobs to the SGE grid, and one that submits jobs so that they are run in parallel on the local machine.
By default, the SGE manager is engaged.
If you don't have access to the SGE grid, or you want to submit locally, please issue the ``bin/jman --local`` (or shortly ``bin/jman -l``) command.
To interact with the Job Manager we use the ``jman`` utility. Make sure to have
your shell environment setup to reach it w/o requiring to type-in the full
path. The first task you may need to pursue is to submit jobs. Here is how:
To keep track of the submitted jobs, an SQL3 database is written.
This database is by default called ``submitted.sql3`` and put in the current directory, but this can be changed using the ``bin/jman --database`` (``bin/jman -d``) flag.
.. code-block:: sh
Normally, the Job Manager acts silently, and only error messages are reported.
To make the Job Manager more verbose, you can use the ``--verbose`` (``-v``) option several times, to increase the verbosity level to 1) WARNING, 2) INFO, 3) DEBUG.
$ jman submit myscript.py --help
Submitted 6151645 @all.q (0 seconds ago) -S /usr/bin/python myscript.py --help
.. note::
Submitting Jobs
---------------
To submit a job, the ``bin/jman submit`` command is used.
The simplest way to submit a job to be run in the SGE grid is:
The command `submit` of the Job Manager will submit a job that will run in
a python environment. It is not the only way to submit a job using the Job
Manager. You can also use `submit`, that considers the command as a self
sufficient application. Read the full help message of ``jman`` for details and
instructions.
.. code-block:: sh
$ bin/jman -vv submit myscript.py
This command will create an SQL3 database, submit the job to the grid and register it in the database.
To be more easily separable from other jobs in the database, you can give your job a name:
Submitting a parametric job
---------------------------
.. code-block:: sh
Parametric or array jobs are jobs that execute the same way, except for the
environment variable ``SGE_TASK_ID``, which changes for every job. This way,
your program controls which bit of the full job has to be executed in each
(parallel) instance. It is great for forking thousands of jobs into the grid.
$ bin/jman -vv submit -n [name] myscript.py
The next example sends 10 copies of the ``myscript.py`` job to the grid with
the same parameters. Only the variable ``SGE_TASK_ID`` changes between them:
If the job requires certain machine specifications, you can add these (please see the SGE manual for possible specifications of [key] and [value] pairs).
Please note the ``--`` option that separates specifications from the command:
.. code-block:: sh
$ jman submit -t 10 myscript.py --help
Submitted 6151645 @all.q (0 seconds ago) -S /usr/bin/python myscript.py --help
$ bin/jman -vv submit -q [queue-name] -m [memory] --io-big -s [key1]=[value1] [key2]=[value2] -- myscript.py
The ``-t`` option in ``jman`` accepts different kinds of job array
descriptions. Have a look at the help documentation for details with ``jman
--help``.
To have jobs run in parallel, you can submit a parametric job.
Simply call:
Probing for jobs
----------------
.. code-block:: sh
Once the job has been submitted you will noticed a database file (by default
called ``submitted.db``) has been created in the current working directory. It
contains the information for the job you just submitted:
$ bin/jman -vv submit -t 10 myscript.py
to run ``myscript.py`` 10 times in parallel.
Each of the parallel jobs will have a different environment variable called ``SGE_TASK_ID``, which will range from 1 to 10 in this case.
If your script can handle this environment variable, it can actually execute 10 different tasks.
Also, jobs with dependencies can be submitted.
When submitted to the grid, each job has its own job id.
These job ids can be used to create dependencies between the jobs (i.e., one job needs to finish before the next one can be started):
.. code-block:: sh
$ jman list
job-id queue age arguments
======== ===== === =======================================================
6151645 all.q 2m -S /usr/bin/python myscript.py --help
$ bin/jman -vv submit -x [job_id_1] [job_id_2] -- myscript.py
In case the first job fails, it can automatically stop the depending jobs from being executed.
Just submit jobs with the ``--stop-on-failure`` option.
From this dump you can see the SGE job identifier, the queue the job has been
submitted to and the command that was given to ``qsub``. The ``list`` command
from ``jman`` only lists the contents of the database, it does **not** update
it.
.. note::
The ``--stop-on-failure`` option is under development and might not work properly.
Use this option with care.
Refreshing the list
-------------------
You may instruct the job manager to probe SGE and update the status of the jobs
it is monitoring. Finished jobs will be reported to the screen and removed from
the job manager database and placed on a second database (actually two)
containing jobs that failed and jobs that succeeded.
While the jobs run, the output and error stream are captured in log files, which are written into a ``logs`` directory.
This directory can be changed by specifying:
.. code-block:: sh
$ jman refresh
These jobs require attention:
6151645 @all.q (30 minutes ago) -S /usr/bin/python myscript.py --help
$ bin/jman -vv submit -l [log_dir]
.. note::
When submitting jobs locally, by default the output and error streams are written to console and no log directory is created.
To get back the SGE grid logging behavior, please specify the log directory.
In this case, output and error streams are written into the log files **after** the job has finished.
Detection of success or failure is based on the length of the standard error
output of the job. If it is greater than zero, it is considered a failure.
Inspecting log files
Running Jobs Locally
--------------------
When jobs are submitted to the SGE grid, they are run immediately.
However, when jobs are submitted locally, (using the ``--local`` option, see above), a local scheduler needs to be run.
This is achieved by issuing the command:
As can be seen the job we submitted just failed. The job manager says it
requires attention. If jobs fail, they are moved to a database named
``failure.db`` in the current directory. Otherwise, they are moved to
``success.db``. You can inspect the job log files like this:
.. code-block:: sh
$ bin/jman -vv run-scheduler -p [parallel_jobs] -s [sleep_time]
This will start the scheduler in the daemon mode.
This will constantly monitor the SQL3 database and execute jobs after submission, starting every ``[sleep_time]`` second.
Use ``Ctrl-C`` to stop the scheduler (if jobs are still running locally, they will automatically be stopped).
If you want to submit a list of jobs and have the scheduler to run the jobs and stop afterward, simply use the ``--die-when-finished`` option.
Also, it is possible to run only specific jobs (and array jobs), which can be specified with the ``--j`` and ``--a`` option, respectively.
Probing for Jobs
----------------
To list the contents of the job database, you can use the ``jman list`` command.
This will show you the job-id, the queue, the current status, the name and the command line of each job.
Since the database is automatically updated when jobs finish, you can use the ``jman list`` again after some time.
Normally, long command lines are cut so that each job is listed in a single line.
To get the full command line, please use the ``-vv`` option:
.. code-block:: sh
$ bin/jman -vv list
By default, array jobs are not listed, but the ``-a`` option changes this behavior.
Usually, it is a good idea to combine the ``-a`` option with ``-j``, which will list only the jobs of the given job id(s):
.. code-block:: sh
$ jman explain failure.db
Job 6151645 @all.q (34 minutes ago) -S /usr/bin/python myscript.py --help
Command line: (['-S', '/usr/bin/python', '--', 'myscript.py', '--help'],) {'deps': [], 'stderr': 'logs', 'stdout': 'logs', 'queue': 'all.q', 'cwd': True, 'name': None}
$ bin/jman -vv list -a -j [job_id_1] [job_id_2]
6151645 stdout (/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.o6151645)
Inspecting log files
--------------------
When a job fails, the status will be ``failure``.
In this case, you might want to know, what happened.
As a first indicator, the exit code of the program is reported as well.
Also, the output and error streams of the job has been recorded and can be seen using the utilities.
E.g.:
.. code-block:: sh
$ bin/jman -vv report -j [job_id] -a [array_id]
6151645 stderr (/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.e6151645)
Traceback (most recent call last):
...
will print the contents of the output and error log file from the job with the desired ID (and only the array job with the given ID).
To report only the output or only the error logs, you can use the ``-o`` or ``-e`` option, respectively.
Hopefully, that helps in debugging the problem!
Re-submitting the job
---------------------
If you are convinced the job did not work because of external conditions (e.g.
temporary network outage), you may re-submit it, *exactly* like it was
submitted the first time:
After correcting your code you might want to submit the same command line again.
For this purpose, the ``bin/jman resubmit`` command exists.
Simply specify the job id(s) that you want to resubmit:
.. code-block:: sh
$ jman resubmit --clean failure.db
Re-submitted job 6151663 @all.q (1 second ago) -S /usr/bin/python myscript.py --help
removed `logs/myscript.py.o6151645'
removed `logs/myscript.py.e6151645'
deleted job 6151645 from database
$ bin/jman -vv resubmit -j [job_id_1] [job_id_2]
This will clean up the old log files (if you didn't specify the ``--keep-logs`` option) and re-submit the job.
If the submission is done in the grid the job id(s) will change during this process.
The ``--clean`` flag tells the job manager to clean-up the old failure and the
log files as it re-submits the new job. Notice the new job identifier has
changed as expected.
Cleaning-up
Cleaning up
-----------
After the job was successfully (or not) executed, you should clean up the database using the ``bin/jman delete`` command.
If not specified otherwise (i.e., using the ``--keep-logs`` option), this command will delete all jobs from the database and delete the log files (including the log directory in case it is empty), and remove the database as well.
If the job in question will not work no matter how many times we re-submit it,
you may just want to clean it up and do something else. The job manager is
here for you again:
Again, job ids and array ids can be specified to limit the deleted jobs with the ``-j`` and ``-a`` option, respectively.
It is also possible to clean up only those jobs (and array jobs) with a certain status.
E.g. use:
.. code-block:: sh
$ jman cleanup --remove-job failure.db
Cleaning-up logs for job 6151663 @all.q (5 minutes ago) -S /usr/bin/python myscript.py --help
removed `logs/myscript.py.o6151663'
removed `logs/myscript.py.e6151663'
deleted job 6151663 from database
$ bin/jman -vv delete -s success
to delete all jobs and the logs of all successfully finished jobs from the database.
Other command line tools
========================
For convenience, we also provide additional command line tools, which are mainly useful at Idiap.
These tools are:
- ``bin/qstat.py``: writes the statuses of the jobs that are currently running in the SGE grid
- ``bin/qsub.py``: submit job to the SGE grid without logging them into the database
- ``bin/qdel.py``: delete job from the SGE grid without logging them into the database
- ``bin/grid``: executes the command in an grid environment (i.e., as if a ``SETSHELL grid`` command would have been issued before)
Inspection on the current directory will now show you everything concerning the
said job is gone.
.. vim: set fileencoding=utf-8 :
.. Andre Anjos <andre.anjos@idiap.ch>
.. Thu 25 Aug 2011 15:58:21 CEST
.. Thu 25 Aug 2011 15:58:21 CEST
=======================
The GridTk User Guide
=======================
.. _developer:
The ``gridtk`` framework is a python library to help submitting, tracking and
querying SGE. Here is quick example on how to use the ``gridtk`` framework to
submit a python script:
=====================
The GridTk User Guide
=====================
The ``gridtk`` framework is a python library to help submitting, tracking and querying SGE.
Here is quick example on how to use the ``gridtk`` framework to submit a python script:
.. code-block:: python
import sys
from gridtk.manager import JobManager
from gridtk.sge import JobManager
from gridtk.tools import make_shell
manager = JobManager()
......@@ -23,15 +24,28 @@ submit a python script:
You can do, programatically, everything you can do with the job manager - just
browse the help messages and the ``jman`` script for more information.
Reference Manual
----------------
API to the Job Manager
======================
API to the Job Managers
=======================
.. automodule:: gridtk.manager
:members:
.. automodule:: gridtk.sge
:members:
.. automodule:: gridtk.local
:members:
The Models of the SQL3 Databases
================================
.. automodule:: gridtk.models
:members:
Middleware
==========
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment