Commit c0366d56 authored by André Anjos's avatar André Anjos 💬

Updated manual

parent 3c3f1a36
......@@ -2,12 +2,12 @@
SGE Job Manager
=================
The Job Manager is python wrapper around SGE utilities like `qsub`, `qstat` and
`qdel`. It interacts with these tools to submit and manage grid jobs making up
a complete workflow ecosystem.
The Job Manager is python wrapper around SGE utilities like ``qsub``, ``qstat``
and ``qdel``. It interacts with these tools to submit and manage grid jobs
making up a complete workflow ecosystem.
Everytime you interact with the Job Manager, a local database file (normally
named `.jobmanager.db`) is read or written so it preserves its state during
named ``submitted.db``) is read or written so it preserves its state during
decoupled calls. The database contains all informations about jobs that is
required for the Job Manager to:
......@@ -25,39 +25,48 @@ Job Manager only makes it dead simple.
Submitting a job
----------------
To interact with the Job Manager we use the `jman` utility. Make sure to have
To interact with the Job Manager we use the ``jman`` utility. Make sure to have
your shell environment setup to reach it w/o requiring to type-in the full
path. The first task you may need to pursue is to submit jobs. Here is how:
.. code-block:: sh
$ jman torch -- dbmanage.py --help
Submitted (torch'd) 6151645 @all.q (0 seconds ago) -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
Notice that we require the double dash (`--`) separating the command one wants
to submit. This tells `jman` to stop reading its own options from this point
and consider all remaining arguments as part of the command to be submitted.
$ jman pysubmit myscript.py --help
Submitted 6151645 @all.q (0 seconds ago) -S /usr/bin/python myscript.py --help
.. note::
The command `torch` of the Job Manager will submit a job that will run in an
environment that is created by Torch5spro's `shell.py`. It is not the only
way to submit a job using the Job Manager. You can use either `submit` or
`wrapper`. Read the full help message of `jman` for details and instructions.
The command `pysubmit` of the Job Manager will submit a job that will run in
a python environment. It is not the only way to submit a job using the Job
Manager. You can also use `submit`, that considers the command as a self
sufficient application. Read the full help message of ``jman`` for details and
instructions.
Submitting a parametric job
---------------------------
Parametric or array jobs are jobs that execute the same way, except for the
environment variable ``SGE_TASK_ID``, which changes for every job. This way,
your program controls which bit of the full job has to be executed in each
(parallel) instance. It is great for forking thousands of jobs into the grid.
The next example sends 10 copies of the ``myscript.py`` job to the grid with
the same parameters. Only the variable ``SGE_TASK_ID`` changes between them:
.. code-block:: sh
$ jman torch -t 1-5:2 -- dbmanage.py --help
Submitted (torch'd) 6151645 @all.q (0 seconds ago) -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
$ jman pysubmit -t 10 myscript.py --help
Submitted 6151645 @all.q (0 seconds ago) -S /usr/bin/python myscript.py --help
The ``-t`` option in ``jman`` accepts different kinds of job array
descriptions. Have a look at the help documentation for details with ``jman
--help``.
Probing for jobs
----------------
Once the job has been submitted you will noticed a database file (by default
called `.jobmanager.db`) has been created in the current working directory. It
called ``submitted.db``) has been created in the current working directory. It
contains the information for the job you just submitted:
.. code-block:: sh
......@@ -65,11 +74,12 @@ contains the information for the job you just submitted:
$ jman list
job-id queue age arguments
======== ===== === =======================================================
6151645 all.q 2m -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
6151645 all.q 2m -S /usr/bin/python myscript.py --help
From this dump you can see the SGE job identifier, the queue the job has been
submitted to and the command that was given to `qsub`. The `list` command from
`jman` only lists the contents of the database, it does **not** update it.
submitted to and the command that was given to ``qsub``. The ``list`` command
from ``jman`` only lists the contents of the database, it does **not** update
it.
Refreshing the list
-------------------
......@@ -83,7 +93,7 @@ containing jobs that failed and jobs that succeeded.
$ jman refresh
These jobs require attention:
6151645 @all.q (30 minutes ago) -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
6151645 @all.q (30 minutes ago) -S /usr/bin/python myscript.py --help
.. note::
......@@ -94,24 +104,22 @@ Inspecting log files
--------------------
As can be seen the job we submitted just failed. The job manager says it
requires attention. If jobs fail, they are copied to a database named
`failure.db` in the current directory. Otherwise, they are copied to
`success.db`. You can inspect the job log files like this:
requires attention. If jobs fail, they are moved to a database named
``failure.db`` in the current directory. Otherwise, they are moved to
``success.db``. You can inspect the job log files like this:
.. code-block:: sh
$ jman explain failure.db
Job 6151645 @all.q (34 minutes ago) -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
Command line: (['-S', '/usr/bin/python', '/idiap/group/torch5spro/nightlies/last/bin/shell.py', '--', 'dbmanage.py', '--help'],) {'deps': [], 'stderr': 'logs', 'stdout': 'logs', 'queue': 'all.q', 'env': ['OVERWRITE_TORCH5SPRO_BINROOT=/idiap/group/torch5spro/nightlies/last/bin'], 'cwd': True, 'name': None}
Job 6151645 @all.q (34 minutes ago) -S /usr/bin/python myscript.py --help
Command line: (['-S', '/usr/bin/python', '--', 'myscript.py', '--help'],) {'deps': [], 'stderr': 'logs', 'stdout': 'logs', 'queue': 'all.q', 'cwd': True, 'name': None}
6151645 stdout (/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.o6151645)
6151645 stderr (/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.e6151645)
Traceback (most recent call last):
File "/idiap/resource/software/sge/6.2u5/grid/spool/beaufix30/job_scripts/6151645", line 12, in <module>
import adm
ImportError: No module named adm
...
Hopefully, that helps in debugging the problem!
......@@ -125,30 +133,29 @@ submitted the first time:
.. code-block:: sh
$ jman resubmit --clean failure.db
Re-submitted job 6151663 @all.q (1 second ago) -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
removed `/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.o6151645'
removed `/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.e6151645'
Re-submitted job 6151663 @all.q (1 second ago) -S /usr/bin/python myscript.py --help
removed `logs/myscript.py.o6151645'
removed `logs/myscript.py.e6151645'
deleted job 6151645 from database
The `--clean` flag tells the job manager to clean-up the old failure and the
The ``--clean`` flag tells the job manager to clean-up the old failure and the
log files as it re-submits the new job. Notice the new job identifier has
changed as expected.
Cleaning-up
-----------
The job in question will not work no matter how many times we re-submit it. It
is not a temporary error. In these circumstances, I may just want to clean the
job and do something else. The job manager is here for you again:
If the job in question will not work no matter how many times we re-submit it,
you may just want to clean it up and do something else. The job manager is
here for you again:
.. code-block:: sh
$ jman cleanup --remove-job failure.db
Cleaning-up logs for job 6151663 @all.q (5 minutes ago) -S /usr/bin/python /idiap/group/torch5spro/nightlies/last/bin/shell.py -- dbmanage.py --help
removed `/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.o6151663'
removed `/remote/filer.gx/user.active/aanjos/work/spoofing/idiap-gridtk/logs/shell.py.e6151663'
Cleaning-up logs for job 6151663 @all.q (5 minutes ago) -S /usr/bin/python myscript.py --help
removed `logs/myscript.py.o6151663'
removed `logs/myscript.py.e6151663'
deleted job 6151663 from database
Inspection on the current directory will now show you everything concerning the
said job is gone.
......@@ -6,33 +6,22 @@
The GridTk User Guide
=======================
The `gridtk` framework is a python library to help submitting, tracking and
querying SGE. Here is quick example on how to use the `gridtk` framework:
The ``gridtk`` framework is a python library to help submitting, tracking and
querying SGE. Here is quick example on how to use the ``gridtk`` framework to
submit a python script:
.. code-block:: python
# This variable points to the torch5spro root directory you want to use
TORCH = '/idiap/group/torch5spro/nightlies/last'
import sys
from gridtk.manager import JobManager
from gridtk.tools import make_shell
# This helps constructing the command line with bracket'ed by Torch
from gridtk.tools import make_torch_wrapper
man = JobManager()
command = ['dbmange.py', '--help']
command, kwargs = make_torch_wrapper(TORCH, False, command, kwargs)
# For more options look do help(gridtk.qsub)
job = man.submit(command, cwd=True, stdout='logs', name='testjob')
manager = JobManager()
command = make_shell(sys.executable, ['myscript.py', '--help'])
job = manager.submit(command)
You can do, programatically, everything you can do with the job manager - just
browse the help messages and the `jman` script for more information.
.. note::
To be able to import the `gridtk` library, you must have it on your
PYTHONPATH.
browse the help messages and the ``jman`` script for more information.
Reference Manual
----------------
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment