manual.rst 11.5 KB
Newer Older
1 2 3
.. vim: set fileencoding=utf-8 :
.. author: Manuel Günther <manuel.guenther@idiap.ch>
.. date: Fri Aug 30 14:31:49 CEST 2013
André Anjos's avatar
André Anjos committed
4

5
.. _command_line:
André Anjos's avatar
André Anjos committed
6

André Anjos's avatar
André Anjos committed
7 8 9
============================
 The Command Line Interface
============================
André Anjos's avatar
André Anjos committed
10

11 12 13 14 15 16 17 18
The command line interface requires the package to be installed properly.
Fortunately, this is easy to be done, using the Buildout tools in the main directory of GridTK:

.. code-block:: sh

  $ python boostrap.py
  $ ./bin/buildout

André Anjos's avatar
André Anjos committed
19 20 21
These two commands will download all required dependencies and create a ``bin``
directory containing all the command line utilities that we will need in this
section.  To verify the installation, you can call out nose tests:
22 23 24

.. code-block:: sh

André Anjos's avatar
André Anjos committed
25 26 27 28 29 30 31
  $ ./bin/nosetests -sv

To build the package documentation, do:

.. code-block:: sh

  $ ./bin/sphinx-build docs sphinx
32 33 34 35


The Job Manager
===============
André Anjos's avatar
André Anjos committed
36 37 38

The most important utility is the Job Manager ``bin/jman``. This Job Manager
can be used to:
39 40

* submit jobs
André Anjos's avatar
André Anjos committed
41 42 43 44
* probe for submitted jobs
* identify problems with submitted jobs
* cleanup logs from submitted jobs
* easily re-submit jobs if problems occur
45
* support for parametric (array) jobs
André Anjos's avatar
André Anjos committed
46

André Anjos's avatar
André Anjos committed
47 48 49 50
The Job Manager has a common set of parameters, which will be explained in the
next section.  Additionally, several commands can be issued, each of which has
its own set of options.  These commands will be explained afterwards.

André Anjos's avatar
André Anjos committed
51

52 53
Basic Job Manager Parameters
----------------------------
André Anjos's avatar
André Anjos committed
54

André Anjos's avatar
André Anjos committed
55 56 57 58 59 60 61 62 63 64
There are two versions of Job Managers: One that submits jobs to the SGE grid,
and one that submits jobs so that they are run in parallel on the local
machine.  By default, the SGE manager is engaged.  If you don't have access to
the SGE grid, or you want to submit locally, please issue the ``bin/jman
--local`` (or shortly ``bin/jman -l``) command.

To keep track of the submitted jobs, an SQL3 database is written.  This
database is by default called ``submitted.sql3`` and put in the current
directory, but this can be changed using the ``bin/jman --database``
(``bin/jman -d``) flag.
André Anjos's avatar
André Anjos committed
65

66
Normally, the Job Manager acts silently, and only error messages are reported.
André Anjos's avatar
André Anjos committed
67 68 69
To make the Job Manager more verbose, you can use the ``--verbose`` (``-v``)
option several times, to increase the verbosity level to 1) WARNING, 2) INFO,
3) DEBUG.
André Anjos's avatar
André Anjos committed
70 71


72 73
Submitting Jobs
---------------
André Anjos's avatar
André Anjos committed
74

75 76
To submit a job, the ``bin/jman submit`` command is used.
The simplest way to submit a job to be run in the SGE grid is:
André Anjos's avatar
André Anjos committed
77

78 79
.. code-block:: sh

André Anjos's avatar
André Anjos committed
80
   $ bin/jman -vv submit myscript.py
81 82 83

This command will create an SQL3 database, submit the job to the grid and register it in the database.
To be more easily separable from other jobs in the database, you can give your job a name:
André Anjos's avatar
André Anjos committed
84

85
.. code-block:: sh
86

André Anjos's avatar
André Anjos committed
87
   $ bin/jman -vv submit -n [name] myscript.py
André Anjos's avatar
André Anjos committed
88

89 90
If the job requires certain machine specifications, you can add these (please see the SGE manual for possible specifications of [key] and [value] pairs).
Please note the ``--`` option that separates specifications from the command:
André Anjos's avatar
André Anjos committed
91

92 93
.. code-block:: sh

André Anjos's avatar
André Anjos committed
94
   $ bin/jman -vv submit -q [queue-name] -m [memory] --io-big -s [key1]=[value1] [key2]=[value2] -- myscript.py
André Anjos's avatar
André Anjos committed
95

André Anjos's avatar
André Anjos committed
96
To have jobs run in parallel, you can submit a parametric job.  Simply call:
97

98
.. code-block:: sh
André Anjos's avatar
André Anjos committed
99

André Anjos's avatar
André Anjos committed
100
   $ bin/jman -vv submit -t 10 myscript.py
101

André Anjos's avatar
André Anjos committed
102 103 104 105
to run ``myscript.py`` 10 times in parallel.  Each of the parallel jobs will
have a different environment variable called ``SGE_TASK_ID``, which will range
from 1 to 10 in this case.  If your script can handle this environment
variable, it can actually execute 10 different tasks.
106

André Anjos's avatar
André Anjos committed
107 108 109 110
Also, jobs with dependencies can be submitted.  When submitted to the grid,
each job has its own job id.  These job ids can be used to create dependencies
between the jobs (i.e., one job needs to finish before the next one can be
started):
André Anjos's avatar
André Anjos committed
111

112
.. code-block:: sh
André Anjos's avatar
André Anjos committed
113

114 115
  $ bin/jman -vv submit -x [job_id_1] [job_id_2] -- myscript.py

André Anjos's avatar
André Anjos committed
116 117
In case the first job fails, it can automatically stop the depending jobs from
being executed.  Just submit jobs with the ``--stop-on-failure`` option.
André Anjos's avatar
André Anjos committed
118

119
.. note::
André Anjos's avatar
André Anjos committed
120 121 122

   The ``--stop-on-failure`` option is under development and might not work
   properly. Use this option with care.
André Anjos's avatar
André Anjos committed
123 124


125 126
While the jobs run, the output and error stream are captured in log files, which are written into a ``logs`` directory.
This directory can be changed by specifying:
André Anjos's avatar
André Anjos committed
127

128
.. code-block:: sh
129 130

  $ bin/jman -vv submit -l [log_dir]
André Anjos's avatar
André Anjos committed
131 132

.. note::
André Anjos's avatar
André Anjos committed
133 134 135 136 137 138

   When submitting jobs locally, by default the output and error streams are
   written to console and no log directory is created.  To get back the SGE
   grid logging behavior, please specify the log directory.  In this case,
   output and error streams are written into the log files **after** the job
   has finished.
André Anjos's avatar
André Anjos committed
139 140


141
Running Jobs Locally
André Anjos's avatar
André Anjos committed
142
--------------------
André Anjos's avatar
André Anjos committed
143 144 145 146

When jobs are submitted to the SGE grid, they are run immediately. However,
when jobs are submitted locally, (using the ``--local`` option, see above), a
local scheduler needs to be run.  This is achieved by issuing the command:
André Anjos's avatar
André Anjos committed
147

148 149
.. code-block:: sh

André Anjos's avatar
André Anjos committed
150
   $ bin/jman -vv run-scheduler -p [parallel_jobs] -s [sleep_time]
151

André Anjos's avatar
André Anjos committed
152 153 154 155
This will start the scheduler in the daemon mode.  This will constantly monitor
the SQL3 database and execute jobs after submission, starting every
``[sleep_time]`` second.  Use ``Ctrl-C`` to stop the scheduler (if jobs are
still running locally, they will automatically be stopped).
156

André Anjos's avatar
André Anjos committed
157 158 159 160
If you want to submit a list of jobs and have the scheduler to run the jobs and
stop afterward, simply use the ``--die-when-finished`` option.  Also, it is
possible to run only specific jobs (and array jobs), which can be specified
with the ``--j`` and ``--a`` option, respectively.
161 162 163 164 165


Probing for Jobs
----------------

André Anjos's avatar
André Anjos committed
166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
To list the contents of the job database, you can use the ``jman list``
command.  This will show you the job-id, the queue, the current status, the
name and the command line of each job.  Since the database is automatically
updated when jobs finish, you can use the ``jman list`` again after some time.

Normally, long command lines are cut so that each job is listed in a single
line.  To get the full command line, please use the ``-vv`` option:

.. code-block:: sh

   $ bin/jman -vv list

By default, array jobs are not listed, but the ``-a`` option changes this
behavior.  Usually, it is a good idea to combine the ``-a`` option with ``-j``,
which will list only the jobs of the given job id(s):
181 182 183

.. code-block:: sh

André Anjos's avatar
André Anjos committed
184 185 186 187 188 189 190
   $ bin/jman -vv list -a -j [job_id_1] [job_id_2]

Note that the ``-j`` option is in general relatively smart.  You can use it to
select a range of job ids, e.g., ``-j 1-4 6-8``.  In this case, please assert
that there are no spaces between job ids and the ``-`` separator.  If any job
id is specified, which is not available in the database, it will simply be
ignored, including job ids that in the ranges.
191

André Anjos's avatar
André Anjos committed
192 193 194 195 196 197 198 199 200 201 202 203
Since version 1.3.0, GridTK also saves timing information about jobs, i.e.,
time stamps when jobs were submitted, started and finished.  You can use the
``-t`` option of ``jman ls`` to add the time stamps to the listing, which are
both written for jobs and parametric jobs (i.e., when using the ``-a`` option).


Submitting dependent jobs
-------------------------

Sometimes, the execution of one job might depend on the execution of another
job. The JobManager can take care of this, simply by adding the id of the job
that we have to wait for:
André Anjos's avatar
André Anjos committed
204

205
.. code-block:: sh
André Anjos's avatar
André Anjos committed
206

André Anjos's avatar
André Anjos committed
207 208 209
   $ jman -vv submit --dependencies 6151645 -- /usr/bin/python myscript.py --help
   ... Added job '<Job: 3> : submitted -- /usr/bin/python myscript.py --help' to the database
   ... Submitted job '<Job: 6151647> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.
André Anjos's avatar
André Anjos committed
210

André Anjos's avatar
André Anjos committed
211 212 213
Now, the new job will only be run after the first one finished.

.. note::
214

André Anjos's avatar
André Anjos committed
215
   Note the ``--`` between the list of dependencies and the command.
216

André Anjos's avatar
André Anjos committed
217

218 219
Inspecting log files
--------------------
André Anjos's avatar
André Anjos committed
220 221 222 223 224

When a job fails, the status will be ``failure``.  In this case, you might want
to know, what happened.  As a first indicator, the exit code of the program is
reported as well.  Also, the output and error streams of the job has been
recorded and can be seen using the utilities.  E.g.:
225 226 227

.. code-block:: sh

André Anjos's avatar
André Anjos committed
228
   $ bin/jman -vv report -j [job_id] -a [array_id]
André Anjos's avatar
André Anjos committed
229

André Anjos's avatar
André Anjos committed
230 231
will print the contents of the output and error log file from the job with the
desired ID (and only the array job with the given ID).
André Anjos's avatar
André Anjos committed
232

André Anjos's avatar
André Anjos committed
233 234
To report only the output or only the error logs, you can use the ``-o`` or
``-e`` option, respectively.  Hopefully, that helps in debugging the problem!
André Anjos's avatar
André Anjos committed
235

236

André Anjos's avatar
André Anjos committed
237 238
Re-submitting the job
---------------------
André Anjos's avatar
André Anjos committed
239 240 241 242

After correcting your code you might want to submit the same command line
again.  For this purpose, the ``bin/jman resubmit`` command exists.  Simply
specify the job id(s) that you want to resubmit:
André Anjos's avatar
André Anjos committed
243

244
.. code-block:: sh
André Anjos's avatar
André Anjos committed
245

André Anjos's avatar
André Anjos committed
246 247 248 249 250 251
   $ bin/jman -vv resubmit -j [job_id_1] [job_id_2]

This will clean up the old log files (if you didn't specify the ``--keep-logs``
option) and re-submit the job. If the submission is done in the grid the job
id(s) will change during this process.

252

André Anjos's avatar
André Anjos committed
253 254 255 256 257 258 259 260 261 262 263 264 265
Stopping a grid job
-------------------

In case you found an error in the code of a grid job that is currently
executing, you might want to kill the job in the grid.  For this purpose, you
can use the command:

.. code-block:: sh

   $ jman stop

The job is removed from the grid, but all log files are still available.  A
common use case is to stop the grid job, fix the bugs, and re-submit it.
André Anjos's avatar
André Anjos committed
266 267


268 269
Note about verbosity and time stamps
------------------------------------
André Anjos's avatar
André Anjos committed
270 271 272 273 274 275 276 277 278

For some jobs, it might be interesting to get the time stamps when the job has
started and when it has finished.  These time stamps are added to the log files
(usually the error log file) automatically, when you use the ``-vv`` option,
one when starting the process and one when it is finished.  However, there is a
difference between the ``SGE`` operation and the ``--local`` operation.  For
the ``SGE`` operation, you need to use the ``-vv`` option during the submission
or re-submission of a job.  In ``--local`` mode, the ``-vv`` flag during
execution (using ``--run-local-scheduler``) is used instead.
279 280

.. note::
André Anjos's avatar
André Anjos committed
281 282 283 284 285 286

   Why writing info logs the error log file, and not to the default output log
   file?  This is the default behavior of python's logging module.  All logs,
   independent of whether they are error, warning, info or debug logs are
   written to ``sys.stderr``, which in turn will be written into the error log
   files.
287 288


289
Cleaning up
André Anjos's avatar
André Anjos committed
290 291
-----------

André Anjos's avatar
André Anjos committed
292 293 294 295 296 297 298 299 300
After the job was successfully (or not) executed, you should clean up the
database using the ``bin/jman delete`` command.  If not specified otherwise
(i.e., using the ``--keep-logs`` option), this command will delete all jobs
from the database and delete the log files (including the log directory in case
it is empty), and remove the database as well.

Again, job ids and array ids can be specified to limit the deleted jobs with
the ``-j`` and ``-a`` option, respectively.  It is also possible to clean up
only those jobs (and array jobs) with a certain status. E.g. use:
André Anjos's avatar
André Anjos committed
301

302
.. code-block:: sh
André Anjos's avatar
André Anjos committed
303

304
  $ bin/jman -vv delete -s success -j 10-20
305

André Anjos's avatar
André Anjos committed
306 307
to delete all jobs and the logs of all successfully finished jobs with job ids
from 10 to 20 from the database.
308 309 310 311 312


Other command line tools
========================

André Anjos's avatar
André Anjos committed
313 314 315 316 317 318 319 320 321 322 323
For convenience, we also provide additional command line tools, which are
mainly useful at Idiap. These tools are:

- ``bin/qstat.py``: writes the statuses of the jobs that are currently running
  in the SGE grid
- ``bin/qsub.py``: submit job to the SGE grid without logging them into the
  database
- ``bin/qdel.py``: delete job from the SGE grid without logging them into the
  database
- ``bin/grid``: executes the command in an grid environment (i.e., as if a
  ``SETSHELL grid`` command would have been issued before)