manual.rst 11.3 KB
Newer Older
1 2 3
.. vim: set fileencoding=utf-8 :
.. author: Manuel Günther <manuel.guenther@idiap.ch>
.. date: Fri Aug 30 14:31:49 CEST 2013
André Anjos's avatar
André Anjos committed
4

5
.. _command_line:
André Anjos's avatar
André Anjos committed
6

André Anjos's avatar
André Anjos committed
7 8 9
============================
 The Command Line Interface
============================
André Anjos's avatar
André Anjos committed
10

11 12 13

The Job Manager
===============
André Anjos's avatar
André Anjos committed
14

15
The most important utility is the Job Manager ``jman``. This Job Manager
André Anjos's avatar
André Anjos committed
16
can be used to:
17 18

* submit jobs
André Anjos's avatar
André Anjos committed
19 20 21 22
* probe for submitted jobs
* identify problems with submitted jobs
* cleanup logs from submitted jobs
* easily re-submit jobs if problems occur
23
* support for parametric (array) jobs
André Anjos's avatar
André Anjos committed
24

André Anjos's avatar
André Anjos committed
25 26 27 28
The Job Manager has a common set of parameters, which will be explained in the
next section.  Additionally, several commands can be issued, each of which has
its own set of options.  These commands will be explained afterwards.

André Anjos's avatar
André Anjos committed
29

30 31
Basic Job Manager Parameters
----------------------------
André Anjos's avatar
André Anjos committed
32

André Anjos's avatar
André Anjos committed
33 34 35
There are two versions of Job Managers: One that submits jobs to the SGE grid,
and one that submits jobs so that they are run in parallel on the local
machine.  By default, the SGE manager is engaged.  If you don't have access to
36 37
the SGE grid, or you want to submit locally, please issue the ``jman
--local`` (or shortly ``jman -l``) command.
André Anjos's avatar
André Anjos committed
38 39 40

To keep track of the submitted jobs, an SQL3 database is written.  This
database is by default called ``submitted.sql3`` and put in the current
41 42
directory, but this can be changed using the ``jman --database``
(``jman -d``) flag.
André Anjos's avatar
André Anjos committed
43

44
Normally, the Job Manager acts silently, and only error messages are reported.
André Anjos's avatar
André Anjos committed
45 46 47
To make the Job Manager more verbose, you can use the ``--verbose`` (``-v``)
option several times, to increase the verbosity level to 1) WARNING, 2) INFO,
3) DEBUG.
André Anjos's avatar
André Anjos committed
48 49


50 51
Submitting Jobs
---------------
André Anjos's avatar
André Anjos committed
52

53
To submit a job, the ``jman submit`` command is used.
54
The simplest way to submit a job to be run in the SGE grid is:
André Anjos's avatar
André Anjos committed
55

56 57
.. code-block:: sh

58
   $ jman -vv submit myscript.py
59 60 61

This command will create an SQL3 database, submit the job to the grid and register it in the database.
To be more easily separable from other jobs in the database, you can give your job a name:
André Anjos's avatar
André Anjos committed
62

63
.. code-block:: sh
64

65
   $ jman -vv submit -n [name] myscript.py
André Anjos's avatar
André Anjos committed
66

67 68
If the job requires certain machine specifications, you can add these (please see the SGE manual for possible specifications of [key] and [value] pairs).
Please note the ``--`` option that separates specifications from the command:
André Anjos's avatar
André Anjos committed
69

70 71
.. code-block:: sh

72
   $ jman -vv submit -q [queue-name] -m [memory] --io-big -s [key1]=[value1] [key2]=[value2] -- myscript.py
André Anjos's avatar
André Anjos committed
73

André Anjos's avatar
André Anjos committed
74
To have jobs run in parallel, you can submit a parametric job.  Simply call:
75

76
.. code-block:: sh
André Anjos's avatar
André Anjos committed
77

78
   $ jman -vv submit -t 10 myscript.py
79

André Anjos's avatar
André Anjos committed
80 81 82 83
to run ``myscript.py`` 10 times in parallel.  Each of the parallel jobs will
have a different environment variable called ``SGE_TASK_ID``, which will range
from 1 to 10 in this case.  If your script can handle this environment
variable, it can actually execute 10 different tasks.
84

André Anjos's avatar
André Anjos committed
85 86 87 88
Also, jobs with dependencies can be submitted.  When submitted to the grid,
each job has its own job id.  These job ids can be used to create dependencies
between the jobs (i.e., one job needs to finish before the next one can be
started):
André Anjos's avatar
André Anjos committed
89

90
.. code-block:: sh
André Anjos's avatar
André Anjos committed
91

92
  $ jman -vv submit -x [job_id_1] [job_id_2] -- myscript.py
93

André Anjos's avatar
André Anjos committed
94 95
In case the first job fails, it can automatically stop the depending jobs from
being executed.  Just submit jobs with the ``--stop-on-failure`` option.
André Anjos's avatar
André Anjos committed
96

97
.. note::
André Anjos's avatar
André Anjos committed
98 99 100

   The ``--stop-on-failure`` option is under development and might not work
   properly. Use this option with care.
André Anjos's avatar
André Anjos committed
101

102 103 104 105 106 107 108 109 110
Also, you can submit the same job several times in a way that each one will
depend on the last one. This is useful when for GPU training when your jobs
gets killed because you run out of time but you want to submit the same job
again.

.. code-block:: sh

  $ jman submit --repeat 5 -- myscript.py

André Anjos's avatar
André Anjos committed
111

112 113
While the jobs run, the output and error stream are captured in log files, which are written into a ``logs`` directory.
This directory can be changed by specifying:
André Anjos's avatar
André Anjos committed
114

115
.. code-block:: sh
116

117
  $ jman -vv submit -l [log_dir]
André Anjos's avatar
André Anjos committed
118 119

.. note::
André Anjos's avatar
André Anjos committed
120 121 122 123 124 125

   When submitting jobs locally, by default the output and error streams are
   written to console and no log directory is created.  To get back the SGE
   grid logging behavior, please specify the log directory.  In this case,
   output and error streams are written into the log files **after** the job
   has finished.
André Anjos's avatar
André Anjos committed
126 127


128
Running Jobs Locally
André Anjos's avatar
André Anjos committed
129
--------------------
André Anjos's avatar
André Anjos committed
130 131 132 133

When jobs are submitted to the SGE grid, they are run immediately. However,
when jobs are submitted locally, (using the ``--local`` option, see above), a
local scheduler needs to be run.  This is achieved by issuing the command:
André Anjos's avatar
André Anjos committed
134

135 136
.. code-block:: sh

137
   $ jman -vv run-scheduler -p [parallel_jobs] -s [sleep_time]
138

André Anjos's avatar
André Anjos committed
139 140 141 142
This will start the scheduler in the daemon mode.  This will constantly monitor
the SQL3 database and execute jobs after submission, starting every
``[sleep_time]`` second.  Use ``Ctrl-C`` to stop the scheduler (if jobs are
still running locally, they will automatically be stopped).
143

André Anjos's avatar
André Anjos committed
144 145 146 147
If you want to submit a list of jobs and have the scheduler to run the jobs and
stop afterward, simply use the ``--die-when-finished`` option.  Also, it is
possible to run only specific jobs (and array jobs), which can be specified
with the ``--j`` and ``--a`` option, respectively.
148 149 150 151 152


Probing for Jobs
----------------

André Anjos's avatar
André Anjos committed
153 154 155 156 157 158 159 160 161 162
To list the contents of the job database, you can use the ``jman list``
command.  This will show you the job-id, the queue, the current status, the
name and the command line of each job.  Since the database is automatically
updated when jobs finish, you can use the ``jman list`` again after some time.

Normally, long command lines are cut so that each job is listed in a single
line.  To get the full command line, please use the ``-vv`` option:

.. code-block:: sh

163
   $ jman -vv list
André Anjos's avatar
André Anjos committed
164 165 166 167

By default, array jobs are not listed, but the ``-a`` option changes this
behavior.  Usually, it is a good idea to combine the ``-a`` option with ``-j``,
which will list only the jobs of the given job id(s):
168 169 170

.. code-block:: sh

171
   $ jman -vv list -a -j [job_id_1] [job_id_2]
André Anjos's avatar
André Anjos committed
172

173 174 175 176 177 178 179
Note that the ``-j`` option is in general relatively smart. You can use it to
select a range of job ids, e.g., ``-j 1-4 6-8 10+2`` is the same as
``-j 1 2 3 4 6 7 8 10 11 12``.  In this case, please assert that there are no
spaces between job ids and the ``-`` and ``+`` separators. You cannot use both
``-`` and ``+`` in one part, i.e., something like ``-j 1-4+2`` will not work.
If any job id is specified, which is not available in the database, it will
simply be ignored, including job ids that are in the ranges.
180

André Anjos's avatar
André Anjos committed
181 182 183 184 185 186 187 188 189 190 191 192
Since version 1.3.0, GridTK also saves timing information about jobs, i.e.,
time stamps when jobs were submitted, started and finished.  You can use the
``-t`` option of ``jman ls`` to add the time stamps to the listing, which are
both written for jobs and parametric jobs (i.e., when using the ``-a`` option).


Submitting dependent jobs
-------------------------

Sometimes, the execution of one job might depend on the execution of another
job. The JobManager can take care of this, simply by adding the id of the job
that we have to wait for:
André Anjos's avatar
André Anjos committed
193

194
.. code-block:: sh
André Anjos's avatar
André Anjos committed
195

André Anjos's avatar
André Anjos committed
196 197 198
   $ jman -vv submit --dependencies 6151645 -- /usr/bin/python myscript.py --help
   ... Added job '<Job: 3> : submitted -- /usr/bin/python myscript.py --help' to the database
   ... Submitted job '<Job: 6151647> : queued -- /usr/bin/python myscript.py --help' to the SGE grid.
André Anjos's avatar
André Anjos committed
199

André Anjos's avatar
André Anjos committed
200 201 202
Now, the new job will only be run after the first one finished.

.. note::
203

André Anjos's avatar
André Anjos committed
204
   Note the ``--`` between the list of dependencies and the command.
205

André Anjos's avatar
André Anjos committed
206

207 208
Inspecting log files
--------------------
André Anjos's avatar
André Anjos committed
209 210 211 212 213

When a job fails, the status will be ``failure``.  In this case, you might want
to know, what happened.  As a first indicator, the exit code of the program is
reported as well.  Also, the output and error streams of the job has been
recorded and can be seen using the utilities.  E.g.:
214 215 216

.. code-block:: sh

217
   $ jman -vv report -j [job_id] -a [array_id]
André Anjos's avatar
André Anjos committed
218

André Anjos's avatar
André Anjos committed
219 220
will print the contents of the output and error log file from the job with the
desired ID (and only the array job with the given ID).
André Anjos's avatar
André Anjos committed
221

André Anjos's avatar
André Anjos committed
222 223
To report only the output or only the error logs, you can use the ``-o`` or
``-e`` option, respectively.  Hopefully, that helps in debugging the problem!
André Anjos's avatar
André Anjos committed
224

225

André Anjos's avatar
André Anjos committed
226 227
Re-submitting the job
---------------------
André Anjos's avatar
André Anjos committed
228 229

After correcting your code you might want to submit the same command line
230
again.  For this purpose, the ``jman resubmit`` command exists.  Simply
André Anjos's avatar
André Anjos committed
231
specify the job id(s) that you want to resubmit:
André Anjos's avatar
André Anjos committed
232

233
.. code-block:: sh
André Anjos's avatar
André Anjos committed
234

235
   $ jman -vv resubmit -j [job_id_1] [job_id_2]
André Anjos's avatar
André Anjos committed
236 237 238 239 240

This will clean up the old log files (if you didn't specify the ``--keep-logs``
option) and re-submit the job. If the submission is done in the grid the job
id(s) will change during this process.

241

André Anjos's avatar
André Anjos committed
242 243 244 245 246 247 248 249 250 251 252 253 254
Stopping a grid job
-------------------

In case you found an error in the code of a grid job that is currently
executing, you might want to kill the job in the grid.  For this purpose, you
can use the command:

.. code-block:: sh

   $ jman stop

The job is removed from the grid, but all log files are still available.  A
common use case is to stop the grid job, fix the bugs, and re-submit it.
André Anjos's avatar
André Anjos committed
255 256


257 258
Note about verbosity and time stamps
------------------------------------
André Anjos's avatar
André Anjos committed
259 260 261 262 263 264 265 266 267

For some jobs, it might be interesting to get the time stamps when the job has
started and when it has finished.  These time stamps are added to the log files
(usually the error log file) automatically, when you use the ``-vv`` option,
one when starting the process and one when it is finished.  However, there is a
difference between the ``SGE`` operation and the ``--local`` operation.  For
the ``SGE`` operation, you need to use the ``-vv`` option during the submission
or re-submission of a job.  In ``--local`` mode, the ``-vv`` flag during
execution (using ``--run-local-scheduler``) is used instead.
268 269

.. note::
André Anjos's avatar
André Anjos committed
270 271 272 273 274 275

   Why writing info logs the error log file, and not to the default output log
   file?  This is the default behavior of python's logging module.  All logs,
   independent of whether they are error, warning, info or debug logs are
   written to ``sys.stderr``, which in turn will be written into the error log
   files.
276 277


278
Cleaning up
André Anjos's avatar
André Anjos committed
279 280
-----------

André Anjos's avatar
André Anjos committed
281
After the job was successfully (or not) executed, you should clean up the
282
database using the ``jman delete`` command.  If not specified otherwise
André Anjos's avatar
André Anjos committed
283 284 285 286 287 288 289
(i.e., using the ``--keep-logs`` option), this command will delete all jobs
from the database and delete the log files (including the log directory in case
it is empty), and remove the database as well.

Again, job ids and array ids can be specified to limit the deleted jobs with
the ``-j`` and ``-a`` option, respectively.  It is also possible to clean up
only those jobs (and array jobs) with a certain status. E.g. use:
André Anjos's avatar
André Anjos committed
290

291
.. code-block:: sh
André Anjos's avatar
André Anjos committed
292

293
  $ jman -vv delete -s success -j 10-20
294

André Anjos's avatar
André Anjos committed
295 296
to delete all jobs and the logs of all successfully finished jobs with job ids
from 10 to 20 from the database.
297 298 299 300 301


Other command line tools
========================

André Anjos's avatar
André Anjos committed
302 303 304
For convenience, we also provide additional command line tools, which are
mainly useful at Idiap. These tools are:

305
- ``qstat.py``: writes the statuses of the jobs that are currently running
André Anjos's avatar
André Anjos committed
306
  in the SGE grid
307
- ``qsub.py``: submit job to the SGE grid without logging them into the
André Anjos's avatar
André Anjos committed
308
  database
309
- ``qdel.py``: delete job from the SGE grid without logging them into the
André Anjos's avatar
André Anjos committed
310
  database
311
- ``grid``: executes the command in an grid environment (i.e., as if a
André Anjos's avatar
André Anjos committed
312
  ``SETSHELL grid`` command would have been issued before)