guide.rst 24.1 KB
Newer Older
André Anjos's avatar
André Anjos committed
1 2 3 4
.. vim: set fileencoding=utf-8 :
.. Andre Anjos <andre.dos.anjos@gmail.com>
.. Tue 15 Oct 17:41:52 2013

André Anjos's avatar
André Anjos committed
5
.. testsetup:: *
André Anjos's avatar
André Anjos committed
6

André Anjos's avatar
André Anjos committed
7 8 9 10 11 12
  import numpy
  positives = numpy.random.normal(1,1,100)
  negatives = numpy.random.normal(-1,1,100)
  import matplotlib
  if not hasattr(matplotlib, 'backends'):
    matplotlib.use('pdf') #non-interactive avoids exception on display
André Anjos's avatar
André Anjos committed
13
  import bob.measure
André Anjos's avatar
André Anjos committed
14 15 16 17 18

============
 User Guide
============

André Anjos's avatar
André Anjos committed
19
Methods in the :py:mod:`bob.measure` module can help you to quickly and easily
20 21
evaluate error for multi-class or binary classification problems. If you are
not yet familiarized with aspects of performance evaluation, we recommend the
22 23
following papers and book chapters for an overview of some of the implemented
methods.
24 25 26 27 28 29 30

* Bengio, S., Keller, M., Mariéthoz, J. (2004). `The Expected Performance
  Curve`_.  International Conference on Machine Learning ICML Workshop on ROC
  Analysis in Machine Learning, 136(1), 1963–1966.
* Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997).
  `The DET curve in assessment of detection task performance`_. Fifth European
  Conference on Speech Communication and Technology (pp. 1895-1898).
31 32
* Li, S., Jain, A.K. (2005), `Handbook of Face Recognition`, Chapter 14, Springer

33 34 35 36 37

Overview
--------

A classifier is subject to two types of errors, either the real access/signal
38 39 40 41
is rejected (false negative) or an impostor attack/a false access is accepted
(false positive). A possible way to measure the detection performance is to
use the Half Total Error Rate (HTER), which combines the False Negative Rate
(FNR) and the False Positive Rate (FPR) and is defined in the following
42 43 44 45
formula:

.. math::

46
   HTER(\tau, \mathcal{D}) = \frac{FPR(\tau, \mathcal{D}) + FNR(\tau, \mathcal{D})}{2} \quad \textrm{[\%]}
47

André Anjos's avatar
André Anjos committed
48

49 50 51
where :math:`\mathcal{D}` denotes the dataset used. Since both the FPR and the
FNR depends on the threshold :math:`\tau`, they are strongly related to each
other: increasing the FPR will reduce the FNR and vice-versa. For this reason,
52 53
results are often presented using either a Receiver Operating Characteristic
(ROC) or a Detection-Error Tradeoff (DET) plot, these two plots basically
54
present the FPR versus the FNR for different values of the threshold. Another
55
widely used measure to summarise the performance of a system is the Equal Error
56 57
Rate (EER), defined as the point along the ROC or DET curve where the FPR
equals the FNR.
58 59 60 61 62 63 64 65

However, it was noted in by Bengio et al. (2004) that ROC and DET curves may be
misleading when comparing systems. Hence, the so-called Expected Performance
Curve (EPC) was proposed and consists of an unbiased estimate of the reachable
performance of a system at various operating points.  Indeed, in real-world
scenarios, the threshold :math:`\tau` has to be set a priori: this is typically
done using a development set (also called cross-validation set). Nevertheless,
the optimal threshold can be different depending on the relative importance
66 67
given to the FPR and the FNR. Hence, in the EPC framework, the cost
:math:`\beta \in [0;1]` is defined as the trade-off between the FPR and FNR.
68
The optimal threshold :math:`\tau^*` is then computed using different values of
69 70 71
:math:`\beta`, corresponding to different operating points:

.. math::
72
  \tau^{*} = \arg\!\min_{\tau} \quad \beta \cdot \textrm{FPR}(\tau, \mathcal{D}_{d}) + (1-\beta) \cdot \textrm{FNR}(\tau, \mathcal{D}_{d})
73

André Anjos's avatar
André Anjos committed
74

75
where :math:`\mathcal{D}_{d}` denotes the development set and should be
André Anjos's avatar
André Anjos committed
76
completely separate to the evaluation set :math:`\mathcal{D}`.
77

78 79
Performance for different values of :math:`\beta` is then computed on the
evaluation
80 81 82
set :math:`\mathcal{D}_{t}` using the previously derived threshold. Note that
setting :math:`\beta` to 0.5 yields to the Half Total Error Rate (HTER) as
defined in the first equation.
André Anjos's avatar
André Anjos committed
83 84 85

.. note::

86
  Most of the methods available in this module require as input a set of 2
87 88 89 90 91 92
  :py:class:`numpy.ndarray` objects that contain the scores obtained by the
  classification system to be evaluated, without specific order. Most of the
  classes that are defined to deal with two-class problems. Therefore, in this
  setting, and throughout this manual, we have defined that the **negatives**
  represents the impostor attacks or false class accesses (that is when a
  sample of class A is given to the classifier of another class, such as class
93
  B) for of the classifier. The second set, referred as the **positives**
94 95 96 97 98 99 100 101 102
  represents the true class accesses or signal response of the classifier. The
  vectors are called this way because the procedures implemented in this module
  expects that the scores of **negatives** to be statistically distributed to
  the left of the signal scores (the **positives**). If that is not the case,
  one should either invert the input to the methods or multiply all scores
  available by -1, in order to have them inverted.

  The input to create these two vectors is generated by experiments conducted
  by the user and normally sits in files that may need some parsing before
103
  these vectors can be extracted. While it is not possible to provide a parser
104 105 106
  for every individual file that may be generated in different experimental
  frameworks, we do provide a parser for a generic two columns format
  where the first column contains -1/1 for negative/positive and the second column
107
  contains score values. Please refer to the documentation of
108
  :py:func:`bob.measure.load.split` for more details.
109

110
  In the remainder of this section we assume you have successfully parsed and
111 112
  loaded your scores in two 1D float64 vectors and are ready to evaluate the
  performance of the classifier.
André Anjos's avatar
André Anjos committed
113

114 115
Verification
------------
André Anjos's avatar
André Anjos committed
116

117 118
To count the number of correctly classified positives and negatives you can use
the following techniques:
André Anjos's avatar
André Anjos committed
119 120 121

.. doctest::

122 123 124
   >>> # negatives, positives = parse_my_scores(...) # write parser if not provided!
   >>> T = 0.0 #Threshold: later we explain how one can calculate these
   >>> correct_negatives = bob.measure.correctly_classified_negatives(negatives, T)
125
   >>> FPR = 1 - (float(correct_negatives.sum())/negatives.size)
126
   >>> correct_positives = bob.measure.correctly_classified_positives(positives, T)
127
   >>> FNR = 1 - (float(correct_positives.sum())/positives.size)
André Anjos's avatar
André Anjos committed
128

129
We do provide a method to calculate the FPR and FNR in a single shot:
André Anjos's avatar
André Anjos committed
130 131 132

.. doctest::

133
   >>> FPR, FNR = bob.measure.farfrr(negatives, positives, T)
André Anjos's avatar
André Anjos committed
134

135 136 137
The threshold ``T`` is normally calculated by looking at the distribution of
negatives and positives in a development (or validation) set, selecting a
threshold that matches a certain criterion and applying this derived threshold
138
to the evaluation set. This technique gives a better overview of the
139 140
generalization of a method. We implement different techniques for the
calculation of the threshold:
André Anjos's avatar
André Anjos committed
141

142
* Threshold for the EER
André Anjos's avatar
André Anjos committed
143

144
  .. doctest::
André Anjos's avatar
André Anjos committed
145

André Anjos's avatar
André Anjos committed
146
    >>> T = bob.measure.eer_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
147

148
* Threshold for the minimum HTER
André Anjos's avatar
André Anjos committed
149

150
  .. doctest::
André Anjos's avatar
André Anjos committed
151

André Anjos's avatar
André Anjos committed
152
    >>> T = bob.measure.min_hter_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
153

154 155
* Threshold for the minimum weighted error rate (MWER) given a certain cost
  :math:`\beta`.
André Anjos's avatar
André Anjos committed
156

157
  .. doctest:: python
André Anjos's avatar
André Anjos committed
158

159
     >>> cost = 0.3 #or "beta"
André Anjos's avatar
André Anjos committed
160
     >>> T = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost)
André Anjos's avatar
André Anjos committed
161

162
  .. note::
André Anjos's avatar
André Anjos committed
163

164 165 166
     By setting cost to 0.5 is equivalent to use
     :py:func:`bob.measure.min_hter_threshold`.

167 168 169 170 171 172

.. important::
   Often, it is not numerically possible to match the requested criteria for
   calculating the threshold based on the provided scores. Instead, the closest
   possible threshold is returned. For example, using
   :any:`bob.measure.eer_threshold` **will not** give you a threshold where
173 174
   :math:`FPR == FNR`. Hence, you cannot report :math:`FPR` or :math:`FNR`
   instead of :math:`EER`; you should report :math:`(FPR+FNR)/2` instead. This
175 176 177
   is also true for :any:`bob.measure.far_threshold` and
   :any:`bob.measure.frr_threshold`. The threshold returned by those functions
   does not guarantee that using that threshold you will get the requested
178
   :math:`FPR` or :math:`FNR` value. Instead, you should recalculate using
179 180
   :any:`bob.measure.farfrr`.

181 182 183 184 185 186 187 188 189 190 191 192 193
.. note::
   Many functions in ``bob.measure`` have an ``is_sorted`` parameter, which defaults to ``False``, throughout.
   However, these functions need sorted ``positive`` and/or ``negative`` scores.
   If scores are not in ascendantly sorted order, internally, they will be copied -- twice!
   To avoid scores to be copied, you might want to sort the scores in ascending order, e.g., by:

   .. doctest:: python

      >>> negatives.sort()
      >>> positives.sort()
      >>> t = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost, is_sorted = True)
      >>> assert T == t

194 195 196 197 198 199 200 201 202 203 204 205 206 207 208
Identification
--------------

For identification, the Recognition Rate is one of the standard measures.
To compute recognition rates, you can use the :py:func:`bob.measure.recognition_rate` function.
This function expects a relatively complex data structure, which is the same as for the `CMC`_ below.
For each probe item, the scores for negative and positive comparisons are computed, and collected for all probe items:

.. doctest::

   >>> rr_scores = []
   >>> for probe in range(10):
   ...   pos = numpy.random.normal(1, 1, 1)
   ...   neg = numpy.random.normal(0, 1, 19)
   ...   rr_scores.append((neg, pos))
209
   >>> rr = bob.measure.recognition_rate(rr_scores, rank=1)
210 211 212 213 214 215 216 217 218 219 220 221 222 223 224

For open set identification, according to Li and Jain (2005) there are two different error measures defined.
The first measure is the :py:func:`bob.measure.detection_identification_rate`, which counts the number of correctly classified in-gallery probe items.
The second measure is the :py:func:`bob.measure.false_alarm_rate`, which counts, how often an out-of-gallery probe item was incorrectly accepted.
Both rates can be computed using the same data structure, with one exception.
Both functions require that at least one probe item exists, which has no according gallery item, i.e., where the positives are empty or ``None``:

(continued from above...)

.. doctest::

   >>> for probe in range(10):
   ...   pos = None
   ...   neg = numpy.random.normal(-2, 1, 10)
   ...   rr_scores.append((neg, pos))
225 226
   >>> dir = bob.measure.detection_identification_rate(rr_scores, threshold = 0, rank=1)
   >>> far = bob.measure.false_alarm_rate(rr_scores, threshold = 0)
227

228 229 230
Confidence interval
-------------------

231 232 233 234 235 236
A confidence interval for parameter :math:`x` consists of a lower
estimate :math:`L`, and an upper estimate :math:`U`, such that the probability
of the true value being within the interval estimate is equal to :math:`\alpha`.
For example, a 95% confidence interval (i.e. :math:`\alpha = 0.95`) for a
parameter :math:`x` is given by :math:`[L, U]` such that

237
.. math:: Prob(x∈[L,U]) = 95%
238

239
The smaller the test size, the wider the confidence
240
interval will be, and the greater :math:`\alpha`, the smaller the confidence interval
241 242 243
will be.

`The Clopper-Pearson interval`_, a common method for calculating
244
confidence intervals, is function of the number of success, the number of trials
245
and confidence
246
value :math:`\alpha` is used as :py:func:`bob.measure.utils.confidence_for_indicator_variable`.
247
It is based on the cumulative probabilities of the binomial distribution. This
248
method is quite conservative, meaning that the true coverage rate of a 95%
249
Clopper–Pearson interval may be well above 95%.
250

251 252 253 254 255 256 257 258 259 260 261
For example, we want to evaluate the reliability of a system to
identify registered persons. Let's say that among 10,000 accepted
transactions, 9856 are true matches. The 95% confidence interval for true match
rate is then:
.. doctest:: python

    >>> bob.measure.utils.confidence_for_indicator_variable(9856, 10000)
    (0.98306835053282549, 0.98784270928084694)

meaning there is a 95% probability that the true match rate is inside :math:`[0.983,
0.988]`.
André Anjos's avatar
André Anjos committed
262

263 264
Plotting
--------
André Anjos's avatar
André Anjos committed
265

266
An image is worth 1000 words, they say. You can combine the capabilities of
267 268
`Matplotlib`_ with |project| to plot a number of curves. However, you must have
that package installed though. In this section we describe a few recipes.
André Anjos's avatar
André Anjos committed
269

270 271
ROC
===
André Anjos's avatar
André Anjos committed
272

273 274 275
The Receiver Operating Characteristic (ROC) curve is one of the oldest plots in
town. To plot an ROC curve, in possession of your **negatives** and
**positives**, just do something along the lines of:
André Anjos's avatar
André Anjos committed
276 277 278

.. doctest::

279 280 281 282
   >>> from matplotlib import pyplot
   >>> # we assume you have your negatives and positives already split
   >>> npoints = 100
   >>> bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
283 284
   >>> pyplot.xlabel('FPR (%)') # doctest: +SKIP
   >>> pyplot.ylabel('FNR (%)') # doctest: +SKIP
285 286
   >>> pyplot.grid(True)
   >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
287

288
You should see an image like the following one:
André Anjos's avatar
André Anjos committed
289

André Anjos's avatar
André Anjos committed
290 291 292
.. plot::

   import numpy
293
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
294
   import bob.measure
André Anjos's avatar
André Anjos committed
295 296 297 298 299
   from matplotlib import pyplot

   positives = numpy.random.normal(1,1,100)
   negatives = numpy.random.normal(-1,1,100)
   npoints = 100
André Anjos's avatar
André Anjos committed
300
   bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test')
André Anjos's avatar
André Anjos committed
301
   pyplot.grid(True)
302 303
   pyplot.xlabel('FPR (%)')
   pyplot.ylabel('FNR (%)')
André Anjos's avatar
André Anjos committed
304
   pyplot.title('ROC')
André Anjos's avatar
André Anjos committed
305

306
As can be observed, plotting methods live in the namespace
307 308 309 310 311 312 313 314 315 316 317 318
:py:mod:`bob.measure.plot`. They work like the
:py:func:`matplotlib.pyplot.plot` itself, except that instead of receiving the
x and y point coordinates as parameters, they receive the two
:py:class:`numpy.ndarray` arrays with negatives and positives, as well as an
indication of the number of points the curve must contain.

As in the :py:func:`matplotlib.pyplot.plot` command, you can pass optional
parameters for the line as shown in the example to setup its color, shape and
even the label.  For an overview of the keywords accepted, please refer to the
`Matplotlib`_'s Documentation. Other plot properties such as the plot title,
axis labels, grids, legends should be controlled directly using the relevant
`Matplotlib`_'s controls.
André Anjos's avatar
André Anjos committed
319

320 321
DET
===
André Anjos's avatar
André Anjos committed
322

323
A DET curve can be drawn using similar commands such as the ones for the ROC curve:
André Anjos's avatar
André Anjos committed
324 325 326

.. doctest::

327 328 329
  >>> from matplotlib import pyplot
  >>> # we assume you have your negatives and positives already split
  >>> npoints = 100
André Anjos's avatar
André Anjos committed
330 331
  >>> bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
  >>> bob.measure.plot.det_axis([0.01, 40, 0.01, 40]) # doctest: +SKIP
332 333
  >>> pyplot.xlabel('FPR (%)') # doctest: +SKIP
  >>> pyplot.ylabel('FNR (%)') # doctest: +SKIP
334 335
  >>> pyplot.grid(True)
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
336

337
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
338

André Anjos's avatar
André Anjos committed
339 340 341
.. plot::

   import numpy
342
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
343
   import bob.measure
André Anjos's avatar
André Anjos committed
344 345 346 347 348 349
   from matplotlib import pyplot

   positives = numpy.random.normal(1,1,100)
   negatives = numpy.random.normal(-1,1,100)

   npoints = 100
André Anjos's avatar
André Anjos committed
350 351
   bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test')
   bob.measure.plot.det_axis([0.1, 80, 0.1, 80])
André Anjos's avatar
André Anjos committed
352
   pyplot.grid(True)
353 354
   pyplot.xlabel('FPR (%)')
   pyplot.ylabel('FNR (%)')
André Anjos's avatar
André Anjos committed
355
   pyplot.title('DET')
André Anjos's avatar
André Anjos committed
356 357 358

.. note::

359 360 361
  If you wish to reset axis zooming, you must use the Gaussian scale rather
  than the visual marks showed at the plot, which are just there for
  displaying purposes. The real axis scale is based on the
362
  :py:func:`bob.measure.ppndf` method. For example, if you wish to set the x and y
363
  axis to display data between 1% and 40% here is the recipe:
André Anjos's avatar
André Anjos committed
364

365
  .. doctest::
André Anjos's avatar
André Anjos committed
366

367
    >>> #AFTER you plot the DET curve, just set the axis in this way:
André Anjos's avatar
André Anjos committed
368
    >>> pyplot.axis([bob.measure.ppndf(k/100.0) for k in (1, 40, 1, 40)]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
369

370
  We provide a convenient way for you to do the above in this module. So,
André Anjos's avatar
André Anjos committed
371
  optionally, you may use the ``bob.measure.plot.det_axis`` method like this:
André Anjos's avatar
André Anjos committed
372

373
  .. doctest::
André Anjos's avatar
André Anjos committed
374

André Anjos's avatar
André Anjos committed
375
    >>> bob.measure.plot.det_axis([1, 40, 1, 40]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
376

377 378
EPC
===
André Anjos's avatar
André Anjos committed
379

380
Drawing an EPC requires that both the development set negatives and positives are provided alongside
381
the evaluation set ones. Because of this the API is slightly modified:
André Anjos's avatar
André Anjos committed
382 383 384

.. doctest::

André Anjos's avatar
André Anjos committed
385
  >>> bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-') # doctest: +SKIP
386
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
387

388
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
389

André Anjos's avatar
André Anjos committed
390 391 392
.. plot::

   import numpy
393
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
394
   import bob.measure
André Anjos's avatar
André Anjos committed
395 396 397 398 399 400 401
   from matplotlib import pyplot

   dev_pos = numpy.random.normal(1,1,100)
   dev_neg = numpy.random.normal(-1,1,100)
   test_pos = numpy.random.normal(0.9,1,100)
   test_neg = numpy.random.normal(-1.1,1,100)
   npoints = 100
André Anjos's avatar
André Anjos committed
402
   bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-')
André Anjos's avatar
André Anjos committed
403 404
   pyplot.grid(True)
   pyplot.title('EPC')
André Anjos's avatar
André Anjos committed
405

406 407 408 409

CMC
===

410 411 412 413 414
The Cumulative Match Characteristics (CMC) curve estimates the probability that
the correct model is in the *N* models with the highest similarity to a given
probe.  A CMC curve can be plotted using the :py:func:`bob.measure.plot.cmc`
function.  The CMC can be calculated from a relatively complex data structure,
which defines a pair of positive and negative scores **per probe**:
415 416 417 418

.. plot::

   import numpy
419
   numpy.random.seed(42)
420 421 422
   import bob.measure
   from matplotlib import pyplot

423
   cmc_scores = []
424 425 426
   for probe in range(10):
     positives = numpy.random.normal(1, 1, 1)
     negatives = numpy.random.normal(0, 1, 19)
427 428
     cmc_scores.append((negatives, positives))
   bob.measure.plot.cmc(cmc_scores, logx=False)
André Anjos's avatar
André Anjos committed
429
   pyplot.grid(True)
430 431 432 433 434
   pyplot.title('CMC')
   pyplot.xlabel('Rank')
   pyplot.xticks([1,5,10,20])
   pyplot.xlim([1,20])
   pyplot.ylim([0,100])
435
   pyplot.ylabel('Probability of Recognition (%)')
436 437 438

Usually, there is only a single positive score per probe, but this is not a fixed restriction.

439 440 441 442

Detection & Identification Curve
================================

André Anjos's avatar
André Anjos committed
443 444 445 446
The detection & identification curve is designed to evaluate open set
identification tasks.  It can be plotted using the
:py:func:`bob.measure.plot.detection_identification_curve` function, but it
requires at least one open-set probe, i.e., where no corresponding positive
447
score exists, for which the FPR values are computed.  Here, we plot the
André Anjos's avatar
André Anjos committed
448
detection and identification curve for rank 1, so that the recognition rate for
449
FPR=1 will be identical to the rank one :py:func:`bob.measure.recognition_rate`
André Anjos's avatar
André Anjos committed
450
obtained in the CMC plot above.
451 452 453 454 455 456 457 458 459

.. plot::

   import numpy
   numpy.random.seed(42)
   import bob.measure
   from matplotlib import pyplot

   cmc_scores = []
460
   for probe in range(1000):
461 462 463
     positives = numpy.random.normal(1, 1, 1)
     negatives = numpy.random.normal(0, 1, 19)
     cmc_scores.append((negatives, positives))
464
   for probe in range(1000):
465 466 467 468 469
     negatives = numpy.random.normal(-1, 1, 10)
     cmc_scores.append((negatives, None))

   bob.measure.plot.detection_identification_curve(cmc_scores, rank=1, logx=True)
   pyplot.xlabel('False Alarm Rate')
470
   pyplot.xlim([0.0001, 1])
471
   pyplot.ylabel('Detection & Identification Rate (%)')
472
   pyplot.ylim([0,1])
473 474 475



476 477
Fine-tunning
============
André Anjos's avatar
André Anjos committed
478

André Anjos's avatar
André Anjos committed
479 480
The methods inside :py:mod:`bob.measure.plot` are only provided as a
`Matplotlib`_ wrapper to equivalent methods in :py:mod:`bob.measure` that can
481 482
only calculate the points without doing any plotting. You may prefer to tweak
the plotting or even use a different plotting system such as gnuplot. Have a
André Anjos's avatar
André Anjos committed
483 484 485 486
look at the implementations at :py:mod:`bob.measure.plot` to understand how to
use the |project| methods to compute the curves and interlace that in the way
that best suits you.

Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
487 488
.. _bob.measure.command_line:

489 490 491 492 493 494 495 496 497 498 499 500
Full applications
-----------------

Commands under ``bob measure`` can be used to quickly evaluate a set of
scores and generate plots. We present these commands in this section. The commands
take as input generic 2-column data format as specified in the documentation of
:py:func:`bob.measure.load.split`

Metrics
=======

To calculate the threshold using a certain criterion (EER (default) or min.HTER)
501 502
on a development set and conduct the threshold computation and its performance
on an evaluation set, after setting up |project|, just do:
503 504 505

.. code-block:: sh

506 507 508 509 510 511 512 513 514 515 516 517 518 519 520
    ./bin/bob measure  metrics ./MTest1/scores-{dev,eval} -e
    [Min. criterion: EER ] Threshold on Development set `./MTest1/scores-dev`: -1.373550e-02
    bob.measure@2018-06-29 10:20:14,177 -- ERROR: NaNs scores (1.0%) were found in ./MTest1/scores-dev
    bob.measure@2018-06-29 10:20:14,177 -- ERROR: NaNs scores (1.0%) were found in ./MTest1/scores-eval
    ===================  ================  ================
    ..                   Development       Evaluation
    ===================  ================  ================
    False Positive Rate  15.5% (767/4942)  15.5% (767/4942)
    False Negative Rate  15.5% (769/4954)  15.5% (769/4954)
    Precision            0.8               0.8
    Recall               0.8               0.8
    F1-score             0.8               0.8
    ===================  ================  ================

The output will present the threshold together with the FPR, FNR, Precision, Recall, F1-score and
521 522 523 524 525
HTER on the given set, calculated using such a threshold. The relative counts of FAs
and FRs are also displayed between parenthesis.

.. note::
    Several scores files can be given at once and the metrics will be computed
526
    for each of them separatly. Development and evaluation files must be given by
527
    pairs. When evaluation files are provided, ``--eval`` flag
528
    must be given.
529 530 531 532 533 534 535


To evaluate the performance of a new score file with a given threshold, use
``--thres``:

.. code-block:: sh

536 537 538 539 540 541 542 543 544 545 546 547 548
    ./bin/bob measure  metrics ./MTest1/scores-eval --thres 0.006
    [Min. criterion: user provided] Threshold on Development set `./MTest1/scores-eval`: 6.000000e-03
    bob.measure@2018-06-29 10:22:06,852 -- ERROR: NaNs scores (1.0%) were found in ./MTest1/scores-eval
    ===================  ================
    ..                   Development
    ===================  ================
    False Positive Rate  15.2% (751/4942)
    False Negative Rate  16.1% (796/4954)
    Precision            0.8
    Recall               0.8
    F1-score             0.8
    ===================  ================

549 550

You can simultaneously conduct the threshold computation and its performance
551
on an evaluation set:
552 553

.. note::
554 555
    Table format can be changed using ``--tablefmt`` option, the default format
    being ``rst``. Please refer to ``bob measure metrics --help`` for more details.
556

557 558 559 560 561

Plots
=====

Customizable plotting commands are available in the :py:mod:`bob.measure` module.
562
They take a list of development and/or evaluation files and generate a single PDF
563 564 565 566 567 568 569 570 571 572 573 574 575
file containing the plots. Available plots are:

*  ``roc`` (receiver operating characteristic)

*  ``det`` (detection error trade-off)

*  ``epc`` (expected performance curve)

*  ``hist`` (histograms of positive and negatives)

Use the ``--help`` option on the above-cited commands to find-out about more
options.

576
For example, to generate a DET curve from development and evaluation datasets:
577 578 579

.. code-block:: sh

580
    $bob measure det -e -v --output "my_det.pdf" -ts "DetDev1,DetEval1,DetDev2,DetEval2"
581
    dev-1.txt eval-1.txt dev-2.txt eval-2.txt
582 583 584 585 586 587 588 589

where `my_det.pdf` will contain DET plots for the two experiments.

.. note::
    By default, ``det`` and ``roc`` plot development and evaluation curves on
    different plots. You can force gather everything in the same plot using
    ``--no-split`` option.

Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
590 591 592 593 594
.. note::
    The ``--figsize`` and ``--style`` options are two powerful options that can
    dramatically change the appearance of your figures. Try them! (e.g.
    ``--figsize 12,10 --style grayscale``)

595 596 597
Evaluate
========

598 599
A convenient command ``evaluate`` is provided to generate multiple metrics and
plots for a list of experiments. It generates two ``metrics`` outputs with ERR
600
and min-HTER criteria along with ``roc``, ``det``, ``epc``, ``hist`` plots for each
601 602 603 604
experiment. For example:

.. code-block:: sh

605
    $bob measure evaluate -e -v -l 'my_metrics.txt' -o 'my_plots.pdf' {sys1,sys2}/{dev,eval}
606

607
will output metrics and plots for the two experiments (dev and eval pairs) in
608 609
`my_metrics.txt` and `my_plots.pdf`, respectively.

André Anjos's avatar
André Anjos committed
610 611
.. include:: links.rst

612
.. Place youre references here:
André Anjos's avatar
André Anjos committed
613

614 615
.. _`The Expected Performance Curve`: http://publications.idiap.ch/downloads/reports/2005/bengio_2005_icml.pdf
.. _`The DET curve in assessment of detection task performance`: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.4489&rep=rep1&type=pdf
616
.. _`The Clopper-Pearson interval`: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval