guide.rst 23 KB
Newer Older
André Anjos's avatar
André Anjos committed
1
2
3
4
.. vim: set fileencoding=utf-8 :
.. Andre Anjos <andre.dos.anjos@gmail.com>
.. Tue 15 Oct 17:41:52 2013

André Anjos's avatar
André Anjos committed
5
.. testsetup:: *
André Anjos's avatar
André Anjos committed
6

André Anjos's avatar
André Anjos committed
7
8
9
10
11
12
  import numpy
  positives = numpy.random.normal(1,1,100)
  negatives = numpy.random.normal(-1,1,100)
  import matplotlib
  if not hasattr(matplotlib, 'backends'):
    matplotlib.use('pdf') #non-interactive avoids exception on display
André Anjos's avatar
André Anjos committed
13
  import bob.measure
André Anjos's avatar
André Anjos committed
14
15
16
17
18

============
 User Guide
============

André Anjos's avatar
André Anjos committed
19
Methods in the :py:mod:`bob.measure` module can help you to quickly and easily
20
21
evaluate error for multi-class or binary classification problems. If you are
not yet familiarized with aspects of performance evaluation, we recommend the
22
23
following papers and book chapters for an overview of some of the implemented
methods.
24
25
26
27
28
29
30

* Bengio, S., Keller, M., Mariéthoz, J. (2004). `The Expected Performance
  Curve`_.  International Conference on Machine Learning ICML Workshop on ROC
  Analysis in Machine Learning, 136(1), 1963–1966.
* Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997).
  `The DET curve in assessment of detection task performance`_. Fifth European
  Conference on Speech Communication and Technology (pp. 1895-1898).
31
32
* Li, S., Jain, A.K. (2005), `Handbook of Face Recognition`, Chapter 14, Springer

33
34
35
36
37
38
39
40
41
42
43
44
45

Overview
--------

A classifier is subject to two types of errors, either the real access/signal
is rejected (false rejection) or an impostor attack/a false access is accepted
(false acceptance). A possible way to measure the detection performance is to
use the Half Total Error Rate (HTER), which combines the False Rejection Rate
(FRR) and the False Acceptance Rate (FAR) and is defined in the following
formula:

.. math::

46
   HTER(\tau, \mathcal{D}) = \frac{FAR(\tau, \mathcal{D}) + FRR(\tau, \mathcal{D})}{2} \quad \textrm{[\%]}
47

André Anjos's avatar
André Anjos committed
48

49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
where :math:`\mathcal{D}` denotes the dataset used. Since both the FAR and the
FRR depends on the threshold :math:`\tau`, they are strongly related to each
other: increasing the FAR will reduce the FRR and vice-versa. For this reason,
results are often presented using either a Receiver Operating Characteristic
(ROC) or a Detection-Error Tradeoff (DET) plot, these two plots basically
present the FAR versus the FRR for different values of the threshold. Another
widely used measure to summarise the performance of a system is the Equal Error
Rate (EER), defined as the point along the ROC or DET curve where the FAR
equals the FRR.

However, it was noted in by Bengio et al. (2004) that ROC and DET curves may be
misleading when comparing systems. Hence, the so-called Expected Performance
Curve (EPC) was proposed and consists of an unbiased estimate of the reachable
performance of a system at various operating points.  Indeed, in real-world
scenarios, the threshold :math:`\tau` has to be set a priori: this is typically
done using a development set (also called cross-validation set). Nevertheless,
the optimal threshold can be different depending on the relative importance
given to the FAR and the FRR. Hence, in the EPC framework, the cost
67
68
:math:`\beta \in [0;1]` is defined as the trade-off between the FAR and FRR.
The optimal threshold :math:`\tau^*` is then computed using different values of
69
70
71
72
73
:math:`\beta`, corresponding to different operating points:

.. math::
  \tau^{*} = \arg\!\min_{\tau} \quad \beta \cdot \textrm{FAR}(\tau, \mathcal{D}_{d}) + (1-\beta) \cdot \textrm{FRR}(\tau, \mathcal{D}_{d})

André Anjos's avatar
André Anjos committed
74

75
where :math:`\mathcal{D}_{d}` denotes the development set and should be
André Anjos's avatar
André Anjos committed
76
completely separate to the evaluation set :math:`\mathcal{D}`.
77

78
79
Performance for different values of :math:`\beta` is then computed on the
evaluation
80
81
82
set :math:`\mathcal{D}_{t}` using the previously derived threshold. Note that
setting :math:`\beta` to 0.5 yields to the Half Total Error Rate (HTER) as
defined in the first equation.
André Anjos's avatar
André Anjos committed
83
84
85

.. note::

86
  Most of the methods available in this module require as input a set of 2
87
88
89
90
91
92
  :py:class:`numpy.ndarray` objects that contain the scores obtained by the
  classification system to be evaluated, without specific order. Most of the
  classes that are defined to deal with two-class problems. Therefore, in this
  setting, and throughout this manual, we have defined that the **negatives**
  represents the impostor attacks or false class accesses (that is when a
  sample of class A is given to the classifier of another class, such as class
93
  B) for of the classifier. The second set, referred as the **positives**
94
95
96
97
98
99
100
101
102
  represents the true class accesses or signal response of the classifier. The
  vectors are called this way because the procedures implemented in this module
  expects that the scores of **negatives** to be statistically distributed to
  the left of the signal scores (the **positives**). If that is not the case,
  one should either invert the input to the methods or multiply all scores
  available by -1, in order to have them inverted.

  The input to create these two vectors is generated by experiments conducted
  by the user and normally sits in files that may need some parsing before
103
  these vectors can be extracted. While it is not possible to provide a parser
104
105
106
  for every individual file that may be generated in different experimental
  frameworks, we do provide a parser for a generic two columns format
  where the first column contains -1/1 for negative/positive and the second column
107
  contains score values. Please refer to the documentation of
108
  :py:func:`bob.measure.load.split` for more details.
109

110
  In the remainder of this section we assume you have successfully parsed and
111
112
  loaded your scores in two 1D float64 vectors and are ready to evaluate the
  performance of the classifier.
André Anjos's avatar
André Anjos committed
113

114
115
Verification
------------
André Anjos's avatar
André Anjos committed
116

117
118
To count the number of correctly classified positives and negatives you can use
the following techniques:
André Anjos's avatar
André Anjos committed
119
120
121

.. doctest::

122
123
124
125
126
127
   >>> # negatives, positives = parse_my_scores(...) # write parser if not provided!
   >>> T = 0.0 #Threshold: later we explain how one can calculate these
   >>> correct_negatives = bob.measure.correctly_classified_negatives(negatives, T)
   >>> FAR = 1 - (float(correct_negatives.sum())/negatives.size)
   >>> correct_positives = bob.measure.correctly_classified_positives(positives, T)
   >>> FRR = 1 - (float(correct_positives.sum())/positives.size)
André Anjos's avatar
André Anjos committed
128

129
We do provide a method to calculate the FAR and FRR in a single shot:
André Anjos's avatar
André Anjos committed
130
131
132

.. doctest::

133
   >>> FAR, FRR = bob.measure.farfrr(negatives, positives, T)
André Anjos's avatar
André Anjos committed
134

135
136
137
The threshold ``T`` is normally calculated by looking at the distribution of
negatives and positives in a development (or validation) set, selecting a
threshold that matches a certain criterion and applying this derived threshold
138
to the evaluation set. This technique gives a better overview of the
139
140
generalization of a method. We implement different techniques for the
calculation of the threshold:
André Anjos's avatar
André Anjos committed
141

142
* Threshold for the EER
André Anjos's avatar
André Anjos committed
143

144
  .. doctest::
André Anjos's avatar
André Anjos committed
145

André Anjos's avatar
André Anjos committed
146
    >>> T = bob.measure.eer_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
147

148
* Threshold for the minimum HTER
André Anjos's avatar
André Anjos committed
149

150
  .. doctest::
André Anjos's avatar
André Anjos committed
151

André Anjos's avatar
André Anjos committed
152
    >>> T = bob.measure.min_hter_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
153

154
155
* Threshold for the minimum weighted error rate (MWER) given a certain cost
  :math:`\beta`.
André Anjos's avatar
André Anjos committed
156

157
  .. doctest:: python
André Anjos's avatar
André Anjos committed
158

159
     >>> cost = 0.3 #or "beta"
André Anjos's avatar
André Anjos committed
160
     >>> T = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost)
André Anjos's avatar
André Anjos committed
161

162
  .. note::
André Anjos's avatar
André Anjos committed
163

164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
     By setting cost to 0.5 is equivalent to use
     :py:func:`bob.measure.min_hter_threshold`.

.. note::
   Many functions in ``bob.measure`` have an ``is_sorted`` parameter, which defaults to ``False``, throughout.
   However, these functions need sorted ``positive`` and/or ``negative`` scores.
   If scores are not in ascendantly sorted order, internally, they will be copied -- twice!
   To avoid scores to be copied, you might want to sort the scores in ascending order, e.g., by:

   .. doctest:: python

      >>> negatives.sort()
      >>> positives.sort()
      >>> t = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost, is_sorted = True)
      >>> assert T == t

180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
Identification
--------------

For identification, the Recognition Rate is one of the standard measures.
To compute recognition rates, you can use the :py:func:`bob.measure.recognition_rate` function.
This function expects a relatively complex data structure, which is the same as for the `CMC`_ below.
For each probe item, the scores for negative and positive comparisons are computed, and collected for all probe items:

.. doctest::

   >>> rr_scores = []
   >>> for probe in range(10):
   ...   pos = numpy.random.normal(1, 1, 1)
   ...   neg = numpy.random.normal(0, 1, 19)
   ...   rr_scores.append((neg, pos))
195
   >>> rr = bob.measure.recognition_rate(rr_scores, rank=1)
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210

For open set identification, according to Li and Jain (2005) there are two different error measures defined.
The first measure is the :py:func:`bob.measure.detection_identification_rate`, which counts the number of correctly classified in-gallery probe items.
The second measure is the :py:func:`bob.measure.false_alarm_rate`, which counts, how often an out-of-gallery probe item was incorrectly accepted.
Both rates can be computed using the same data structure, with one exception.
Both functions require that at least one probe item exists, which has no according gallery item, i.e., where the positives are empty or ``None``:

(continued from above...)

.. doctest::

   >>> for probe in range(10):
   ...   pos = None
   ...   neg = numpy.random.normal(-2, 1, 10)
   ...   rr_scores.append((neg, pos))
211
212
   >>> dir = bob.measure.detection_identification_rate(rr_scores, threshold = 0, rank=1)
   >>> far = bob.measure.false_alarm_rate(rr_scores, threshold = 0)
213

214
215
216
Confidence interval
-------------------

217
218
219
220
221
222
A confidence interval for parameter :math:`x` consists of a lower
estimate :math:`L`, and an upper estimate :math:`U`, such that the probability
of the true value being within the interval estimate is equal to :math:`\alpha`.
For example, a 95% confidence interval (i.e. :math:`\alpha = 0.95`) for a
parameter :math:`x` is given by :math:`[L, U]` such that

223
.. math:: Prob(x∈[L,U]) = 95%
224

225
The smaller the test size, the wider the confidence
226
interval will be, and the greater :math:`\alpha`, the smaller the confidence interval
227
228
229
will be.

`The Clopper-Pearson interval`_, a common method for calculating
230
confidence intervals, is function of the number of success, the number of trials
231
and confidence
232
value :math:`\alpha` is used as :py:func:`bob.measure.utils.confidence_for_indicator_variable`.
233
It is based on the cumulative probabilities of the binomial distribution. This
234
method is quite conservative, meaning that the true coverage rate of a 95%
235
Clopper–Pearson interval may be well above 95%.
236

237
238
239
240
241
242
243
244
245
246
247
For example, we want to evaluate the reliability of a system to
identify registered persons. Let's say that among 10,000 accepted
transactions, 9856 are true matches. The 95% confidence interval for true match
rate is then:
.. doctest:: python

    >>> bob.measure.utils.confidence_for_indicator_variable(9856, 10000)
    (0.98306835053282549, 0.98784270928084694)

meaning there is a 95% probability that the true match rate is inside :math:`[0.983,
0.988]`.
André Anjos's avatar
André Anjos committed
248

249
250
Plotting
--------
André Anjos's avatar
André Anjos committed
251

252
An image is worth 1000 words, they say. You can combine the capabilities of
253
254
`Matplotlib`_ with |project| to plot a number of curves. However, you must have
that package installed though. In this section we describe a few recipes.
André Anjos's avatar
André Anjos committed
255

256
257
ROC
===
André Anjos's avatar
André Anjos committed
258

259
260
261
The Receiver Operating Characteristic (ROC) curve is one of the oldest plots in
town. To plot an ROC curve, in possession of your **negatives** and
**positives**, just do something along the lines of:
André Anjos's avatar
André Anjos committed
262
263
264

.. doctest::

265
266
267
268
269
270
271
272
   >>> from matplotlib import pyplot
   >>> # we assume you have your negatives and positives already split
   >>> npoints = 100
   >>> bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
   >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP
   >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP
   >>> pyplot.grid(True)
   >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
273

274
You should see an image like the following one:
André Anjos's avatar
André Anjos committed
275

André Anjos's avatar
André Anjos committed
276
277
278
.. plot::

   import numpy
279
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
280
   import bob.measure
André Anjos's avatar
André Anjos committed
281
282
283
284
285
   from matplotlib import pyplot

   positives = numpy.random.normal(1,1,100)
   negatives = numpy.random.normal(-1,1,100)
   npoints = 100
André Anjos's avatar
André Anjos committed
286
   bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test')
André Anjos's avatar
André Anjos committed
287
288
289
290
   pyplot.grid(True)
   pyplot.xlabel('FAR (%)')
   pyplot.ylabel('FRR (%)')
   pyplot.title('ROC')
André Anjos's avatar
André Anjos committed
291

292
As can be observed, plotting methods live in the namespace
293
294
295
296
297
298
299
300
301
302
303
304
:py:mod:`bob.measure.plot`. They work like the
:py:func:`matplotlib.pyplot.plot` itself, except that instead of receiving the
x and y point coordinates as parameters, they receive the two
:py:class:`numpy.ndarray` arrays with negatives and positives, as well as an
indication of the number of points the curve must contain.

As in the :py:func:`matplotlib.pyplot.plot` command, you can pass optional
parameters for the line as shown in the example to setup its color, shape and
even the label.  For an overview of the keywords accepted, please refer to the
`Matplotlib`_'s Documentation. Other plot properties such as the plot title,
axis labels, grids, legends should be controlled directly using the relevant
`Matplotlib`_'s controls.
André Anjos's avatar
André Anjos committed
305

306
307
DET
===
André Anjos's avatar
André Anjos committed
308

309
A DET curve can be drawn using similar commands such as the ones for the ROC curve:
André Anjos's avatar
André Anjos committed
310
311
312

.. doctest::

313
314
315
  >>> from matplotlib import pyplot
  >>> # we assume you have your negatives and positives already split
  >>> npoints = 100
André Anjos's avatar
André Anjos committed
316
317
  >>> bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
  >>> bob.measure.plot.det_axis([0.01, 40, 0.01, 40]) # doctest: +SKIP
318
319
320
321
  >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP
  >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP
  >>> pyplot.grid(True)
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
322

323
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
324

André Anjos's avatar
André Anjos committed
325
326
327
.. plot::

   import numpy
328
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
329
   import bob.measure
André Anjos's avatar
André Anjos committed
330
331
332
333
334
335
   from matplotlib import pyplot

   positives = numpy.random.normal(1,1,100)
   negatives = numpy.random.normal(-1,1,100)

   npoints = 100
André Anjos's avatar
André Anjos committed
336
337
   bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test')
   bob.measure.plot.det_axis([0.1, 80, 0.1, 80])
André Anjos's avatar
André Anjos committed
338
339
340
341
   pyplot.grid(True)
   pyplot.xlabel('FAR (%)')
   pyplot.ylabel('FRR (%)')
   pyplot.title('DET')
André Anjos's avatar
André Anjos committed
342
343
344

.. note::

345
346
347
  If you wish to reset axis zooming, you must use the Gaussian scale rather
  than the visual marks showed at the plot, which are just there for
  displaying purposes. The real axis scale is based on the
348
  :py:func:`bob.measure.ppndf` method. For example, if you wish to set the x and y
349
  axis to display data between 1% and 40% here is the recipe:
André Anjos's avatar
André Anjos committed
350

351
  .. doctest::
André Anjos's avatar
André Anjos committed
352

353
    >>> #AFTER you plot the DET curve, just set the axis in this way:
André Anjos's avatar
André Anjos committed
354
    >>> pyplot.axis([bob.measure.ppndf(k/100.0) for k in (1, 40, 1, 40)]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
355

356
  We provide a convenient way for you to do the above in this module. So,
André Anjos's avatar
André Anjos committed
357
  optionally, you may use the ``bob.measure.plot.det_axis`` method like this:
André Anjos's avatar
André Anjos committed
358

359
  .. doctest::
André Anjos's avatar
André Anjos committed
360

André Anjos's avatar
André Anjos committed
361
    >>> bob.measure.plot.det_axis([1, 40, 1, 40]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
362

363
364
EPC
===
André Anjos's avatar
André Anjos committed
365

366
Drawing an EPC requires that both the development set negatives and positives are provided alongside
367
the evaluation set ones. Because of this the API is slightly modified:
André Anjos's avatar
André Anjos committed
368
369
370

.. doctest::

André Anjos's avatar
André Anjos committed
371
  >>> bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-') # doctest: +SKIP
372
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
373

374
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
375

André Anjos's avatar
André Anjos committed
376
377
378
.. plot::

   import numpy
379
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
380
   import bob.measure
André Anjos's avatar
André Anjos committed
381
382
383
384
385
386
387
   from matplotlib import pyplot

   dev_pos = numpy.random.normal(1,1,100)
   dev_neg = numpy.random.normal(-1,1,100)
   test_pos = numpy.random.normal(0.9,1,100)
   test_neg = numpy.random.normal(-1.1,1,100)
   npoints = 100
André Anjos's avatar
André Anjos committed
388
   bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-')
André Anjos's avatar
André Anjos committed
389
390
   pyplot.grid(True)
   pyplot.title('EPC')
André Anjos's avatar
André Anjos committed
391

392
393
394
395

CMC
===

396
397
398
399
400
The Cumulative Match Characteristics (CMC) curve estimates the probability that
the correct model is in the *N* models with the highest similarity to a given
probe.  A CMC curve can be plotted using the :py:func:`bob.measure.plot.cmc`
function.  The CMC can be calculated from a relatively complex data structure,
which defines a pair of positive and negative scores **per probe**:
401
402
403
404

.. plot::

   import numpy
405
   numpy.random.seed(42)
406
407
408
   import bob.measure
   from matplotlib import pyplot

409
   cmc_scores = []
410
411
412
   for probe in range(10):
     positives = numpy.random.normal(1, 1, 1)
     negatives = numpy.random.normal(0, 1, 19)
413
414
     cmc_scores.append((negatives, positives))
   bob.measure.plot.cmc(cmc_scores, logx=False)
André Anjos's avatar
André Anjos committed
415
   pyplot.grid(True)
416
417
418
419
420
   pyplot.title('CMC')
   pyplot.xlabel('Rank')
   pyplot.xticks([1,5,10,20])
   pyplot.xlim([1,20])
   pyplot.ylim([0,100])
421
   pyplot.ylabel('Probability of Recognition (%)')
422
423
424

Usually, there is only a single positive score per probe, but this is not a fixed restriction.

425
426
427
428

Detection & Identification Curve
================================

André Anjos's avatar
André Anjos committed
429
430
431
432
433
434
435
436
The detection & identification curve is designed to evaluate open set
identification tasks.  It can be plotted using the
:py:func:`bob.measure.plot.detection_identification_curve` function, but it
requires at least one open-set probe, i.e., where no corresponding positive
score exists, for which the FAR values are computed.  Here, we plot the
detection and identification curve for rank 1, so that the recognition rate for
FAR=1 will be identical to the rank one :py:func:`bob.measure.recognition_rate`
obtained in the CMC plot above.
437
438
439
440
441
442
443
444
445

.. plot::

   import numpy
   numpy.random.seed(42)
   import bob.measure
   from matplotlib import pyplot

   cmc_scores = []
446
   for probe in range(1000):
447
448
449
     positives = numpy.random.normal(1, 1, 1)
     negatives = numpy.random.normal(0, 1, 19)
     cmc_scores.append((negatives, positives))
450
   for probe in range(1000):
451
452
453
454
455
     negatives = numpy.random.normal(-1, 1, 10)
     cmc_scores.append((negatives, None))

   bob.measure.plot.detection_identification_curve(cmc_scores, rank=1, logx=True)
   pyplot.xlabel('False Alarm Rate')
456
   pyplot.xlim([0.0001, 1])
457
   pyplot.ylabel('Detection & Identification Rate (%)')
458
   pyplot.ylim([0,1])
459
460
461



462
463
Fine-tunning
============
André Anjos's avatar
André Anjos committed
464

André Anjos's avatar
André Anjos committed
465
466
The methods inside :py:mod:`bob.measure.plot` are only provided as a
`Matplotlib`_ wrapper to equivalent methods in :py:mod:`bob.measure` that can
467
468
only calculate the points without doing any plotting. You may prefer to tweak
the plotting or even use a different plotting system such as gnuplot. Have a
André Anjos's avatar
André Anjos committed
469
470
471
472
look at the implementations at :py:mod:`bob.measure.plot` to understand how to
use the |project| methods to compute the curves and interlace that in the way
that best suits you.

Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
473
474
.. _bob.measure.command_line:

475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
Full applications
-----------------

Commands under ``bob measure`` can be used to quickly evaluate a set of
scores and generate plots. We present these commands in this section. The commands
take as input generic 2-column data format as specified in the documentation of
:py:func:`bob.measure.load.split`

Metrics
=======

To calculate the threshold using a certain criterion (EER (default) or min.HTER)
on a set, after setting up |project|, just do:

.. code-block:: sh

  $ bob measure metrics dev-1.txt
  [Min. criterion: EER] Threshold on Development set `dev-1.txt`: -8.025286e-03
493
494
495
  ====  ===================
  ..    Development dev-1
  ====  ===================
496
  FtA   0.000%
497
498
499
500
501
502
  FMR   6.263% (31/495)
  FNMR  6.208% (28/451)
  FAR   5.924%
  FRR   11.273%
  HTER  8.599%
  ====  ===================
503

504
The output will present the threshold together with the FtA, FMR, FMNR, FAR, FRR and
505
506
507
508
509
HTER on the given set, calculated using such a threshold. The relative counts of FAs
and FRs are also displayed between parenthesis.

.. note::
    Several scores files can be given at once and the metrics will be computed
510
511
512
    for each of them separatly. Development and evaluation files must be given by
    pairs. When only Development file are provided, ``--no-evaluation`` flag
    must be given.
513
514
515
516
517
518
519


To evaluate the performance of a new score file with a given threshold, use
``--thres``:

.. code-block:: sh

520
521
  $ bob measure metrics --thres 0.006 eval-1.txt
  [Min. criterion: user provider] Threshold on Development set `eval-1`: 6.000000e-03
522
  ====  ====================
523
  ..    Development eval-1
524
  ====  ====================
525
  FtA   0.000%
526
527
528
529
530
531
  FMR   5.010% (24/479)
  FNMR  6.977% (33/473)
  FAR   4.770%
  FRR   11.442%
  HTER  8.106%
  ====  ====================
532
533

You can simultaneously conduct the threshold computation and its performance
534
on an evaluation set:
535
536
537

.. code-block:: sh

538
  $ bob measure metrics dev-1.txt eval-1.txt
539
540
  [Min. criterion: EER] Threshold on Development set `dev-1`: -8.025286e-03
  ====  ===================  ===============
541
  ..    Development dev-1    Eval. eval-1
542
  ====  ===================  ===============
543
  FtA   0.000%               0.000%
544
545
546
547
548
549
  FMR   6.263% (31/495)      5.637% (27/479)
  FNMR  6.208% (28/451)      6.131% (29/473)
  FAR   5.924%               5.366%
  FRR   11.273%              10.637%
  HTER  8.599%               8.001%
  ====  ===================  ===============
550
551

.. note::
552
553
    Table format can be changed using ``--tablefmt`` option, the default format
    being ``rst``. Please refer to ``bob measure metrics --help`` for more details.
554

555
556
557
558
559

Plots
=====

Customizable plotting commands are available in the :py:mod:`bob.measure` module.
560
They take a list of development and/or evaluation files and generate a single PDF
561
562
563
564
565
566
567
568
569
570
571
572
573
file containing the plots. Available plots are:

*  ``roc`` (receiver operating characteristic)

*  ``det`` (detection error trade-off)

*  ``epc`` (expected performance curve)

*  ``hist`` (histograms of positive and negatives)

Use the ``--help`` option on the above-cited commands to find-out about more
options.

574
For example, to generate a DET curve from development and evaluation datasets:
575
576
577

.. code-block:: sh

578
    $bob measure det -v --output 'my_det.pdf' dev-1.txt eval-1.txt
579
    dev-2.txt eval-2.txt
580
581
582
583
584
585
586
587

where `my_det.pdf` will contain DET plots for the two experiments.

.. note::
    By default, ``det`` and ``roc`` plot development and evaluation curves on
    different plots. You can force gather everything in the same plot using
    ``--no-split`` option.

Amir MOHAMMADI's avatar
Amir MOHAMMADI committed
588
589
590
591
592
.. note::
    The ``--figsize`` and ``--style`` options are two powerful options that can
    dramatically change the appearance of your figures. Try them! (e.g.
    ``--figsize 12,10 --style grayscale``)

593
594
595
Evaluate
========

596
597
A convenient command ``evaluate`` is provided to generate multiple metrics and
plots for a list of experiments. It generates two ``metrics`` outputs with ERR
598
and min-HTER criteria along with ``roc``, ``det``, ``epc``, ``hist`` plots for each
599
600
601
602
experiment. For example:

.. code-block:: sh

603
    $bob measure evaluate -v -l 'my_metrics.txt' -o 'my_plots.pdf' {sys1, sys2}/
604
    {eval,dev}
605

606
will output metrics and plots for the two experiments (dev and eval pairs) in
607
608
`my_metrics.txt` and `my_plots.pdf`, respectively.

André Anjos's avatar
André Anjos committed
609
610
.. include:: links.rst

611
.. Place youre references here:
André Anjos's avatar
André Anjos committed
612

613
614
.. _`The Expected Performance Curve`: http://publications.idiap.ch/downloads/reports/2005/bengio_2005_icml.pdf
.. _`The DET curve in assessment of detection task performance`: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.4489&rep=rep1&type=pdf
615
.. _`The Clopper-Pearson interval`: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval