guide.rst 18 KB
Newer Older
André Anjos's avatar
André Anjos committed
1 2 3 4
.. vim: set fileencoding=utf-8 :
.. Andre Anjos <andre.dos.anjos@gmail.com>
.. Tue 15 Oct 17:41:52 2013

André Anjos's avatar
André Anjos committed
5
.. testsetup:: *
André Anjos's avatar
André Anjos committed
6

André Anjos's avatar
André Anjos committed
7 8 9 10 11 12
  import numpy
  positives = numpy.random.normal(1,1,100)
  negatives = numpy.random.normal(-1,1,100)
  import matplotlib
  if not hasattr(matplotlib, 'backends'):
    matplotlib.use('pdf') #non-interactive avoids exception on display
André Anjos's avatar
André Anjos committed
13
  import bob.measure
André Anjos's avatar
André Anjos committed
14 15 16 17 18

============
 User Guide
============

André Anjos's avatar
André Anjos committed
19
Methods in the :py:mod:`bob.measure` module can help you to quickly and easily
20 21
evaluate error for multi-class or binary classification problems. If you are
not yet familiarized with aspects of performance evaluation, we recommend the
22 23
following papers and book chapters for an overview of some of the implemented
methods.
24 25 26 27 28 29 30

* Bengio, S., Keller, M., Mariéthoz, J. (2004). `The Expected Performance
  Curve`_.  International Conference on Machine Learning ICML Workshop on ROC
  Analysis in Machine Learning, 136(1), 1963–1966.
* Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997).
  `The DET curve in assessment of detection task performance`_. Fifth European
  Conference on Speech Communication and Technology (pp. 1895-1898).
31 32
* Li, S., Jain, A.K. (2005), `Handbook of Face Recognition`, Chapter 14, Springer

33 34 35 36 37 38 39 40 41 42 43 44 45

Overview
--------

A classifier is subject to two types of errors, either the real access/signal
is rejected (false rejection) or an impostor attack/a false access is accepted
(false acceptance). A possible way to measure the detection performance is to
use the Half Total Error Rate (HTER), which combines the False Rejection Rate
(FRR) and the False Acceptance Rate (FAR) and is defined in the following
formula:

.. math::

46
   HTER(\tau, \mathcal{D}) = \frac{FAR(\tau, \mathcal{D}) + FRR(\tau, \mathcal{D})}{2} \quad \textrm{[\%]}
47

André Anjos's avatar
André Anjos committed
48

49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
where :math:`\mathcal{D}` denotes the dataset used. Since both the FAR and the
FRR depends on the threshold :math:`\tau`, they are strongly related to each
other: increasing the FAR will reduce the FRR and vice-versa. For this reason,
results are often presented using either a Receiver Operating Characteristic
(ROC) or a Detection-Error Tradeoff (DET) plot, these two plots basically
present the FAR versus the FRR for different values of the threshold. Another
widely used measure to summarise the performance of a system is the Equal Error
Rate (EER), defined as the point along the ROC or DET curve where the FAR
equals the FRR.

However, it was noted in by Bengio et al. (2004) that ROC and DET curves may be
misleading when comparing systems. Hence, the so-called Expected Performance
Curve (EPC) was proposed and consists of an unbiased estimate of the reachable
performance of a system at various operating points.  Indeed, in real-world
scenarios, the threshold :math:`\tau` has to be set a priori: this is typically
done using a development set (also called cross-validation set). Nevertheless,
the optimal threshold can be different depending on the relative importance
given to the FAR and the FRR. Hence, in the EPC framework, the cost
67 68
:math:`\beta \in [0;1]` is defined as the trade-off between the FAR and FRR.
The optimal threshold :math:`\tau^*` is then computed using different values of
69 70 71 72 73
:math:`\beta`, corresponding to different operating points:

.. math::
  \tau^{*} = \arg\!\min_{\tau} \quad \beta \cdot \textrm{FAR}(\tau, \mathcal{D}_{d}) + (1-\beta) \cdot \textrm{FRR}(\tau, \mathcal{D}_{d})

André Anjos's avatar
André Anjos committed
74

75
where :math:`\mathcal{D}_{d}` denotes the development set and should be
André Anjos's avatar
André Anjos committed
76
completely separate to the evaluation set :math:`\mathcal{D}`.
77 78 79 80 81

Performance for different values of :math:`\beta` is then computed on the test
set :math:`\mathcal{D}_{t}` using the previously derived threshold. Note that
setting :math:`\beta` to 0.5 yields to the Half Total Error Rate (HTER) as
defined in the first equation.
André Anjos's avatar
André Anjos committed
82 83 84

.. note::

85
  Most of the methods available in this module require as input a set of 2
86 87 88 89 90 91
  :py:class:`numpy.ndarray` objects that contain the scores obtained by the
  classification system to be evaluated, without specific order. Most of the
  classes that are defined to deal with two-class problems. Therefore, in this
  setting, and throughout this manual, we have defined that the **negatives**
  represents the impostor attacks or false class accesses (that is when a
  sample of class A is given to the classifier of another class, such as class
92
  B) for of the classifier. The second set, referred as the **positives**
93 94 95 96 97 98 99 100 101 102 103
  represents the true class accesses or signal response of the classifier. The
  vectors are called this way because the procedures implemented in this module
  expects that the scores of **negatives** to be statistically distributed to
  the left of the signal scores (the **positives**). If that is not the case,
  one should either invert the input to the methods or multiply all scores
  available by -1, in order to have them inverted.

  The input to create these two vectors is generated by experiments conducted
  by the user and normally sits in files that may need some parsing before
  these vectors can be extracted.

104
  In the remainder of this section we assume you have successfully parsed and
105 106
  loaded your scores in two 1D float64 vectors and are ready to evaluate the
  performance of the classifier.
André Anjos's avatar
André Anjos committed
107

108 109
Verification
------------
André Anjos's avatar
André Anjos committed
110

111 112
To count the number of correctly classified positives and negatives you can use
the following techniques:
André Anjos's avatar
André Anjos committed
113 114 115

.. doctest::

116 117 118 119 120 121
   >>> # negatives, positives = parse_my_scores(...) # write parser if not provided!
   >>> T = 0.0 #Threshold: later we explain how one can calculate these
   >>> correct_negatives = bob.measure.correctly_classified_negatives(negatives, T)
   >>> FAR = 1 - (float(correct_negatives.sum())/negatives.size)
   >>> correct_positives = bob.measure.correctly_classified_positives(positives, T)
   >>> FRR = 1 - (float(correct_positives.sum())/positives.size)
André Anjos's avatar
André Anjos committed
122

123
We do provide a method to calculate the FAR and FRR in a single shot:
André Anjos's avatar
André Anjos committed
124 125 126

.. doctest::

127
   >>> FAR, FRR = bob.measure.farfrr(negatives, positives, T)
André Anjos's avatar
André Anjos committed
128

129 130 131 132 133 134
The threshold ``T`` is normally calculated by looking at the distribution of
negatives and positives in a development (or validation) set, selecting a
threshold that matches a certain criterion and applying this derived threshold
to the test (or evaluation) set. This technique gives a better overview of the
generalization of a method. We implement different techniques for the
calculation of the threshold:
André Anjos's avatar
André Anjos committed
135

136
* Threshold for the EER
André Anjos's avatar
André Anjos committed
137

138
  .. doctest::
André Anjos's avatar
André Anjos committed
139

André Anjos's avatar
André Anjos committed
140
    >>> T = bob.measure.eer_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
141

142
* Threshold for the minimum HTER
André Anjos's avatar
André Anjos committed
143

144
  .. doctest::
André Anjos's avatar
André Anjos committed
145

André Anjos's avatar
André Anjos committed
146
    >>> T = bob.measure.min_hter_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
147

148 149
* Threshold for the minimum weighted error rate (MWER) given a certain cost
  :math:`\beta`.
André Anjos's avatar
André Anjos committed
150

151
  .. doctest:: python
André Anjos's avatar
André Anjos committed
152

153
     >>> cost = 0.3 #or "beta"
André Anjos's avatar
André Anjos committed
154
     >>> T = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost)
André Anjos's avatar
André Anjos committed
155

156
  .. note::
André Anjos's avatar
André Anjos committed
157

158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173
     By setting cost to 0.5 is equivalent to use
     :py:func:`bob.measure.min_hter_threshold`.

.. note::
   Many functions in ``bob.measure`` have an ``is_sorted`` parameter, which defaults to ``False``, throughout.
   However, these functions need sorted ``positive`` and/or ``negative`` scores.
   If scores are not in ascendantly sorted order, internally, they will be copied -- twice!
   To avoid scores to be copied, you might want to sort the scores in ascending order, e.g., by:

   .. doctest:: python

      >>> negatives.sort()
      >>> positives.sort()
      >>> t = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost, is_sorted = True)
      >>> assert T == t

174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
Identification
--------------

For identification, the Recognition Rate is one of the standard measures.
To compute recognition rates, you can use the :py:func:`bob.measure.recognition_rate` function.
This function expects a relatively complex data structure, which is the same as for the `CMC`_ below.
For each probe item, the scores for negative and positive comparisons are computed, and collected for all probe items:

.. doctest::

   >>> rr_scores = []
   >>> for probe in range(10):
   ...   pos = numpy.random.normal(1, 1, 1)
   ...   neg = numpy.random.normal(0, 1, 19)
   ...   rr_scores.append((neg, pos))
189
   >>> rr = bob.measure.recognition_rate(rr_scores, rank=1)
190 191 192 193 194 195 196 197 198 199 200 201 202 203 204

For open set identification, according to Li and Jain (2005) there are two different error measures defined.
The first measure is the :py:func:`bob.measure.detection_identification_rate`, which counts the number of correctly classified in-gallery probe items.
The second measure is the :py:func:`bob.measure.false_alarm_rate`, which counts, how often an out-of-gallery probe item was incorrectly accepted.
Both rates can be computed using the same data structure, with one exception.
Both functions require that at least one probe item exists, which has no according gallery item, i.e., where the positives are empty or ``None``:

(continued from above...)

.. doctest::

   >>> for probe in range(10):
   ...   pos = None
   ...   neg = numpy.random.normal(-2, 1, 10)
   ...   rr_scores.append((neg, pos))
205 206
   >>> dir = bob.measure.detection_identification_rate(rr_scores, threshold = 0, rank=1)
   >>> far = bob.measure.false_alarm_rate(rr_scores, threshold = 0)
207

208 209 210
Confidence interval
-------------------

211 212 213 214 215 216 217 218 219 220
A confidence interval for parameter :math:`x` consists of a lower
estimate :math:`L`, and an upper estimate :math:`U`, such that the probability
of the true value being within the interval estimate is equal to :math:`\alpha`.
For example, a 95% confidence interval (i.e. :math:`\alpha = 0.95`) for a
parameter :math:`x` is given by :math:`[L, U]` such that

.. math:: Prob(x∈[L,U]) = 95% 

The smaller the test size, the wider the confidence 
interval will be, and the greater :math:`\alpha`, the smaller the confidence interval
221 222 223 224 225
will be.

`The Clopper-Pearson interval`_, a common method for calculating
confidence intervals, is function of the number of success, the number of trials 
and confidence
226
value :math:`\alpha` is used as :py:func:`bob.measure.utils.confidence_for_indicator_variable`.
227 228 229 230
It is based on the cumulative probabilities of the binomial distribution. This
method is quite conservative, meaning that the true coverage rate of a 95% 
Clopper–Pearson interval may be well above 95%. 

André Anjos's avatar
André Anjos committed
231

232 233
Plotting
--------
André Anjos's avatar
André Anjos committed
234

235
An image is worth 1000 words, they say. You can combine the capabilities of
236 237
`Matplotlib`_ with |project| to plot a number of curves. However, you must have
that package installed though. In this section we describe a few recipes.
André Anjos's avatar
André Anjos committed
238

239 240
ROC
===
André Anjos's avatar
André Anjos committed
241

242 243 244
The Receiver Operating Characteristic (ROC) curve is one of the oldest plots in
town. To plot an ROC curve, in possession of your **negatives** and
**positives**, just do something along the lines of:
André Anjos's avatar
André Anjos committed
245 246 247

.. doctest::

248 249 250 251 252 253 254 255
   >>> from matplotlib import pyplot
   >>> # we assume you have your negatives and positives already split
   >>> npoints = 100
   >>> bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
   >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP
   >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP
   >>> pyplot.grid(True)
   >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
256

257
You should see an image like the following one:
André Anjos's avatar
André Anjos committed
258

André Anjos's avatar
André Anjos committed
259 260 261
.. plot::

   import numpy
262
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
263
   import bob.measure
André Anjos's avatar
André Anjos committed
264 265 266 267 268
   from matplotlib import pyplot

   positives = numpy.random.normal(1,1,100)
   negatives = numpy.random.normal(-1,1,100)
   npoints = 100
André Anjos's avatar
André Anjos committed
269
   bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test')
André Anjos's avatar
André Anjos committed
270 271 272 273
   pyplot.grid(True)
   pyplot.xlabel('FAR (%)')
   pyplot.ylabel('FRR (%)')
   pyplot.title('ROC')
André Anjos's avatar
André Anjos committed
274

275
As can be observed, plotting methods live in the namespace
276 277 278 279 280 281 282 283 284 285 286 287
:py:mod:`bob.measure.plot`. They work like the
:py:func:`matplotlib.pyplot.plot` itself, except that instead of receiving the
x and y point coordinates as parameters, they receive the two
:py:class:`numpy.ndarray` arrays with negatives and positives, as well as an
indication of the number of points the curve must contain.

As in the :py:func:`matplotlib.pyplot.plot` command, you can pass optional
parameters for the line as shown in the example to setup its color, shape and
even the label.  For an overview of the keywords accepted, please refer to the
`Matplotlib`_'s Documentation. Other plot properties such as the plot title,
axis labels, grids, legends should be controlled directly using the relevant
`Matplotlib`_'s controls.
André Anjos's avatar
André Anjos committed
288

289 290
DET
===
André Anjos's avatar
André Anjos committed
291

292
A DET curve can be drawn using similar commands such as the ones for the ROC curve:
André Anjos's avatar
André Anjos committed
293 294 295

.. doctest::

296 297 298
  >>> from matplotlib import pyplot
  >>> # we assume you have your negatives and positives already split
  >>> npoints = 100
André Anjos's avatar
André Anjos committed
299 300
  >>> bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
  >>> bob.measure.plot.det_axis([0.01, 40, 0.01, 40]) # doctest: +SKIP
301 302 303 304
  >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP
  >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP
  >>> pyplot.grid(True)
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
305

306
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
307

André Anjos's avatar
André Anjos committed
308 309 310
.. plot::

   import numpy
311
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
312
   import bob.measure
André Anjos's avatar
André Anjos committed
313 314 315 316 317 318
   from matplotlib import pyplot

   positives = numpy.random.normal(1,1,100)
   negatives = numpy.random.normal(-1,1,100)

   npoints = 100
André Anjos's avatar
André Anjos committed
319 320
   bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test')
   bob.measure.plot.det_axis([0.1, 80, 0.1, 80])
André Anjos's avatar
André Anjos committed
321 322 323 324
   pyplot.grid(True)
   pyplot.xlabel('FAR (%)')
   pyplot.ylabel('FRR (%)')
   pyplot.title('DET')
André Anjos's avatar
André Anjos committed
325 326 327

.. note::

328 329 330
  If you wish to reset axis zooming, you must use the Gaussian scale rather
  than the visual marks showed at the plot, which are just there for
  displaying purposes. The real axis scale is based on the
331
  :py:func:`bob.measure.ppndf` method. For example, if you wish to set the x and y
332
  axis to display data between 1% and 40% here is the recipe:
André Anjos's avatar
André Anjos committed
333

334
  .. doctest::
André Anjos's avatar
André Anjos committed
335

336
    >>> #AFTER you plot the DET curve, just set the axis in this way:
André Anjos's avatar
André Anjos committed
337
    >>> pyplot.axis([bob.measure.ppndf(k/100.0) for k in (1, 40, 1, 40)]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
338

339
  We provide a convenient way for you to do the above in this module. So,
André Anjos's avatar
André Anjos committed
340
  optionally, you may use the ``bob.measure.plot.det_axis`` method like this:
André Anjos's avatar
André Anjos committed
341

342
  .. doctest::
André Anjos's avatar
André Anjos committed
343

André Anjos's avatar
André Anjos committed
344
    >>> bob.measure.plot.det_axis([1, 40, 1, 40]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
345

346 347
EPC
===
André Anjos's avatar
André Anjos committed
348

349
Drawing an EPC requires that both the development set negatives and positives are provided alongside
350
the test (or evaluation) set ones. Because of this the API is slightly modified:
André Anjos's avatar
André Anjos committed
351 352 353

.. doctest::

André Anjos's avatar
André Anjos committed
354
  >>> bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-') # doctest: +SKIP
355
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
356

357
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
358

André Anjos's avatar
André Anjos committed
359 360 361
.. plot::

   import numpy
362
   numpy.random.seed(42)
André Anjos's avatar
André Anjos committed
363
   import bob.measure
André Anjos's avatar
André Anjos committed
364 365 366 367 368 369 370
   from matplotlib import pyplot

   dev_pos = numpy.random.normal(1,1,100)
   dev_neg = numpy.random.normal(-1,1,100)
   test_pos = numpy.random.normal(0.9,1,100)
   test_neg = numpy.random.normal(-1.1,1,100)
   npoints = 100
André Anjos's avatar
André Anjos committed
371
   bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-')
André Anjos's avatar
André Anjos committed
372 373
   pyplot.grid(True)
   pyplot.title('EPC')
André Anjos's avatar
André Anjos committed
374

375 376 377 378

CMC
===

379 380 381 382 383
The Cumulative Match Characteristics (CMC) curve estimates the probability that
the correct model is in the *N* models with the highest similarity to a given
probe.  A CMC curve can be plotted using the :py:func:`bob.measure.plot.cmc`
function.  The CMC can be calculated from a relatively complex data structure,
which defines a pair of positive and negative scores **per probe**:
384 385 386 387

.. plot::

   import numpy
388
   numpy.random.seed(42)
389 390 391
   import bob.measure
   from matplotlib import pyplot

392
   cmc_scores = []
393 394 395
   for probe in range(10):
     positives = numpy.random.normal(1, 1, 1)
     negatives = numpy.random.normal(0, 1, 19)
396 397
     cmc_scores.append((negatives, positives))
   bob.measure.plot.cmc(cmc_scores, logx=False)
André Anjos's avatar
André Anjos committed
398
   pyplot.grid(True)
399 400 401 402 403
   pyplot.title('CMC')
   pyplot.xlabel('Rank')
   pyplot.xticks([1,5,10,20])
   pyplot.xlim([1,20])
   pyplot.ylim([0,100])
404
   pyplot.ylabel('Probability of Recognition (%)')
405 406 407

Usually, there is only a single positive score per probe, but this is not a fixed restriction.

408 409 410 411

Detection & Identification Curve
================================

André Anjos's avatar
André Anjos committed
412 413 414 415 416 417 418 419
The detection & identification curve is designed to evaluate open set
identification tasks.  It can be plotted using the
:py:func:`bob.measure.plot.detection_identification_curve` function, but it
requires at least one open-set probe, i.e., where no corresponding positive
score exists, for which the FAR values are computed.  Here, we plot the
detection and identification curve for rank 1, so that the recognition rate for
FAR=1 will be identical to the rank one :py:func:`bob.measure.recognition_rate`
obtained in the CMC plot above.
420 421 422 423 424 425 426 427 428

.. plot::

   import numpy
   numpy.random.seed(42)
   import bob.measure
   from matplotlib import pyplot

   cmc_scores = []
429
   for probe in range(1000):
430 431 432
     positives = numpy.random.normal(1, 1, 1)
     negatives = numpy.random.normal(0, 1, 19)
     cmc_scores.append((negatives, positives))
433
   for probe in range(1000):
434 435 436 437 438
     negatives = numpy.random.normal(-1, 1, 10)
     cmc_scores.append((negatives, None))

   bob.measure.plot.detection_identification_curve(cmc_scores, rank=1, logx=True)
   pyplot.xlabel('False Alarm Rate')
439
   pyplot.xlim([0.0001, 1])
440
   pyplot.ylabel('Detection & Identification Rate (%)')
441
   pyplot.ylim([0,1])
442 443 444



445 446
Fine-tunning
============
André Anjos's avatar
André Anjos committed
447

André Anjos's avatar
André Anjos committed
448 449
The methods inside :py:mod:`bob.measure.plot` are only provided as a
`Matplotlib`_ wrapper to equivalent methods in :py:mod:`bob.measure` that can
450 451
only calculate the points without doing any plotting. You may prefer to tweak
the plotting or even use a different plotting system such as gnuplot. Have a
André Anjos's avatar
André Anjos committed
452 453 454 455
look at the implementations at :py:mod:`bob.measure.plot` to understand how to
use the |project| methods to compute the curves and interlace that in the way
that best suits you.

André Anjos's avatar
André Anjos committed
456 457
.. include:: links.rst

458
.. Place youre references here:
André Anjos's avatar
André Anjos committed
459

460 461
.. _`The Expected Performance Curve`: http://publications.idiap.ch/downloads/reports/2005/bengio_2005_icml.pdf
.. _`The DET curve in assessment of detection task performance`: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.4489&rep=rep1&type=pdf
462
.. _`The Clopper-Pearson interval`: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval