guide.rst 14 KB
Newer Older
André Anjos's avatar
André Anjos committed
1 2 3 4
.. vim: set fileencoding=utf-8 :
.. Andre Anjos <andre.dos.anjos@gmail.com>
.. Tue 15 Oct 17:41:52 2013

5
.. testsetup:: measuretest
André Anjos's avatar
André Anjos committed
6 7

   import numpy
8
   import xbob.measure
André Anjos's avatar
André Anjos committed
9 10 11 12 13

============
 User Guide
============

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
Methods in the :py:mod:`xbob.measure` module can help you to quickly and easily
evaluate error for multi-class or binary classification problems. If you are
not yet familiarized with aspects of performance evaluation, we recommend the
following papers for an overview of some of the methods implemented.

* Bengio, S., Keller, M., Mariéthoz, J. (2004). `The Expected Performance
  Curve`_.  International Conference on Machine Learning ICML Workshop on ROC
  Analysis in Machine Learning, 136(1), 19631966.
* Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997).
  `The DET curve in assessment of detection task performance`_. Fifth European
  Conference on Speech Communication and Technology (pp. 1895-1898).

Overview
--------

A classifier is subject to two types of errors, either the real access/signal
is rejected (false rejection) or an impostor attack/a false access is accepted
(false acceptance). A possible way to measure the detection performance is to
use the Half Total Error Rate (HTER), which combines the False Rejection Rate
(FRR) and the False Acceptance Rate (FAR) and is defined in the following
formula:

.. math::

  HTER(\tau, \mathcal{D}) = \frac{FAR(\tau, \mathcal{D}) + FRR(\tau, \mathcal{D})}{2} \quad \textrm{[\%]}

where :math:`\mathcal{D}` denotes the dataset used. Since both the FAR and the
FRR depends on the threshold :math:`\tau`, they are strongly related to each
other: increasing the FAR will reduce the FRR and vice-versa. For this reason,
results are often presented using either a Receiver Operating Characteristic
(ROC) or a Detection-Error Tradeoff (DET) plot, these two plots basically
present the FAR versus the FRR for different values of the threshold. Another
widely used measure to summarise the performance of a system is the Equal Error
Rate (EER), defined as the point along the ROC or DET curve where the FAR
equals the FRR.

However, it was noted in by Bengio et al. (2004) that ROC and DET curves may be
misleading when comparing systems. Hence, the so-called Expected Performance
Curve (EPC) was proposed and consists of an unbiased estimate of the reachable
performance of a system at various operating points.  Indeed, in real-world
scenarios, the threshold :math:`\tau` has to be set a priori: this is typically
done using a development set (also called cross-validation set). Nevertheless,
the optimal threshold can be different depending on the relative importance
given to the FAR and the FRR. Hence, in the EPC framework, the cost
:math:`\beta \in [0;1]` is defined as the tradeoff between the FAR and FRR. The
optimal threshold :math:`\tau^*` is then computed using different values of
:math:`\beta`, corresponding to different operating points:

.. math::
  \tau^{*} = \arg\!\min_{\tau} \quad \beta \cdot \textrm{FAR}(\tau, \mathcal{D}_{d}) + (1-\beta) \cdot \textrm{FRR}(\tau, \mathcal{D}_{d})

where :math:`\mathcal{D}_{d}` denotes the development set and should be
completely separate to the evaluation set `\mathcal{D}`.

Performance for different values of :math:`\beta` is then computed on the test
set :math:`\mathcal{D}_{t}` using the previously derived threshold. Note that
setting :math:`\beta` to 0.5 yields to the Half Total Error Rate (HTER) as
defined in the first equation.
André Anjos's avatar
André Anjos committed
72 73 74

.. note::

75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101
  Most of the methods availabe in this module require as input a set of 2
  :py:class:`numpy.ndarray` objects that contain the scores obtained by the
  classification system to be evaluated, without specific order. Most of the
  classes that are defined to deal with two-class problems. Therefore, in this
  setting, and throughout this manual, we have defined that the **negatives**
  represents the impostor attacks or false class accesses (that is when a
  sample of class A is given to the classifier of another class, such as class
  B) for of the classifier. The second set, refered as the **positives**
  represents the true class accesses or signal response of the classifier. The
  vectors are called this way because the procedures implemented in this module
  expects that the scores of **negatives** to be statistically distributed to
  the left of the signal scores (the **positives**). If that is not the case,
  one should either invert the input to the methods or multiply all scores
  available by -1, in order to have them inverted.

  The input to create these two vectors is generated by experiments conducted
  by the user and normally sits in files that may need some parsing before
  these vectors can be extracted.

  While it is not possible to provide a parser for every individual file that
  may be generated in different experimental frameworks, we do provide a few
  parsers for formats we use the most. Please refer to the documentation of
  :py:mod:`bob.measure.load` for a list of formats and details.

  In the remainder of this section we assume you have successfuly parsed and
  loaded your scores in two 1D float64 vectors and are ready to evaluate the
  performance of the classifier.
André Anjos's avatar
André Anjos committed
102

103
.. testsetup:: *
André Anjos's avatar
André Anjos committed
104

105 106 107 108 109 110 111
  import numpy
  positives = numpy.random.normal(1,1,100)
  negatives = numpy.random.normal(-1,1,100)
  import bob
  import matplotlib
  if not hasattr(matplotlib, 'backends'):
    matplotlib.use('pdf') #non-interactive avoids exception on display
André Anjos's avatar
André Anjos committed
112

113 114
Evaluation
----------
André Anjos's avatar
André Anjos committed
115

116 117
To count the number of correctly classified positives and negatives you can use
the following techniques:
André Anjos's avatar
André Anjos committed
118 119 120

.. doctest::

121 122 123 124 125 126
  >>> # negatives, positives = parse_my_scores(...) # write parser if not provided!
  >>> T = 0.0 #Threshold: later we explain how one can calculate these
  >>> correct_negatives = bob.measure.correctly_classified_negatives(negatives, T)
  >>> FAR = 1 - (float(correct_negatives.sum())/negatives.size)
  >>> correct_positives = bob.measure.correctly_classified_positives(positives, T)
  >>> FRR = 1 - (float(correct_positives.sum())/positives.size)
André Anjos's avatar
André Anjos committed
127

128
We do provide a method to calculate the FAR and FRR in a single shot:
André Anjos's avatar
André Anjos committed
129 130 131

.. doctest::

132
  >>> FAR, FRR = bob.measure.farfrr(negatives, positives, T)
André Anjos's avatar
André Anjos committed
133

134 135 136 137 138 139
The threshold ``T`` is normally calculated by looking at the distribution of
negatives and positives in a development (or validation) set, selecting a
threshold that matches a certain criterion and applying this derived threshold
to the test (or evaluation) set. This technique gives a better overview of the
generalization of a method. We implement different techniques for the
calculation of the threshold:
André Anjos's avatar
André Anjos committed
140

141
* Threshold for the EER
André Anjos's avatar
André Anjos committed
142

143
  .. doctest::
André Anjos's avatar
André Anjos committed
144

145
    >>> T = bob.measure.eer_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
146

147
* Threshold for the minimum HTER
André Anjos's avatar
André Anjos committed
148

149
  .. doctest::
André Anjos's avatar
André Anjos committed
150

151
    >>> T = bob.measure.min_hter_threshold(negatives, positives)
André Anjos's avatar
André Anjos committed
152

153 154
* Threshold for the minimum weighted error rate (MWER) given a certain cost
  :math:`\beta`.
André Anjos's avatar
André Anjos committed
155

156
  .. code-block:: python
André Anjos's avatar
André Anjos committed
157

158 159
     >>> cost = 0.3 #or "beta"
     >>> T = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost)
André Anjos's avatar
André Anjos committed
160

161
  .. note::
André Anjos's avatar
André Anjos committed
162

163 164
    By setting cost to 0.5 is equivalent to use
    :py:meth:`bob.measure.min_hter_threshold`.
André Anjos's avatar
André Anjos committed
165

166 167
Plotting
--------
André Anjos's avatar
André Anjos committed
168

169 170 171
An image is worth 1000 words, they say. You can combine the capabilities of
`Matplotlib`_ with |project| to plot a number of curves. However, you must have that
package installed though. In this section we describe a few recipes.
André Anjos's avatar
André Anjos committed
172

173 174
ROC
===
André Anjos's avatar
André Anjos committed
175

176 177 178
The Receiver Operating Characteristic (ROC) curve is one of the oldest plots in
town. To plot an ROC curve, in possession of your **negatives** and
**positives**, just do something along the lines of:
André Anjos's avatar
André Anjos committed
179 180 181

.. doctest::

182 183 184 185 186 187 188 189
  >>> from matplotlib import pyplot
  >>> # we assume you have your negatives and positives already split
  >>> npoints = 100
  >>> bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
  >>> pyplot.xlabel('FRR (%)') # doctest: +SKIP
  >>> pyplot.ylabel('FAR (%)') # doctest: +SKIP
  >>> pyplot.grid(True)
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
190

191
You should see an image like the following one:
André Anjos's avatar
André Anjos committed
192

193 194
.. plot:: plot/perf_roc.py
  :include-source: False
André Anjos's avatar
André Anjos committed
195

196 197 198 199 200 201
As can be observed, plotting methods live in the namespace
:py:mod:`bob.measure.plot`. They work like `Matplotlib`_'s `plot()`_ method
itself, except that instead of receiving the x and y point coordinates as
parameters, they receive the two :py:class:`numpy.ndarray` arrays with
negatives and positives, as well as an indication of the number of points the
curve must contain.
André Anjos's avatar
André Anjos committed
202

203 204 205 206 207 208
As in `Matplotlib`_'s `plot()`_ command, you can pass optional parameters for
the line as shown in the example to setup its color, shape and even the label.
For an overview of the keywords accepted, please refer to the `Matplotlib`_'s
Documentation. Other plot properties such as the plot title, axis labels,
grids, legends should be controlled directly using the relevant `Matplotlib`_'s
controls.
André Anjos's avatar
André Anjos committed
209

210 211
DET
===
André Anjos's avatar
André Anjos committed
212

213
A DET curve can be drawn using similar commands such as the ones for the ROC curve:
André Anjos's avatar
André Anjos committed
214 215 216

.. doctest::

217 218 219 220 221 222 223 224 225
  >>> from matplotlib import pyplot
  >>> # we assume you have your negatives and positives already split
  >>> npoints = 100
  >>> bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP
  >>> bob.measure.plot.det_axis([0.01, 40, 0.01, 40]) # doctest: +SKIP
  >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP
  >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP
  >>> pyplot.grid(True)
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
226

227
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
228

229 230
.. plot:: plot/perf_det.py
  :include-source: False
André Anjos's avatar
André Anjos committed
231 232 233

.. note::

234 235 236 237 238
  If you wish to reset axis zooming, you must use the Gaussian scale rather
  than the visual marks showed at the plot, which are just there for
  displaying purposes. The real axis scale is based on the
  ``bob.measure.ppndf()`` method. For example, if you wish to set the x and y
  axis to display data between 1% and 40% here is the recipe:
André Anjos's avatar
André Anjos committed
239

240
  .. doctest::
André Anjos's avatar
André Anjos committed
241

242 243
    >>> #AFTER you plot the DET curve, just set the axis in this way:
    >>> pyplot.axis([bob.measure.ppndf(k/100.0) for k in (1, 40, 1, 40)]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
244

245 246
  We provide a convenient way for you to do the above in this module. So,
  optionally, you may use the ``bob.measure.plot.det_axis`` method like this:
André Anjos's avatar
André Anjos committed
247

248
  .. doctest::
André Anjos's avatar
André Anjos committed
249

250
    >>> bob.measure.plot.det_axis([1, 40, 1, 40]) # doctest: +SKIP
André Anjos's avatar
André Anjos committed
251

252 253
EPC
===
André Anjos's avatar
André Anjos committed
254

255 256
Drawing an EPC requires that both the development set negatives and positives are provided alognside
the test (or evaluation) set ones. Because of this the API is slightly modified:
André Anjos's avatar
André Anjos committed
257 258 259

.. doctest::

260 261
  >>> bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-') # doctest: +SKIP
  >>> pyplot.show() # doctest: +SKIP
André Anjos's avatar
André Anjos committed
262

263
This will produce an image like the following one:
André Anjos's avatar
André Anjos committed
264

265 266
.. plot:: plot/perf_epc.py
  :include-source: False
André Anjos's avatar
André Anjos committed
267

268 269
Fine-tunning
============
André Anjos's avatar
André Anjos committed
270

271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341
The methods inside :py:mod:`bob.measure.plot` are only provided as a
`Matplotlib`_ wrapper to equivalent methods in :py:mod:`bob.measure` that can
only calculate the points without doing any plotting. You may prefer to tweak
the plotting or even use a different plotting system such as gnuplot. Have a
look at the implementations at :py:mod:`bob.measure.plot` to understand how
to use the |project| methods to compute the curves and interlace that in the
way that best suits you.

Full applications
-----------------

We do provide a few scripts that can be used to quickly evaluate a set of
scores. We present these scripts in this section. The scripts take as input
either a 4-column or 5-column data format as specified in the documentation of
:py:mod:`bob.measure.load.four_column` or
:py:mod:`bob.measure.load.five_column`.

To calculate the threshold using a certain criterion (EER, min.HTER or weighted
Error Rate) on a set, after setting up |project|, just do:

.. code-block:: sh

  $ bob_eval_threshold.py --scores=development-scores-4col.txt
  Threshold: -0.004787956164
  FAR : 6.731% (35/520)
  FRR : 6.667% (26/390)
  HTER: 6.699%

The output will present the threshold together with the FAR, FRR and HTER on
the given set, calculated using such a threshold. The relative counts of FAs
and FRs are also displayed between parenthesis.

To evaluate the performance of a new score file with a given threshold, use the
application ``bob_apply_threshold.py``:

.. code-block:: sh

  $ bob_apply_threshold.py --scores=test-scores-4col.txt --threshold=-0.0047879
  FAR : 2.115% (11/520)
  FRR : 7.179% (28/390)
  HTER: 4.647%

In this case, only the error figures are presented. You can conduct the
evaluation and plotting of development and test set data using our combined
``bob_compute_perf.py`` script. You pass both sets and it does the rest:

.. code-block:: sh

  $ bob_compute_perf.py --devel=development-scores-4col.txt --test=test-scores-4col.txt
  [Min. criterium: EER] Threshold on Development set: -4.787956e-03
         | Development     | Test
  -------+-----------------+------------------
    FAR  | 6.731% (35/520) | 2.500% (13/520)
    FRR  | 6.667% (26/390) | 6.154% (24/390)
    HTER | 6.699%          | 4.327%
  [Min. criterium: Min. HTER] Threshold on Development set: 3.411070e-03
         | Development     | Test
  -------+-----------------+------------------
    FAR  | 4.231% (22/520) | 1.923% (10/520)
    FRR  | 7.949% (31/390) | 7.692% (30/390)
    HTER | 6.090%          | 4.808%
  [Plots] Performance curves => 'curves.pdf'

Inside that script we evaluate 2 different thresholds based on the EER and the
minimum HTER on the development set and apply the output to the test set. As
can be seen from the toy-example above, the system generalizes reasonably well.
A single PDF file is generated containing an EPC as well as ROC and DET plots of such a
system.

Use the ``--help`` option on the above-cited scripts to find-out about more
options.
André Anjos's avatar
André Anjos committed
342 343 344

.. include:: links.rst

345
.. Place youre references here:
André Anjos's avatar
André Anjos committed
346

347 348 349
.. _`The Expected Performance Curve`: http://publications.idiap.ch/downloads/reports/2005/bengio_2005_icml.pdf
.. _`The DET curve in assessment of detection task performance`: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.4489&rep=rep1&type=pdf
.. _`plot()`: http://matplotlib.sourceforge.net/api/pyplot_api.html#matplotlib.pyplot.plot