.. vim: set fileencoding=utf-8 : .. Andre Anjos .. Tue 15 Oct 17:41:52 2013 .. testsetup:: * import numpy positives = numpy.random.normal(1,1,100) negatives = numpy.random.normal(-1,1,100) import matplotlib if not hasattr(matplotlib, 'backends'): matplotlib.use('pdf') #non-interactive avoids exception on display import bob.measure ============ User Guide ============ Methods in the :py:mod:bob.measure module can help you to quickly and easily evaluate error for multi-class or binary classification problems. If you are not yet familiarized with aspects of performance evaluation, we recommend the following papers and book chapters for an overview of some of the implemented methods. * Bengio, S., Keller, M., Mariéthoz, J. (2004). The Expected Performance Curve_. International Conference on Machine Learning ICML Workshop on ROC Analysis in Machine Learning, 136(1), 1963–1966. * Martin, A., Doddington, G., Kamm, T., Ordowski, M., & Przybocki, M. (1997). The DET curve in assessment of detection task performance_. Fifth European Conference on Speech Communication and Technology (pp. 1895-1898). * Li, S., Jain, A.K. (2005), Handbook of Face Recognition, Chapter 14, Springer Overview -------- A classifier is subject to two types of errors, either the real access/signal is rejected (false rejection) or an impostor attack/a false access is accepted (false acceptance). A possible way to measure the detection performance is to use the Half Total Error Rate (HTER), which combines the False Rejection Rate (FRR) and the False Acceptance Rate (FAR) and is defined in the following formula: .. math:: HTER(\tau, \mathcal{D}) = \frac{FAR(\tau, \mathcal{D}) + FRR(\tau, \mathcal{D})}{2} \quad \textrm{[\%]} where :math:\mathcal{D} denotes the dataset used. Since both the FAR and the FRR depends on the threshold :math:\tau, they are strongly related to each other: increasing the FAR will reduce the FRR and vice-versa. For this reason, results are often presented using either a Receiver Operating Characteristic (ROC) or a Detection-Error Tradeoff (DET) plot, these two plots basically present the FAR versus the FRR for different values of the threshold. Another widely used measure to summarise the performance of a system is the Equal Error Rate (EER), defined as the point along the ROC or DET curve where the FAR equals the FRR. However, it was noted in by Bengio et al. (2004) that ROC and DET curves may be misleading when comparing systems. Hence, the so-called Expected Performance Curve (EPC) was proposed and consists of an unbiased estimate of the reachable performance of a system at various operating points. Indeed, in real-world scenarios, the threshold :math:\tau has to be set a priori: this is typically done using a development set (also called cross-validation set). Nevertheless, the optimal threshold can be different depending on the relative importance given to the FAR and the FRR. Hence, in the EPC framework, the cost :math:\beta \in [0;1] is defined as the trade-off between the FAR and FRR. The optimal threshold :math:\tau^* is then computed using different values of :math:\beta, corresponding to different operating points: .. math:: \tau^{*} = \arg\!\min_{\tau} \quad \beta \cdot \textrm{FAR}(\tau, \mathcal{D}_{d}) + (1-\beta) \cdot \textrm{FRR}(\tau, \mathcal{D}_{d}) where :math:\mathcal{D}_{d} denotes the development set and should be completely separate to the evaluation set :math:\mathcal{D}. Performance for different values of :math:\beta is then computed on the test set :math:\mathcal{D}_{t} using the previously derived threshold. Note that setting :math:\beta to 0.5 yields to the Half Total Error Rate (HTER) as defined in the first equation. .. note:: Most of the methods available in this module require as input a set of 2 :py:class:numpy.ndarray objects that contain the scores obtained by the classification system to be evaluated, without specific order. Most of the classes that are defined to deal with two-class problems. Therefore, in this setting, and throughout this manual, we have defined that the **negatives** represents the impostor attacks or false class accesses (that is when a sample of class A is given to the classifier of another class, such as class B) for of the classifier. The second set, referred as the **positives** represents the true class accesses or signal response of the classifier. The vectors are called this way because the procedures implemented in this module expects that the scores of **negatives** to be statistically distributed to the left of the signal scores (the **positives**). If that is not the case, one should either invert the input to the methods or multiply all scores available by -1, in order to have them inverted. The input to create these two vectors is generated by experiments conducted by the user and normally sits in files that may need some parsing before these vectors can be extracted. In the remainder of this section we assume you have successfully parsed and loaded your scores in two 1D float64 vectors and are ready to evaluate the performance of the classifier. Verification ------------ To count the number of correctly classified positives and negatives you can use the following techniques: .. doctest:: >>> # negatives, positives = parse_my_scores(...) # write parser if not provided! >>> T = 0.0 #Threshold: later we explain how one can calculate these >>> correct_negatives = bob.measure.correctly_classified_negatives(negatives, T) >>> FAR = 1 - (float(correct_negatives.sum())/negatives.size) >>> correct_positives = bob.measure.correctly_classified_positives(positives, T) >>> FRR = 1 - (float(correct_positives.sum())/positives.size) We do provide a method to calculate the FAR and FRR in a single shot: .. doctest:: >>> FAR, FRR = bob.measure.farfrr(negatives, positives, T) The threshold T is normally calculated by looking at the distribution of negatives and positives in a development (or validation) set, selecting a threshold that matches a certain criterion and applying this derived threshold to the test (or evaluation) set. This technique gives a better overview of the generalization of a method. We implement different techniques for the calculation of the threshold: * Threshold for the EER .. doctest:: >>> T = bob.measure.eer_threshold(negatives, positives) * Threshold for the minimum HTER .. doctest:: >>> T = bob.measure.min_hter_threshold(negatives, positives) * Threshold for the minimum weighted error rate (MWER) given a certain cost :math:\beta. .. doctest:: python >>> cost = 0.3 #or "beta" >>> T = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost) .. note:: By setting cost to 0.5 is equivalent to use :py:func:bob.measure.min_hter_threshold. .. note:: Many functions in bob.measure have an is_sorted parameter, which defaults to False, throughout. However, these functions need sorted positive and/or negative scores. If scores are not in ascendantly sorted order, internally, they will be copied -- twice! To avoid scores to be copied, you might want to sort the scores in ascending order, e.g., by: .. doctest:: python >>> negatives.sort() >>> positives.sort() >>> t = bob.measure.min_weighted_error_rate_threshold(negatives, positives, cost, is_sorted = True) >>> assert T == t Identification -------------- For identification, the Recognition Rate is one of the standard measures. To compute recognition rates, you can use the :py:func:bob.measure.recognition_rate function. This function expects a relatively complex data structure, which is the same as for the CMC_ below. For each probe item, the scores for negative and positive comparisons are computed, and collected for all probe items: .. doctest:: >>> rr_scores = [] >>> for probe in range(10): ... pos = numpy.random.normal(1, 1, 1) ... neg = numpy.random.normal(0, 1, 19) ... rr_scores.append((neg, pos)) >>> rr = bob.measure.recognition_rate(rr_scores, rank=1) For open set identification, according to Li and Jain (2005) there are two different error measures defined. The first measure is the :py:func:bob.measure.detection_identification_rate, which counts the number of correctly classified in-gallery probe items. The second measure is the :py:func:bob.measure.false_alarm_rate, which counts, how often an out-of-gallery probe item was incorrectly accepted. Both rates can be computed using the same data structure, with one exception. Both functions require that at least one probe item exists, which has no according gallery item, i.e., where the positives are empty or None: (continued from above...) .. doctest:: >>> for probe in range(10): ... pos = None ... neg = numpy.random.normal(-2, 1, 10) ... rr_scores.append((neg, pos)) >>> dir = bob.measure.detection_identification_rate(rr_scores, threshold = 0, rank=1) >>> far = bob.measure.false_alarm_rate(rr_scores, threshold = 0) Confidence interval ------------------- A confidence interval for parameter :math:x consists of a lower estimate :math:L, and an upper estimate :math:U, such that the probability of the true value being within the interval estimate is equal to :math:\alpha. For example, a 95% confidence interval (i.e. :math:\alpha = 0.95) for a parameter :math:x is given by :math:[L, U] such that .. math:: Prob(x∈[L,U]) = 95% The smaller the test size, the wider the confidence interval will be, and the greater :math:\alpha, the smaller the confidence interval will be. The Clopper-Pearson interval_, a common method for calculating confidence intervals, is function of the number of success, the number of trials and confidence value :math:\alpha is used as :py:func:bob.measure.utils.confidence_for_indicator_variable. It is based on the cumulative probabilities of the binomial distribution. This method is quite conservative, meaning that the true coverage rate of a 95% Clopper–Pearson interval may be well above 95%. Plotting -------- An image is worth 1000 words, they say. You can combine the capabilities of Matplotlib_ with |project| to plot a number of curves. However, you must have that package installed though. In this section we describe a few recipes. ROC === The Receiver Operating Characteristic (ROC) curve is one of the oldest plots in town. To plot an ROC curve, in possession of your **negatives** and **positives**, just do something along the lines of: .. doctest:: >>> from matplotlib import pyplot >>> # we assume you have your negatives and positives already split >>> npoints = 100 >>> bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP >>> pyplot.grid(True) >>> pyplot.show() # doctest: +SKIP You should see an image like the following one: .. plot:: import numpy numpy.random.seed(42) import bob.measure from matplotlib import pyplot positives = numpy.random.normal(1,1,100) negatives = numpy.random.normal(-1,1,100) npoints = 100 bob.measure.plot.roc(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') pyplot.grid(True) pyplot.xlabel('FAR (%)') pyplot.ylabel('FRR (%)') pyplot.title('ROC') As can be observed, plotting methods live in the namespace :py:mod:bob.measure.plot. They work like the :py:func:matplotlib.pyplot.plot itself, except that instead of receiving the x and y point coordinates as parameters, they receive the two :py:class:numpy.ndarray arrays with negatives and positives, as well as an indication of the number of points the curve must contain. As in the :py:func:matplotlib.pyplot.plot command, you can pass optional parameters for the line as shown in the example to setup its color, shape and even the label. For an overview of the keywords accepted, please refer to the Matplotlib_'s Documentation. Other plot properties such as the plot title, axis labels, grids, legends should be controlled directly using the relevant Matplotlib_'s controls. DET === A DET curve can be drawn using similar commands such as the ones for the ROC curve: .. doctest:: >>> from matplotlib import pyplot >>> # we assume you have your negatives and positives already split >>> npoints = 100 >>> bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') # doctest: +SKIP >>> bob.measure.plot.det_axis([0.01, 40, 0.01, 40]) # doctest: +SKIP >>> pyplot.xlabel('FAR (%)') # doctest: +SKIP >>> pyplot.ylabel('FRR (%)') # doctest: +SKIP >>> pyplot.grid(True) >>> pyplot.show() # doctest: +SKIP This will produce an image like the following one: .. plot:: import numpy numpy.random.seed(42) import bob.measure from matplotlib import pyplot positives = numpy.random.normal(1,1,100) negatives = numpy.random.normal(-1,1,100) npoints = 100 bob.measure.plot.det(negatives, positives, npoints, color=(0,0,0), linestyle='-', label='test') bob.measure.plot.det_axis([0.1, 80, 0.1, 80]) pyplot.grid(True) pyplot.xlabel('FAR (%)') pyplot.ylabel('FRR (%)') pyplot.title('DET') .. note:: If you wish to reset axis zooming, you must use the Gaussian scale rather than the visual marks showed at the plot, which are just there for displaying purposes. The real axis scale is based on the :py:func:bob.measure.ppndf method. For example, if you wish to set the x and y axis to display data between 1% and 40% here is the recipe: .. doctest:: >>> #AFTER you plot the DET curve, just set the axis in this way: >>> pyplot.axis([bob.measure.ppndf(k/100.0) for k in (1, 40, 1, 40)]) # doctest: +SKIP We provide a convenient way for you to do the above in this module. So, optionally, you may use the bob.measure.plot.det_axis method like this: .. doctest:: >>> bob.measure.plot.det_axis([1, 40, 1, 40]) # doctest: +SKIP EPC === Drawing an EPC requires that both the development set negatives and positives are provided alongside the test (or evaluation) set ones. Because of this the API is slightly modified: .. doctest:: >>> bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-') # doctest: +SKIP >>> pyplot.show() # doctest: +SKIP This will produce an image like the following one: .. plot:: import numpy numpy.random.seed(42) import bob.measure from matplotlib import pyplot dev_pos = numpy.random.normal(1,1,100) dev_neg = numpy.random.normal(-1,1,100) test_pos = numpy.random.normal(0.9,1,100) test_neg = numpy.random.normal(-1.1,1,100) npoints = 100 bob.measure.plot.epc(dev_neg, dev_pos, test_neg, test_pos, npoints, color=(0,0,0), linestyle='-') pyplot.grid(True) pyplot.title('EPC') CMC === The Cumulative Match Characteristics (CMC) curve estimates the probability that the correct model is in the *N* models with the highest similarity to a given probe. A CMC curve can be plotted using the :py:func:bob.measure.plot.cmc function. The CMC can be calculated from a relatively complex data structure, which defines a pair of positive and negative scores **per probe**: .. plot:: import numpy numpy.random.seed(42) import bob.measure from matplotlib import pyplot cmc_scores = [] for probe in range(10): positives = numpy.random.normal(1, 1, 1) negatives = numpy.random.normal(0, 1, 19) cmc_scores.append((negatives, positives)) bob.measure.plot.cmc(cmc_scores, logx=False) pyplot.grid(True) pyplot.title('CMC') pyplot.xlabel('Rank') pyplot.xticks([1,5,10,20]) pyplot.xlim([1,20]) pyplot.ylim([0,100]) pyplot.ylabel('Probability of Recognition (%)') Usually, there is only a single positive score per probe, but this is not a fixed restriction. Detection & Identification Curve ================================ The detection & identification curve is designed to evaluate open set identification tasks. It can be plotted using the :py:func:bob.measure.plot.detection_identification_curve function, but it requires at least one open-set probe, i.e., where no corresponding positive score exists, for which the FAR values are computed. Here, we plot the detection and identification curve for rank 1, so that the recognition rate for FAR=1 will be identical to the rank one :py:func:bob.measure.recognition_rate obtained in the CMC plot above. .. plot:: import numpy numpy.random.seed(42) import bob.measure from matplotlib import pyplot cmc_scores = [] for probe in range(1000): positives = numpy.random.normal(1, 1, 1) negatives = numpy.random.normal(0, 1, 19) cmc_scores.append((negatives, positives)) for probe in range(1000): negatives = numpy.random.normal(-1, 1, 10) cmc_scores.append((negatives, None)) bob.measure.plot.detection_identification_curve(cmc_scores, rank=1, logx=True) pyplot.xlabel('False Alarm Rate') pyplot.xlim([0.0001, 1]) pyplot.ylabel('Detection & Identification Rate (%)') pyplot.ylim([0,1]) Fine-tunning ============ The methods inside :py:mod:bob.measure.plot are only provided as a Matplotlib_ wrapper to equivalent methods in :py:mod:bob.measure that can only calculate the points without doing any plotting. You may prefer to tweak the plotting or even use a different plotting system such as gnuplot. Have a look at the implementations at :py:mod:bob.measure.plot to understand how to use the |project| methods to compute the curves and interlace that in the way that best suits you. .. include:: links.rst .. Place youre references here: .. _The Expected Performance Curve: http://publications.idiap.ch/downloads/reports/2005/bengio_2005_icml.pdf .. _The DET curve in assessment of detection task performance: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.117.4489&rep=rep1&type=pdf .. _The Clopper-Pearson interval: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval#Clopper-Pearson_interval