[credible region] Added implementation to return the credible region for the...
[credible region] Added implementation to return the credible region for the F1 score and compare 2 systems
Merge request reports
Activity
215 Returns 216 ------- 217 218 f1_score : (float, float, float, float) 219 F1, mean, mode and credible intervals (95% CI). See `F1-score 220 <https://en.wikipedia.org/wiki/F1_score>`_. It corresponds 221 arithmetically to ``2*P*R/(P+R)`` or ``2*tp/(2*tp+fp+fn)``. The F1 or 222 Dice score depends on a TP-only numerator, similarly to the Jaccard 223 index. For regions where there are no annotations, the F1-score will 224 always be zero, irrespective of the model output. Accuracy may be a 225 better proxy if one needs to consider the true abscence of annotations 226 in a region as part of the measure. 227 228 """ 176 229 230 nbsample = 100000 changed this line in version 4 of the diff
375 coverage = 0.95 376 system1 = measures(tp1, fp1, tn1, fn1, lambda_, coverage) 377 system2 = measures(tp2, fp2, tn2, fn2, lambda_, coverage) 378 measure = ['precision', 'recall', 'specificity', 'accuracy', 'Jaccard index', 'F1 score'] 379 result = "" 380 for i in range(len(measure)): 381 result += "For the %s we can say that : \n " % (measure[i]) 382 if system1[i][2] > system2[i][3]: 383 # lower bound from system 1 is greater than the upper bound from system 2 384 result += "System 1 is better than system 2 with convincing evidence \n" 385 elif system2[i][2] > system1[i][3]: 386 # lower bound from system 2 is greater than the upper bound from system 1 387 result += "System 2 is better than system 1 with convincing evidence \n" 388 else : 389 # the confidence intervals overlap so we compute the 85% confidence intervals to compare them 390 # (cf. https://mmeredith.net/blog/2013/1303_Comparison_of_confidence_intervals.htm and changed this line in version 4 of the diff
268 335 beta(tn, fp, lambda_, coverage), #specificity 269 336 beta(tp+tn, fp+fn, lambda_, coverage), #accuracy 270 337 beta(tp, fp+fn, lambda_, coverage), #jaccard index 271 beta(2*tp, fp+fn, lambda_, coverage), #f1-score 338 f1score(tp, fp, tn, fn, lambda_, coverage), #f1-score 272 339 ) 340 341 def compare(tp1, fp1, tn1, fn1, tp2, fp2, tn2, fn2, lambda_): changed this line in version 4 of the diff
388 else : 389 # the confidence intervals overlap so we compute the 85% confidence intervals to compare them 390 # (cf. https://mmeredith.net/blog/2013/1303_Comparison_of_confidence_intervals.htm and 391 # https://statisticsbyjim.com/hypothesis-testing/confidence-intervals-compare-means) 392 coverage = 0.85 393 system1 = measures(tp1, fp1, tn1, fn1, lambda_, coverage) 394 system2 = measures(tp2, fp2, tn2, fn2, lambda_, coverage) 395 if system1[i][2] > system2[i][3]: 396 # lower bound from system 1 is greater than the upper bound from system 2 397 result += "System 1 is better than system 2 with \"significance\" at the 5% level. \n" 398 elif system2[i][2] > system1[i][3]: 399 # lower bound from system 2 is greater than the upper bound from system 1 400 result += "System 2 is better than system 1 with \"significance\" at the 5% level. \n" 401 else : 402 result += "There is no statistical difference between the 2 CIs \n" 403 return result Instead of returning a string, let's return a dictionary where the keys are the various measures, and the values, the condition that is fulfilled (one of 5 possible values, or a tuple that indicates the direction ">", "<", and the CI (0.95, 0.85).
retval["f1-score"] = (">", 0.85)
means that system 1 is better than system 2 with a 5% uncertainty considering a 0.85 CI.retval["precision"] = ("=", None)
means that system 1 and system 2 are comparable according to that metric.changed this line in version 4 of the diff
added 17 commits
-
540cebe9...d654dd55 - 16 commits from branch
andres-upgrades
- fafe85a4 - [credible region] Added implementation to return the credible region for the...
-
540cebe9...d654dd55 - 16 commits from branch
@amorais: I fixed the pipeline errors (a rebase to master was required). The problems listed are real problems with the commit. In particular, this one should be addressed:
bob/measure/credible_region.py:docstring of bob.measure.credible_region:12:Indirect hyperlink target "five confidence intervals for proportions that you should know about" refers to target "ci-evaluation", which does not exist.
added 1 commit
- 7ba50ade - [credible region] fixed the modifications of the implementation for the F1score
@amorais: I just pushed some fixes for the documentation strings.
added 1 commit
- cfc865aa - Finalize first draft of ROC/PR curve with CI visualization (and AUC estimation)
added 1 commit
- 602d5757 - Implements paired comparison; Externalize examples; Improve ROC bound...
mentioned in commit a93a031b