Skip to content
Snippets Groups Projects

[credible region] Added implementation to return the credible region for the...

Merged Antonio MORAIS requested to merge antonio-merge into andres-upgrades
4 unresolved threads

[credible region] Added implementation to return the credible region for the F1 score and compare 2 systems

Merge request reports

Pipeline #56435 passed

Pipeline passed for 728e7711 on antonio-merge

Test coverage 72.00% (-6.00%) from 2 jobs

Merged by André AnjosAndré Anjos 3 years ago (Nov 22, 2021 12:03pm UTC)

Loading

Pipeline #56436 passed

Pipeline passed for a93a031b on andres-upgrades

Test coverage 72.00% (-6.00%) from 2 jobs

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
215 Returns
216 -------
217
218 f1_score : (float, float, float, float)
219 F1, mean, mode and credible intervals (95% CI). See `F1-score
220 <https://en.wikipedia.org/wiki/F1_score>`_. It corresponds
221 arithmetically to ``2*P*R/(P+R)`` or ``2*tp/(2*tp+fp+fn)``. The F1 or
222 Dice score depends on a TP-only numerator, similarly to the Jaccard
223 index. For regions where there are no annotations, the F1-score will
224 always be zero, irrespective of the model output. Accuracy may be a
225 better proxy if one needs to consider the true abscence of annotations
226 in a region as part of the measure.
227
228 """
176 229
230 nbsample = 100000
  • 375 coverage = 0.95
    376 system1 = measures(tp1, fp1, tn1, fn1, lambda_, coverage)
    377 system2 = measures(tp2, fp2, tn2, fn2, lambda_, coverage)
    378 measure = ['precision', 'recall', 'specificity', 'accuracy', 'Jaccard index', 'F1 score']
    379 result = ""
    380 for i in range(len(measure)):
    381 result += "For the %s we can say that : \n " % (measure[i])
    382 if system1[i][2] > system2[i][3]:
    383 # lower bound from system 1 is greater than the upper bound from system 2
    384 result += "System 1 is better than system 2 with convincing evidence \n"
    385 elif system2[i][2] > system1[i][3]:
    386 # lower bound from system 2 is greater than the upper bound from system 1
    387 result += "System 2 is better than system 1 with convincing evidence \n"
    388 else :
    389 # the confidence intervals overlap so we compute the 85% confidence intervals to compare them
    390 # (cf. https://mmeredith.net/blog/2013/1303_Comparison_of_confidence_intervals.htm and
  • 268 335 beta(tn, fp, lambda_, coverage), #specificity
    269 336 beta(tp+tn, fp+fn, lambda_, coverage), #accuracy
    270 337 beta(tp, fp+fn, lambda_, coverage), #jaccard index
    271 beta(2*tp, fp+fn, lambda_, coverage), #f1-score
    338 f1score(tp, fp, tn, fn, lambda_, coverage), #f1-score
    272 339 )
    340
    341 def compare(tp1, fp1, tn1, fn1, tp2, fp2, tn2, fn2, lambda_):
  • 388 else :
    389 # the confidence intervals overlap so we compute the 85% confidence intervals to compare them
    390 # (cf. https://mmeredith.net/blog/2013/1303_Comparison_of_confidence_intervals.htm and
    391 # https://statisticsbyjim.com/hypothesis-testing/confidence-intervals-compare-means)
    392 coverage = 0.85
    393 system1 = measures(tp1, fp1, tn1, fn1, lambda_, coverage)
    394 system2 = measures(tp2, fp2, tn2, fn2, lambda_, coverage)
    395 if system1[i][2] > system2[i][3]:
    396 # lower bound from system 1 is greater than the upper bound from system 2
    397 result += "System 1 is better than system 2 with \"significance\" at the 5% level. \n"
    398 elif system2[i][2] > system1[i][3]:
    399 # lower bound from system 2 is greater than the upper bound from system 1
    400 result += "System 2 is better than system 1 with \"significance\" at the 5% level. \n"
    401 else :
    402 result += "There is no statistical difference between the 2 CIs \n"
    403 return result
    • Instead of returning a string, let's return a dictionary where the keys are the various measures, and the values, the condition that is fulfilled (one of 5 possible values, or a tuple that indicates the direction ">", "<", and the CI (0.95, 0.85). retval["f1-score"] = (">", 0.85) means that system 1 is better than system 2 with a 5% uncertainty considering a 0.85 CI. retval["precision"] = ("=", None) means that system 1 and system 2 are comparable according to that metric.

    • changed this line in version 4 of the diff

    • Please register or sign in to reply
  • It would be interesting to add Goutte's implementation to compare 2 systems.

  • Furthermore, it would be important that functions allow multiple (vector) inputs.

  • André Anjos added 17 commits

    added 17 commits

    • 540cebe9...d654dd55 - 16 commits from branch andres-upgrades
    • fafe85a4 - [credible region] Added implementation to return the credible region for the...

    Compare with previous version

  • @amorais: I fixed the pipeline errors (a rebase to master was required). The problems listed are real problems with the commit. In particular, this one should be addressed:

    bob/measure/credible_region.py:docstring of bob.measure.credible_region:12:Indirect hyperlink target "five confidence intervals for proportions that you should know about"  refers to target "ci-evaluation", which does not exist.
  • Antonio MORAIS added 1 commit

    added 1 commit

    • 7ba50ade - [credible region] fixed the modifications of the implementation for the F1score

    Compare with previous version

  • André Anjos added 1 commit

    added 1 commit

    • 4b1d67d6 - [confidence_interval] Fix import

    Compare with previous version

  • André Anjos added 1 commit

    added 1 commit

    • f70730cf - [doc] Fix documentation errors

    Compare with previous version

  • @amorais: I just pushed some fixes for the documentation strings.

  • André Anjos added 1 commit

    added 1 commit

    • cfc865aa - Finalize first draft of ROC/PR curve with CI visualization (and AUC estimation)

    Compare with previous version

  • André Anjos added 1 commit

    added 1 commit

    • 602d5757 - Implements paired comparison; Externalize examples; Improve ROC bound...

    Compare with previous version

  • André Anjos added 3 commits

    added 3 commits

    • b6a774a4 - [doc] Ensure precision on demonstration
    • 12f9fa9a - [conda] Pin matplotlib to x.x
    • 728e7711 - [tests] Add tests for bayesian CI estimation

    Compare with previous version

  • merged

  • André Anjos mentioned in commit a93a031b

    mentioned in commit a93a031b

  • Please register or sign in to reply
    Loading