Evaluation will require re-integration of bayesian confidence intervals where relevant

The evaluation system has been thoroughly simplified to mostly use scikit-learn primitives. To simplify, we have not yet incorporated Bayesian confidence interval calculations for the various measures in place.

I'm unsure right now on the best way to approach this problem. There is a reviewed implementation here: bob/bob.measure!103 (closed), which requires type annotations to make it complete. Furthermore, it would be good if this implementation integrated well with whatever scikit-learn is doing, naming conventions, etc.

Potentially, we should create a new package just to host this? TBD