Commit 05b2d5f3 authored by André Anjos's avatar André Anjos 💬

Use a less biometric-related name for functions; Add confidence interval and...

Use a less biometric-related name for functions; Add confidence interval and credible region details; Use python-vectorisation instead of numba for most contents in our library
parent 7dd97739
Pipeline #51397 failed with stage
in 4 minutes and 33 seconds
......@@ -18,3 +18,4 @@ build
*.egg
src/
record.txt
.coverage
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env python
# coding=utf-8
"""Frequentist confidence interval estimation
(Frequentist) confidence interval interpretation, with 95% coverage: **If we
are to take several independent random samples from the population and
construct confidence intervals from each of the sample data, then 95 out of 100
confidence intervals will contain the true mean (true proportion, in this
context of proportion)**.
See a discussion in `Five Confidence Intervals for Proportions That You
Should Know About <ci-evaluation_>`_.
"""
import numbers
import numpy
import scipy.stats
def clopper_pearson(k, l, coverage=0.95):
"""Calculates the "exact" confidence interval for proportion estimates
The Clopper-Pearson interval method is used for estimating the confidence
intervals. This implementation is based on [CLOPPER-1934]_. This
technique is **very** conservative - in most of the cases, coverage is
greater than the required value, which may imply in too large confidence
intervals.
Parameters
==========
k : int, numpy.ndarray (int, 1D)
Number of successes observed on the experiment
l : int, numpy.ndarray (int, 1D)
Number of failures observed on the experiment
coverage : :py:class:`float`, Optional
A floating-point number between 0 and 1.0 indicating the
coverage you're expecting. A value of 0.95 will ensure 95% coverage.
Returns
-------
lower, upper: float, numpy.ndarray (float, 1D)
The lower and upper bounds of the credible region
"""
is_scalar = isinstance(k, numbers.Integral)
if is_scalar:
# make it an array
k = numpy.array(k).reshape((1,))
l = numpy.array(l).reshape((1,))
right = (1.0 - coverage) / 2 # half-width in each side
lower = scipy.stats.beta.ppf(right, k, l + 1)
upper = scipy.stats.beta.ppf(1 - right, k + 1, l)
lower[numpy.isnan(lower)] = 0.0
upper[numpy.isnan(upper)] = 1.0
if is_scalar:
return lower[0], upper[0]
else:
return lower, upper
def agresti_coull(k, l, coverage=0.95):
"""Calculates the confidence interval for proportion estimates
The Agresti-Coull interval method is used for estimating the confidence
intervals. This implementation is based on [AGRESTI-1998]_. This
technique is conservative - in most of the cases, coverage is greater
than the required value, which may imply a larger confidence interval that
required.
This function is considered a good choice for the frequentist approach, if
you cannot use :py:func:`clopper_pearson`.
Parameters
==========
k : int, numpy.ndarray (int, 1D)
Number of successes observed on the experiment
l : int, numpy.ndarray (int, 1D)
Number of failures observed on the experiment
coverage : :py:class:`float`, Optional
A floating-point number between 0 and 1.0 indicating the
coverage you're expecting. A value of 0.95 will ensure 95% coverage.
Returns
-------
lower, upper: float, numpy.ndarray (float, 1D)
The lower and upper bounds of the credible region
"""
is_scalar = isinstance(k, numbers.Integral)
if is_scalar:
# make it an array
k = numpy.array(k).reshape((1,))
l = numpy.array(l).reshape((1,))
right = (1.0 - coverage) / 2 # half-width in each side
crit = scipy.stats.norm.isf(right)
kl_c = (k + l) + crit ** 2
q_c = (k + crit ** 2 / 2.0) / kl_c
std_c = numpy.sqrt(q_c * (1.0 - q_c) / kl_c)
dist = crit * std_c
lower = q_c - dist
upper = q_c + dist
lower[numpy.isnan(lower)] = 0.0
upper[numpy.isnan(upper)] = 1.0
if is_scalar:
return lower[0], upper[0]
else:
return lower, upper
def wilson(k, l, coverage=0.95):
"""Calculates the confidence interval for proportion estimates
The Wilson interval method is used for estimating the confidence intervals.
This implementation is based on [WILSON-1927]_. This implementation does
**not** contain the continuity correction. It is as conservative in the
extremes of the domain as the bayesian approach and can be a good default,
if :py:func:`clopper_pearson` cannot be used.
This function is considered the best "default" for the frequentist
approach as it is not too conservative and assumes a resonable value
through out the range.
Parameters
==========
k : int, numpy.ndarray (int, 1D)
Number of successes observed on the experiment
l : int, numpy.ndarray (int, 1D)
Number of failures observed on the experiment
coverage : :py:class:`float`, Optional
A floating-point number between 0 and 1.0 indicating the
coverage you're expecting. A value of 0.95 will ensure 95% coverage.
Returns
-------
lower, upper: float, numpy.ndarray (float, 1D)
The lower and upper bounds of the credible region
"""
is_scalar = isinstance(k, numbers.Integral)
if is_scalar:
# make it an array
k = numpy.array(k).reshape((1,))
l = numpy.array(l).reshape((1,))
right = (1.0 - coverage) / 2 # half-width in each side
n = k + l
p = k / n
crit = scipy.stats.norm.isf(right)
crit2 = crit ** 2
denom = 1 + crit2 / n
center = (p + crit2 / (2 * n)) / denom
dist = crit * numpy.sqrt(p * (1.0 - p) / n + crit2 / (4.0 * n ** 2))
dist /= denom
lower = center - dist
upper = center + dist
lower[numpy.isnan(lower)] = 0.0
upper[numpy.isnan(upper)] = 1.0
if is_scalar:
return lower[0], upper[0]
else:
return lower, upper
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
#!/usr/bin/env python
# coding=utf-8
"""Methods for global sensitivity analysis"""
from .regression import mse
def relevance(data, f):
"""Calculates the relevance of every input feature to the estimation process
Uses the formula:
Neural Triggering System Operating on High Resolution Calorimetry
Information, Anjos et al, April 2006, Nuclear Instruments and Methods in
Physics Research, volume 559, pages 134-138
.. math::
R(x_{i}) = |E[(o(x) - o(x|x_{i}=E[x_{i}]))^2]|
In other words, the relevance of a certain input feature **i** is the
change on the machine output value when such feature is replaced by its
mean for all input vectors. For this to work, the `input` parameter has to
be a 2D array with features arranged column-wise while different examples
are arranged row-wise.
Parameters
==========
data : numpy.ndarray (float, 2D)
A 2D float array that contains the input data to be analyzed by the
method of interest. Data should be arranged such that the first
dimension (rows) corresponds to the samples and the second dimension
(columns) to the features available. So, if you have 20 samples and 5
features for each input data sample, then your array should have a
shape equal to ``(20, 5)``.
f : object
A callable, that inputs data, and outputs a :py:class:`numpy.ndarray`.
Returns
=======
relevance : numpy.ndarray (float, 1D)
A 1D float array as large as the number of columns (second dimension) of
your input array, estimating the "relevance" of each input column (or
feature) to the output provided by the function ``f``.
"""
o = f(data)
d2 = data.copy()
retval = numpy.zeros((data.shape[1],), dtype=float)
for k in range(data.shape[1]):
d2[:, :] = data # reset
d2[:, k] = numpy.mean(data[:, k])
retval[k] = (mse(f(d2), o).sum()) ** 0.5
return retval
......@@ -8,7 +8,7 @@ import h5py
import numpy
import pkg_resources
from . import load
from .. import load
def _F(f):
......
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment