nustattools.stats package¶
Module contents¶
Copyright (c) 2024 Lukas Koch. All rights reserved.
Potentially useful statistical tools that are not available in scipy.stats.
- class nustattools.stats.Bee(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]¶
Bases:
rv_continuousA random variable representing the maximum of df chi distributions.
Each
chidisitribution hasdf = 1.Note
You probably do not need to use this class directly. Instead work with the instance
bee.- Parameters:
df (int) – The number of chi-distirbuted variables to take the maximum of.
- class nustattools.stats.Bee2(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]¶
Bases:
rv_continuousA random variable representing the maximum of df chi2 distributions.
Each
chii2disitribution hasdf = 1.Note
You probably do not need to use this class directly. Instead work with the instance
bee2.- Parameters:
df (int) – The number of chi-distirbuted variables to take the maximum of.
Notes
This distribution is discussed in [Koch2021] in the context of robust test statistics.
References
[Koch2021]L. Koch, “Robust test statistics for data sets with missing correlation information,” Phys. Rev. D 103, 113008 (2021) , Vol. 103, No. 11 p. 113008, https://arxiv.org/abs/2102.06172
- class nustattools.stats.Cee(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]¶
Bases:
rv_continuousA random variable representing the maximum of multiple chi distributions.
Each
chidisitribution can have a differentdf. If alldfare equal to 1, this is identical to theBeedistribution withdf = len(k).Note
You probably do not need to use this class directly. Instead work with the instance
cee.- Parameters:
k (DF or Iterable of DF) – Special class to pass variable length list of degrees of freedom of the chi-distirbuted variables to take the maximum of.
Examples
>>> cee(k=DF(1,2,3)).pdf(1) 0.2501359390297275
- class nustattools.stats.Cee2(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]¶
Bases:
rv_continuousA random variable representing the maximum of multiple chi2 distributions.
Each
chi2disitribution can have a differentdf. If alldfare equal to 1, this is identical to aBee2disritbution withdf = len(k).Note
You probably do not need to use this class directly. Instead work with the instance
cee2.- Parameters:
k (DF or Iterable of DF) – Special class to pass variable length list of degrees of freedom of the chi-distirbuted variables to take the maximum of.
Examples
>>> cee2(k=DF(1,2,3)).pdf(1) 0.12506796951321578
Notes
TODO: Add reference to paper.
- class nustattools.stats.DF(*args)[source]¶
Bases:
objectHelper class to get around limitations of rv_continuous.
This class represents the shape parameter for the
CeeandCee2distributions.
- class nustattools.stats.FMaxStatistic(*, k, funcs=None, inv_funcs=None)[source]¶
Bases:
TestStatisticTest statistic that takes the maximum of functions of chi2 distributed data.
- Parameters:
k (Iterable of int) – Number of degrees of freedom for the assumed chi2 distributions of the data points.
funcs (Iterable of Callable or None, optional) – Functions to apply to each data point. Maximum will be taken from the function outputs. If not specified, the identity function``f(x) = x`` will be used for all data points.
invfuncs (Iterable of Callable or None, optional) – Inverse of the functions that are applied to each data point. These are used to calculate the CDF. If not specified, the inverse will be calculated numerically using the funcs.
Notes
TODO: Cite paper.
- class nustattools.stats.OptimalFMaxStatistic(*, k)[source]¶
Bases:
FMaxStatisticFMaxStatistic that minimizes the maximum M-distance for any given p-value.
- Parameters:
k (Iterable of int) – Number of degrees of freedom for the assumed chi2 distributions of the data points.
Notes
TODO: Cite paper.
- class nustattools.stats.QMaxStatistic(*, k)[source]¶
Bases:
FMaxStatisticFMaxStatistic where the functions are the assumed CDFs of the data.
This is equivalent to taking the minimum of the individual p-values as the test statistic.
- Parameters:
k (Iterable of int) – Number of degrees of freedom for the assumed chi2 distributions of the data points.
Notes
TODO: Cite paper.
- class nustattools.stats.RVTestStatistic(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]¶
Bases:
rv_continuousA random variable distributed as the expectation of a
TestStatistic.Note
You probably do not need to use this class directly. Instead work with the instance
rvteststatistic.- Parameters:
statistic (TestStatistic or Iterable of TestStatistic)
Examples
>>> ts = OptimalFMaxStatistic(k=[1,2,3]) >>> rvteststatistic(statistic=ts).pdf(1) 0.24601379637056994
- class nustattools.stats.TestStatistic[source]¶
Bases:
ABCGeneral class for test statistcs.
A TestStatistic must implement a way to calculate the test statistic from some data, as well as a way to calculate the CDF of the expected distribution of the test statistic if an assumed model is correct.
- calculate(data)[source]¶
Calculate the test statistic from some given data.
You can also call the object directly, e.g.
statistic(data)instead ofstatistic.calculate(data).- Parameters:
data (numpy.ndarray) – The data to calculate the test statistic for. The test statstic is calculated over the last dimension of the array, so the return shape will be
input_shape[:-1].- Returns:
statistic_value
- Return type:
- cdf(statistic)[source]¶
Calculate the CDF of the expected distribution of the test statistic.
Assumes that certain model assumptions are true.
- Parameters:
statistic (numpy.ndarray) – The test statistic to calculate the CDF for.
- Returns:
cdf
- Return type:
- nustattools.stats.derate_covariance(cov, *, jacobian=None, sigma=3.0, return_dict=None, whitening='zca_aligned', method='gx2', precision=0.01, max_batch_size=10000, goodness_of_fit=False)[source]¶
Derate the covariance of some data to account for unknown correlations.
See [Koch2024].
- Parameters:
cov (numpy.ndarray or list of numpy.ndarray) – The covariance matrix of the data or a list of covariances that add up to the total. Unknown covariance blocks must be
np.nan. Off diagonal blocks may only be0ornp.nan. Diagonal blocks must not benp.nan.jacobian (numpy.ndarray, default=None) – Jacobian matrix of the model prediction wrt the best-fit parameters. If no jacobian is provided, the identity matrix will be used.
sigma (float, default=3.) – The desired confidence level up to which the derated covariance should be conservative, expressed in standard-normal standard deviations. E.g.
sigma=3.corresponds toCL=0.997.return_dict (dict, optional) – If specified, the nightmare covariance and thrown data samples are added to this dictionary for detailed studies outside the function.
whitening (str, default="zca_aligned") – Specify which method to use for the whitening transform.
method (str, default="gx2") – Either
gx2ormc. The former calculates the p-value directly, the latter uses Monte Carlo methods.precision (float, default=0.01) – If the derating factor is calculated using numerical sampling (MC), this parameter determines how many samples to throw. Lower values mean more samples.
max_batch_size (int, default=10_000) – If the derating factor is calculated using numerical sampling (MC), this parameter determines how many samples to are thrown at once. This is repeated until the total number for the required precision is reached.
goodness_of_fit (bool, default=False) – The derating factor for the Goodness of Fit test is different from the derating factor for model parameter estimation. If this parameter is set to True, The provided model jacobian will be used to construct the null space of the model in the data space. This is then used to calculate the necessary derating factor for the Goodness of Fit or Composite Hypothesis test.
- Returns:
a – The derating factor for the total covariance.
- Return type:
Notes
The basic available whitening transforms are:
mahalanobisorzcaW = sqrtm(inv(cov))zca-corW = sqrtm(inv(cor)) @ diag(1/sqrt(diag(cov)))choleskyW = cholesky(inv(cov)).T
See [Kessy2015].
If
_alignedis appended to a basic transform, an additional rotation matrix is prepended, which aligns the whitened coordinate axes with the model parameter space given by jacobian. See [Koch2024].References
[Kessy2015]Kessy, Agnan / Lewin, Alex / Strimmer, Korbinian “Optimal whitening and decorrelation”, The American Statistician 2018, Vol. 72, No. 4, pp. 309-314 , Vol. 72, No. 4, Informa UK Limited, p. 309-314, https://arxiv.org/abs/1512.00809
[Koch2024] (1,2)L. Koch “Hypothesis tests and model parameter estimation on data sets1 with missing correlation information”, https://arxiv.org/abs/2410.22333
- nustattools.stats.rvteststatistic = <nustattools.stats._dist.RVTestStatistic object>¶
Use this instance of
RVTestStatistic.