nustattools.stats package

Module contents

Copyright (c) 2024 Lukas Koch. All rights reserved.

Potentially useful statistical tools that are not available in scipy.stats.

class nustattools.stats.Bee(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]

Bases: rv_continuous

A random variable representing the maximum of df chi distributions.

Each chi disitribution has df = 1.

Note

You probably do not need to use this class directly. Instead work with the instance bee.

Parameters:

df (int) – The number of chi-distirbuted variables to take the maximum of.

class nustattools.stats.Bee2(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]

Bases: rv_continuous

A random variable representing the maximum of df chi2 distributions.

Each chii2 disitribution has df = 1.

Note

You probably do not need to use this class directly. Instead work with the instance bee2.

Parameters:

df (int) – The number of chi-distirbuted variables to take the maximum of.

Notes

This distribution is discussed in [Koch2021] in the context of robust test statistics.

References

[Koch2021]

L. Koch, “Robust test statistics for data sets with missing correlation information,” Phys. Rev. D 103, 113008 (2021) , Vol. 103, No. 11 p. 113008, https://arxiv.org/abs/2102.06172

class nustattools.stats.Cee(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]

Bases: rv_continuous

A random variable representing the maximum of multiple chi distributions.

Each chi disitribution can have a different df. If all df are equal to 1, this is identical to the Bee distribution with df = len(k).

Note

You probably do not need to use this class directly. Instead work with the instance cee.

Parameters:

k (DF or Iterable of DF) – Special class to pass variable length list of degrees of freedom of the chi-distirbuted variables to take the maximum of.

Examples

>>> cee(k=DF(1,2,3)).pdf(1)
0.2501359390297275
class nustattools.stats.Cee2(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]

Bases: rv_continuous

A random variable representing the maximum of multiple chi2 distributions.

Each chi2 disitribution can have a different df. If all df are equal to 1, this is identical to a Bee2 disritbution with df = len(k).

Note

You probably do not need to use this class directly. Instead work with the instance cee2.

Parameters:

k (DF or Iterable of DF) – Special class to pass variable length list of degrees of freedom of the chi-distirbuted variables to take the maximum of.

Examples

>>> cee2(k=DF(1,2,3)).pdf(1)
0.12506796951321578

Notes

TODO: Add reference to paper.

class nustattools.stats.DF(*args)[source]

Bases: object

Helper class to get around limitations of rv_continuous.

This class represents the shape parameter for the Cee and Cee2 distributions.

Parameters:

*args (tuple of int) – The df for each chi(2) distribution

class nustattools.stats.FMaxStatistic(*, k, funcs=None, inv_funcs=None)[source]

Bases: TestStatistic

Test statistic that takes the maximum of functions of chi2 distributed data.

Parameters:
  • k (Iterable of int) – Number of degrees of freedom for the assumed chi2 distributions of the data points.

  • funcs (Iterable of Callable or None, optional) – Functions to apply to each data point. Maximum will be taken from the function outputs. If not specified, the identity function``f(x) = x`` will be used for all data points.

  • invfuncs (Iterable of Callable or None, optional) – Inverse of the functions that are applied to each data point. These are used to calculate the CDF. If not specified, the inverse will be calculated numerically using the funcs.

Notes

TODO: Cite paper.

Parameters:

inv_funcs (Optional[Iterable[Optional[Callable[..., ndarray[tuple[Any, ...], dtype[Any]]]]]])

class nustattools.stats.OptimalFMaxStatistic(*, k)[source]

Bases: FMaxStatistic

FMaxStatistic that minimizes the maximum M-distance for any given p-value.

Parameters:

k (Iterable of int) – Number of degrees of freedom for the assumed chi2 distributions of the data points.

Notes

TODO: Cite paper.

class nustattools.stats.QMaxStatistic(*, k)[source]

Bases: FMaxStatistic

FMaxStatistic where the functions are the assumed CDFs of the data.

This is equivalent to taking the minimum of the individual p-values as the test statistic.

Parameters:

k (Iterable of int) – Number of degrees of freedom for the assumed chi2 distributions of the data points.

Notes

TODO: Cite paper.

class nustattools.stats.RVTestStatistic(momtype=1, a=None, b=None, xtol=1e-14, badvalue=None, name=None, longname=None, shapes=None, seed=None)[source]

Bases: rv_continuous

A random variable distributed as the expectation of a TestStatistic.

Note

You probably do not need to use this class directly. Instead work with the instance rvteststatistic.

Parameters:

statistic (TestStatistic or Iterable of TestStatistic)

Examples

>>> ts = OptimalFMaxStatistic(k=[1,2,3])
>>> rvteststatistic(statistic=ts).pdf(1)
0.24601379637056994
class nustattools.stats.TestStatistic[source]

Bases: ABC

General class for test statistcs.

A TestStatistic must implement a way to calculate the test statistic from some data, as well as a way to calculate the CDF of the expected distribution of the test statistic if an assumed model is correct.

calculate(data)[source]

Calculate the test statistic from some given data.

You can also call the object directly, e.g. statistic(data) instead of statistic.calculate(data).

Parameters:

data (numpy.ndarray) – The data to calculate the test statistic for. The test statstic is calculated over the last dimension of the array, so the return shape will be input_shape[:-1].

Returns:

statistic_value

Return type:

numpy.ndarray

cdf(statistic)[source]

Calculate the CDF of the expected distribution of the test statistic.

Assumes that certain model assumptions are true.

Parameters:

statistic (numpy.ndarray) – The test statistic to calculate the CDF for.

Returns:

cdf

Return type:

numpy.ndarray

nustattools.stats.bee = <nustattools.stats._dist.Bee object>

Use this instance of Bee.

nustattools.stats.bee2 = <nustattools.stats._dist.Bee2 object>

Use this instance of Bee2.

nustattools.stats.cee = <nustattools.stats._dist.Cee object>

Use this instance of Cee.

nustattools.stats.cee2 = <nustattools.stats._dist.Cee2 object>

Use this instance of Cee2.

nustattools.stats.derate_covariance(cov, *, jacobian=None, sigma=3.0, return_dict=None, whitening='zca_aligned', method='gx2', precision=0.01, max_batch_size=10000, goodness_of_fit=False)[source]

Derate the covariance of some data to account for unknown correlations.

See [Koch2024].

Parameters:
  • cov (numpy.ndarray or list of numpy.ndarray) – The covariance matrix of the data or a list of covariances that add up to the total. Unknown covariance blocks must be np.nan. Off diagonal blocks may only be 0 or np.nan. Diagonal blocks must not be np.nan.

  • jacobian (numpy.ndarray, default=None) – Jacobian matrix of the model prediction wrt the best-fit parameters. If no jacobian is provided, the identity matrix will be used.

  • sigma (float, default=3.) – The desired confidence level up to which the derated covariance should be conservative, expressed in standard-normal standard deviations. E.g. sigma=3. corresponds to CL=0.997.

  • return_dict (dict, optional) – If specified, the nightmare covariance and thrown data samples are added to this dictionary for detailed studies outside the function.

  • whitening (str, default="zca_aligned") – Specify which method to use for the whitening transform.

  • method (str, default="gx2") – Either gx2 or mc. The former calculates the p-value directly, the latter uses Monte Carlo methods.

  • precision (float, default=0.01) – If the derating factor is calculated using numerical sampling (MC), this parameter determines how many samples to throw. Lower values mean more samples.

  • max_batch_size (int, default=10_000) – If the derating factor is calculated using numerical sampling (MC), this parameter determines how many samples to are thrown at once. This is repeated until the total number for the required precision is reached.

  • goodness_of_fit (bool, default=False) – The derating factor for the Goodness of Fit test is different from the derating factor for model parameter estimation. If this parameter is set to True, The provided model jacobian will be used to construct the null space of the model in the data space. This is then used to calculate the necessary derating factor for the Goodness of Fit or Composite Hypothesis test.

Returns:

a – The derating factor for the total covariance.

Return type:

float

Notes

The basic available whitening transforms are:

mahalanobis or zca

W = sqrtm(inv(cov))

zca-cor

W = sqrtm(inv(cor)) @ diag(1/sqrt(diag(cov)))

cholesky

W = cholesky(inv(cov)).T

See [Kessy2015].

If _aligned is appended to a basic transform, an additional rotation matrix is prepended, which aligns the whitened coordinate axes with the model parameter space given by jacobian. See [Koch2024].

References

[Kessy2015]

Kessy, Agnan / Lewin, Alex / Strimmer, Korbinian “Optimal whitening and decorrelation”, The American Statistician 2018, Vol. 72, No. 4, pp. 309-314 , Vol. 72, No. 4, Informa UK Limited, p. 309-314, https://arxiv.org/abs/1512.00809

[Koch2024] (1,2)

L. Koch “Hypothesis tests and model parameter estimation on data sets1 with missing correlation information”, https://arxiv.org/abs/2410.22333

nustattools.stats.rvteststatistic = <nustattools.stats._dist.RVTestStatistic object>

Use this instance of RVTestStatistic.