nnsa.stats package

Submodules

nnsa.stats.intervals module

Code related to confindence and prediction intervals.

Functions:

prediction_interval(x, y_true, y_pred, x_query)

Compute the prediction intervals for the given linear 1D regression model at the given query points.

nnsa.stats.intervals.prediction_interval(x, y_true, y_pred, x_query, alpha=0.05, dof=None)[source]

Compute the prediction intervals for the given linear 1D regression model at the given query points.

Parameters:
  • x (np.ndarray) – 1D array with the predictor values of the training dataset (array with length n_samples).

  • y_true (np.ndarray) – 1D array with the true outcome values of the training dataset.

  • y_pred (np.ndarray) – 1D array with the predicted outcome values (from the fitted model) of the training dataset.

  • x_query (np.ndarray) – the x-points at which to compute the prediction interval.

  • alpha (float, optional) – the significance determining the prediction interval (PI). PI = (1 - alpha)*100% Defaults to 0.05.

  • dof (int, optional) – the degrees of freedom left after fitting the model. If None, a linear regression model with one predicter and an intercept is assumed, corresponding to dof = n_samples - 2. Defaults to None.

Returns:

delta_y (np.ndarray) – array with same length as x_query containing the prediction margin. The prediction interval is: y +/- delta_y.

Examples

The following example case is taken from https://newonlinecourses.science.psu.edu/stat462/node/128/. >>> import pandas as pd >>> from sklearn.linear_model import LinearRegression

Read data file. >>> datafile = ‘https://newonlinecourses.science.psu.edu/stat462/sites/onlinecourses.science.psu.edu.stat462/files/data/infectionrisk/index.txt’ >>> df = pd.read_csv(datafile, delimiter=’t’)

>>> df = df[(df['Region'] == 1) | (df['Region'] == 2)]  # Only keep region 1 and 2.
>>> df = df[df['Stay'] <= 14]  # Remove 2 outliers.
>>> x = df['Stay'].values.reshape(-1, 1)
>>> y_true = df['InfctRsk'].values

# Fit model (with an intercept). >>> model = LinearRegression().fit(x, y_true) >>> y_pred = model.predict(x)

# Get prediction interval at x = 10. >>> x_query = np.array([10]) >>> delta_y = prediction_interval(x[:, 0], y_true, y_pred, x_query, alpha=0.05) >>> print(‘{:.4f}’.format(delta_y[0])) 2.0699

nnsa.stats.paired module

Functions:

wilcoxon(x[, y, fill_outside_table])

This function is deprecated, as SciPy has implemented a 'mode' parameter that allows for exact p value computation.

nnsa.stats.paired.wilcoxon(x, y=None, fill_outside_table=nan, **kwargs)[source]

This function is deprecated, as SciPy has implemented a ‘mode’ parameter that allows for exact p value computation.

Calculate the Wilcoxon signed-rank test.

The scipy.stats.wilcoxon() function is used to compute the test statistic. For the p-value, the scipy function uses a normal approximation, which is only valid for larger sample sizes (see docs scipy.stats.wilcoxon()). Therefore, for len(x) > 20, the scipy.stats.wilcoxon() function is also used to compute the p-value. For len(x) <= 20, a lookup table is used to determine the lowest p-value for which the statistic is equal or lower.

Parameters:
  • x (np.ndarray) – see scipy.stats.wilcoxon().

  • y (np.ndarray, optional) – see scipy.stats.wilcoxon().

  • fill_outside_table (float, optional) – p-value to return if the test statistic does not fall in the range of the lookup table (for computing the p-value if len(x) <= 20). Defaults to np.nan.

  • **kwargs (optional) – keyword arguments for scipy.stats.wilcoxon().

Returns:
  • statistic (float) – see scipy.stats.wilcoxon().

  • pvalue (float) – see scipy.stats.wilcoxon(). If len(x) <= 20, this is the lowest p-value for which the statistic is equal or lower.

Examples

Create random data with different means. >>> np.random.seed(65) >>> n = 20 >>> x = np.random.normal(2, 1, n) >>> y = np.random.normal(2.5, 1, n)

Compute Wilcoxon p-value using scipy (uses the normal equation). >>> scipy.stats.wilcoxon(x, y, mode=’approx’)[1] 0.020633435105949553

Compute the lowest Wilcoxon p-value in the look up table for which the statistic is equal or lower. It should be similar to the wilcoxon computed above using the normal equation with n = 20. >>> wilcoxon(x, y)[1] 0.025

nnsa.stats.partially_paired module

Statistical test for comparing groups in partially paired data.

Functions:

opt_pooled_ttest(a, b)

Calculate the optimal pooled t-test on 2 partially paired samples.

nnsa.stats.partially_paired.opt_pooled_ttest(a, b)[source]

Calculate the optimal pooled t-test on 2 partially paired samples.

This is a two-sided test for the nul hypothesis that 2 sets of partially paired data have identical averages. The partially paired data refers to e.g. the case where there is a pre-treatment and a post-treatment group. Some subjects may have values for both pre and post treament, while other subjects may only be measured pre or post treatment. This test is suitable to compare the means of such partially paired pre and post groups, when the data is not normally distributed or when the sample size is low. See the reference for more detail and for other test comparing partially paired data.

Reference:

B. Guo and Y. Yuan, “A comparative review of methods for comparing means using partially paired data,” Statistical Methods in Medical Research, vol. 26, no. 3, pp. 1323-1340, Apr. 2015.

Parameters:
  • a (np.ndarray) – sample of the first group. Must have the same shape as b. a[i] and b[i] must correspond to the same subject. Missing values must be indicated by np.nan.

  • b (np.ndarray) – sample of the second group. Must have the same shape as a. a[i] and b[i] must correspond to the same subject. Missing values must be indicated by np.nan.

Returns:
  • statistic (float) – t-statistic.

  • pvalue (float) – two-tailed p-value.

Examples

Compute the t-statstic and p-value for the example case in the paper by Guo et al. 2015 (see Table 5). >>> paired_data = np.asarray(((20, 10), (30, 20), (25, 10), (20, 20), (25, 20), (10, 10), (15, 15), (20, 20), (30, 30))) >>> pre_data = np.array((10,20,25,30,20,30,15,20,30,15,15,20,10,25,30,20,20,30,25,30,20,20,10,25,20,10,20,20)) >>> post_data = np.array((15,25,30,20,10,20,10,30,10,10,10,25,15,20,20,20,20,10,10,10,20,30,10)) >>> pre = paired_data[:, 0] >>> post = paired_data[:, 1] >>> pre = np.concatenate((pre, pre_data, np.full(len(post_data), np.nan))) >>> post = np.concatenate((post, np.full(len(pre_data), np.nan), post_data)) >>> statistic, pvalue = opt_pooled_ttest(pre, post) >>> print(“Test statistic: {:.3f}, p-value: {:.4f}”.format(statistic, pvalue)) Test statistic: 2.697, p-value: 0.0119

Shuffle the data. >>> idx = np.random.permutation(len(pre)) >>> statistic_2, pvalue_2 = opt_pooled_ttest(pre[idx], post[idx]) >>> pvalue == pvalue_2 True

nnsa.stats.statistics module

Author: Tim Hermans (tim-hermans@hotmail.com).

Functions:

compute_mad(x[, k, n, axis, keepdims])

Compute the (running) median absolute deviation.

nnsa.stats.statistics.compute_mad(x, k=1.4826, n=None, axis=-1, keepdims=False)[source]

Compute the (running) median absolute deviation.

Ignores nan values.

Parameters:
  • x (np.ndarray) – input array.

  • k (float) – scale factor for MAD. Defaults to 1.4826 to approximate SD of normal distribution. See also: https://en.wikipedia.org/wiki/Median_absolute_deviation.

  • n (int) – Deprecated. Do not use. This is only here to provide Deprecation warnings.

  • axis (int) – axis along which to compute the MAD.

  • keepdims (bool) – whether to keep dims or not (see np.median()).

Returns:

mad (float or np.ndarray) – MAD value(s) along the specified axis.

nnsa.stats.surrogates module

Functions:

compute_AAFT_surrogate(x[, seed])

Compute the Amplitude Adjusted Fourier Transform surrogate.

compute_AR_surrogate(x[, r, seed])

Compute a red noise surrogate assuming x can be modelled with a univariate lag-1 autoregressive model.

compute_FT_surrogate(x[, xf, seed])

Compute the Fourier Transform surrogate.

compute_IAAFT_surrogate(x[, n_iters, xf, seed])

Compute the Iterative Amplitude Adjusted Fourier Transform surrogate.

compute_RS_surrogate(x[, seed, replace_nan])

Compute the random shuffle (RS) surrogate.

compute_ar1(x)

Compute lag-1 autocorrelation.

compute_surrogate(x, how[, seed])

Compute a surrogate signal for x using the method specified by how.

compute_surrogate_couplings_sample(x, y, fun)

Compute surrogate values for the coupling/interaction between random segments of x and y as computed by fun.

compute_surrogate_fun(inputs, fun[, ...])

Evaluate fun(*inputs) n_surrogates times, where each iteration new surrogates are used of the original inputs.

nnsa.stats.surrogates.compute_AAFT_surrogate(x, seed=None)[source]

Compute the Amplitude Adjusted Fourier Transform surrogate.

This method tries to preserve both the linear structure and the amplitude distribution. The drawback of this method is precisely that the last step changes somewhat the linear structure.

Parameters:
  • x (np.ndarray) – 1D signal array.

  • seed (int, optional) – seed for the random generator.

Returns:

x_surrogate (np.ndarray) – surrogate signal for x.

nnsa.stats.surrogates.compute_AR_surrogate(x, r=None, seed=None)[source]

Compute a red noise surrogate assuming x can be modelled with a univariate lag-1 autoregressive model.

Surrogate data is created by generating random red noise, using the lag-1 autocorrelation estimated from x.

Taken from the pycwt package.

Parameters:
  • x (np.ndarray) – 1D signal array.

  • r (float, optional) – precomputed lag-1 autocorrelation of x.

  • seed (int, optional) – seed for the random generator.

Returns:

x_surrogate (np.ndarray) – surrogate signal for x.

nnsa.stats.surrogates.compute_FT_surrogate(x, xf=None, seed=None)[source]

Compute the Fourier Transform surrogate.

Surrogate data is created by the inverse Fourier Transform of the Fourier Transform of the original signal with new (uniformly random) phases. This preserves the linear correlation (the periodogram) of the original signal.

Parameters:
  • x (np.ndarray) – 1D signal array.

  • xf (np.ndarray, optional) – precomputed one-dimensional discrete Fourier Transform of x, i.e. np.fft.rfft(x, axis=0). If None, it is computed in the function. Defaults to None.

  • seed (int, optional) – seed for the random generator.

Returns:

x_surrogate (np.ndarray) – surrogate signal for x.

nnsa.stats.surrogates.compute_IAAFT_surrogate(x, n_iters=10, xf=None, seed=None)[source]

Compute the Iterative Amplitude Adjusted Fourier Transform surrogate.

This algorithm is an iterative version of AAFT. The steps are repeated so that the surrogate autocorrelation function becomes more similar to the original.

Parameters:
  • x (np.ndarray) – 1D signal array.

  • n_iters (np.ndarray, optional) – number of iterations to do. Defaults to 10.

  • xf (np.ndarray, optional) – precomputed one-dimensional discrete Fourier Transform of x, i.e. np.fft.rfft(x, axis=0). If None, it is computed in the function. Defaults to None.

  • seed (int, optional) – seed for the random generator.

Returns:

x_surrogate (np.ndarray) – surrogate signal for x.

nnsa.stats.surrogates.compute_RS_surrogate(x, seed=None, replace_nan=True)[source]

Compute the random shuffle (RS) surrogate.

Surrogate data is created by randomly shuffling the samples in the time domain. The permutations guarantee the same amplitude distribution than the original series, but destroy any linear correlation. This method is associated to the null hypothesis of the data being uncorrelated noise (possibly Gaussian and measured by a static nonlinear function).

Parameters:
  • x (np.ndarray) – 1D signal array.

  • seed (int, optional) – seed for the random generator.

Returns:

x_surrogate (np.ndarray) – surrogate signal for x.

nnsa.stats.surrogates.compute_ar1(x)[source]

Compute lag-1 autocorrelation.

Parameters:

x (np.ndarray) – 1D signal array.

Returns:

r (float) – lag-1 autocorrelation of x.

References

[1] Allen, M. R. and Smith, L. A. Monte Carlo SSA: detecting

irregular oscillations in the presence of colored noise. Journal of Climate, 1996, 9(12), 3373-3404. <http://dx.doi.org/10.1175/1520-0442(1996)009<3373:MCSDIO>2.0.CO;2>

[2] http://www.madsci.org/posts/archives/may97/864012045.Eg.r.html

nnsa.stats.surrogates.compute_surrogate(x, how, seed=None, **kwargs)[source]

Compute a surrogate signal for x using the method specified by how.

Parameters:
  • x (np.ndarray) – 1D time series.

  • how (str, optional) – method how to generate the surrogates. Choose from: ‘RS’/’random_shuffle’, ‘FT’/’forier_transform’/’RP’/’random_phase’, ‘AAFT’, ‘IAAFT’, ‘AR’. See the codes for each of these methods for more information about them (nnsa.stats.surrogates).

  • seed (int, optional) – seed for the random generator. Defaults to None.

  • **kwargs (optional) – optional keyword arguments for the surrogate function that is called.

Returns:

x_surrogate (np.ndarray) – surrogate signal for x.

nnsa.stats.surrogates.compute_surrogate_couplings_sample(x, y, fun, axis=-1, nperseg=None, n_surrogates=300, how='IAAFT', seed=None)[source]

Compute surrogate values for the coupling/interaction between random segments of x and y as computed by fun.

Create n_surrogate surrogates of segments of x and y by randomly selecting a channel and a starting index. Computes the coupling between the surrogates and returns these.

Parameters:
  • x (np.ndarray) – N-D array containing time series.

  • y (np.ndarray) – N-D array containing time series. Should have the same shape as x.

  • fun (function) – function that takes in two 1D arrays and returns a coupling value.

  • axis (int, optional) – axis in x and y corresponding to the time dimension. Defaults to -1.

  • nperseg (int, optional) – number of samples in a randomly selected segment for computation of the coupling. If None, the entire length of the signal is used to compute the coupling.

  • n_surrogates (int, optional) – number of surrogates to use. Defaults to 300.

  • how (str, optional) – method how to generate the surrogates. See compute_surrogate in nnsa.stats.surrogates. Defaults to ‘IAAFT’.

  • seed (int, optional) – seed for the random generator. Defaults to None.

Returns:

surrogate_values (np.ndarray) – n_surrogates coupling values of the surrogates.

Examples

>>> np.random.seed(43)
>>> x = np.random.rand(10, 512*5, 20)
>>> y = np.random.rand(10, 512*5, 20)
>>> fun = lambda x, y: np.abs(np.corrcoef(x, y)[0, 1])

# Compute coupling for surrogates of random segments from random channels. >>> surrogate_values = compute_surrogate_couplings_sample(x, y, axis=1, fun=fun, n_surrogates=300, nperseg=512) >>> surrogate_values.shape (300,)

# Determine significance threshold for correlation between x and y. >>> sig_threshold = np.percentile(surrogate_values, 95) >>> print(sig_threshold) 0.0820297882707145

nnsa.stats.surrogates.compute_surrogate_fun(inputs, fun, n_surrogates=300, how='IAAFT', seed=None, return_af_errors=False, verbose=1, **kwargs)[source]

Evaluate fun(*inputs) n_surrogates times, where each iteration new surrogates are used of the original inputs.

Parameters:
  • inputs (iterable) – list or tuple of 1D arrays which are the original inputs for fun.

  • fun (function) – function that takes as input len(inputs) 1D arrays and returns a value or array.

  • n_surrogates (int, optional) – number of surrogates to use. Defaults to 300.

  • how (str, optional) – method how to generate the surrogates. See compute_surrogate in nnsa.stats.surrogates. Defaults to ‘IAAFT’.

  • seed (int, optional) – seed for the random generator. Defaults to None.

  • return_af_errors (bool, optional) – if True, returns an additional array which contains the errors between the coherence of clean vs artefacted surrogates. To create the artefact surrogates, artefacts are inserted at locations where there are artefacts in the inputs.

  • verbose (int, optional) – verbosity level. Defaults to 1.

  • **kwargs – optional keyword arguments for fun.

Returns:
  • surrogate_values (np.ndarray) – array with length n_surrogates containing the results for each iteration.

  • errors (np.ndarray) – only returned if return_af_errors: array with length n_surrogates containing the difference of the fun value between clean vs artefact surrogates.

Examples

>>> np.random.seed(43)
>>> x = np.random.rand(512)
>>> y = np.random.rand(512)
>>> fun = lambda x, y: np.abs(np.corrcoef(x, y)[0, 1])

# Compute coupling for all surrogates. >>> surrogate_values = compute_surrogate_fun((x, y), fun, n_surrogates=300, verbose=0) >>> surrogate_values.shape (300,)

# Determine significance threshold for correlation between x and y. >>> sig_threshold = np.percentile(surrogate_values, 95) >>> print(sig_threshold) 0.07910408001084722

nnsa.stats.utils module

Functions:

make_stars(pval)

nnsa.stats.utils.make_stars(pval)[source]

Module contents