nnsa.training package
Submodules
nnsa.training.cross_validation module
Functions:
|
Cross validate a regression model using (stratified) k-fold cross validation. |
|
Generate train and validation sets from k folds. |
- nnsa.training.cross_validation.cross_validate(X, y, model, n_folds=5, how='stratified', verbose=0)[source]
Cross validate a regression model using (stratified) k-fold cross validation.
For leave-one-out cross validation, set n_folds to len(X) or None.
- Parameters:
X (np.ndarray) – train data with shape (n_samples, n_features).
y (np.ndarray) – outcome values with shape (n_samples).
model – (untrained) model that implements fit() and predict() methods.
n_folds (int, optional) – number of folds for k-fold cross validation. Defaults to 5. If n_folds is None, takes n_folds=len(X), i.e., leave-one-out cross validation.
how (str, optional) – how to divide in folds. See kfold_generator.
verbose (int, optional) – verbosity level. Defaults to 0.
- Returns:
y_pred_val (np.ndarray) – predicted scores on the validation set, which is the same size as the train set, but the predictions are made when the sample was not included in the training.
all_models (list) – list of all models, each trained on a specific part of the data.
- nnsa.training.cross_validation.kfold_generator(X, y, n_folds=5, how='stratified', return_idx=False)[source]
Generate train and validation sets from k folds.
For fold k, every `n_folds`th sample is put in the set, starting at k.
- Parameters:
X (np.ndarray) – input data with len (n_samples).
y (np.ndarray) – output data with len (n_samples).
n_folds (int, optional) – number of folds. Defaults to 5.
how (str, optional) – how to divide the data into folds. Choose from: ‘random’: random division in folds. ‘stratified’: sorts the data based on y, so that the folds are stratified (for regression). Defaults to ‘stratified_regression’.
return_idx (bool, optional) – if True, return/yield the indices for train and validation. Defaults to False.
- Yields:
X_train (np.ndarray) – train input data for current fold. X_val (np.ndarray): validation input data for current fold. y_train (np.ndarray): train output data for current fold. y_val (np.ndarray): validation output data for current fold. idx_train (np.ndarray, optional): boolean mask for train data (if return_masks is True). idx_val (np.ndarray, optional): boolean mask for validation data (if return_masks is True).
nnsa.training.feature_importance module
This module contains functions to assess feature importance.
Functions:
|
TODO Adapted from https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e. |
- nnsa.training.feature_importance.drop_col_feat_imp(model, X_train, y_train, X_test, y_test, random_state=43)[source]
TODO Adapted from https://towardsdatascience.com/explaining-feature-importance-by-example-of-a-random-forest-d9166011959e. :param model: :param X_train: :param y_train: :param random_state:
Returns:
nnsa.training.feature_selection module
Functions:
|
Get a subset of features, which are present in df, and uncorrelated with each other. |
|
Select features in X. |
|
Remove correlating features, keeping features most important for prediction y according to a RandomForestRegressor. |
- nnsa.training.feature_selection.get_uncorrelating_features(df, features=None, max_corr=0.8, match_class=False, verbose=1)[source]
Get a subset of features, which are present in df, and uncorrelated with each other.
The order of the features in features corresponds to the priority that the features have. E.g. if if 2 features correlate significantly, the feature that is located more to the beginning of the features list will be kept, the other one not.
- Parameters:
df (pd.DataFrame) – DataFrame containing the features (columns) for a number of samples (index).
features (list, optional) – list of features in df to find an uncorrelating subset of. This list must be sorted in order of priority. I.e., put features that you want to keep in the beginning of the list. If None, all numeric columns of the df are used as features, favouring features that correlate with most other features. Defaults to None.
max_corr (float, optional) – the maximum allowable correlation coefficient between two features. If the correlation between two features is higher than max_corr the correlating appearing latest in features will be removed. Defaults to 0.7.
match_class (bool, optional) – if True, each feature is only correlated with features of the same type. Note that the type of the feature is assumed to be coded by the first couple of characters (before the first underscore), e.g. POW_delta. If False, considers correlations between features of different types. Defaults to False.
verbose (int, optional) – verbosity level. Defaults to 1.
- Returns:
uncorrelating_features (dict) – dict where the keys are a subset of features containing only features whose mutual correlation coefficient is lower than max_corr. The values corresponding to the keys are the features that where removed due to significant correlating with the key feature.
- nnsa.training.feature_selection.select_features(X, y=None, how='RFR')[source]
Select features in X.
- Parameters:
X (pd.DataFrame) – feature data, X.values.shape = (n_samples, n_features).
y (np.ndarray, optional) – outcome data with shape (n_samples,). For some options of how, this input is not needed. Defaults to None.
how (str, optional) – how to select the features. Choose from: ‘all’ or None: use all available features. ‘RFR’ or ‘RandomForestRegressor’: do a feature selection step using a random forest regressor. Defaults to ‘RFR’.
- Returns:
X (pd.DataFrame) – data array with selected data.
- nnsa.training.feature_selection.select_features_rfr(X, y, max_corr=0.8, min_imp=0, **kwargs)[source]
Remove correlating features, keeping features most important for prediction y according to a RandomForestRegressor.
- Parameters:
X (pd.DataFrame) – feature data with shape (n_samples, n_features).
y (np.ndarray, optional) – outcome data with shape (n_samples,).
max_corr (float, optional) – maximum allowed correlation. Features are removed that correlate more than this with a more important feature. Defaults to 0.8.
min_imp (float or str, optional) – if a float, only features are selected that have this minimum importance. If a str, you can specify q5 or q10 to define the 5th or 10th percentile as the threshold.
**kwargs (optional) – optional keyword parameters to pass to RandomForestRegressor.
- Returns:
X (pd.DataFrame) – data array with selected data.
nnsa.training.utils module
Functions:
|
Drop rows with nans in X or y. |
|
TODO Deprecate this method. |
- nnsa.training.utils.dropna(X, y, verbose=False)[source]
Drop rows with nans in X or y.
- Parameters:
X (np.ndarray) – data array with dimensions (n_samples, n_features).
y (np.ndarray) – data array with dimensions (n_samples,) or (n_samples, 1).
verbose (bool, optional) – if True, prints the fraction of rows that is removed.
- Returns:
X, y – without nans.
- nnsa.training.utils.split_data(x, y, train_frac=0.75, shuffle=True)[source]
TODO Deprecate this method. Use sklearn.model_selection.train_test_split insetad.
Split the data samples into a train set and test set.
- Parameters:
x (np.ndarray) – array with (feature) data. The first axis must correspond to the samples.
y (np.ndarray) – array with labels. The first axis must correspond to the samples.
train_frac (float, optional) – fraction of data to keep in train set. Defaults to 0.75.
shuffle (bool, optional) – if True, the data is shuffled before splitting. If False, the data is not shuffled. Defaults to True.
- Returns:
x_train (np.ndarray) – train set features.
y_train (np.ndarray) – train set labels.
x_test (np.ndarray) – test set features.
y_test (np.ndarray) – test set labels.