Feature Selection¶

class endgame.feature_selection.UnivariateSelector(score_func='f_classif', mode='k_best', k=10, percentile=10, alpha=0.05, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Unified univariate feature selection.

Selects features based on univariate statistical tests. Supports various scoring functions and selection modes.

Parameters:

score_func (str or callable, default='f_classif') – Scoring function. Options: - ‘f_classif’: ANOVA F-test for classification - ‘f_regression’: F-test for regression - ‘mutual_info_classif’: Mutual information for classification - ‘mutual_info_regression’: Mutual information for regression - ‘chi2’: Chi-squared test (requires non-negative features) - callable: Custom function (X, y) -> (scores, pvalues)
mode (str, default='k_best') – Selection mode: - ‘k_best’: Select top k features - ‘percentile’: Select top percentile - ‘fpr’: Select by false positive rate - ‘fdr’: Select by false discovery rate - ‘fwe’: Select by family-wise error
k (int, default=10) – Number of features to select (for k_best mode).
percentile (int, default=10) – Percentile of features to select (for percentile mode).
alpha (float, default=0.05) – Threshold for fpr/fdr/fwe modes.
random_state (int, optional) – Random seed for mutual information estimation.

scores_¶

Scores for each feature.

Type:: ndarray

pvalues_¶

P-values for each feature (if available).

Type:: ndarray

selected_features_¶

Indices of selected features.

Type:: ndarray

Example

>>> from endgame.feature_selection import UnivariateSelector
>>> selector = UnivariateSelector(score_func='mutual_info_classif', k=20)
>>> X_selected = selector.fit_transform(X, y)

SCORE_FUNCS = {'chi2': <function chi2>, 'f_classif': <function f_classif>, 'f_regression': <function f_regression>, 'mutual_info_classif': <function mutual_info_classif>, 'mutual_info_regression': <function mutual_info_regression>}¶

fit(X, y)[source]¶

Fit the selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (UnivariateSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_scores()[source]¶

Get feature scores.

Return type:: ndarray

class endgame.feature_selection.MutualInfoSelector(k=10, task='classification', n_neighbors=3, random_state=None)[source]¶

Bases: UnivariateSelector

Mutual information-based feature selection.

Convenience wrapper for mutual information scoring, which captures nonlinear dependencies.

Parameters:

k (int, default=10) – Number of features to select.
task (str, default='classification') – Task type: ‘classification’ or ‘regression’.
n_neighbors (int, default=3) – Number of neighbors for MI estimation.
random_state (int, optional) – Random seed.

Example

>>> selector = MutualInfoSelector(k=20, task='classification')
>>> X_selected = selector.fit_transform(X, y)

class endgame.feature_selection.FTestSelector(k=10, task='classification')[source]¶

Bases: UnivariateSelector

F-test based feature selection.

Uses ANOVA F-test for classification or F-regression for regression. Fast linear baseline.

Parameters:

k (int, default=10) – Number of features to select.
task (str, default='classification') – Task type: ‘classification’ or ‘regression’.

Example

>>> selector = FTestSelector(k=20)
>>> X_selected = selector.fit_transform(X, y)

class endgame.feature_selection.Chi2Selector(k=10)[source]¶

Bases: UnivariateSelector

Chi-squared feature selection.

For categorical features vs categorical target. Requires non-negative feature values.

Parameters:: k (int, default=10) – Number of features to select.

Example

>>> selector = Chi2Selector(k=20)
>>> X_selected = selector.fit_transform(X_categorical, y)

class endgame.feature_selection.MRMRSelector(n_features=10, task='classification', relevance_func='mutual_info', redundancy_func='pearson', n_neighbors=3, random_state=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

Minimum Redundancy Maximum Relevance feature selection.

MRMR balances feature relevance (high mutual information with target) with redundancy (low mutual information with already-selected features).

The selection criterion is: max(relevance - redundancy)

Parameters:

n_features (int, default=10) – Number of features to select.
task (str, default='classification') – Task type: ‘classification’ or ‘regression’.
relevance_func (str, default='mutual_info') – Function for computing relevance: - ‘mutual_info’: Mutual information - ‘f_test’: F-statistic
redundancy_func (str, default='pearson') – Function for computing redundancy: - ‘pearson’: Absolute Pearson correlation - ‘mutual_info’: Mutual information between features
n_neighbors (int, default=3) – Number of neighbors for MI estimation.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print selection progress.

selected_features_¶

Indices of selected features in order of selection.

Type:: ndarray

relevance_scores_¶

Relevance scores for all features.

Type:: ndarray

ranking_¶

Full feature ranking.

Type:: ndarray

Example

>>> from endgame.feature_selection import MRMRSelector
>>> selector = MRMRSelector(n_features=20)
>>> X_selected = selector.fit_transform(X, y)
>>> print(f"Selected features: {selector.selected_features_}")

fit(X, y)[source]¶

Fit the MRMR selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (MRMRSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_ranking()[source]¶

Get complete feature ranking.

Return type:: ndarray

class endgame.feature_selection.ReliefFSelector(n_features=10, n_neighbors=10, n_samples=1.0, algorithm='relieff', random_state=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

ReliefF feature selection algorithm.

ReliefF is an instance-based feature weighting algorithm that naturally handles feature interactions. It evaluates features by how well they distinguish between near-miss instances.

Parameters:

n_features (int, default=10) – Number of features to select.
n_neighbors (int, default=10) – Number of neighbors to consider for each instance.
n_samples (int or float, default=1.0) – Number of samples to use for estimation. - If int, uses that many samples. - If float (0-1), uses that fraction of samples.
algorithm (str, default='relieff') – Algorithm variant: - ‘relieff’: Standard ReliefF - ‘multisurf’: MultiSURF (adaptive radius)
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.

feature_importances_¶

Feature importance scores.

Type:: ndarray

selected_features_¶

Indices of selected features.

Type:: ndarray

ranking_¶

Feature ranking by importance.

Type:: ndarray

Example

>>> from endgame.feature_selection import ReliefFSelector
>>> selector = ReliefFSelector(n_features=20, n_neighbors=10)
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the ReliefF selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target labels (must be discrete for classification).

Returns:

self (ReliefFSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_ranking()[source]¶

Get complete feature ranking.

Return type:: ndarray

class endgame.feature_selection.CorrelationSelector(threshold=0.95, method='pearson', keep='first')[source]¶

Bases: TransformerMixin, BaseEstimator

Remove highly correlated features.

Identifies and removes features that are highly correlated with other features, keeping only one from each correlated group.

Parameters:

threshold (float, default=0.95) – Correlation threshold. Features with correlation above this are considered redundant.
method (str, default='pearson') – Correlation method: - ‘pearson’: Pearson correlation (linear) - ‘spearman’: Spearman rank correlation (monotonic) - ‘kendall’: Kendall tau correlation (ordinal)
keep (str, default='first') – Which feature to keep from correlated pairs: - ‘first’: Keep the first feature encountered - ‘variance’: Keep the feature with higher variance - ‘target_corr’: Keep the feature with higher target correlation

features_to_drop_¶

Indices of features to drop.

Type:: list

selected_features_¶

Indices of selected features.

Type:: ndarray

correlation_matrix_¶

Computed correlation matrix.

Type:: ndarray

Example

>>> from endgame.feature_selection import CorrelationSelector
>>> selector = CorrelationSelector(threshold=0.90)
>>> X_reduced = selector.fit_transform(X)

fit(X, y=None)[source]¶

Fit the correlation selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like, optional) – Target values (required if keep=’target_corr’).

Returns:

self (CorrelationSelector)

transform(X)[source]¶

Remove correlated features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with correlated features removed.

fit_transform(X, y=None)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_correlated_pairs()[source]¶

Get pairs of highly correlated features.

Return type:: list
Returns:: pairs (list of tuples) – Each tuple is (feature_i, feature_j, correlation).

class endgame.feature_selection.RFESelector(estimator=None, n_features=None, step=1, cv=5, scoring=None, min_features_to_select=1, verbose=0)[source]¶

Bases: TransformerMixin, BaseEstimator

Recursive Feature Elimination feature selection.

RFE iteratively removes the least important features based on model coefficients or feature importances.

Parameters:

estimator (BaseEstimator, optional) – Model to use for feature ranking. Must have coef_ or feature_importances_ attribute. Default: RandomForest.
n_features (int, float, or None, default=None) – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction of features. - If None, use cross-validation to find optimal.
step (int or float, default=1) – Number of features to remove at each iteration: - If int > 1, remove that many features. - If float (0-1), remove that fraction.
cv (int, default=5) – Cross-validation folds (used when n_features=None).
scoring (str, optional) – Scoring metric for RFECV.
min_features_to_select (int, default=1) – Minimum features for RFECV.
verbose (int, default=0) – Verbosity level.

selected_features_¶

Indices of selected features.

Type:: ndarray

ranking_¶

Feature ranking (1 = selected).

Type:: ndarray

n_features_¶

Number of selected features.

Type:: int

estimator_¶

Fitted estimator used for final ranking.

Type:: BaseEstimator

Example

>>> from endgame.feature_selection import RFESelector
>>> selector = RFESelector(n_features=20)
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the RFE selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (RFESelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_ranking()[source]¶

Get feature ranking (1 = selected).

Return type:: ndarray

class endgame.feature_selection.BorutaSelector(estimator=None, n_estimators='auto', max_iter=100, alpha=0.05, perc=100, two_step=True, random_state=None, verbose=0)[source]¶

Bases: TransformerMixin, BaseEstimator

Boruta all-relevant feature selection algorithm.

Boruta is a wrapper around Random Forest. It creates “shadow” features (shuffled copies of real features) and selects features that have significantly higher importance than the best shadow feature.

This is a statistically principled method that finds ALL relevant features, not just the minimal set.

Parameters:

estimator (BaseEstimator, optional) – Tree-based model with feature_importances_. Default: RandomForest.
n_estimators (int or 'auto', default='auto') – Number of trees. ‘auto’ uses heuristic based on features.
max_iter (int, default=100) – Maximum iterations.
alpha (float, default=0.05) – Significance level for the binomial test.
perc (int, default=100) – Percentile of shadow feature importance distribution to use as threshold. 100 = max (original Boruta).
two_step (bool, default=True) – If True, use two-step correction for multiple testing.
random_state (int, optional) – Random seed.
verbose (int, default=0) – Verbosity level.

selected_features_¶

Indices of confirmed features.

Type:: ndarray

tentative_features_¶

Indices of tentative features (borderline).

Type:: ndarray

rejected_features_¶

Indices of rejected features.

Type:: ndarray

ranking_¶

Feature ranking (1 = confirmed, 2 = tentative, 3 = rejected).

Type:: ndarray

n_features_¶

Number of confirmed features.

Type:: int

feature_importances_¶

Mean feature importances across iterations.

Type:: ndarray

Example

>>> from endgame.feature_selection import BorutaSelector
>>> selector = BorutaSelector(max_iter=100)
>>> X_selected = selector.fit_transform(X, y)
>>> print(f"Confirmed: {len(selector.selected_features_)}")
>>> print(f"Tentative: {len(selector.tentative_features_)}")

fit(X, y)[source]¶

Fit the Boruta selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (BorutaSelector)

transform(X)[source]¶

Select confirmed features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with confirmed features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_all_relevant(include_tentative=True)[source]¶

Get all potentially relevant features.

Parameters:: include_tentative (bool, default=True) – Whether to include tentative features.
Return type:: ndarray
Returns:: features (ndarray) – Indices of relevant features.

class endgame.feature_selection.SequentialSelector(estimator=None, n_features='auto', direction='forward', scoring=None, cv=5, tol=None, n_jobs=None, verbose=0)[source]¶

Bases: TransformerMixin, BaseEstimator

Sequential feature selection.

Implements forward selection, backward elimination, or bidirectional search for optimal feature subsets.

Parameters:

estimator (BaseEstimator, optional) – Model to use for evaluation. Default: LogisticRegression.
n_features (int, float, or 'auto', default='auto') – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction. - If ‘auto’, use cross-validation to find optimal.
direction (str, default='forward') – Search direction: - ‘forward’: Start empty, add features - ‘backward’: Start full, remove features - ‘bidirectional’: Both directions (floating)
scoring (str, optional) – Scoring metric.
cv (int, default=5) – Cross-validation folds.
tol (float, optional) – Tolerance for early stopping (only for sklearn >= 1.1).
n_jobs (int, default=None) – Number of parallel jobs.
verbose (int, default=0) – Verbosity level.

selected_features_¶

Indices of selected features.

Type:: ndarray

n_features_¶

Number of selected features.

Type:: int

scores_¶

Cross-validation scores at each step.

Type:: dict

Example

>>> from endgame.feature_selection import SequentialSelector
>>> selector = SequentialSelector(n_features=10, direction='forward')
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the sequential selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (SequentialSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

class endgame.feature_selection.GeneticSelector(estimator=None, population_size=50, n_generations=100, mutation_rate=0.1, crossover_rate=0.8, tournament_size=3, elitism=2, min_features=1, max_features=None, scoring=None, cv=5, early_stopping=None, random_state=None, verbose=0)[source]¶

Bases: TransformerMixin, BaseEstimator

Genetic algorithm for feature selection.

Evolves feature subsets using selection, crossover, and mutation to optimize cross-validation score.

Parameters:

estimator (BaseEstimator, optional) – Model to use for fitness evaluation.
population_size (int, default=50) – Size of the population.
n_generations (int, default=100) – Number of generations.
mutation_rate (float, default=0.1) – Probability of mutating each gene (feature).
crossover_rate (float, default=0.8) – Probability of crossover between parents.
tournament_size (int, default=3) – Number of individuals in tournament selection.
elitism (int, default=2) – Number of best individuals to keep unchanged.
min_features (int, default=1) – Minimum number of features to select.
max_features (int or float, optional) – Maximum features. If float, fraction of total.
scoring (str, optional) – Scoring metric.
cv (int, default=5) – Cross-validation folds.
early_stopping (int, optional) – Stop if no improvement for this many generations.
random_state (int, optional) – Random seed.
verbose (int, default=0) – Verbosity level.

selected_features_¶

Indices of selected features.

Type:: ndarray

best_score_¶

Best cross-validation score achieved.

Type:: float

history_¶

Best score at each generation.

Type:: list

n_features_¶

Number of selected features.

Type:: int

Example

>>> from endgame.feature_selection import GeneticSelector
>>> selector = GeneticSelector(n_generations=50, population_size=30)
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the genetic selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (GeneticSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

class endgame.feature_selection.PermutationSelector(estimator, n_features=10, n_repeats=10, scoring=None, threshold=None, use_pimp=False, alpha=0.05, random_state=None, n_jobs=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Feature selection based on permutation importance.

More reliable than model-specific importances as it measures actual predictive contribution. Can optionally compute p-values (PIMP) for statistical significance.

Parameters:

estimator (BaseEstimator) – Fitted model or model to fit.
n_features (int or float, default=10) – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction.
n_repeats (int, default=10) – Number of permutation repetitions.
scoring (str, optional) – Scoring metric.
threshold (float, optional) – Minimum importance threshold. If set, overrides n_features.
use_pimp (bool, default=False) – Whether to compute p-values using PIMP (permutation importance with p-values). More statistically rigorous.
alpha (float, default=0.05) – Significance level for PIMP.
random_state (int, optional) – Random seed.
n_jobs (int, default=None) – Number of parallel jobs.

feature_importances_¶

Permutation importance for each feature.

Type:: ndarray

importance_std_¶

Standard deviation of importance across repeats.

Type:: ndarray

pvalues_¶

P-values for each feature (if use_pimp=True).

Type:: ndarray

selected_features_¶

Indices of selected features.

Type:: ndarray

Example

>>> from endgame.feature_selection import PermutationSelector
>>> model.fit(X, y)
>>> selector = PermutationSelector(estimator=model, n_features=20)
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the permutation selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data (should be validation set for fitted model).
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (PermutationSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_ranking()[source]¶

Get feature ranking by importance.

Return type:: ndarray

class endgame.feature_selection.SHAPSelector(estimator, n_features=10, explainer_type='auto', background_samples=100, max_samples=None, check_additivity=False, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Feature selection based on SHAP values.

Uses mean absolute SHAP values as feature importance. More theoretically grounded than permutation importance.

Parameters:

estimator (BaseEstimator) – Fitted model.
n_features (int or float, default=10) – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction.
explainer_type (str, default='auto') – Type of SHAP explainer: - ‘auto’: Auto-detect based on model type - ‘tree’: TreeExplainer (fast for tree models) - ‘linear’: LinearExplainer - ‘kernel’: KernelExplainer (model-agnostic, slow) - ‘deep’: DeepExplainer (for neural networks)
background_samples (int, default=100) – Number of background samples for KernelExplainer.
max_samples (int, optional) – Maximum samples to use for SHAP computation.
check_additivity (bool, default=False) – Whether to verify SHAP additivity (slower).
random_state (int, optional) – Random seed.

shap_values_¶

SHAP values for each sample and feature.

Type:: ndarray

feature_importances_¶

Mean absolute SHAP values for each feature.

Type:: ndarray

selected_features_¶

Indices of selected features.

Type:: ndarray

Example

>>> from endgame.feature_selection import SHAPSelector
>>> model.fit(X_train, y_train)
>>> selector = SHAPSelector(estimator=model, n_features=20)
>>> X_selected = selector.fit_transform(X_val, y_val)

fit(X, y=None)[source]¶

Fit the SHAP selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data to compute SHAP values on.
y (Ignored)

Returns:

self (SHAPSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y=None)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_ranking()[source]¶

Get feature ranking by SHAP importance.

Return type:: ndarray

get_interaction_values(X)[source]¶

Get SHAP interaction values.

Only available for TreeExplainer.

Parameters:: X (array-like) – Data to compute interactions on.
Return type:: ndarray
Returns:: interaction_values (ndarray of shape (n_samples, n_features, n_features))

class endgame.feature_selection.TreeImportanceSelector(estimator=None, n_features='mean', importance_type='native', threshold=None, prefit=False, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Feature selection based on tree-based importance.

Uses Gini/entropy importance from tree-based models. Fast but can be biased toward high-cardinality features.

Parameters:

estimator (BaseEstimator, optional) – Tree-based model with feature_importances_. Default: RandomForestClassifier.
n_features (int, float, or str, default='mean') – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction. - If ‘mean’, select features with importance > mean. - If ‘median’, select features with importance > median.
importance_type (str, default='native') – Type of importance to use: - ‘native’: Use model’s feature_importances_ - ‘gain’: Gain-based (LightGBM/XGBoost specific) - ‘split’: Split count-based
threshold (float, optional) – Explicit importance threshold.
prefit (bool, default=False) – Whether the estimator is already fitted.
random_state (int, optional) – Random seed.

feature_importances_¶

Feature importance scores.

Type:: ndarray

selected_features_¶

Indices of selected features.

Type:: ndarray

threshold_¶

Actual threshold used for selection.

Type:: float

Example

>>> from endgame.feature_selection import TreeImportanceSelector
>>> selector = TreeImportanceSelector(n_features=20)
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the tree importance selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (TreeImportanceSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_feature_ranking()[source]¶

Get feature ranking by importance.

Return type:: ndarray

class endgame.feature_selection.StabilitySelector(base_selector, n_bootstrap=100, sample_fraction=0.5, threshold=0.6, lambda_grid=None, max_features=None, random_state=None, n_jobs=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

Stability selection wrapper for any feature selection method.

Runs feature selection multiple times on bootstrap samples and keeps features that are consistently selected. This addresses the instability problem in most selection methods.

Based on Meinshausen & Buhlmann (2010).

Parameters:

base_selector (TransformerMixin) – Base feature selector to wrap (must have fit/get_support).
n_bootstrap (int, default=100) – Number of bootstrap iterations.
sample_fraction (float, default=0.5) – Fraction of samples to use in each bootstrap.
threshold (float, default=0.6) – Selection frequency threshold. Features selected in more than this fraction of bootstraps are kept.
lambda_grid (array-like, optional) – For LASSO-style selectors, grid of regularization values.
max_features (int, optional) – Maximum number of features to select.
random_state (int, optional) – Random seed.
n_jobs (int, default=None) – Number of parallel jobs.
verbose (bool, default=False) – Whether to print progress.

selection_frequencies_¶

Selection frequency for each feature.

Type:: ndarray

selected_features_¶

Indices of stable features.

Type:: ndarray

bootstrap_results_¶

Selected features in each bootstrap.

Type:: list

Example

>>> from endgame.feature_selection import StabilitySelector, MRMRSelector
>>> base = MRMRSelector(n_features=20)
>>> stable = StabilitySelector(base, n_bootstrap=50, threshold=0.7)
>>> X_selected = stable.fit_transform(X, y)

fit(X, y)[source]¶

Fit the stability selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (StabilitySelector)

transform(X)[source]¶

Select stable features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with stable features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_selection_frequencies()[source]¶

Get selection frequency for all features.

Return type:: ndarray

plot_selection_frequencies(feature_names=None)[source]¶

Plot selection frequencies.

Parameters:: feature_names (list, optional) – Names for features.
Returns:: fig (matplotlib Figure)

class endgame.feature_selection.KnockoffSelector(fdr=0.1, method='equicorrelated', statistic='lasso_cv', offset=1, random_state=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

Knockoff filter for feature selection with FDR control.

The knockoff filter creates “knockoff” copies of features that have the same correlation structure but are independent of the target. Features are selected if they are more important than their knockoffs.

Provides rigorous statistical guarantees on False Discovery Rate.

Based on Barber & Candes (2015) and Candes et al. (2018).

Parameters:

fdr (float, default=0.1) – Target false discovery rate.
method (str, default='equicorrelated') – Knockoff generation method: - ‘equicorrelated’: Equicorrelated knockoffs (faster) - ‘sdp’: SDP knockoffs (more powerful, requires cvxpy) - ‘gaussian’: Model-X Gaussian knockoffs
statistic (str, default='lasso_cv') – Feature statistic: - ‘lasso_cv’: LASSO with CV selection - ‘lasso_fixed’: LASSO with fixed lambda - ‘ridge’: Ridge coefficients
offset (int, default=1) – Knockoff+ offset (0 for original knockoff).
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.

selected_features_¶

Indices of selected features.

Type:: ndarray

statistics_¶

Knockoff statistics W_j for each feature.

Type:: ndarray

threshold_¶

Selection threshold.

Type:: float

knockoffs_¶

Generated knockoff features.

Type:: ndarray

Example

>>> from endgame.feature_selection import KnockoffSelector
>>> selector = KnockoffSelector(fdr=0.1)
>>> X_selected = selector.fit_transform(X, y)

fit(X, y)[source]¶

Fit the knockoff selector.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.

Returns:

self (KnockoffSelector)

transform(X)[source]¶

Select features.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_selected (ndarray) – Data with selected features.

fit_transform(X, y)[source]¶

Fit and transform.

Return type:: ndarray

get_support(indices=False)[source]¶

Get mask or indices of selected features.

Return type:: ndarray
Parameters:: indices (bool)

get_statistics()[source]¶

Get knockoff W statistics.

Return type:: ndarray

class endgame.feature_selection.AdversarialFeatureSelector(threshold=0.05, max_features_to_remove=10, estimator=None, output_format='auto', random_state=None, verbose=False)[source]¶

Bases: PolarsTransformer

Removes features that contribute to train/test drift.

Uses adversarial validation to identify and remove features that differ significantly between train and test distributions.

Parameters:

threshold (float, default=0.05) – Remove features with importance above this threshold.
max_features_to_remove (int, default=10) – Maximum number of features to remove.
estimator (BaseEstimator, optional) – Classifier for adversarial validation.
output_format (str)
random_state (int | None)
verbose (bool)

Examples

>>> selector = AdversarialFeatureSelector(threshold=0.05)
>>> selector.fit(X_train, X_test=X_test)
>>> X_train_clean = selector.transform(X_train)

fit(X, y=None, X_test=None, **fit_params)[source]¶

Identify features to remove based on adversarial validation.

Parameters:

X (array-like) – Training features.
y (ignored)
X_test (array-like) – Test features for adversarial validation.

Return type:

AdversarialFeatureSelector

Returns:

self

transform(X)[source]¶

Remove drifted features.

Return type:: Any

property features_to_drop_: list[str]¶: Features identified for removal.

property feature_importances_: dict[str, float]¶: Adversarial validation feature importances.

set_fit_request(*, X_test='$UNCHANGED$')¶

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to fit.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in fit.
self (AdversarialFeatureSelector)

Returns:

self (object) – The updated object.

Return type:

AdversarialFeatureSelector

class endgame.feature_selection.NullImportanceSelector(estimator=None, n_iterations=100, significance_threshold=0.95, output_format='auto', random_state=None, verbose=False)[source]¶

Bases: PolarsTransformer

Selects features based on null importance distribution.

Features must significantly outperform a shuffled-target baseline. Robust method for identifying truly predictive features.

Parameters:

estimator (BaseEstimator, optional) – Model to use. If None, uses LightGBM.
n_iterations (int, default=100) – Number of null importance iterations.
significance_threshold (float, default=0.95) – Percentile threshold for significance.
output_format (str)
random_state (int | None)
verbose (bool)

Examples

>>> selector = NullImportanceSelector(n_iterations=100)
>>> selector.fit(X, y)
>>> X_selected = selector.transform(X)

fit(X, y, **fit_params)[source]¶

Compute actual and null importances.

Parameters:

X (array-like) – Training features.
y (array-like) – Target values.

Return type:

NullImportanceSelector

Returns:

self

transform(X)[source]¶

Keep only significant features.

Return type:: Any

property selected_features_: list[str]¶: Features that passed significance test.

property actual_importance_: dict[str, float]¶: Actual feature importances.

property null_threshold_: dict[str, float]¶: Null importance thresholds.