Feature Selection¶
- class endgame.feature_selection.UnivariateSelector(score_func='f_classif', mode='k_best', k=10, percentile=10, alpha=0.05, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorUnified univariate feature selection.
Selects features based on univariate statistical tests. Supports various scoring functions and selection modes.
- Parameters:
score_func (str or callable, default='f_classif') – Scoring function. Options: - ‘f_classif’: ANOVA F-test for classification - ‘f_regression’: F-test for regression - ‘mutual_info_classif’: Mutual information for classification - ‘mutual_info_regression’: Mutual information for regression - ‘chi2’: Chi-squared test (requires non-negative features) - callable: Custom function (X, y) -> (scores, pvalues)
mode (str, default='k_best') – Selection mode: - ‘k_best’: Select top k features - ‘percentile’: Select top percentile - ‘fpr’: Select by false positive rate - ‘fdr’: Select by false discovery rate - ‘fwe’: Select by family-wise error
k (int, default=10) – Number of features to select (for k_best mode).
percentile (int, default=10) – Percentile of features to select (for percentile mode).
alpha (float, default=0.05) – Threshold for fpr/fdr/fwe modes.
random_state (int, optional) – Random seed for mutual information estimation.
- scores_¶
Scores for each feature.
- Type:
ndarray
- pvalues_¶
P-values for each feature (if available).
- Type:
ndarray
- selected_features_¶
Indices of selected features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import UnivariateSelector >>> selector = UnivariateSelector(score_func='mutual_info_classif', k=20) >>> X_selected = selector.fit_transform(X, y)
- SCORE_FUNCS = {'chi2': <function chi2>, 'f_classif': <function f_classif>, 'f_regression': <function f_regression>, 'mutual_info_classif': <function mutual_info_classif>, 'mutual_info_regression': <function mutual_info_regression>}¶
- fit(X, y)[source]¶
Fit the selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (UnivariateSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.MutualInfoSelector(k=10, task='classification', n_neighbors=3, random_state=None)[source]¶
Bases:
UnivariateSelectorMutual information-based feature selection.
Convenience wrapper for mutual information scoring, which captures nonlinear dependencies.
- Parameters:
Example
>>> selector = MutualInfoSelector(k=20, task='classification') >>> X_selected = selector.fit_transform(X, y)
- class endgame.feature_selection.FTestSelector(k=10, task='classification')[source]¶
Bases:
UnivariateSelectorF-test based feature selection.
Uses ANOVA F-test for classification or F-regression for regression. Fast linear baseline.
- Parameters:
Example
>>> selector = FTestSelector(k=20) >>> X_selected = selector.fit_transform(X, y)
- class endgame.feature_selection.Chi2Selector(k=10)[source]¶
Bases:
UnivariateSelectorChi-squared feature selection.
For categorical features vs categorical target. Requires non-negative feature values.
- Parameters:
k (int, default=10) – Number of features to select.
Example
>>> selector = Chi2Selector(k=20) >>> X_selected = selector.fit_transform(X_categorical, y)
- class endgame.feature_selection.MRMRSelector(n_features=10, task='classification', relevance_func='mutual_info', redundancy_func='pearson', n_neighbors=3, random_state=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorMinimum Redundancy Maximum Relevance feature selection.
MRMR balances feature relevance (high mutual information with target) with redundancy (low mutual information with already-selected features).
The selection criterion is: max(relevance - redundancy)
- Parameters:
n_features (int, default=10) – Number of features to select.
task (str, default='classification') – Task type: ‘classification’ or ‘regression’.
relevance_func (str, default='mutual_info') – Function for computing relevance: - ‘mutual_info’: Mutual information - ‘f_test’: F-statistic
redundancy_func (str, default='pearson') – Function for computing redundancy: - ‘pearson’: Absolute Pearson correlation - ‘mutual_info’: Mutual information between features
n_neighbors (int, default=3) – Number of neighbors for MI estimation.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print selection progress.
- selected_features_¶
Indices of selected features in order of selection.
- Type:
ndarray
- relevance_scores_¶
Relevance scores for all features.
- Type:
ndarray
- ranking_¶
Full feature ranking.
- Type:
ndarray
Example
>>> from endgame.feature_selection import MRMRSelector >>> selector = MRMRSelector(n_features=20) >>> X_selected = selector.fit_transform(X, y) >>> print(f"Selected features: {selector.selected_features_}")
- fit(X, y)[source]¶
Fit the MRMR selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (MRMRSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.ReliefFSelector(n_features=10, n_neighbors=10, n_samples=1.0, algorithm='relieff', random_state=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorReliefF feature selection algorithm.
ReliefF is an instance-based feature weighting algorithm that naturally handles feature interactions. It evaluates features by how well they distinguish between near-miss instances.
- Parameters:
n_features (int, default=10) – Number of features to select.
n_neighbors (int, default=10) – Number of neighbors to consider for each instance.
n_samples (int or float, default=1.0) – Number of samples to use for estimation. - If int, uses that many samples. - If float (0-1), uses that fraction of samples.
algorithm (str, default='relieff') – Algorithm variant: - ‘relieff’: Standard ReliefF - ‘multisurf’: MultiSURF (adaptive radius)
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.
- feature_importances_¶
Feature importance scores.
- Type:
ndarray
- selected_features_¶
Indices of selected features.
- Type:
ndarray
- ranking_¶
Feature ranking by importance.
- Type:
ndarray
Example
>>> from endgame.feature_selection import ReliefFSelector >>> selector = ReliefFSelector(n_features=20, n_neighbors=10) >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the ReliefF selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target labels (must be discrete for classification).
- Returns:
self (ReliefFSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.CorrelationSelector(threshold=0.95, method='pearson', keep='first')[source]¶
Bases:
TransformerMixin,BaseEstimatorRemove highly correlated features.
Identifies and removes features that are highly correlated with other features, keeping only one from each correlated group.
- Parameters:
threshold (float, default=0.95) – Correlation threshold. Features with correlation above this are considered redundant.
method (str, default='pearson') – Correlation method: - ‘pearson’: Pearson correlation (linear) - ‘spearman’: Spearman rank correlation (monotonic) - ‘kendall’: Kendall tau correlation (ordinal)
keep (str, default='first') – Which feature to keep from correlated pairs: - ‘first’: Keep the first feature encountered - ‘variance’: Keep the feature with higher variance - ‘target_corr’: Keep the feature with higher target correlation
- selected_features_¶
Indices of selected features.
- Type:
ndarray
- correlation_matrix_¶
Computed correlation matrix.
- Type:
ndarray
Example
>>> from endgame.feature_selection import CorrelationSelector >>> selector = CorrelationSelector(threshold=0.90) >>> X_reduced = selector.fit_transform(X)
- fit(X, y=None)[source]¶
Fit the correlation selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like, optional) – Target values (required if keep=’target_corr’).
- Returns:
self (CorrelationSelector)
- transform(X)[source]¶
Remove correlated features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with correlated features removed.
Get pairs of highly correlated features.
- Return type:
- Returns:
pairs (list of tuples) – Each tuple is (feature_i, feature_j, correlation).
- class endgame.feature_selection.RFESelector(estimator=None, n_features=None, step=1, cv=5, scoring=None, min_features_to_select=1, verbose=0)[source]¶
Bases:
TransformerMixin,BaseEstimatorRecursive Feature Elimination feature selection.
RFE iteratively removes the least important features based on model coefficients or feature importances.
- Parameters:
estimator (BaseEstimator, optional) – Model to use for feature ranking. Must have coef_ or feature_importances_ attribute. Default: RandomForest.
n_features (int, float, or None, default=None) – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction of features. - If None, use cross-validation to find optimal.
step (int or float, default=1) – Number of features to remove at each iteration: - If int > 1, remove that many features. - If float (0-1), remove that fraction.
cv (int, default=5) – Cross-validation folds (used when n_features=None).
scoring (str, optional) – Scoring metric for RFECV.
min_features_to_select (int, default=1) – Minimum features for RFECV.
verbose (int, default=0) – Verbosity level.
- selected_features_¶
Indices of selected features.
- Type:
ndarray
- ranking_¶
Feature ranking (1 = selected).
- Type:
ndarray
- estimator_¶
Fitted estimator used for final ranking.
- Type:
BaseEstimator
Example
>>> from endgame.feature_selection import RFESelector >>> selector = RFESelector(n_features=20) >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the RFE selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (RFESelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.BorutaSelector(estimator=None, n_estimators='auto', max_iter=100, alpha=0.05, perc=100, two_step=True, random_state=None, verbose=0)[source]¶
Bases:
TransformerMixin,BaseEstimatorBoruta all-relevant feature selection algorithm.
Boruta is a wrapper around Random Forest. It creates “shadow” features (shuffled copies of real features) and selects features that have significantly higher importance than the best shadow feature.
This is a statistically principled method that finds ALL relevant features, not just the minimal set.
- Parameters:
estimator (BaseEstimator, optional) – Tree-based model with feature_importances_. Default: RandomForest.
n_estimators (int or 'auto', default='auto') – Number of trees. ‘auto’ uses heuristic based on features.
max_iter (int, default=100) – Maximum iterations.
alpha (float, default=0.05) – Significance level for the binomial test.
perc (int, default=100) – Percentile of shadow feature importance distribution to use as threshold. 100 = max (original Boruta).
two_step (bool, default=True) – If True, use two-step correction for multiple testing.
random_state (int, optional) – Random seed.
verbose (int, default=0) – Verbosity level.
- selected_features_¶
Indices of confirmed features.
- Type:
ndarray
- tentative_features_¶
Indices of tentative features (borderline).
- Type:
ndarray
- rejected_features_¶
Indices of rejected features.
- Type:
ndarray
- ranking_¶
Feature ranking (1 = confirmed, 2 = tentative, 3 = rejected).
- Type:
ndarray
- feature_importances_¶
Mean feature importances across iterations.
- Type:
ndarray
Example
>>> from endgame.feature_selection import BorutaSelector >>> selector = BorutaSelector(max_iter=100) >>> X_selected = selector.fit_transform(X, y) >>> print(f"Confirmed: {len(selector.selected_features_)}") >>> print(f"Tentative: {len(selector.tentative_features_)}")
- fit(X, y)[source]¶
Fit the Boruta selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (BorutaSelector)
- transform(X)[source]¶
Select confirmed features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with confirmed features.
- class endgame.feature_selection.SequentialSelector(estimator=None, n_features='auto', direction='forward', scoring=None, cv=5, tol=None, n_jobs=None, verbose=0)[source]¶
Bases:
TransformerMixin,BaseEstimatorSequential feature selection.
Implements forward selection, backward elimination, or bidirectional search for optimal feature subsets.
- Parameters:
estimator (BaseEstimator, optional) – Model to use for evaluation. Default: LogisticRegression.
n_features (int, float, or 'auto', default='auto') – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction. - If ‘auto’, use cross-validation to find optimal.
direction (str, default='forward') – Search direction: - ‘forward’: Start empty, add features - ‘backward’: Start full, remove features - ‘bidirectional’: Both directions (floating)
scoring (str, optional) – Scoring metric.
cv (int, default=5) – Cross-validation folds.
tol (float, optional) – Tolerance for early stopping (only for sklearn >= 1.1).
n_jobs (int, default=None) – Number of parallel jobs.
verbose (int, default=0) – Verbosity level.
- selected_features_¶
Indices of selected features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import SequentialSelector >>> selector = SequentialSelector(n_features=10, direction='forward') >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the sequential selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (SequentialSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.GeneticSelector(estimator=None, population_size=50, n_generations=100, mutation_rate=0.1, crossover_rate=0.8, tournament_size=3, elitism=2, min_features=1, max_features=None, scoring=None, cv=5, early_stopping=None, random_state=None, verbose=0)[source]¶
Bases:
TransformerMixin,BaseEstimatorGenetic algorithm for feature selection.
Evolves feature subsets using selection, crossover, and mutation to optimize cross-validation score.
- Parameters:
estimator (BaseEstimator, optional) – Model to use for fitness evaluation.
population_size (int, default=50) – Size of the population.
n_generations (int, default=100) – Number of generations.
mutation_rate (float, default=0.1) – Probability of mutating each gene (feature).
crossover_rate (float, default=0.8) – Probability of crossover between parents.
tournament_size (int, default=3) – Number of individuals in tournament selection.
elitism (int, default=2) – Number of best individuals to keep unchanged.
min_features (int, default=1) – Minimum number of features to select.
max_features (int or float, optional) – Maximum features. If float, fraction of total.
scoring (str, optional) – Scoring metric.
cv (int, default=5) – Cross-validation folds.
early_stopping (int, optional) – Stop if no improvement for this many generations.
random_state (int, optional) – Random seed.
verbose (int, default=0) – Verbosity level.
- selected_features_¶
Indices of selected features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import GeneticSelector >>> selector = GeneticSelector(n_generations=50, population_size=30) >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the genetic selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (GeneticSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.PermutationSelector(estimator, n_features=10, n_repeats=10, scoring=None, threshold=None, use_pimp=False, alpha=0.05, random_state=None, n_jobs=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorFeature selection based on permutation importance.
More reliable than model-specific importances as it measures actual predictive contribution. Can optionally compute p-values (PIMP) for statistical significance.
- Parameters:
estimator (BaseEstimator) – Fitted model or model to fit.
n_features (int or float, default=10) – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction.
n_repeats (int, default=10) – Number of permutation repetitions.
scoring (str, optional) – Scoring metric.
threshold (float, optional) – Minimum importance threshold. If set, overrides n_features.
use_pimp (bool, default=False) – Whether to compute p-values using PIMP (permutation importance with p-values). More statistically rigorous.
alpha (float, default=0.05) – Significance level for PIMP.
random_state (int, optional) – Random seed.
n_jobs (int, default=None) – Number of parallel jobs.
- feature_importances_¶
Permutation importance for each feature.
- Type:
ndarray
- importance_std_¶
Standard deviation of importance across repeats.
- Type:
ndarray
- pvalues_¶
P-values for each feature (if use_pimp=True).
- Type:
ndarray
- selected_features_¶
Indices of selected features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import PermutationSelector >>> model.fit(X, y) >>> selector = PermutationSelector(estimator=model, n_features=20) >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the permutation selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data (should be validation set for fitted model).
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (PermutationSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.SHAPSelector(estimator, n_features=10, explainer_type='auto', background_samples=100, max_samples=None, check_additivity=False, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorFeature selection based on SHAP values.
Uses mean absolute SHAP values as feature importance. More theoretically grounded than permutation importance.
- Parameters:
estimator (BaseEstimator) – Fitted model.
n_features (int or float, default=10) – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction.
explainer_type (str, default='auto') – Type of SHAP explainer: - ‘auto’: Auto-detect based on model type - ‘tree’: TreeExplainer (fast for tree models) - ‘linear’: LinearExplainer - ‘kernel’: KernelExplainer (model-agnostic, slow) - ‘deep’: DeepExplainer (for neural networks)
background_samples (int, default=100) – Number of background samples for KernelExplainer.
max_samples (int, optional) – Maximum samples to use for SHAP computation.
check_additivity (bool, default=False) – Whether to verify SHAP additivity (slower).
random_state (int, optional) – Random seed.
- shap_values_¶
SHAP values for each sample and feature.
- Type:
ndarray
- feature_importances_¶
Mean absolute SHAP values for each feature.
- Type:
ndarray
- selected_features_¶
Indices of selected features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import SHAPSelector >>> model.fit(X_train, y_train) >>> selector = SHAPSelector(estimator=model, n_features=20) >>> X_selected = selector.fit_transform(X_val, y_val)
- fit(X, y=None)[source]¶
Fit the SHAP selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to compute SHAP values on.
y (Ignored)
- Returns:
self (SHAPSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.TreeImportanceSelector(estimator=None, n_features='mean', importance_type='native', threshold=None, prefit=False, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorFeature selection based on tree-based importance.
Uses Gini/entropy importance from tree-based models. Fast but can be biased toward high-cardinality features.
- Parameters:
estimator (BaseEstimator, optional) – Tree-based model with feature_importances_. Default: RandomForestClassifier.
n_features (int, float, or str, default='mean') – Number of features to select: - If int, select that many features. - If float (0-1), select that fraction. - If ‘mean’, select features with importance > mean. - If ‘median’, select features with importance > median.
importance_type (str, default='native') – Type of importance to use: - ‘native’: Use model’s feature_importances_ - ‘gain’: Gain-based (LightGBM/XGBoost specific) - ‘split’: Split count-based
threshold (float, optional) – Explicit importance threshold.
prefit (bool, default=False) – Whether the estimator is already fitted.
random_state (int, optional) – Random seed.
- feature_importances_¶
Feature importance scores.
- Type:
ndarray
- selected_features_¶
Indices of selected features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import TreeImportanceSelector >>> selector = TreeImportanceSelector(n_features=20) >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the tree importance selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (TreeImportanceSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.StabilitySelector(base_selector, n_bootstrap=100, sample_fraction=0.5, threshold=0.6, lambda_grid=None, max_features=None, random_state=None, n_jobs=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorStability selection wrapper for any feature selection method.
Runs feature selection multiple times on bootstrap samples and keeps features that are consistently selected. This addresses the instability problem in most selection methods.
Based on Meinshausen & Buhlmann (2010).
- Parameters:
base_selector (TransformerMixin) – Base feature selector to wrap (must have fit/get_support).
n_bootstrap (int, default=100) – Number of bootstrap iterations.
sample_fraction (float, default=0.5) – Fraction of samples to use in each bootstrap.
threshold (float, default=0.6) – Selection frequency threshold. Features selected in more than this fraction of bootstraps are kept.
lambda_grid (array-like, optional) – For LASSO-style selectors, grid of regularization values.
max_features (int, optional) – Maximum number of features to select.
random_state (int, optional) – Random seed.
n_jobs (int, default=None) – Number of parallel jobs.
verbose (bool, default=False) – Whether to print progress.
- selection_frequencies_¶
Selection frequency for each feature.
- Type:
ndarray
- selected_features_¶
Indices of stable features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import StabilitySelector, MRMRSelector >>> base = MRMRSelector(n_features=20) >>> stable = StabilitySelector(base, n_bootstrap=50, threshold=0.7) >>> X_selected = stable.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the stability selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (StabilitySelector)
- transform(X)[source]¶
Select stable features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with stable features.
- class endgame.feature_selection.KnockoffSelector(fdr=0.1, method='equicorrelated', statistic='lasso_cv', offset=1, random_state=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorKnockoff filter for feature selection with FDR control.
The knockoff filter creates “knockoff” copies of features that have the same correlation structure but are independent of the target. Features are selected if they are more important than their knockoffs.
Provides rigorous statistical guarantees on False Discovery Rate.
Based on Barber & Candes (2015) and Candes et al. (2018).
- Parameters:
fdr (float, default=0.1) – Target false discovery rate.
method (str, default='equicorrelated') – Knockoff generation method: - ‘equicorrelated’: Equicorrelated knockoffs (faster) - ‘sdp’: SDP knockoffs (more powerful, requires cvxpy) - ‘gaussian’: Model-X Gaussian knockoffs
statistic (str, default='lasso_cv') – Feature statistic: - ‘lasso_cv’: LASSO with CV selection - ‘lasso_fixed’: LASSO with fixed lambda - ‘ridge’: Ridge coefficients
offset (int, default=1) – Knockoff+ offset (0 for original knockoff).
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.
- selected_features_¶
Indices of selected features.
- Type:
ndarray
- statistics_¶
Knockoff statistics W_j for each feature.
- Type:
ndarray
- knockoffs_¶
Generated knockoff features.
- Type:
ndarray
Example
>>> from endgame.feature_selection import KnockoffSelector >>> selector = KnockoffSelector(fdr=0.1) >>> X_selected = selector.fit_transform(X, y)
- fit(X, y)[source]¶
Fit the knockoff selector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Returns:
self (KnockoffSelector)
- transform(X)[source]¶
Select features.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_selected (ndarray) – Data with selected features.
- class endgame.feature_selection.AdversarialFeatureSelector(threshold=0.05, max_features_to_remove=10, estimator=None, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerRemoves features that contribute to train/test drift.
Uses adversarial validation to identify and remove features that differ significantly between train and test distributions.
- Parameters:
threshold (float, default=0.05) – Remove features with importance above this threshold.
max_features_to_remove (int, default=10) – Maximum number of features to remove.
estimator (BaseEstimator, optional) – Classifier for adversarial validation.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> selector = AdversarialFeatureSelector(threshold=0.05) >>> selector.fit(X_train, X_test=X_test) >>> X_train_clean = selector.transform(X_train)
- fit(X, y=None, X_test=None, **fit_params)[source]¶
Identify features to remove based on adversarial validation.
- Parameters:
X (array-like) – Training features.
y (ignored)
X_test (array-like) – Test features for adversarial validation.
- Return type:
- Returns:
self
- set_fit_request(*, X_test='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter infit.self (AdversarialFeatureSelector)
- Returns:
self (object) – The updated object.
- Return type:
- class endgame.feature_selection.NullImportanceSelector(estimator=None, n_iterations=100, significance_threshold=0.95, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerSelects features based on null importance distribution.
Features must significantly outperform a shuffled-target baseline. Robust method for identifying truly predictive features.
- Parameters:
Examples
>>> selector = NullImportanceSelector(n_iterations=100) >>> selector.fit(X, y) >>> X_selected = selector.transform(X)