Preprocessing¶
- class endgame.preprocessing.SafeTargetEncoder(cols=None, smoothing=10.0, cv=5, min_samples_leaf=1, noise_level=0.0, handle_unknown='global_mean', output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerTarget encoding with M-estimate smoothing and inner-fold encoding.
Prevents target leakage through nested cross-validation during fit and applies smoothing for rare categories.
Implements: S_i = (n_i × μ_i + m × μ_global) / (n_i + m)
- Parameters:
cols (List[str], optional) – Columns to encode. If None, encodes all categorical columns.
smoothing (float, default=10) – Smoothing parameter (m) for rare categories. Higher values = more regularization toward global mean.
cv (int, default=5) – Number of folds for inner-fold encoding during fit.
min_samples_leaf (int, default=1) – Minimum samples required to compute category statistic.
noise_level (float, default=0.0) – Gaussian noise std to add for regularization.
handle_unknown (str, default='global_mean') – Strategy for unseen categories: ‘global_mean’, ‘nan’, ‘error’.
output_format (str, default='auto') – Output format: ‘auto’, ‘polars’, ‘pandas’, ‘numpy’.
random_state (int, optional) – Random seed for cross-validation and noise.
verbose (bool)
Examples
>>> from endgame.preprocessing import SafeTargetEncoder >>> encoder = SafeTargetEncoder(smoothing=10, cv=5) >>> X_encoded = encoder.fit_transform(X, y)
- fit(X, y, **fit_params)[source]¶
Fit the target encoder.
Uses inner-fold encoding to prevent leakage during training.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
self
- transform(X)[source]¶
Transform data using learned encodings.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_transformed (array-like) – Transformed data with encoded columns.
- fit_transform(X, y, **fit_params)[source]¶
Fit and transform with inner-fold encoding to prevent leakage.
During fit_transform, uses cross-validation to compute encodings without leakage. Each sample is encoded using statistics computed only from other samples.
- Parameters:
X (array-like) – Training data.
y (array-like) – Target values.
- Return type:
- Returns:
X_transformed (array-like) – Transformed training data.
- class endgame.preprocessing.LeaveOneOutEncoder(cols=None, smoothing=1.0, handle_unknown='global_mean', output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerLeave-One-Out target encoding for online settings.
Each sample’s encoding excludes its own target value, preventing direct leakage while still using all available data.
- Parameters:
cols (List[str], optional) – Columns to encode. If None, encodes all categorical columns.
smoothing (float, default=1.0) – Smoothing parameter for regularization.
handle_unknown (str, default='global_mean') – Strategy for unseen categories.
random_state (int, optional) – Random seed for reproducibility.
output_format (str)
verbose (bool)
- class endgame.preprocessing.CatBoostEncoder(cols=None, smoothing=1.0, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerCatBoost-style ordered target encoding.
Encodes based only on preceding samples, mimicking CatBoost’s internal target statistic computation. Prevents leakage by using only “past” information for each sample.
- Parameters:
- class endgame.preprocessing.FrequencyEncoder(cols=None, normalize=True, handle_unknown='zero', output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerFrequency encoding for categorical features.
Replaces categories with their frequency (count or proportion). Simple but effective encoding that doesn’t require target values.
- Parameters:
cols (List[str], optional) – Columns to encode. If None, encodes all categorical columns.
normalize (bool, default=True) – If True, use proportions. If False, use raw counts.
handle_unknown (str, default='zero') – Strategy for unseen categories: ‘zero’, ‘nan’, ‘error’.
output_format (str)
random_state (int | None)
verbose (bool)
- class endgame.preprocessing.AutoAggregator(group_cols, agg_cols=None, methods=('mean', 'std', 'min', 'max'), rank_features=True, diff_features=False, ratio_features=False, prefix=None, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerGenerates “Magic Feature” aggregations used in winning solutions.
Creates group-level statistics that capture relationships between entities. Key technique from Optiver 1st place and many tabular wins.
- Parameters:
group_cols (List[str]) – Columns to group by (e.g., [‘customer_id’, ‘store_id’]).
agg_cols (List[str], optional) – Columns to aggregate (e.g., [‘amount’, ‘quantity’]). If None, aggregates all numeric columns.
methods (List[str], default=['mean', 'std', 'min', 'max']) – Aggregation methods: ‘mean’, ‘std’, ‘min’, ‘max’, ‘sum’, ‘count’, ‘median’, ‘skew’, ‘kurtosis’, ‘first’, ‘last’, ‘nunique’.
rank_features (bool, default=True) – Whether to compute rank features within groups. Key technique from Optiver 1st place solution.
diff_features (bool, default=False) – Whether to compute difference from group mean.
ratio_features (bool, default=False) – Whether to compute ratio to group mean.
prefix (str, optional) – Prefix for generated feature names.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> agg = AutoAggregator( ... group_cols=['customer_id'], ... agg_cols=['amount'], ... methods=['mean', 'std', 'skew'], ... rank_features=True ... ) >>> X_agg = agg.fit_transform(X)
- fit(X, y=None, **fit_params)[source]¶
Compute aggregation statistics from training data.
- Parameters:
X (array-like) – Training data.
y (array-like, optional) – Ignored.
- Return type:
- Returns:
self
- class endgame.preprocessing.InteractionFeatures(interaction_pairs=None, operations=('multiply', 'divide'), max_interactions=100, include_cols=None, exclude_cols=None, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerGenerates interaction features between specified columns.
Creates arithmetic combinations (multiply, divide, add, subtract) between pairs of numeric features.
- Parameters:
interaction_pairs (List[Tuple[str, str]], optional) – Specific pairs to create. If None, creates all pairs.
operations (List[str], default=['multiply', 'divide']) – Operations: ‘multiply’, ‘divide’, ‘add’, ‘subtract’.
max_interactions (int, default=100) – Maximum number of interactions to create.
include_cols (List[str], optional) – Only consider these columns for interactions.
exclude_cols (List[str], optional) – Exclude these columns from interactions.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> inter = InteractionFeatures( ... operations=['multiply', 'divide'], ... max_interactions=50 ... ) >>> X_inter = inter.fit_transform(X)
- class endgame.preprocessing.RankFeatures(cols=None, method='average', pct=True, suffix='_rank', output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerCompute rank-based features.
Converts numeric values to ranks, which can be more robust to outliers and non-linear relationships.
- Parameters:
cols (List[str], optional) – Columns to rank. If None, ranks all numeric columns.
method (str, default='average') – Ranking method: ‘average’, ‘min’, ‘max’, ‘dense’, ‘ordinal’.
pct (bool, default=True) – Whether to return percentile ranks (0-1).
suffix (str, default='_rank') – Suffix for ranked column names.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> ranker = RankFeatures(pct=True) >>> X_ranked = ranker.fit_transform(X)
- class endgame.preprocessing.TemporalFeatures(datetime_cols=None, features=None, cyclical=True, drop_original=False, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerExtracts temporal features from datetime columns.
Generates comprehensive datetime features including cyclical encodings for periodic patterns.
Features generated: - Basic: year, month, day, dayofweek, hour, minute, second - Boolean: is_weekend, is_month_start, is_month_end, is_year_start, is_year_end - Derived: quarter, week_of_year, day_of_year - Cyclical: sin/cos encodings for month, day, hour, dayofweek
- Parameters:
datetime_cols (List[str], optional) – Datetime columns to extract features from. If None, auto-detects datetime columns.
features (List[str], optional) – Features to extract. If None, extracts all. Options: ‘year’, ‘month’, ‘day’, ‘dayofweek’, ‘hour’, ‘minute’, ‘second’, ‘is_weekend’, ‘quarter’, ‘week_of_year’, ‘day_of_year’, ‘is_month_start’, ‘is_month_end’, ‘cyclical’.
cyclical (bool, default=True) – Whether to add cyclical (sin/cos) encodings.
drop_original (bool, default=False) – Whether to drop the original datetime columns.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> tf = TemporalFeatures(cyclical=True) >>> X_temporal = tf.fit_transform(X)
- class endgame.preprocessing.LagFeatures(cols=None, lags=(1, 2, 3), group_cols=None, fill_value=None, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerGenerate lag features for time series data.
Creates shifted versions of features to capture temporal dependencies.
- Parameters:
cols (List[str], optional) – Columns to create lags for. If None, uses all numeric columns.
lags (List[int], default=[1, 2, 3]) – Lag periods to create.
group_cols (List[str], optional) – Columns to group by when computing lags.
fill_value (float, optional) – Value to fill NaN from lagging. If None, keeps NaN.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> lf = LagFeatures(cols=['price'], lags=[1, 7, 30]) >>> X_lagged = lf.fit_transform(X)
- class endgame.preprocessing.RollingFeatures(cols=None, windows=(3, 7, 14), methods=('mean', 'std'), group_cols=None, min_periods=1, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerGenerate rolling window statistics.
Creates rolling aggregations for time series data.
- Parameters:
cols (List[str], optional) – Columns to compute rolling stats for.
windows (List[int], default=[3, 7, 14]) – Window sizes.
methods (List[str], default=['mean', 'std']) – Aggregation methods: ‘mean’, ‘std’, ‘min’, ‘max’, ‘sum’.
group_cols (List[str], optional) – Columns to group by.
min_periods (int, default=1) – Minimum observations in window required.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> rf = RollingFeatures(cols=['price'], windows=[7, 30]) >>> X_rolling = rf.fit_transform(X)
- class endgame.preprocessing.AdversarialFeatureSelector(threshold=0.05, max_features_to_remove=10, estimator=None, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerRemoves features that contribute to train/test drift.
Uses adversarial validation to identify and remove features that differ significantly between train and test distributions.
- Parameters:
threshold (float, default=0.05) – Remove features with importance above this threshold.
max_features_to_remove (int, default=10) – Maximum number of features to remove.
estimator (BaseEstimator, optional) – Classifier for adversarial validation.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> selector = AdversarialFeatureSelector(threshold=0.05) >>> selector.fit(X_train, X_test=X_test) >>> X_train_clean = selector.transform(X_train)
- fit(X, y=None, X_test=None, **fit_params)[source]¶
Identify features to remove based on adversarial validation.
- Parameters:
X (array-like) – Training features.
y (ignored)
X_test (array-like) – Test features for adversarial validation.
- Return type:
- Returns:
self
- set_fit_request(*, X_test='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
fitmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed tofitif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it tofit.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_testparameter infit.self (AdversarialFeatureSelector)
- Returns:
self (object) – The updated object.
- Return type:
- class endgame.preprocessing.PermutationImportanceSelector(estimator=None, threshold=0.0, n_repeats=10, scoring=None, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerSelects features based on permutation importance.
More robust than model-specific importance measures because it measures actual predictive contribution.
- Parameters:
estimator (BaseEstimator) – Fitted estimator to evaluate.
threshold (float, default=0.0) – Minimum importance to keep a feature.
n_repeats (int, default=10) – Number of permutation repetitions.
scoring (str, optional) – Scoring metric for importance calculation.
output_format (str)
random_state (int | None)
verbose (bool)
Examples
>>> selector = PermutationImportanceSelector(estimator=model) >>> selector.fit(X_val, y_val) >>> X_selected = selector.transform(X_train)
- class endgame.preprocessing.NullImportanceSelector(estimator=None, n_iterations=100, significance_threshold=0.95, output_format='auto', random_state=None, verbose=False)[source]¶
Bases:
PolarsTransformerSelects features based on null importance distribution.
Features must significantly outperform a shuffled-target baseline. Robust method for identifying truly predictive features.
- Parameters:
Examples
>>> selector = NullImportanceSelector(n_iterations=100) >>> selector.fit(X, y) >>> X_selected = selector.transform(X)
- class endgame.preprocessing.BayesianDiscretizer(strategy='mdlp', max_bins=10, min_samples_bin=5, discrete_features='auto', max_unique_continuous=20, random_state=None, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinDiscretizes continuous features for Bayesian Network Classifier consumption.
Supports multiple discretization strategies with automatic handling of already-discrete features.
- Parameters:
strategy ({'mdlp', 'equal_width', 'equal_freq', 'kmeans'}, default='mdlp') – Discretization strategy: - ‘mdlp’: Minimum Description Length Principle (supervised, requires y) - ‘equal_width’: Fixed-width bins - ‘equal_freq’: Equal-frequency bins (quantiles) - ‘kmeans’: Cluster-based discretization
max_bins (int, default=10) – Maximum number of bins per feature.
min_samples_bin (int, default=5) – Minimum samples per bin (affects MDLP stopping criterion).
discrete_features (array-like of int | 'auto' | None, default='auto') – Which features are already discrete: - ‘auto’: Detect based on dtype and unique values - list of int: Indices of discrete features - None: Treat all features as continuous
max_unique_continuous (int, default=20) – If ‘auto’, features with <= this many unique values are considered discrete.
random_state (int, optional) – Random seed for kmeans initialization.
verbose (bool, default=False) – Enable verbose output.
- n_bins_¶
Number of bins for each feature.
- Type:
np.ndarray
- discrete_features_¶
Boolean mask of discrete features.
- Type:
np.ndarray
- feature_names_in_¶
Feature names (if input was DataFrame).
- Type:
np.ndarray
Examples
>>> from endgame.preprocessing import BayesianDiscretizer >>> disc = BayesianDiscretizer(strategy='mdlp') >>> X_disc = disc.fit_transform(X_train, y_train) >>> X_test_disc = disc.transform(X_test)
- fit(X, y=None, **fit_params)[source]¶
Fit the discretizer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,), optional) – Target values. Required for ‘mdlp’ strategy.
- Return type:
- Returns:
self
- transform(X)[source]¶
Transform continuous features to discrete.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
np.ndarray – Discretized data with integer values.
- inverse_transform(X_disc)[source]¶
Approximate inverse transform (returns bin centers).
Note: This is lossy - the original continuous values cannot be recovered exactly.
- Parameters:
X_disc (np.ndarray) – Discretized data.
- Return type:
- Returns:
np.ndarray – Approximate continuous values (bin centers).
- set_inverse_transform_request(*, X_disc='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
inverse_transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toinverse_transformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toinverse_transform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_disc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_discparameter ininverse_transform.self (BayesianDiscretizer)
- Returns:
self (object) – The updated object.
- Return type:
- class endgame.preprocessing.SimpleImputer(strategy='median', fill_value=None, add_indicator=False, copy=True, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinSimple imputation with mean, median, mode, or constant fill.
Thin wrapper around sklearn.impute.SimpleImputer with better defaults for competition settings (median instead of mean, which is more robust to outliers).
- Parameters:
strategy (str, default='median') – Imputation strategy: - ‘mean’: Replace with column mean - ‘median’: Replace with column median (default, outlier-robust) - ‘most_frequent’: Replace with mode - ‘constant’: Replace with
fill_valuefill_value (float or str, optional) – Value to use when
strategy='constant'. Default is 0.add_indicator (bool, default=False) – If True, append binary missing-indicator columns.
copy (bool, default=True) – If True, create a copy of X before imputing.
verbose (bool, default=False) – Enable verbose output.
- statistics_¶
The imputation fill value for each feature.
- Type:
ndarray of shape (n_features,)
- indicator_¶
Indicator used to add binary indicators for missing values.
- Type:
MissingIndicator or None
Examples
>>> import numpy as np >>> from endgame.preprocessing.imputation import SimpleImputer >>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan]]) >>> imp = SimpleImputer(strategy='median') >>> imp.fit_transform(X) array([[1. , 2. ], [4. , 3. ], [7. , 2.5]])
- fit(X, y=None, **fit_params)[source]¶
Fit the imputer on training data.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data with missing values (np.nan).
y (ignored)
- Return type:
- Returns:
self
- transform(X)[source]¶
Impute missing values in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data with missing values.
- Return type:
- Returns:
X_imputed (ndarray or DataFrame of shape (n_samples, n_features)) – Imputed data.
- class endgame.preprocessing.IndicatorImputer(base_strategy='median', fill_value=None, only_missing=True, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinImputer that adds binary missing-indicator columns alongside imputed values.
For each feature with missing values, appends a binary column indicating which rows were originally missing. This is a common Kaggle trick that lets tree-based models learn different splits for missing vs. non-missing.
- Parameters:
base_strategy (str, default='median') – Strategy for filling missing values: ‘mean’, ‘median’, ‘most_frequent’, ‘constant’.
fill_value (float, optional) – Fill value when base_strategy=’constant’.
only_missing (bool, default=True) – If True, only add indicators for features that have missing values in the training data. If False, add indicators for all features.
verbose (bool, default=False) – Enable verbose output.
- statistics_¶
The imputation fill value for each feature.
- Type:
ndarray of shape (n_features,)
Examples
>>> import numpy as np >>> from endgame.preprocessing.imputation import IndicatorImputer >>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan]]) >>> imp = IndicatorImputer(base_strategy='median') >>> X_out = imp.fit_transform(X) >>> X_out.shape (3, 4)
- fit(X, y=None, **fit_params)[source]¶
Fit the indicator imputer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored)
- Return type:
- Returns:
self
- transform(X)[source]¶
Impute and add indicator columns.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data with missing values.
- Return type:
- Returns:
X_out (ndarray or DataFrame of shape (n_samples, n_features + n_indicators)) – Imputed data with binary indicator columns appended.
- class endgame.preprocessing.KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean', add_indicator=False, copy=True, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinK-Nearest Neighbors imputation with competition defaults.
Wraps sklearn.impute.KNNImputer with defaults tuned for tabular competitions: n_neighbors=5, uniform weights, nan_euclidean distance.
- Parameters:
n_neighbors (int, default=5) – Number of nearest neighbors to use.
weights (str, default='uniform') – Weight function for prediction: ‘uniform’ or ‘distance’.
metric (str, default='nan_euclidean') – Distance metric for finding neighbors.
add_indicator (bool, default=False) – If True, append binary missing-indicator columns.
copy (bool, default=True) – If True, create a copy of X.
verbose (bool, default=False) – Enable verbose output.
Examples
>>> import numpy as np >>> from endgame.preprocessing.imputation import KNNImputer >>> X = np.array([[1, 2], [np.nan, 3], [7, 6], [5, np.nan]]) >>> imp = KNNImputer(n_neighbors=2) >>> imp.fit_transform(X) array([[1. , 2. ], [3. , 3. ], [7. , 6. ], [5. , 4. ]])
- fit(X, y=None, **fit_params)[source]¶
Fit the KNN imputer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored)
- Return type:
- Returns:
self
- transform(X)[source]¶
Impute missing values using KNN.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data with missing values.
- Return type:
- Returns:
X_imputed (ndarray or DataFrame) – Imputed data.
- class endgame.preprocessing.MICEImputer(estimator=None, max_iter=10, tol=0.001, initial_strategy='median', sample_posterior=False, random_state=42, add_indicator=False, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinMultiple Imputation by Chained Equations.
Uses sklearn.impute.IterativeImputer with BayesianRidge as the default estimator, which is the standard MICE implementation. Iteratively models each feature as a function of all other features.
- Parameters:
estimator (estimator, optional) – The estimator to predict each feature from all others. Default is BayesianRidge, which provides the standard MICE formulation.
max_iter (int, default=10) – Maximum number of imputation rounds.
tol (float, default=1e-3) – Convergence tolerance.
initial_strategy (str, default='median') – Strategy for initial imputation before iterating: ‘mean’, ‘median’, ‘most_frequent’, ‘constant’.
sample_posterior (bool, default=False) – If True, sample from the predictive posterior for each imputation. Provides proper multiple imputations when True.
random_state (int, default=42) – Random seed for reproducibility. Default set for deterministic results in competition settings.
add_indicator (bool, default=False) – If True, append binary missing-indicator columns.
verbose (bool, default=False) – Enable verbose output.
Examples
>>> import numpy as np >>> from endgame.preprocessing.imputation import MICEImputer >>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [5, 4]]) >>> imp = MICEImputer(max_iter=10, random_state=42) >>> X_imputed = imp.fit_transform(X)
- fit(X, y=None, **fit_params)[source]¶
Fit the MICE imputer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored)
- Return type:
- Returns:
self
- transform(X)[source]¶
Impute missing values using MICE.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data with missing values.
- Return type:
- Returns:
X_imputed (ndarray or DataFrame) – Imputed data.
- class endgame.preprocessing.MissForestImputer(n_estimators=100, max_iter=10, max_depth=None, max_features='sqrt', initial_strategy='median', random_state=42, n_jobs=-1, add_indicator=False, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinRandom Forest-based iterative imputation (MissForest algorithm).
Uses sklearn.impute.IterativeImputer with a RandomForestRegressor as the base estimator. This non-parametric approach handles non-linear relationships and interactions between features effectively.
- Parameters:
n_estimators (int, default=100) – Number of trees in the random forest estimator.
max_iter (int, default=10) – Maximum number of imputation rounds.
max_depth (int or None, default=None) – Maximum depth of each tree. None means nodes are expanded until all leaves are pure or contain fewer than min_samples_split samples.
max_features (str or float, default='sqrt') – Number of features considered at each split.
initial_strategy (str, default='median') – Strategy for initial imputation before iterating.
random_state (int, default=42) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs for the random forest. -1 uses all cores.
add_indicator (bool, default=False) – If True, append binary missing-indicator columns.
verbose (bool, default=False) – Enable verbose output.
Examples
>>> import numpy as np >>> from endgame.preprocessing.imputation import MissForestImputer >>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [5, 4]]) >>> imp = MissForestImputer(n_estimators=50, random_state=42) >>> X_imputed = imp.fit_transform(X)
- fit(X, y=None, **fit_params)[source]¶
Fit the MissForest imputer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored)
- Return type:
- Returns:
self
- transform(X)[source]¶
Impute missing values using MissForest.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data with missing values.
- Return type:
- Returns:
X_imputed (ndarray or DataFrame) – Imputed data.
- class endgame.preprocessing.AutoImputer(strategy='auto', low_threshold=0.05, high_threshold=0.3, random_state=42, add_indicator=False, verbose=False)[source]¶
Bases:
EndgameEstimator,TransformerMixinAutomatic imputation strategy selection based on missingness patterns.
Analyzes the missingness structure in the data and selects an appropriate imputation strategy:
<5% missing -> SimpleImputer (fast, sufficient for low missingness)
5-30% missing -> KNNImputer (captures local structure)
>30% missing -> MICEImputer (models complex dependencies)
Also performs an approximate Little’s MCAR test to characterize the missingness mechanism (MCAR, MAR, or MNAR).
- Parameters:
strategy (str, default='auto') – Imputation strategy: - ‘auto’: Automatically select based on missingness percentage - ‘simple’: Force SimpleImputer - ‘knn’: Force KNNImputer - ‘mice’: Force MICEImputer - ‘missforest’: Force MissForestImputer
low_threshold (float, default=0.05) – Missingness fraction below which SimpleImputer is used (in auto mode).
high_threshold (float, default=0.30) – Missingness fraction above which MICEImputer is used (in auto mode).
random_state (int, default=42) – Random seed for reproducibility.
add_indicator (bool, default=False) – If True, append binary missing-indicator columns.
verbose (bool, default=False) – Enable verbose output.
- imputer_¶
The fitted imputer instance.
- Type:
estimator
Examples
>>> import numpy as np >>> from endgame.preprocessing.imputation import AutoImputer >>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [5, 4]]) >>> imp = AutoImputer(strategy='auto', random_state=42) >>> X_imputed = imp.fit_transform(X) >>> imp.selected_strategy_ 'knn'
- fit(X, y=None, **fit_params)[source]¶
Fit the auto imputer.
Analyzes missingness patterns and selects the appropriate strategy, then fits the chosen imputer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored)
- Return type:
- Returns:
self
- transform(X)[source]¶
Impute missing values using the selected strategy.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data with missing values.
- Return type:
- Returns:
X_imputed (ndarray or DataFrame) – Imputed data.
- class endgame.preprocessing.TargetTransformer(regressor=None, method='auto', random_state=None, verbose=False)[source]¶
Bases:
EndgameEstimator,RegressorMixinWrapper that applies target transformations for regression.
Transforms the target variable y during
fit, trains the wrapped regressor on the transformed targets, and inverse-transforms predictions at inference time.- Parameters:
regressor (estimator) – Any sklearn-compatible regressor. This is required.
method (str, default='auto') –
Transformation method. One of:
'auto': Test normality via Shapiro-Wilk; try Box-Cox and Yeo-Johnson and pick whichever produces the most normal transformed y. Falls back to'yeo_johnson'when Box-Cox is not applicable (non-positive targets).'log': Natural log. Requires strictly positive targets.'log1p':log(1 + y). Requires non-negative targets.'sqrt': Square root. Requires non-negative targets.'box_cox': Box-Cox power transform (scipy). Requires strictly positive targets.'yeo_johnson': Yeo-Johnson power transform (scipy). Works with any real-valued targets.'quantile': Sklearn QuantileTransformer mapping to normal.'rank': Rank-based (ordinal) normalization.'none': No transformation (passthrough).
random_state (int, optional) – Random seed for reproducibility (passed to quantile transform and the wrapped regressor if it supports it).
verbose (bool, default=False) – Enable verbose output.
- regressor_¶
The fitted regressor (clone of
regressor).- Type:
estimator
- qt_¶
Fitted QuantileTransformer instance (for
method='quantile').- Type:
QuantileTransformer or None
- y_train_sorted_¶
Sorted training targets for rank inverse transform.
- Type:
ndarray or None
- feature_importances_¶
Delegated from the wrapped regressor, if available.
- Type:
ndarray
Examples
>>> from sklearn.ensemble import RandomForestRegressor >>> from endgame.preprocessing import TargetTransformer >>> model = TargetTransformer( ... regressor=RandomForestRegressor(n_estimators=100, random_state=42), ... method='auto', ... ) >>> model.fit(X_train, y_train) >>> preds = model.predict(X_test)
- fit(X, y, **fit_params)[source]¶
Fit the wrapped regressor on transformed targets.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training features.
y (array-like of shape (n_samples,)) – Training targets.
**fit_params (dict) – Additional parameters forwarded to the wrapped regressor’s
fitmethod (e.g.sample_weight).
- Return type:
- Returns:
self – Fitted TargetTransformer.
- predict(X)[source]¶
Predict target values, inverse-transforming the regressor’s output.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test features.
- Return type:
- Returns:
ndarray of shape (n_samples,) – Predicted target values in the original scale.
- predict_proba(X)[source]¶
Pass through to the wrapped regressor’s predict_proba, if available.
Some regressors (e.g. NGBoost) support probabilistic predictions. This method delegates directly without inverse-transforming, as the semantics are regressor-specific.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test features.
- Return type:
- Returns:
ndarray – Whatever the wrapped regressor returns from predict_proba.
- Raises:
AttributeError – If the wrapped regressor does not support predict_proba.
- property feature_importances_: ndarray¶
Feature importances from the wrapped regressor.
- Returns:
ndarray of shape (n_features,) – Feature importances.
- Raises:
AttributeError – If the wrapped regressor does not expose feature_importances_.
- set_score_request(*, sample_weight='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.self (TargetTransformer)
- Returns:
self (object) – The updated object.
- Return type:
- class endgame.preprocessing.TargetQuantileTransformer(regressor=None, n_quantiles=1000, output_distribution='normal', subsample=100000, random_state=None, verbose=False)[source]¶
Bases:
EndgameEstimator,RegressorMixinConvenience wrapper applying QuantileTransformer to the target.
This is a specialized shortcut for
TargetTransformer(method='quantile'). It wraps a regressor and normalizes the target via sklearn’s QuantileTransformer before fitting.- Parameters:
regressor (estimator) – Any sklearn-compatible regressor.
n_quantiles (int, default=1000) – Number of quantiles for the QuantileTransformer.
output_distribution (str, default='normal') – Output distribution: ‘normal’ or ‘uniform’.
subsample (int, default=100000) – Subsample size for quantile estimation.
random_state (int, optional) – Random seed for reproducibility.
verbose (bool, default=False) – Enable verbose output.
- regressor_¶
The fitted regressor.
- Type:
estimator
- qt_¶
The fitted target QuantileTransformer.
- Type:
QuantileTransformer
- feature_importances_¶
Delegated from the wrapped regressor, if available.
- Type:
ndarray
Examples
>>> from sklearn.linear_model import Ridge >>> from endgame.preprocessing.target_transform import TargetQuantileTransformer >>> model = TargetQuantileTransformer( ... regressor=Ridge(), ... n_quantiles=500, ... output_distribution='normal', ... ) >>> model.fit(X_train, y_train) >>> preds = model.predict(X_test)
- fit(X, y, **fit_params)[source]¶
Fit the wrapped regressor on quantile-transformed targets.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training features.
y (array-like of shape (n_samples,)) – Training targets.
**fit_params (dict) – Additional parameters forwarded to the regressor.
- Return type:
- Returns:
self
- predict(X)[source]¶
Predict target values, inverse-transforming the output.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Test features.
- Return type:
- Returns:
ndarray of shape (n_samples,) – Predicted target values in the original scale.
- set_score_request(*, sample_weight='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.self (TargetQuantileTransformer)
- Returns:
self (object) – The updated object.
- Return type:
- class endgame.preprocessing.ConfidentLearningFilter(base_estimator='rf', cv=5, threshold='auto', method='prune_by_class', n_jobs=1, random_state=None)[source]¶
Bases:
BaseEstimatorIdentify mislabeled examples using Confident Learning.
Uses cross-validated predicted probabilities to estimate the joint distribution of noisy and true labels, then identifies examples that are likely mislabeled.
- Parameters:
base_estimator (estimator or str, default='rf') – Classifier to use for cross-validated probability estimation. Can be ‘rf’ (RandomForest), ‘xgboost’, ‘lgbm’, or any sklearn-compatible classifier with predict_proba.
cv (int, default=5) – Number of cross-validation folds for probability estimation.
threshold (float or str, default='auto') – Confidence threshold for identifying noise. If ‘auto’, uses per-class average predicted probability as threshold. If float, uses the same threshold for all classes.
method (str, default='prune_by_class') – Method for identifying noisy labels: - ‘prune_by_class’: Remove examples with low self-confidence - ‘prune_by_noise_rate’: Remove based on estimated noise rates - ‘both’: Intersection of both methods (most conservative)
n_jobs (int, default=1) – Number of parallel jobs for cross-validation.
random_state (int or None, default=None) – Random state for reproducibility.
- noise_mask_¶
Boolean mask where True indicates suspected noisy labels.
- Type:
ndarray of shape (n_samples,)
- noise_indices_¶
Indices of suspected noisy examples.
- Type:
ndarray
- confident_joint_¶
Estimated joint distribution of noisy vs. true labels.
- per_class_noise_rate_¶
Estimated noise rate per class.
- Type:
ndarray
Example
>>> clf = ConfidentLearningFilter(base_estimator='rf', cv=5) >>> noise_mask = clf.fit_detect(X, y) >>> print(f"Found {noise_mask.sum()} noisy labels ({noise_mask.mean():.1%})") >>> X_clean, y_clean = X[~noise_mask], y[~noise_mask]
- fit(X, y)[source]¶
Fit the noise detector.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training features.
y (array-like of shape (n_samples,)) – Noisy training labels.
- Return type:
- Returns:
self
- fit_detect(X, y)[source]¶
Fit and return the noise mask.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training features.
y (array-like of shape (n_samples,)) – Noisy training labels.
- Return type:
- Returns:
noise_mask (ndarray of shape (n_samples,)) – Boolean mask where True indicates suspected noisy label.
- class endgame.preprocessing.ConsensusFilter(estimators=None, cv=5, consensus_threshold=0.5, n_jobs=1, random_state=None)[source]¶
Bases:
BaseEstimatorIdentify noisy labels via consensus of multiple classifiers.
Trains multiple diverse classifiers and identifies examples where the majority disagree with the given label.
- Parameters:
estimators (list of estimators, optional) – List of classifiers to use. If None, uses a default diverse set.
cv (int, default=5) – Cross-validation folds for prediction.
consensus_threshold (float, default=0.5) – Fraction of classifiers that must disagree with the given label for it to be flagged as noisy.
n_jobs (int, default=1) – Number of parallel jobs.
random_state (int or None, default=None) – Random state for reproducibility.
Example
>>> from endgame.preprocessing import ConsensusFilter >>> cf = ConsensusFilter(consensus_threshold=0.7) >>> noise_mask = cf.fit_detect(X, y)
- fit(X, y)[source]¶
Fit the consensus noise detector.
- Parameters:
X (array-like of shape (n_samples, n_features))
y (array-like of shape (n_samples,))
- Return type:
- Returns:
self
- class endgame.preprocessing.CrossValNoiseDetector(base_estimator=None, cv=5, n_repeats=3, misclassification_threshold=0.5, random_state=None)[source]¶
Bases:
BaseEstimatorSimple cross-validated noise detection.
Flags examples that are consistently misclassified across CV folds as potentially noisy.
- Parameters:
base_estimator (estimator, default=None) – Classifier to use. If None, uses RandomForestClassifier.
cv (int, default=5) – Number of CV folds.
n_repeats (int, default=3) – Number of repetitions with different random seeds.
misclassification_threshold (float, default=0.5) – Fraction of times an example must be misclassified across all folds and repeats to be flagged as noisy.
random_state (int or None, default=None) – Random state.
Example
>>> detector = CrossValNoiseDetector(n_repeats=5) >>> noise_mask = detector.fit_detect(X, y)
- class endgame.preprocessing.SMOTEResampler(sampling_strategy='auto', k_neighbors=5, random_state=None)[source]¶
Bases:
BaseEstimatorSMOTE (Synthetic Minority Over-sampling Technique) wrapper.
Creates synthetic samples by interpolating between minority class instances and their k-nearest neighbors.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information: - ‘auto’: Resample all classes but the majority - ‘minority’: Resample only the minority class - ‘not majority’: Resample all classes but the majority - ‘all’: Resample all classes - float: Ratio of minority to majority (0 < ratio <= 1) - dict: {class_label: n_samples} for each class
k_neighbors (int, default=5) – Number of nearest neighbors used to construct synthetic samples.
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs for neighbor search.
- sampler_¶
The fitted SMOTE sampler.
- Type:
imblearn.over_sampling.SMOTE
Examples
>>> from endgame.preprocessing import SMOTEResampler >>> smote = SMOTEResampler(k_neighbors=5, random_state=42) >>> X_res, y_res = smote.fit_resample(X, y)
- fit(X, y)[source]¶
Fit the SMOTE sampler.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
self (SMOTEResampler) – Fitted sampler.
- fit_resample(X, y)[source]¶
Fit and resample the dataset.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
X_resampled (ndarray of shape (n_samples_new, n_features)) – Resampled training data.
y_resampled (ndarray of shape (n_samples_new,)) – Resampled target values.
- class endgame.preprocessing.BorderlineSMOTEResampler(sampling_strategy='auto', k_neighbors=5, m_neighbors=10, kind='borderline-1', random_state=None)[source]¶
Bases:
BaseEstimatorBorderline-SMOTE wrapper focusing on difficult borderline samples.
Only generates synthetic samples from minority instances that are near the decision boundary (borderline instances).
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.
k_neighbors (int, default=5) – Number of nearest neighbors for SMOTE interpolation.
m_neighbors (int, default=10) – Number of nearest neighbors to determine if instance is borderline.
kind ({'borderline-1', 'borderline-2'}, default='borderline-1') –
‘borderline-1’: Only use borderline minority instances
’borderline-2’: Use borderline minority + their majority neighbors
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.ADASYNResampler(sampling_strategy='auto', n_neighbors=5, random_state=None)[source]¶
Bases:
BaseEstimatorADASYN (Adaptive Synthetic Sampling) wrapper.
Generates synthetic samples adaptively based on local density - more samples are generated in regions where minority class is sparse.
- Parameters:
- class endgame.preprocessing.SVMSMOTEResampler(sampling_strategy='auto', k_neighbors=5, m_neighbors=10, svm_estimator=None, out_step=0.5, random_state=None)[source]¶
Bases:
BaseEstimatorSVM-SMOTE wrapper using SVM to identify borderline samples.
Uses SVM to identify support vectors (borderline samples) and generates synthetic samples only from those.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.
k_neighbors (int, default=5) – Number of nearest neighbors for SMOTE.
m_neighbors (int, default=10) – Number of nearest neighbors for borderline detection.
svm_estimator (estimator or None, default=None) – SVM classifier. If None, uses SVC with default parameters.
out_step (float, default=0.5) – Step size for generating samples outside the decision boundary.
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.KMeansSMOTEResampler(sampling_strategy='auto', k_neighbors=2, kmeans_estimator=None, cluster_balance_threshold=0.1, density_exponent='auto', random_state=None, n_jobs=-1)[source]¶
Bases:
BaseEstimatorK-Means SMOTE wrapper for cluster-based oversampling.
Applies k-means clustering before SMOTE, generating synthetic samples in under-represented clusters.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.
k_neighbors (int, default=2) – Number of nearest neighbors for SMOTE.
kmeans_estimator (estimator or int, default=None) – KMeans instance or number of clusters. If None, uses n_classes.
cluster_balance_threshold (float, default=0.1) – Threshold for considering clusters as imbalanced.
density_exponent (float or 'auto', default='auto') – Exponent for density-based sample allocation.
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.RandomOverSampler(sampling_strategy='auto', random_state=None, shrinkage=None)[source]¶
Bases:
BaseEstimatorRandom over-sampling wrapper (duplicates minority samples).
Simply duplicates random minority class samples. Fast but may lead to overfitting.
- Parameters:
- class endgame.preprocessing.EditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=-1)[source]¶
Bases:
BaseEstimatorEdited Nearest Neighbours (ENN) under-sampling.
Removes samples whose class label differs from the majority of their k-nearest neighbors (noise removal).
- Parameters:
sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.
n_neighbors (int, default=3) – Number of nearest neighbors for majority voting.
kind_sel ({'all', 'mode'}, default='all') –
‘all’: Sample removed if any neighbor is from different class
’mode’: Sample removed if majority of neighbors are different
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.AllKNNUnderSampler(sampling_strategy='auto', n_neighbors=3, kind_sel='all', allow_minority=False, n_jobs=-1)[source]¶
Bases:
BaseEstimatorAllKNN under-sampling (multiple passes of ENN).
Applies ENN repeatedly with increasing k values until no more samples are removed.
- Parameters:
sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.
n_neighbors (int, default=3) – Starting number of nearest neighbors.
kind_sel ({'all', 'mode'}, default='all') – Selection strategy (see EditedNearestNeighbours).
allow_minority (bool, default=False) – If True, allow removal of minority samples.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.TomekLinksUnderSampler(sampling_strategy='auto', n_jobs=-1)[source]¶
Bases:
BaseEstimatorTomek Links under-sampling.
Removes Tomek links - pairs of instances from different classes that are each other’s nearest neighbor. Cleans the decision boundary.
- Parameters:
- class endgame.preprocessing.RandomUnderSampler(sampling_strategy='auto', random_state=None, replacement=False)[source]¶
Bases:
BaseEstimatorRandom under-sampling (removes random majority samples).
Randomly removes majority class samples. Fast but may lose important information.
- Parameters:
- class endgame.preprocessing.NearMissUnderSampler(sampling_strategy='auto', version=1, n_neighbors=3, n_neighbors_ver3=3, n_jobs=-1)[source]¶
Bases:
BaseEstimatorNearMiss under-sampling using nearest neighbor heuristics.
Selects majority samples based on their distance to minority samples.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information.
version ({1, 2, 3}, default=1) – Version of NearMiss algorithm: - 1: Select majority samples with smallest average distance to k nearest minority - 2: Select majority samples with smallest average distance to k farthest minority - 3: Select majority samples with smallest distance to each minority sample
n_neighbors (int, default=3) – Number of nearest neighbors.
n_neighbors_ver3 (int, default=3) – Number of neighbors for version 3.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.CondensedNearestNeighbour(sampling_strategy='auto', random_state=None, n_neighbors=1, n_seeds_S=1, n_jobs=-1)[source]¶
Bases:
BaseEstimatorCondensed Nearest Neighbour (CNN) under-sampling.
Iteratively selects samples that are misclassified by 1-NN on the current condensed set. Finds a minimal consistent subset.
- Parameters:
sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.
random_state (int or None, default=None) – Random seed for reproducibility.
n_neighbors (int, default=1) – Number of nearest neighbors.
n_seeds_S (int, default=1) – Number of samples to start the condensing.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.OneSidedSelectionUnderSampler(sampling_strategy='auto', random_state=None, n_neighbors=1, n_seeds_S=1, n_jobs=-1)[source]¶
Bases:
BaseEstimatorOne-Sided Selection (OSS) under-sampling.
Combines Tomek links removal with CNN to remove noisy and redundant majority samples.
- Parameters:
sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.
random_state (int or None, default=None) – Random seed for reproducibility.
n_neighbors (int, default=1) – Number of nearest neighbors for CNN step.
n_seeds_S (int, default=1) – Number of samples to start CNN condensing.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.NeighbourhoodCleaningRule(sampling_strategy='auto', n_neighbors=3, threshold_cleaning=0.5, n_jobs=None)[source]¶
Bases:
BaseEstimatorNeighbourhood Cleaning Rule (NCR) under-sampling.
Uses ENN to clean the data and then removes majority samples whose nearest neighbors are mostly minority.
- Parameters:
- class endgame.preprocessing.InstanceHardnessThresholdSampler(sampling_strategy='auto', estimator=None, cv=5, random_state=None, n_jobs=-1)[source]¶
Bases:
BaseEstimatorInstance Hardness Threshold (IHT) under-sampling.
Removes samples that are hard to classify based on a classifier’s predicted probabilities.
- Parameters:
sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.
estimator (estimator or None, default=None) – Classifier for computing instance hardness. If None, uses RandomForestClassifier.
cv (int, default=5) – Number of cross-validation folds.
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.ClusterCentroidsUnderSampler(sampling_strategy='auto', random_state=None, estimator=None, voting='auto')[source]¶
Bases:
BaseEstimatorCluster Centroids under-sampling.
Replaces majority samples with cluster centroids from k-means.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information.
random_state (int or None, default=None) – Random seed for reproducibility.
estimator (estimator or None, default=None) – Clustering estimator. If None, uses KMeans.
voting ({'hard', 'soft'}, default='auto') – Voting strategy for cluster assignment.
- class endgame.preprocessing.SMOTEENNResampler(sampling_strategy='auto', smote=None, enn=None, random_state=None, n_jobs=-1)[source]¶
Bases:
BaseEstimatorSMOTE + Edited Nearest Neighbours combined resampling.
Applies SMOTE over-sampling followed by ENN cleaning to remove noisy synthetic samples.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – Sampling strategy for SMOTE.
smote (SMOTEResampler or dict, default=None) – SMOTE instance or parameters.
enn (EditedNearestNeighbours or dict, default=None) – ENN instance or parameters.
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.SMOTETomekResampler(sampling_strategy='auto', smote=None, tomek=None, random_state=None, n_jobs=-1)[source]¶
Bases:
BaseEstimatorSMOTE + Tomek Links combined resampling.
Applies SMOTE over-sampling followed by Tomek links removal to clean the decision boundary.
- Parameters:
sampling_strategy (float, str, dict, or callable, default='auto') – Sampling strategy for SMOTE.
smote (SMOTEResampler or dict, default=None) – SMOTE instance or parameters.
tomek (TomekLinksUnderSampler or dict, default=None) – Tomek Links instance or parameters.
random_state (int or None, default=None) – Random seed for reproducibility.
n_jobs (int, default=-1) – Number of parallel jobs.
- class endgame.preprocessing.MultivariateGaussianSMOTE(sampling_strategy='auto', k_neighbors=5, regularization=1e-06, random_state=None)[source]¶
Bases:
BaseEstimatorMultivariate Gaussian SMOTE oversampler.
For each minority sample, fits a local multivariate Gaussian from its k-nearest minority neighbours and samples new points from it.
- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets()for semantics.k_neighbors (int, default=5) – Number of nearest minority neighbours for covariance estimation.
regularization (float, default=1e-6) – Ridge added to the diagonal of local covariance matrices to ensure positive-definiteness.
random_state (int or None, default=None) – Random seed.
References
“Do we need rebalancing strategies?” (ICLR 2025)
- fit(X, y)[source]¶
Fit the sampler (validates input and computes targets).
- Parameters:
X (array-like of shape (n_samples, n_features))
y (array-like of shape (n_samples,))
- Return type:
- Returns:
self
- class endgame.preprocessing.SimplicialSMOTE(sampling_strategy='auto', k_neighbors=5, simplex_dim=2, random_state=None)[source]¶
Bases:
BaseEstimatorSimplicial complex SMOTE oversampler.
Builds simplicial complexes from the k-NN graph of minority samples and generates new points inside simplices using Dirichlet-distributed barycentric coordinates.
- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().k_neighbors (int, default=5) – Number of nearest neighbours for graph construction.
simplex_dim (int, default=2) – Dimension of the simplices to sample from (2 = triangles, 3 = tetrahedra). Clamped to
min(simplex_dim, k_neighbors).random_state (int or None, default=None) – Random seed.
References
Simplicial complex extension of SMOTE (KDD 2025)
- class endgame.preprocessing.CVSMOTEResampler(sampling_strategy='auto', k_neighbors=5, cv=3, estimator=None, scoring='f1_macro', candidate_pool_factor=2.0, random_state=None)[source]¶
Bases:
BaseEstimatorCross-validation guided SMOTE oversampler.
Generates a pool of candidate synthetic samples via SMOTE-style interpolation, then uses cross-validation to retain only those that improve a scorer metric.
- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().k_neighbors (int, default=5) – Nearest neighbours for SMOTE interpolation.
cv (int, default=3) – Number of cross-validation folds for candidate evaluation.
estimator (estimator or None, default=None) – Classifier used to score candidate batches. Defaults to
LogisticRegression(max_iter=500).scoring (str, default='f1_macro') – Scoring metric for cross-validation (sklearn convention).
candidate_pool_factor (float, default=2.0) – Generate this many times the required synthetic samples as candidates, then keep the best subset.
random_state (int or None, default=None) – Random seed.
References
CV-informed SMOTE (ICLR 2025)
- class endgame.preprocessing.OverlapRegionDetector(sampling_strategy='auto', base_sampler='smote', overlap_estimator=None, k_neighbors=5, threshold=0.3, random_state=None)[source]¶
Bases:
BaseEstimatorOverlap Region Detection meta-method for class imbalance.
Identifies the overlap region between classes using classifier uncertainty, then applies a base sampler with overlap awareness.
Algorithm¶
Train a classifier to get predicted probabilities.
Samples with high uncertainty (max prob < 1 - threshold) are labelled as “overlap”.
Apply the base sampler on the augmented label space.
Map generated samples back to original labels.
- type sampling_strategy:
- param sampling_strategy:
See
_compute_sampling_targets().- type sampling_strategy:
str, float, or dict, default=’auto’
- type base_sampler:
- param base_sampler:
Base oversampling method. If a string, looked up in the combined sampler registries. Otherwise must support
fit_resample(X, y).- type base_sampler:
str or estimator, default=’smote’
- type overlap_estimator:
- param overlap_estimator:
Classifier for overlap detection. Defaults to
RandomForestClassifier(n_estimators=100).- type overlap_estimator:
estimator or None, default=None
- type k_neighbors:
- param k_neighbors:
Passed to base sampler when constructed from string.
- type k_neighbors:
int, default=5
- type threshold:
- param threshold:
Uncertainty threshold: a sample is in the overlap region if
max(predicted_proba) < 1 - threshold.- type threshold:
float, default=0.3
- type random_state:
- param random_state:
Random seed.
- type random_state:
int or None, default=None
References
Overlap Region Detection (AAAI 2025)
- class endgame.preprocessing.AutoBalancer(strategy='auto', sampling_strategy='auto', imbalance_threshold=0.5, severe_imbalance_threshold=0.1, include_generative=False, random_state=None, n_jobs=-1, **kwargs)[source]¶
Bases:
BaseEstimatorAutomatic class balancing with strategy selection.
Automatically selects and applies the best resampling strategy based on the imbalance ratio and data characteristics.
- Parameters:
strategy (str, default='auto') – Balancing strategy: - ‘auto’: Automatically select based on imbalance ratio - ‘oversample’: Use SMOTE-based oversampling - ‘undersample’: Use ENN-based undersampling - ‘combine’: Use SMOTE + ENN - ‘geometric’: Use MultivariateGaussianSMOTE (from geometric module) - ‘generative’: Use ForestFlowResampler (from generative module) - Any key from ALL_SAMPLERS (e.g., ‘smote’, ‘borderline_smote’, etc.)
sampling_strategy (float, str, dict, or callable, default='auto') – Target class distribution.
imbalance_threshold (float, default=0.5) – Ratio below which data is considered imbalanced.
severe_imbalance_threshold (float, default=0.1) – Ratio below which imbalance is considered severe.
random_state (int or None, default=None) – Random seed for reproducibility.
include_generative (bool, default=False) – If True, include generative samplers (from
imbalance_generative) in the auto-selection pool.n_jobs (int, default=-1) – Number of parallel jobs.
**kwargs (dict) – Additional parameters passed to the selected sampler.
- sampler_¶
The fitted sampler.
- Type:
BaseEstimator
Examples
>>> from endgame.preprocessing import AutoBalancer >>> balancer = AutoBalancer(strategy='auto', random_state=42) >>> X_balanced, y_balanced = balancer.fit_resample(X, y) >>> print(f"Selected: {balancer.selected_strategy_}")
- fit(X, y)[source]¶
Fit the auto-balancer.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
self (AutoBalancer) – Fitted balancer.
- fit_resample(X, y)[source]¶
Fit and resample the dataset.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
X_resampled (ndarray of shape (n_samples_new, n_features)) – Resampled training data.
y_resampled (ndarray of shape (n_samples_new,)) – Resampled target values.
- endgame.preprocessing.get_imbalance_ratio(y)[source]¶
Compute the imbalance ratio of a target array.
- Parameters:
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
ratio (float) – Imbalance ratio (minority_count / majority_count). Returns 1.0 if all classes have the same count.
Examples
>>> y = [0, 0, 0, 0, 0, 1, 1] >>> get_imbalance_ratio(y) 0.4
- endgame.preprocessing.get_class_distribution(y)[source]¶
Get the class distribution of a target array.
- Parameters:
y (array-like of shape (n_samples,)) – Target values.
- Return type:
- Returns:
distribution (dict) – Dictionary mapping class labels to counts.
Examples
>>> y = [0, 0, 0, 1, 1, 2] >>> get_class_distribution(y) {0: 3, 1: 2, 2: 1}
- class endgame.preprocessing.DenoisingAutoEncoder(hidden_dims=None, noise_fraction=0.1, dropout=0.1, activation='relu', n_epochs=100, batch_size=256, learning_rate=0.001, weight_decay=1e-05, early_stopping=10, scheduler='cosine', device='auto', random_state=None, verbose=False)[source]¶
Bases:
BaseEstimator,TransformerMixinDenoising Autoencoder for tabular representation learning.
Corrupts input with swap noise (randomly swapping values between samples), trains to reconstruct the original input, and extracts bottleneck layer embeddings as new features.
This is a key technique from Tabular Playground Series 1st place solutions.
- Parameters:
hidden_dims (List[int], default=[256, 128, 64]) – Architecture of encoder (decoder mirrors). The last dimension is the bottleneck/embedding size.
noise_fraction (float, default=0.1) – Fraction of features to corrupt with swap noise.
dropout (float, default=0.1) – Dropout rate for regularization.
activation (str, default='relu') – Activation function: ‘relu’, ‘leaky_relu’, ‘elu’, ‘selu’, ‘gelu’, ‘swish’, ‘tanh’.
n_epochs (int, default=100) – Maximum training epochs.
batch_size (int, default=256) – Training batch size.
learning_rate (float, default=1e-3) – Initial learning rate.
weight_decay (float, default=1e-5) – L2 regularization strength.
early_stopping (int, default=10) – Patience for early stopping (based on reconstruction loss).
scheduler (str, default='cosine') – Learning rate scheduler: ‘cosine’, ‘step’, ‘none’.
device (str, default='auto') – Device: ‘cuda’, ‘cpu’, or ‘auto’ (auto-detect GPU).
random_state (int, optional) – Random seed for reproducibility.
verbose (bool, default=False) – Enable verbose output.
- model_¶
Fitted PyTorch DAE model.
- Type:
_DAEModule
- scaler_¶
Feature scaler.
- Type:
StandardScaler
Examples
>>> from endgame.preprocessing import DenoisingAutoEncoder >>> # Create DAE with 64-dimensional embeddings >>> dae = DenoisingAutoEncoder(hidden_dims=[256, 128, 64], n_epochs=50) >>> # Fit on training data >>> dae.fit(X_train) >>> # Extract embeddings as new features >>> X_train_embed = dae.transform(X_train) >>> X_test_embed = dae.transform(X_test) >>> # Combine with original features >>> X_train_enriched = np.hstack([X_train, X_train_embed])
- fit(X, y=None)[source]¶
Fit the Denoising Autoencoder.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored) – Not used, present for API consistency.
- Return type:
- Returns:
self – Fitted transformer.
- transform(X)[source]¶
Extract bottleneck embeddings.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
ndarray of shape (n_samples, embedding_dim) – Bottleneck embeddings.
- fit_transform(X, y=None)[source]¶
Fit and transform in one step.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (ignored) – Not used.
- Return type:
- Returns:
ndarray of shape (n_samples, embedding_dim) – Bottleneck embeddings.
- reconstruct(X)[source]¶
Reconstruct input from embeddings.
Useful for detecting anomalies (high reconstruction error).
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to reconstruct.
- Return type:
- Returns:
ndarray of shape (n_samples, n_features) – Reconstructed data.
- reconstruction_error(X)[source]¶
Compute per-sample reconstruction error.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to evaluate.
- Return type:
- Returns:
ndarray of shape (n_samples,) – Mean squared reconstruction error per sample.
- class endgame.preprocessing.CTGANResampler(sampling_strategy='auto', embedding_dim=128, generator_dim=(256, 256), discriminator_dim=(256, 256), n_epochs=300, batch_size=500, random_state=None, verbose=False)[source]¶
Bases:
BaseEstimatorConditional Tabular GAN oversampler.
Thin wrapper around the
ctgan.CTGANpackage. Trains a conditional GAN on minority class data and generates synthetic samples to balance.- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().embedding_dim (int, default=128) – Embedding dimension for the generator.
generator_dim (tuple of int, default=(256, 256)) – Generator hidden layer sizes.
discriminator_dim (tuple of int, default=(256, 256)) – Discriminator hidden layer sizes.
n_epochs (int, default=300) – Training epochs.
batch_size (int, default=500) – Training batch size.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
References
CTGAN (NeurIPS 2019)
- class endgame.preprocessing.ForestFlowResampler(sampling_strategy='auto', n_estimators=100, max_depth=6, n_steps=50, noise_type='gaussian', random_state=None, verbose=False)[source]¶
Bases:
BaseEstimatorXGBoost-based flow matching oversampler (ForestFlow).
Trains XGBoost to learn the velocity field
v(x, t) = x_1 - x_0of a conditional flow matching ODE, then integrates from noise to data via Euler steps. CPU-friendly — no PyTorch required.- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().n_estimators (int, default=100) – Number of trees per XGBoost model.
max_depth (int, default=6) – Maximum tree depth.
n_steps (int, default=50) – Number of Euler integration steps.
noise_type (str, default='gaussian') – Noise distribution for the source: ‘gaussian’ or ‘uniform’.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
References
Jolicoeur-Martineau et al., “Generating and Imputing Tabular Data via Diffusion and Flow XGBoost Models”, 2024.
- class endgame.preprocessing.TabDDPMResampler(sampling_strategy='auto', n_timesteps=1000, hidden_dims=None, n_epochs=100, batch_size=256, lr=0.001, device='auto', random_state=None, verbose=False)[source]¶
Bases:
BaseEstimatorTab-DDPM oversampler: denoising diffusion for tabular data.
Uses Gaussian diffusion with an MLP denoiser that predicts noise given a noisy sample and timestep embedding.
- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().n_timesteps (int, default=1000) – Number of diffusion timesteps.
hidden_dims (list of int, default=[256, 256]) – MLP denoiser hidden layer sizes.
n_epochs (int, default=100) – Training epochs.
batch_size (int, default=256) – Training batch size.
lr (float, default=1e-3) – Learning rate.
device (str, default='auto') – Computation device.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
References
TabDDPM (Kotelnikov et al., ICML 2023)
- class endgame.preprocessing.TabSynResampler(sampling_strategy='auto', latent_dim=64, vae_hidden_dims=None, vae_epochs=100, diffusion_hidden_dims=None, diffusion_epochs=100, n_timesteps=1000, batch_size=256, lr=0.001, device='auto', random_state=None, verbose=False)[source]¶
Bases:
BaseEstimatorTabSyn oversampler: VAE + latent diffusion for tabular data.
Two-stage approach: 1. Train a VAE on minority data to learn a smooth latent space. 2. Train a diffusion model in the latent space. Generation: reverse diffusion in latent space -> decode through VAE.
- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().latent_dim (int, default=64) – VAE latent dimension.
vae_hidden_dims (list of int, default=[256, 128]) – VAE encoder/decoder hidden sizes.
vae_epochs (int, default=100) – VAE training epochs.
diffusion_hidden_dims (list of int, default=[256, 256]) – Diffusion denoiser hidden sizes.
diffusion_epochs (int, default=100) – Diffusion training epochs.
n_timesteps (int, default=1000) – Number of diffusion timesteps.
batch_size (int, default=256) – Training batch size.
lr (float, default=1e-3) – Learning rate.
device (str, default='auto') – Computation device.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
References
TabSyn (Zhang et al., ICLR 2024)
- class endgame.preprocessing.GReaTResampler(sampling_strategy='auto', model_name='distilgpt2', n_epochs=5, batch_size=8, max_length=256, temperature=0.7, feature_names=None, label_name='Class', device='auto', random_state=None, verbose=False)[source]¶
Bases:
BaseEstimatorGReaT oversampler: LLM-based tabular data generation.
Serializes tabular rows as natural language strings, fine-tunes a small causal language model (e.g. distilgpt2), and generates new minority samples by prompting with the minority class label prefix.
- Parameters:
sampling_strategy (str, float, or dict, default='auto') – See
_compute_sampling_targets().model_name (str, default='distilgpt2') – HuggingFace model name for the causal LM backbone.
n_epochs (int, default=5) – Fine-tuning epochs.
batch_size (int, default=8) – Training batch size.
max_length (int, default=256) – Maximum token length for serialized rows.
temperature (float, default=0.7) – Sampling temperature for generation.
feature_names (list of str or None, default=None) – Feature names. If None, uses
f0, f1, ....label_name (str, default='Class') – Name for the target column in serialization.
device (str, default='auto') – Computation device.
random_state (int or None, default=None) – Random seed.
verbose (bool, default=False) – Enable verbose output.
References
GReaT (Borisov et al., 2023), ImbLLM (2025)