Preprocessing

class endgame.preprocessing.SafeTargetEncoder(cols=None, smoothing=10.0, cv=5, min_samples_leaf=1, noise_level=0.0, handle_unknown='global_mean', output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Target encoding with M-estimate smoothing and inner-fold encoding.

Prevents target leakage through nested cross-validation during fit and applies smoothing for rare categories.

Implements: S_i = (n_i × μ_i + m × μ_global) / (n_i + m)

Parameters:
  • cols (List[str], optional) – Columns to encode. If None, encodes all categorical columns.

  • smoothing (float, default=10) – Smoothing parameter (m) for rare categories. Higher values = more regularization toward global mean.

  • cv (int, default=5) – Number of folds for inner-fold encoding during fit.

  • min_samples_leaf (int, default=1) – Minimum samples required to compute category statistic.

  • noise_level (float, default=0.0) – Gaussian noise std to add for regularization.

  • handle_unknown (str, default='global_mean') – Strategy for unseen categories: ‘global_mean’, ‘nan’, ‘error’.

  • output_format (str, default='auto') – Output format: ‘auto’, ‘polars’, ‘pandas’, ‘numpy’.

  • random_state (int, optional) – Random seed for cross-validation and noise.

  • verbose (bool)

Examples

>>> from endgame.preprocessing import SafeTargetEncoder
>>> encoder = SafeTargetEncoder(smoothing=10, cv=5)
>>> X_encoded = encoder.fit_transform(X, y)
fit(X, y, **fit_params)[source]

Fit the target encoder.

Uses inner-fold encoding to prevent leakage during training.

Parameters:
Return type:

SafeTargetEncoder

Returns:

self

transform(X)[source]

Transform data using learned encodings.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data to transform.

Return type:

Any

Returns:

X_transformed (array-like) – Transformed data with encoded columns.

fit_transform(X, y, **fit_params)[source]

Fit and transform with inner-fold encoding to prevent leakage.

During fit_transform, uses cross-validation to compute encodings without leakage. Each sample is encoded using statistics computed only from other samples.

Parameters:
  • X (array-like) – Training data.

  • y (array-like) – Target values.

Return type:

Any

Returns:

X_transformed (array-like) – Transformed training data.

get_feature_names_out(input_features=None)[source]

Get output feature names (same as input for target encoding).

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.LeaveOneOutEncoder(cols=None, smoothing=1.0, handle_unknown='global_mean', output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Leave-One-Out target encoding for online settings.

Each sample’s encoding excludes its own target value, preventing direct leakage while still using all available data.

Parameters:
  • cols (List[str], optional) – Columns to encode. If None, encodes all categorical columns.

  • smoothing (float, default=1.0) – Smoothing parameter for regularization.

  • handle_unknown (str, default='global_mean') – Strategy for unseen categories.

  • random_state (int, optional) – Random seed for reproducibility.

  • output_format (str)

  • verbose (bool)

fit(X, y, **fit_params)[source]

Fit the LOO encoder.

Return type:

LeaveOneOutEncoder

transform(X)[source]

Transform using stored statistics (no LOO at test time).

Return type:

Any

fit_transform(X, y, **fit_params)[source]

Fit and transform with LOO to prevent leakage.

Return type:

Any

class endgame.preprocessing.CatBoostEncoder(cols=None, smoothing=1.0, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

CatBoost-style ordered target encoding.

Encodes based only on preceding samples, mimicking CatBoost’s internal target statistic computation. Prevents leakage by using only “past” information for each sample.

Parameters:
  • cols (List[str], optional) – Columns to encode.

  • smoothing (float, default=1.0) – Smoothing parameter.

  • random_state (int, optional) – Random seed for sample ordering.

  • output_format (str)

  • verbose (bool)

fit(X, y, **fit_params)[source]

Fit encoder (stores final statistics for transform).

Return type:

CatBoostEncoder

transform(X)[source]

Transform using final statistics.

Return type:

Any

fit_transform(X, y, **fit_params)[source]

Fit and transform with ordered encoding.

Return type:

Any

class endgame.preprocessing.FrequencyEncoder(cols=None, normalize=True, handle_unknown='zero', output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Frequency encoding for categorical features.

Replaces categories with their frequency (count or proportion). Simple but effective encoding that doesn’t require target values.

Parameters:
  • cols (List[str], optional) – Columns to encode. If None, encodes all categorical columns.

  • normalize (bool, default=True) – If True, use proportions. If False, use raw counts.

  • handle_unknown (str, default='zero') – Strategy for unseen categories: ‘zero’, ‘nan’, ‘error’.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

fit(X, y=None, **fit_params)[source]

Compute frequencies from training data.

Return type:

FrequencyEncoder

transform(X)[source]

Apply frequency encoding.

Return type:

Any

class endgame.preprocessing.AutoAggregator(group_cols, agg_cols=None, methods=('mean', 'std', 'min', 'max'), rank_features=True, diff_features=False, ratio_features=False, prefix=None, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Generates “Magic Feature” aggregations used in winning solutions.

Creates group-level statistics that capture relationships between entities. Key technique from Optiver 1st place and many tabular wins.

Parameters:
  • group_cols (List[str]) – Columns to group by (e.g., [‘customer_id’, ‘store_id’]).

  • agg_cols (List[str], optional) – Columns to aggregate (e.g., [‘amount’, ‘quantity’]). If None, aggregates all numeric columns.

  • methods (List[str], default=['mean', 'std', 'min', 'max']) – Aggregation methods: ‘mean’, ‘std’, ‘min’, ‘max’, ‘sum’, ‘count’, ‘median’, ‘skew’, ‘kurtosis’, ‘first’, ‘last’, ‘nunique’.

  • rank_features (bool, default=True) – Whether to compute rank features within groups. Key technique from Optiver 1st place solution.

  • diff_features (bool, default=False) – Whether to compute difference from group mean.

  • ratio_features (bool, default=False) – Whether to compute ratio to group mean.

  • prefix (str, optional) – Prefix for generated feature names.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> agg = AutoAggregator(
...     group_cols=['customer_id'],
...     agg_cols=['amount'],
...     methods=['mean', 'std', 'skew'],
...     rank_features=True
... )
>>> X_agg = agg.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Compute aggregation statistics from training data.

Parameters:
  • X (array-like) – Training data.

  • y (array-like, optional) – Ignored.

Return type:

AutoAggregator

Returns:

self

transform(X)[source]

Apply aggregation features to data.

Parameters:

X (array-like) – Data to transform.

Return type:

Any

Returns:

X_transformed (array-like) – Original data with aggregation features added.

get_feature_names_out(input_features=None)[source]

Get output feature names including generated aggregations.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.InteractionFeatures(interaction_pairs=None, operations=('multiply', 'divide'), max_interactions=100, include_cols=None, exclude_cols=None, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Generates interaction features between specified columns.

Creates arithmetic combinations (multiply, divide, add, subtract) between pairs of numeric features.

Parameters:
  • interaction_pairs (List[Tuple[str, str]], optional) – Specific pairs to create. If None, creates all pairs.

  • operations (List[str], default=['multiply', 'divide']) – Operations: ‘multiply’, ‘divide’, ‘add’, ‘subtract’.

  • max_interactions (int, default=100) – Maximum number of interactions to create.

  • include_cols (List[str], optional) – Only consider these columns for interactions.

  • exclude_cols (List[str], optional) – Exclude these columns from interactions.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> inter = InteractionFeatures(
...     operations=['multiply', 'divide'],
...     max_interactions=50
... )
>>> X_inter = inter.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Determine interaction pairs from training data.

Return type:

InteractionFeatures

transform(X)[source]

Create interaction features.

Return type:

Any

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.RankFeatures(cols=None, method='average', pct=True, suffix='_rank', output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Compute rank-based features.

Converts numeric values to ranks, which can be more robust to outliers and non-linear relationships.

Parameters:
  • cols (List[str], optional) – Columns to rank. If None, ranks all numeric columns.

  • method (str, default='average') – Ranking method: ‘average’, ‘min’, ‘max’, ‘dense’, ‘ordinal’.

  • pct (bool, default=True) – Whether to return percentile ranks (0-1).

  • suffix (str, default='_rank') – Suffix for ranked column names.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> ranker = RankFeatures(pct=True)
>>> X_ranked = ranker.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Identify columns to rank.

Return type:

RankFeatures

transform(X)[source]

Compute rank features.

Return type:

Any

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.TemporalFeatures(datetime_cols=None, features=None, cyclical=True, drop_original=False, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Extracts temporal features from datetime columns.

Generates comprehensive datetime features including cyclical encodings for periodic patterns.

Features generated: - Basic: year, month, day, dayofweek, hour, minute, second - Boolean: is_weekend, is_month_start, is_month_end, is_year_start, is_year_end - Derived: quarter, week_of_year, day_of_year - Cyclical: sin/cos encodings for month, day, hour, dayofweek

Parameters:
  • datetime_cols (List[str], optional) – Datetime columns to extract features from. If None, auto-detects datetime columns.

  • features (List[str], optional) – Features to extract. If None, extracts all. Options: ‘year’, ‘month’, ‘day’, ‘dayofweek’, ‘hour’, ‘minute’, ‘second’, ‘is_weekend’, ‘quarter’, ‘week_of_year’, ‘day_of_year’, ‘is_month_start’, ‘is_month_end’, ‘cyclical’.

  • cyclical (bool, default=True) – Whether to add cyclical (sin/cos) encodings.

  • drop_original (bool, default=False) – Whether to drop the original datetime columns.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> tf = TemporalFeatures(cyclical=True)
>>> X_temporal = tf.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Identify datetime columns.

Return type:

TemporalFeatures

transform(X)[source]

Extract temporal features.

Return type:

Any

class endgame.preprocessing.LagFeatures(cols=None, lags=(1, 2, 3), group_cols=None, fill_value=None, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Generate lag features for time series data.

Creates shifted versions of features to capture temporal dependencies.

Parameters:
  • cols (List[str], optional) – Columns to create lags for. If None, uses all numeric columns.

  • lags (List[int], default=[1, 2, 3]) – Lag periods to create.

  • group_cols (List[str], optional) – Columns to group by when computing lags.

  • fill_value (float, optional) – Value to fill NaN from lagging. If None, keeps NaN.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> lf = LagFeatures(cols=['price'], lags=[1, 7, 30])
>>> X_lagged = lf.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Identify columns to lag.

Return type:

LagFeatures

transform(X)[source]

Create lag features.

Return type:

Any

class endgame.preprocessing.RollingFeatures(cols=None, windows=(3, 7, 14), methods=('mean', 'std'), group_cols=None, min_periods=1, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Generate rolling window statistics.

Creates rolling aggregations for time series data.

Parameters:
  • cols (List[str], optional) – Columns to compute rolling stats for.

  • windows (List[int], default=[3, 7, 14]) – Window sizes.

  • methods (List[str], default=['mean', 'std']) – Aggregation methods: ‘mean’, ‘std’, ‘min’, ‘max’, ‘sum’.

  • group_cols (List[str], optional) – Columns to group by.

  • min_periods (int, default=1) – Minimum observations in window required.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> rf = RollingFeatures(cols=['price'], windows=[7, 30])
>>> X_rolling = rf.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Identify columns for rolling statistics.

Return type:

RollingFeatures

transform(X)[source]

Compute rolling features.

Return type:

Any

class endgame.preprocessing.AdversarialFeatureSelector(threshold=0.05, max_features_to_remove=10, estimator=None, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Removes features that contribute to train/test drift.

Uses adversarial validation to identify and remove features that differ significantly between train and test distributions.

Parameters:
  • threshold (float, default=0.05) – Remove features with importance above this threshold.

  • max_features_to_remove (int, default=10) – Maximum number of features to remove.

  • estimator (BaseEstimator, optional) – Classifier for adversarial validation.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> selector = AdversarialFeatureSelector(threshold=0.05)
>>> selector.fit(X_train, X_test=X_test)
>>> X_train_clean = selector.transform(X_train)
fit(X, y=None, X_test=None, **fit_params)[source]

Identify features to remove based on adversarial validation.

Parameters:
  • X (array-like) – Training features.

  • y (ignored)

  • X_test (array-like) – Test features for adversarial validation.

Return type:

AdversarialFeatureSelector

Returns:

self

transform(X)[source]

Remove drifted features.

Return type:

Any

property features_to_drop_: list[str]

Features identified for removal.

property feature_importances_: dict[str, float]

Adversarial validation feature importances.

set_fit_request(*, X_test='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in fit.

  • self (AdversarialFeatureSelector)

Returns:

self (object) – The updated object.

Return type:

AdversarialFeatureSelector

class endgame.preprocessing.PermutationImportanceSelector(estimator=None, threshold=0.0, n_repeats=10, scoring=None, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Selects features based on permutation importance.

More robust than model-specific importance measures because it measures actual predictive contribution.

Parameters:
  • estimator (BaseEstimator) – Fitted estimator to evaluate.

  • threshold (float, default=0.0) – Minimum importance to keep a feature.

  • n_repeats (int, default=10) – Number of permutation repetitions.

  • scoring (str, optional) – Scoring metric for importance calculation.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> selector = PermutationImportanceSelector(estimator=model)
>>> selector.fit(X_val, y_val)
>>> X_selected = selector.transform(X_train)
fit(X, y, **fit_params)[source]

Compute permutation importances and select features.

Parameters:
  • X (array-like) – Validation features.

  • y (array-like) – Validation targets.

Return type:

PermutationImportanceSelector

Returns:

self

transform(X)[source]

Keep only selected features.

Return type:

Any

property selected_features_: list[str]

Features selected based on importance.

property importances_: dict[str, float]

Permutation importance for each feature.

class endgame.preprocessing.NullImportanceSelector(estimator=None, n_iterations=100, significance_threshold=0.95, output_format='auto', random_state=None, verbose=False)[source]

Bases: PolarsTransformer

Selects features based on null importance distribution.

Features must significantly outperform a shuffled-target baseline. Robust method for identifying truly predictive features.

Parameters:
  • estimator (BaseEstimator, optional) – Model to use. If None, uses LightGBM.

  • n_iterations (int, default=100) – Number of null importance iterations.

  • significance_threshold (float, default=0.95) – Percentile threshold for significance.

  • output_format (str)

  • random_state (int | None)

  • verbose (bool)

Examples

>>> selector = NullImportanceSelector(n_iterations=100)
>>> selector.fit(X, y)
>>> X_selected = selector.transform(X)
fit(X, y, **fit_params)[source]

Compute actual and null importances.

Parameters:
  • X (array-like) – Training features.

  • y (array-like) – Target values.

Return type:

NullImportanceSelector

Returns:

self

transform(X)[source]

Keep only significant features.

Return type:

Any

property selected_features_: list[str]

Features that passed significance test.

property actual_importance_: dict[str, float]

Actual feature importances.

property null_threshold_: dict[str, float]

Null importance thresholds.

class endgame.preprocessing.BayesianDiscretizer(strategy='mdlp', max_bins=10, min_samples_bin=5, discrete_features='auto', max_unique_continuous=20, random_state=None, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

Discretizes continuous features for Bayesian Network Classifier consumption.

Supports multiple discretization strategies with automatic handling of already-discrete features.

Parameters:
  • strategy ({'mdlp', 'equal_width', 'equal_freq', 'kmeans'}, default='mdlp') – Discretization strategy: - ‘mdlp’: Minimum Description Length Principle (supervised, requires y) - ‘equal_width’: Fixed-width bins - ‘equal_freq’: Equal-frequency bins (quantiles) - ‘kmeans’: Cluster-based discretization

  • max_bins (int, default=10) – Maximum number of bins per feature.

  • min_samples_bin (int, default=5) – Minimum samples per bin (affects MDLP stopping criterion).

  • discrete_features (array-like of int | 'auto' | None, default='auto') – Which features are already discrete: - ‘auto’: Detect based on dtype and unique values - list of int: Indices of discrete features - None: Treat all features as continuous

  • max_unique_continuous (int, default=20) – If ‘auto’, features with <= this many unique values are considered discrete.

  • random_state (int, optional) – Random seed for kmeans initialization.

  • verbose (bool, default=False) – Enable verbose output.

n_features_in_

Number of features seen during fit.

Type:

int

n_bins_

Number of bins for each feature.

Type:

np.ndarray

bin_edges_

Bin edges for each continuous feature.

Type:

list[np.ndarray]

discrete_features_

Boolean mask of discrete features.

Type:

np.ndarray

feature_names_in_

Feature names (if input was DataFrame).

Type:

np.ndarray

Examples

>>> from endgame.preprocessing import BayesianDiscretizer
>>> disc = BayesianDiscretizer(strategy='mdlp')
>>> X_disc = disc.fit_transform(X_train, y_train)
>>> X_test_disc = disc.transform(X_test)
fit(X, y=None, **fit_params)[source]

Fit the discretizer.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (array-like of shape (n_samples,), optional) – Target values. Required for ‘mdlp’ strategy.

Return type:

BayesianDiscretizer

Returns:

self

transform(X)[source]

Transform continuous features to discrete.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data to transform.

Return type:

ndarray

Returns:

np.ndarray – Discretized data with integer values.

fit_transform(X, y=None, **fit_params)[source]

Fit and transform in one step.

Return type:

ndarray

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

inverse_transform(X_disc)[source]

Approximate inverse transform (returns bin centers).

Note: This is lossy - the original continuous values cannot be recovered exactly.

Parameters:

X_disc (np.ndarray) – Discretized data.

Return type:

ndarray

Returns:

np.ndarray – Approximate continuous values (bin centers).

set_inverse_transform_request(*, X_disc='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the inverse_transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • X_disc (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_disc parameter in inverse_transform.

  • self (BayesianDiscretizer)

Returns:

self (object) – The updated object.

Return type:

BayesianDiscretizer

class endgame.preprocessing.SimpleImputer(strategy='median', fill_value=None, add_indicator=False, copy=True, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

Simple imputation with mean, median, mode, or constant fill.

Thin wrapper around sklearn.impute.SimpleImputer with better defaults for competition settings (median instead of mean, which is more robust to outliers).

Parameters:
  • strategy (str, default='median') – Imputation strategy: - ‘mean’: Replace with column mean - ‘median’: Replace with column median (default, outlier-robust) - ‘most_frequent’: Replace with mode - ‘constant’: Replace with fill_value

  • fill_value (float or str, optional) – Value to use when strategy='constant'. Default is 0.

  • add_indicator (bool, default=False) – If True, append binary missing-indicator columns.

  • copy (bool, default=True) – If True, create a copy of X before imputing.

  • verbose (bool, default=False) – Enable verbose output.

statistics_

The imputation fill value for each feature.

Type:

ndarray of shape (n_features,)

indicator_

Indicator used to add binary indicators for missing values.

Type:

MissingIndicator or None

n_features_in_

Number of features seen during fit.

Type:

int

Examples

>>> import numpy as np
>>> from endgame.preprocessing.imputation import SimpleImputer
>>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan]])
>>> imp = SimpleImputer(strategy='median')
>>> imp.fit_transform(X)
array([[1. , 2. ],
       [4. , 3. ],
       [7. , 2.5]])
fit(X, y=None, **fit_params)[source]

Fit the imputer on training data.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data with missing values (np.nan).

  • y (ignored)

Return type:

SimpleImputer

Returns:

self

transform(X)[source]

Impute missing values in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data with missing values.

Return type:

Any

Returns:

X_imputed (ndarray or DataFrame of shape (n_samples, n_features)) – Imputed data.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.IndicatorImputer(base_strategy='median', fill_value=None, only_missing=True, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

Imputer that adds binary missing-indicator columns alongside imputed values.

For each feature with missing values, appends a binary column indicating which rows were originally missing. This is a common Kaggle trick that lets tree-based models learn different splits for missing vs. non-missing.

Parameters:
  • base_strategy (str, default='median') – Strategy for filling missing values: ‘mean’, ‘median’, ‘most_frequent’, ‘constant’.

  • fill_value (float, optional) – Fill value when base_strategy=’constant’.

  • only_missing (bool, default=True) – If True, only add indicators for features that have missing values in the training data. If False, add indicators for all features.

  • verbose (bool, default=False) – Enable verbose output.

statistics_

The imputation fill value for each feature.

Type:

ndarray of shape (n_features,)

missing_features_

Indices of features that had missing values during fit.

Type:

list of int

n_features_in_

Number of features seen during fit.

Type:

int

Examples

>>> import numpy as np
>>> from endgame.preprocessing.imputation import IndicatorImputer
>>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan]])
>>> imp = IndicatorImputer(base_strategy='median')
>>> X_out = imp.fit_transform(X)
>>> X_out.shape
(3, 4)
fit(X, y=None, **fit_params)[source]

Fit the indicator imputer.

Parameters:
Return type:

IndicatorImputer

Returns:

self

transform(X)[source]

Impute and add indicator columns.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data with missing values.

Return type:

Any

Returns:

X_out (ndarray or DataFrame of shape (n_samples, n_features + n_indicators)) – Imputed data with binary indicator columns appended.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.KNNImputer(n_neighbors=5, weights='uniform', metric='nan_euclidean', add_indicator=False, copy=True, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

K-Nearest Neighbors imputation with competition defaults.

Wraps sklearn.impute.KNNImputer with defaults tuned for tabular competitions: n_neighbors=5, uniform weights, nan_euclidean distance.

Parameters:
  • n_neighbors (int, default=5) – Number of nearest neighbors to use.

  • weights (str, default='uniform') – Weight function for prediction: ‘uniform’ or ‘distance’.

  • metric (str, default='nan_euclidean') – Distance metric for finding neighbors.

  • add_indicator (bool, default=False) – If True, append binary missing-indicator columns.

  • copy (bool, default=True) – If True, create a copy of X.

  • verbose (bool, default=False) – Enable verbose output.

n_features_in_

Number of features seen during fit.

Type:

int

Examples

>>> import numpy as np
>>> from endgame.preprocessing.imputation import KNNImputer
>>> X = np.array([[1, 2], [np.nan, 3], [7, 6], [5, np.nan]])
>>> imp = KNNImputer(n_neighbors=2)
>>> imp.fit_transform(X)
array([[1. , 2. ],
       [3. , 3. ],
       [7. , 6. ],
       [5. , 4. ]])
fit(X, y=None, **fit_params)[source]

Fit the KNN imputer.

Parameters:
Return type:

KNNImputer

Returns:

self

transform(X)[source]

Impute missing values using KNN.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data with missing values.

Return type:

Any

Returns:

X_imputed (ndarray or DataFrame) – Imputed data.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.MICEImputer(estimator=None, max_iter=10, tol=0.001, initial_strategy='median', sample_posterior=False, random_state=42, add_indicator=False, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

Multiple Imputation by Chained Equations.

Uses sklearn.impute.IterativeImputer with BayesianRidge as the default estimator, which is the standard MICE implementation. Iteratively models each feature as a function of all other features.

Parameters:
  • estimator (estimator, optional) – The estimator to predict each feature from all others. Default is BayesianRidge, which provides the standard MICE formulation.

  • max_iter (int, default=10) – Maximum number of imputation rounds.

  • tol (float, default=1e-3) – Convergence tolerance.

  • initial_strategy (str, default='median') – Strategy for initial imputation before iterating: ‘mean’, ‘median’, ‘most_frequent’, ‘constant’.

  • sample_posterior (bool, default=False) – If True, sample from the predictive posterior for each imputation. Provides proper multiple imputations when True.

  • random_state (int, default=42) – Random seed for reproducibility. Default set for deterministic results in competition settings.

  • add_indicator (bool, default=False) – If True, append binary missing-indicator columns.

  • verbose (bool, default=False) – Enable verbose output.

n_features_in_

Number of features seen during fit.

Type:

int

n_iter_

Number of iterations performed.

Type:

int

Examples

>>> import numpy as np
>>> from endgame.preprocessing.imputation import MICEImputer
>>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [5, 4]])
>>> imp = MICEImputer(max_iter=10, random_state=42)
>>> X_imputed = imp.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Fit the MICE imputer.

Parameters:
Return type:

MICEImputer

Returns:

self

transform(X)[source]

Impute missing values using MICE.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data with missing values.

Return type:

Any

Returns:

X_imputed (ndarray or DataFrame) – Imputed data.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.MissForestImputer(n_estimators=100, max_iter=10, max_depth=None, max_features='sqrt', initial_strategy='median', random_state=42, n_jobs=-1, add_indicator=False, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

Random Forest-based iterative imputation (MissForest algorithm).

Uses sklearn.impute.IterativeImputer with a RandomForestRegressor as the base estimator. This non-parametric approach handles non-linear relationships and interactions between features effectively.

Parameters:
  • n_estimators (int, default=100) – Number of trees in the random forest estimator.

  • max_iter (int, default=10) – Maximum number of imputation rounds.

  • max_depth (int or None, default=None) – Maximum depth of each tree. None means nodes are expanded until all leaves are pure or contain fewer than min_samples_split samples.

  • max_features (str or float, default='sqrt') – Number of features considered at each split.

  • initial_strategy (str, default='median') – Strategy for initial imputation before iterating.

  • random_state (int, default=42) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs for the random forest. -1 uses all cores.

  • add_indicator (bool, default=False) – If True, append binary missing-indicator columns.

  • verbose (bool, default=False) – Enable verbose output.

n_features_in_

Number of features seen during fit.

Type:

int

n_iter_

Number of iterations performed.

Type:

int

Examples

>>> import numpy as np
>>> from endgame.preprocessing.imputation import MissForestImputer
>>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [5, 4]])
>>> imp = MissForestImputer(n_estimators=50, random_state=42)
>>> X_imputed = imp.fit_transform(X)
fit(X, y=None, **fit_params)[source]

Fit the MissForest imputer.

Parameters:
Return type:

MissForestImputer

Returns:

self

transform(X)[source]

Impute missing values using MissForest.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data with missing values.

Return type:

Any

Returns:

X_imputed (ndarray or DataFrame) – Imputed data.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.AutoImputer(strategy='auto', low_threshold=0.05, high_threshold=0.3, random_state=42, add_indicator=False, verbose=False)[source]

Bases: EndgameEstimator, TransformerMixin

Automatic imputation strategy selection based on missingness patterns.

Analyzes the missingness structure in the data and selects an appropriate imputation strategy:

  • <5% missing -> SimpleImputer (fast, sufficient for low missingness)

  • 5-30% missing -> KNNImputer (captures local structure)

  • >30% missing -> MICEImputer (models complex dependencies)

Also performs an approximate Little’s MCAR test to characterize the missingness mechanism (MCAR, MAR, or MNAR).

Parameters:
  • strategy (str, default='auto') – Imputation strategy: - ‘auto’: Automatically select based on missingness percentage - ‘simple’: Force SimpleImputer - ‘knn’: Force KNNImputer - ‘mice’: Force MICEImputer - ‘missforest’: Force MissForestImputer

  • low_threshold (float, default=0.05) – Missingness fraction below which SimpleImputer is used (in auto mode).

  • high_threshold (float, default=0.30) – Missingness fraction above which MICEImputer is used (in auto mode).

  • random_state (int, default=42) – Random seed for reproducibility.

  • add_indicator (bool, default=False) – If True, append binary missing-indicator columns.

  • verbose (bool, default=False) – Enable verbose output.

missingness_fraction_

Overall fraction of missing values in the training data.

Type:

float

missingness_type_

Detected missingness mechanism: ‘MCAR’, ‘MAR’, or ‘MNAR’.

Type:

str

selected_strategy_

The imputation strategy that was selected.

Type:

str

imputer_

The fitted imputer instance.

Type:

estimator

n_features_in_

Number of features seen during fit.

Type:

int

Examples

>>> import numpy as np
>>> from endgame.preprocessing.imputation import AutoImputer
>>> X = np.array([[1, 2], [np.nan, 3], [7, np.nan], [5, 4]])
>>> imp = AutoImputer(strategy='auto', random_state=42)
>>> X_imputed = imp.fit_transform(X)
>>> imp.selected_strategy_
'knn'
fit(X, y=None, **fit_params)[source]

Fit the auto imputer.

Analyzes missingness patterns and selects the appropriate strategy, then fits the chosen imputer.

Parameters:
Return type:

AutoImputer

Returns:

self

transform(X)[source]

Impute missing values using the selected strategy.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data with missing values.

Return type:

Any

Returns:

X_imputed (ndarray or DataFrame) – Imputed data.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Return type:

list[Text]

Parameters:

input_features (list[str] | None)

class endgame.preprocessing.TargetTransformer(regressor=None, method='auto', random_state=None, verbose=False)[source]

Bases: EndgameEstimator, RegressorMixin

Wrapper that applies target transformations for regression.

Transforms the target variable y during fit, trains the wrapped regressor on the transformed targets, and inverse-transforms predictions at inference time.

Parameters:
  • regressor (estimator) – Any sklearn-compatible regressor. This is required.

  • method (str, default='auto') –

    Transformation method. One of:

    • 'auto': Test normality via Shapiro-Wilk; try Box-Cox and Yeo-Johnson and pick whichever produces the most normal transformed y. Falls back to 'yeo_johnson' when Box-Cox is not applicable (non-positive targets).

    • 'log': Natural log. Requires strictly positive targets.

    • 'log1p': log(1 + y). Requires non-negative targets.

    • 'sqrt': Square root. Requires non-negative targets.

    • 'box_cox': Box-Cox power transform (scipy). Requires strictly positive targets.

    • 'yeo_johnson': Yeo-Johnson power transform (scipy). Works with any real-valued targets.

    • 'quantile': Sklearn QuantileTransformer mapping to normal.

    • 'rank': Rank-based (ordinal) normalization.

    • 'none': No transformation (passthrough).

  • random_state (int, optional) – Random seed for reproducibility (passed to quantile transform and the wrapped regressor if it supports it).

  • verbose (bool, default=False) – Enable verbose output.

regressor_

The fitted regressor (clone of regressor).

Type:

estimator

method_

The method actually used (relevant when method='auto').

Type:

str

lambda_

The fitted lambda parameter for Box-Cox / Yeo-Johnson transforms.

Type:

float or None

qt_

Fitted QuantileTransformer instance (for method='quantile').

Type:

QuantileTransformer or None

y_train_sorted_

Sorted training targets for rank inverse transform.

Type:

ndarray or None

feature_importances_

Delegated from the wrapped regressor, if available.

Type:

ndarray

Examples

>>> from sklearn.ensemble import RandomForestRegressor
>>> from endgame.preprocessing import TargetTransformer
>>> model = TargetTransformer(
...     regressor=RandomForestRegressor(n_estimators=100, random_state=42),
...     method='auto',
... )
>>> model.fit(X_train, y_train)
>>> preds = model.predict(X_test)
fit(X, y, **fit_params)[source]

Fit the wrapped regressor on transformed targets.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training features.

  • y (array-like of shape (n_samples,)) – Training targets.

  • **fit_params (dict) – Additional parameters forwarded to the wrapped regressor’s fit method (e.g. sample_weight).

Return type:

TargetTransformer

Returns:

self – Fitted TargetTransformer.

predict(X)[source]

Predict target values, inverse-transforming the regressor’s output.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test features.

Return type:

ndarray

Returns:

ndarray of shape (n_samples,) – Predicted target values in the original scale.

predict_proba(X)[source]

Pass through to the wrapped regressor’s predict_proba, if available.

Some regressors (e.g. NGBoost) support probabilistic predictions. This method delegates directly without inverse-transforming, as the semantics are regressor-specific.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test features.

Return type:

ndarray

Returns:

ndarray – Whatever the wrapped regressor returns from predict_proba.

Raises:

AttributeError – If the wrapped regressor does not support predict_proba.

property feature_importances_: ndarray

Feature importances from the wrapped regressor.

Returns:

ndarray of shape (n_features,) – Feature importances.

Raises:

AttributeError – If the wrapped regressor does not expose feature_importances_.

set_score_request(*, sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (TargetTransformer)

Returns:

self (object) – The updated object.

Return type:

TargetTransformer

class endgame.preprocessing.TargetQuantileTransformer(regressor=None, n_quantiles=1000, output_distribution='normal', subsample=100000, random_state=None, verbose=False)[source]

Bases: EndgameEstimator, RegressorMixin

Convenience wrapper applying QuantileTransformer to the target.

This is a specialized shortcut for TargetTransformer(method='quantile'). It wraps a regressor and normalizes the target via sklearn’s QuantileTransformer before fitting.

Parameters:
  • regressor (estimator) – Any sklearn-compatible regressor.

  • n_quantiles (int, default=1000) – Number of quantiles for the QuantileTransformer.

  • output_distribution (str, default='normal') – Output distribution: ‘normal’ or ‘uniform’.

  • subsample (int, default=100000) – Subsample size for quantile estimation.

  • random_state (int, optional) – Random seed for reproducibility.

  • verbose (bool, default=False) – Enable verbose output.

regressor_

The fitted regressor.

Type:

estimator

qt_

The fitted target QuantileTransformer.

Type:

QuantileTransformer

feature_importances_

Delegated from the wrapped regressor, if available.

Type:

ndarray

Examples

>>> from sklearn.linear_model import Ridge
>>> from endgame.preprocessing.target_transform import TargetQuantileTransformer
>>> model = TargetQuantileTransformer(
...     regressor=Ridge(),
...     n_quantiles=500,
...     output_distribution='normal',
... )
>>> model.fit(X_train, y_train)
>>> preds = model.predict(X_test)
fit(X, y, **fit_params)[source]

Fit the wrapped regressor on quantile-transformed targets.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training features.

  • y (array-like of shape (n_samples,)) – Training targets.

  • **fit_params (dict) – Additional parameters forwarded to the regressor.

Return type:

TargetQuantileTransformer

Returns:

self

predict(X)[source]

Predict target values, inverse-transforming the output.

Parameters:

X (array-like of shape (n_samples, n_features)) – Test features.

Return type:

ndarray

Returns:

ndarray of shape (n_samples,) – Predicted target values in the original scale.

property feature_importances_: ndarray

Feature importances from the wrapped regressor.

set_score_request(*, sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (TargetQuantileTransformer)

Returns:

self (object) – The updated object.

Return type:

TargetQuantileTransformer

class endgame.preprocessing.ConfidentLearningFilter(base_estimator='rf', cv=5, threshold='auto', method='prune_by_class', n_jobs=1, random_state=None)[source]

Bases: BaseEstimator

Identify mislabeled examples using Confident Learning.

Uses cross-validated predicted probabilities to estimate the joint distribution of noisy and true labels, then identifies examples that are likely mislabeled.

Parameters:
  • base_estimator (estimator or str, default='rf') – Classifier to use for cross-validated probability estimation. Can be ‘rf’ (RandomForest), ‘xgboost’, ‘lgbm’, or any sklearn-compatible classifier with predict_proba.

  • cv (int, default=5) – Number of cross-validation folds for probability estimation.

  • threshold (float or str, default='auto') – Confidence threshold for identifying noise. If ‘auto’, uses per-class average predicted probability as threshold. If float, uses the same threshold for all classes.

  • method (str, default='prune_by_class') – Method for identifying noisy labels: - ‘prune_by_class’: Remove examples with low self-confidence - ‘prune_by_noise_rate’: Remove based on estimated noise rates - ‘both’: Intersection of both methods (most conservative)

  • n_jobs (int, default=1) – Number of parallel jobs for cross-validation.

  • random_state (int or None, default=None) – Random state for reproducibility.

noise_mask_

Boolean mask where True indicates suspected noisy labels.

Type:

ndarray of shape (n_samples,)

noise_indices_

Indices of suspected noisy examples.

Type:

ndarray

confident_joint_

Estimated joint distribution of noisy vs. true labels.

Type:

ndarray of shape (n_classes, n_classes)

noise_rate_

Estimated overall noise rate.

Type:

float

per_class_noise_rate_

Estimated noise rate per class.

Type:

ndarray

pred_proba_

Cross-validated predicted probabilities.

Type:

ndarray of shape (n_samples, n_classes)

Example

>>> clf = ConfidentLearningFilter(base_estimator='rf', cv=5)
>>> noise_mask = clf.fit_detect(X, y)
>>> print(f"Found {noise_mask.sum()} noisy labels ({noise_mask.mean():.1%})")
>>> X_clean, y_clean = X[~noise_mask], y[~noise_mask]
fit(X, y)[source]

Fit the noise detector.

Parameters:
Return type:

ConfidentLearningFilter

Returns:

self

fit_detect(X, y)[source]

Fit and return the noise mask.

Parameters:
Return type:

ndarray

Returns:

noise_mask (ndarray of shape (n_samples,)) – Boolean mask where True indicates suspected noisy label.

clean(X, y)[source]

Fit and return cleaned data.

Parameters:
  • X (array-like) – Features.

  • y (array-like) – Labels.

Returns:

  • X_clean (ndarray) – Features with noisy examples removed.

  • y_clean (ndarray) – Labels with noisy examples removed.

class endgame.preprocessing.ConsensusFilter(estimators=None, cv=5, consensus_threshold=0.5, n_jobs=1, random_state=None)[source]

Bases: BaseEstimator

Identify noisy labels via consensus of multiple classifiers.

Trains multiple diverse classifiers and identifies examples where the majority disagree with the given label.

Parameters:
  • estimators (list of estimators, optional) – List of classifiers to use. If None, uses a default diverse set.

  • cv (int, default=5) – Cross-validation folds for prediction.

  • consensus_threshold (float, default=0.5) – Fraction of classifiers that must disagree with the given label for it to be flagged as noisy.

  • n_jobs (int, default=1) – Number of parallel jobs.

  • random_state (int or None, default=None) – Random state for reproducibility.

Example

>>> from endgame.preprocessing import ConsensusFilter
>>> cf = ConsensusFilter(consensus_threshold=0.7)
>>> noise_mask = cf.fit_detect(X, y)
fit(X, y)[source]

Fit the consensus noise detector.

Parameters:
Return type:

ConsensusFilter

Returns:

self

fit_detect(X, y)[source]

Fit and return noise mask.

Return type:

ndarray

clean(X, y)[source]

Fit and return cleaned data.

class endgame.preprocessing.CrossValNoiseDetector(base_estimator=None, cv=5, n_repeats=3, misclassification_threshold=0.5, random_state=None)[source]

Bases: BaseEstimator

Simple cross-validated noise detection.

Flags examples that are consistently misclassified across CV folds as potentially noisy.

Parameters:
  • base_estimator (estimator, default=None) – Classifier to use. If None, uses RandomForestClassifier.

  • cv (int, default=5) – Number of CV folds.

  • n_repeats (int, default=3) – Number of repetitions with different random seeds.

  • misclassification_threshold (float, default=0.5) – Fraction of times an example must be misclassified across all folds and repeats to be flagged as noisy.

  • random_state (int or None, default=None) – Random state.

Example

>>> detector = CrossValNoiseDetector(n_repeats=5)
>>> noise_mask = detector.fit_detect(X, y)
fit(X, y)[source]

Fit the noise detector.

Return type:

CrossValNoiseDetector

fit_detect(X, y)[source]

Fit and return noise mask.

Return type:

ndarray

clean(X, y)[source]

Fit and return cleaned data.

class endgame.preprocessing.SMOTEResampler(sampling_strategy='auto', k_neighbors=5, random_state=None)[source]

Bases: BaseEstimator

SMOTE (Synthetic Minority Over-sampling Technique) wrapper.

Creates synthetic samples by interpolating between minority class instances and their k-nearest neighbors.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information: - ‘auto’: Resample all classes but the majority - ‘minority’: Resample only the minority class - ‘not majority’: Resample all classes but the majority - ‘all’: Resample all classes - float: Ratio of minority to majority (0 < ratio <= 1) - dict: {class_label: n_samples} for each class

  • k_neighbors (int, default=5) – Number of nearest neighbors used to construct synthetic samples.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs for neighbor search.

sampler_

The fitted SMOTE sampler.

Type:

imblearn.over_sampling.SMOTE

sampling_strategy_

The computed sampling strategy.

Type:

dict

Examples

>>> from endgame.preprocessing import SMOTEResampler
>>> smote = SMOTEResampler(k_neighbors=5, random_state=42)
>>> X_res, y_res = smote.fit_resample(X, y)
fit(X, y)[source]

Fit the SMOTE sampler.

Parameters:
Return type:

SMOTEResampler

Returns:

self (SMOTEResampler) – Fitted sampler.

fit_resample(X, y)[source]

Fit and resample the dataset.

Parameters:
Return type:

tuple[ndarray, ndarray]

Returns:

  • X_resampled (ndarray of shape (n_samples_new, n_features)) – Resampled training data.

  • y_resampled (ndarray of shape (n_samples_new,)) – Resampled target values.

class endgame.preprocessing.BorderlineSMOTEResampler(sampling_strategy='auto', k_neighbors=5, m_neighbors=10, kind='borderline-1', random_state=None)[source]

Bases: BaseEstimator

Borderline-SMOTE wrapper focusing on difficult borderline samples.

Only generates synthetic samples from minority instances that are near the decision boundary (borderline instances).

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.

  • k_neighbors (int, default=5) – Number of nearest neighbors for SMOTE interpolation.

  • m_neighbors (int, default=10) – Number of nearest neighbors to determine if instance is borderline.

  • kind ({'borderline-1', 'borderline-2'}, default='borderline-1') –

    • ‘borderline-1’: Only use borderline minority instances

    • ’borderline-2’: Use borderline minority + their majority neighbors

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the BorderlineSMOTE sampler.

Return type:

BorderlineSMOTEResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.ADASYNResampler(sampling_strategy='auto', n_neighbors=5, random_state=None)[source]

Bases: BaseEstimator

ADASYN (Adaptive Synthetic Sampling) wrapper.

Generates synthetic samples adaptively based on local density - more samples are generated in regions where minority class is sparse.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.

  • n_neighbors (int, default=5) – Number of nearest neighbors.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the ADASYN sampler.

Return type:

ADASYNResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.SVMSMOTEResampler(sampling_strategy='auto', k_neighbors=5, m_neighbors=10, svm_estimator=None, out_step=0.5, random_state=None)[source]

Bases: BaseEstimator

SVM-SMOTE wrapper using SVM to identify borderline samples.

Uses SVM to identify support vectors (borderline samples) and generates synthetic samples only from those.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.

  • k_neighbors (int, default=5) – Number of nearest neighbors for SMOTE.

  • m_neighbors (int, default=10) – Number of nearest neighbors for borderline detection.

  • svm_estimator (estimator or None, default=None) – SVM classifier. If None, uses SVC with default parameters.

  • out_step (float, default=0.5) – Step size for generating samples outside the decision boundary.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the SVM-SMOTE sampler.

Return type:

SVMSMOTEResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.KMeansSMOTEResampler(sampling_strategy='auto', k_neighbors=2, kmeans_estimator=None, cluster_balance_threshold=0.1, density_exponent='auto', random_state=None, n_jobs=-1)[source]

Bases: BaseEstimator

K-Means SMOTE wrapper for cluster-based oversampling.

Applies k-means clustering before SMOTE, generating synthetic samples in under-represented clusters.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.

  • k_neighbors (int, default=2) – Number of nearest neighbors for SMOTE.

  • kmeans_estimator (estimator or int, default=None) – KMeans instance or number of clusters. If None, uses n_classes.

  • cluster_balance_threshold (float, default=0.1) – Threshold for considering clusters as imbalanced.

  • density_exponent (float or 'auto', default='auto') – Exponent for density-based sample allocation.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the K-Means SMOTE sampler.

Return type:

KMeansSMOTEResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.RandomOverSampler(sampling_strategy='auto', random_state=None, shrinkage=None)[source]

Bases: BaseEstimator

Random over-sampling wrapper (duplicates minority samples).

Simply duplicates random minority class samples. Fast but may lead to overfitting.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – See SMOTEResampler for details.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • shrinkage (float or dict, default=None) – If not None, apply smoothed bootstrap with this shrinkage factor.

fit(X, y)[source]

Fit the random over-sampler.

Return type:

RandomOverSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.EditedNearestNeighbours(sampling_strategy='auto', n_neighbors=3, kind_sel='all', n_jobs=-1)[source]

Bases: BaseEstimator

Edited Nearest Neighbours (ENN) under-sampling.

Removes samples whose class label differs from the majority of their k-nearest neighbors (noise removal).

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • n_neighbors (int, default=3) – Number of nearest neighbors for majority voting.

  • kind_sel ({'all', 'mode'}, default='all') –

    • ‘all’: Sample removed if any neighbor is from different class

    • ’mode’: Sample removed if majority of neighbors are different

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the ENN sampler.

Return type:

EditedNearestNeighbours

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.AllKNNUnderSampler(sampling_strategy='auto', n_neighbors=3, kind_sel='all', allow_minority=False, n_jobs=-1)[source]

Bases: BaseEstimator

AllKNN under-sampling (multiple passes of ENN).

Applies ENN repeatedly with increasing k values until no more samples are removed.

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • n_neighbors (int, default=3) – Starting number of nearest neighbors.

  • kind_sel ({'all', 'mode'}, default='all') – Selection strategy (see EditedNearestNeighbours).

  • allow_minority (bool, default=False) – If True, allow removal of minority samples.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the AllKNN sampler.

Return type:

AllKNNUnderSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.TomekLinksUnderSampler(sampling_strategy='auto', n_jobs=-1)[source]

Bases: BaseEstimator

Tomek Links under-sampling.

Removes Tomek links - pairs of instances from different classes that are each other’s nearest neighbor. Cleans the decision boundary.

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the Tomek Links sampler.

Return type:

TomekLinksUnderSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.RandomUnderSampler(sampling_strategy='auto', random_state=None, replacement=False)[source]

Bases: BaseEstimator

Random under-sampling (removes random majority samples).

Randomly removes majority class samples. Fast but may lose important information.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • replacement (bool, default=False) – Whether to sample with replacement.

fit(X, y)[source]

Fit the random under-sampler.

Return type:

RandomUnderSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.NearMissUnderSampler(sampling_strategy='auto', version=1, n_neighbors=3, n_neighbors_ver3=3, n_jobs=-1)[source]

Bases: BaseEstimator

NearMiss under-sampling using nearest neighbor heuristics.

Selects majority samples based on their distance to minority samples.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information.

  • version ({1, 2, 3}, default=1) – Version of NearMiss algorithm: - 1: Select majority samples with smallest average distance to k nearest minority - 2: Select majority samples with smallest average distance to k farthest minority - 3: Select majority samples with smallest distance to each minority sample

  • n_neighbors (int, default=3) – Number of nearest neighbors.

  • n_neighbors_ver3 (int, default=3) – Number of neighbors for version 3.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the NearMiss sampler.

Return type:

NearMissUnderSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.CondensedNearestNeighbour(sampling_strategy='auto', random_state=None, n_neighbors=1, n_seeds_S=1, n_jobs=-1)[source]

Bases: BaseEstimator

Condensed Nearest Neighbour (CNN) under-sampling.

Iteratively selects samples that are misclassified by 1-NN on the current condensed set. Finds a minimal consistent subset.

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_neighbors (int, default=1) – Number of nearest neighbors.

  • n_seeds_S (int, default=1) – Number of samples to start the condensing.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the CNN sampler.

Return type:

CondensedNearestNeighbour

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.OneSidedSelectionUnderSampler(sampling_strategy='auto', random_state=None, n_neighbors=1, n_seeds_S=1, n_jobs=-1)[source]

Bases: BaseEstimator

One-Sided Selection (OSS) under-sampling.

Combines Tomek links removal with CNN to remove noisy and redundant majority samples.

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_neighbors (int, default=1) – Number of nearest neighbors for CNN step.

  • n_seeds_S (int, default=1) – Number of samples to start CNN condensing.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the OSS sampler.

Return type:

OneSidedSelectionUnderSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.NeighbourhoodCleaningRule(sampling_strategy='auto', n_neighbors=3, threshold_cleaning=0.5, n_jobs=None)[source]

Bases: BaseEstimator

Neighbourhood Cleaning Rule (NCR) under-sampling.

Uses ENN to clean the data and then removes majority samples whose nearest neighbors are mostly minority.

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • n_neighbors (int, default=3) – Number of nearest neighbors.

  • threshold_cleaning (float, default=0.5) – Threshold for cleaning majority samples.

  • n_jobs (int, default=None) – Number of parallel jobs.

fit(X, y)[source]

Fit the NCR sampler.

Return type:

NeighbourhoodCleaningRule

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.InstanceHardnessThresholdSampler(sampling_strategy='auto', estimator=None, cv=5, random_state=None, n_jobs=-1)[source]

Bases: BaseEstimator

Instance Hardness Threshold (IHT) under-sampling.

Removes samples that are hard to classify based on a classifier’s predicted probabilities.

Parameters:
  • sampling_strategy (str, list, or callable, default='auto') – Classes to be under-sampled.

  • estimator (estimator or None, default=None) – Classifier for computing instance hardness. If None, uses RandomForestClassifier.

  • cv (int, default=5) – Number of cross-validation folds.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the IHT sampler.

Return type:

InstanceHardnessThresholdSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.ClusterCentroidsUnderSampler(sampling_strategy='auto', random_state=None, estimator=None, voting='auto')[source]

Bases: BaseEstimator

Cluster Centroids under-sampling.

Replaces majority samples with cluster centroids from k-means.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – Sampling information.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • estimator (estimator or None, default=None) – Clustering estimator. If None, uses KMeans.

  • voting ({'hard', 'soft'}, default='auto') – Voting strategy for cluster assignment.

fit(X, y)[source]

Fit the Cluster Centroids sampler.

Return type:

ClusterCentroidsUnderSampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.SMOTEENNResampler(sampling_strategy='auto', smote=None, enn=None, random_state=None, n_jobs=-1)[source]

Bases: BaseEstimator

SMOTE + Edited Nearest Neighbours combined resampling.

Applies SMOTE over-sampling followed by ENN cleaning to remove noisy synthetic samples.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – Sampling strategy for SMOTE.

  • smote (SMOTEResampler or dict, default=None) – SMOTE instance or parameters.

  • enn (EditedNearestNeighbours or dict, default=None) – ENN instance or parameters.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the SMOTE-ENN sampler.

Return type:

SMOTEENNResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.SMOTETomekResampler(sampling_strategy='auto', smote=None, tomek=None, random_state=None, n_jobs=-1)[source]

Bases: BaseEstimator

SMOTE + Tomek Links combined resampling.

Applies SMOTE over-sampling followed by Tomek links removal to clean the decision boundary.

Parameters:
  • sampling_strategy (float, str, dict, or callable, default='auto') – Sampling strategy for SMOTE.

  • smote (SMOTEResampler or dict, default=None) – SMOTE instance or parameters.

  • tomek (TomekLinksUnderSampler or dict, default=None) – Tomek Links instance or parameters.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • n_jobs (int, default=-1) – Number of parallel jobs.

fit(X, y)[source]

Fit the SMOTE-Tomek sampler.

Return type:

SMOTETomekResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample the dataset.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.MultivariateGaussianSMOTE(sampling_strategy='auto', k_neighbors=5, regularization=1e-06, random_state=None)[source]

Bases: BaseEstimator

Multivariate Gaussian SMOTE oversampler.

For each minority sample, fits a local multivariate Gaussian from its k-nearest minority neighbours and samples new points from it.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets() for semantics.

  • k_neighbors (int, default=5) – Number of nearest minority neighbours for covariance estimation.

  • regularization (float, default=1e-6) – Ridge added to the diagonal of local covariance matrices to ensure positive-definiteness.

  • random_state (int or None, default=None) – Random seed.

References

“Do we need rebalancing strategies?” (ICLR 2025)

fit(X, y)[source]

Fit the sampler (validates input and computes targets).

Parameters:
Return type:

MultivariateGaussianSMOTE

Returns:

self

fit_resample(X, y)[source]

Fit and resample the dataset.

Parameters:
Return type:

tuple[ndarray, ndarray]

Returns:

  • X_resampled (ndarray)

  • y_resampled (ndarray)

class endgame.preprocessing.SimplicialSMOTE(sampling_strategy='auto', k_neighbors=5, simplex_dim=2, random_state=None)[source]

Bases: BaseEstimator

Simplicial complex SMOTE oversampler.

Builds simplicial complexes from the k-NN graph of minority samples and generates new points inside simplices using Dirichlet-distributed barycentric coordinates.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • k_neighbors (int, default=5) – Number of nearest neighbours for graph construction.

  • simplex_dim (int, default=2) – Dimension of the simplices to sample from (2 = triangles, 3 = tetrahedra). Clamped to min(simplex_dim, k_neighbors).

  • random_state (int or None, default=None) – Random seed.

References

Simplicial complex extension of SMOTE (KDD 2025)

fit(X, y)[source]

Fit the sampler.

Return type:

SimplicialSMOTE

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.CVSMOTEResampler(sampling_strategy='auto', k_neighbors=5, cv=3, estimator=None, scoring='f1_macro', candidate_pool_factor=2.0, random_state=None)[source]

Bases: BaseEstimator

Cross-validation guided SMOTE oversampler.

Generates a pool of candidate synthetic samples via SMOTE-style interpolation, then uses cross-validation to retain only those that improve a scorer metric.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • k_neighbors (int, default=5) – Nearest neighbours for SMOTE interpolation.

  • cv (int, default=3) – Number of cross-validation folds for candidate evaluation.

  • estimator (estimator or None, default=None) – Classifier used to score candidate batches. Defaults to LogisticRegression(max_iter=500).

  • scoring (str, default='f1_macro') – Scoring metric for cross-validation (sklearn convention).

  • candidate_pool_factor (float, default=2.0) – Generate this many times the required synthetic samples as candidates, then keep the best subset.

  • random_state (int or None, default=None) – Random seed.

References

CV-informed SMOTE (ICLR 2025)

fit(X, y)[source]

Fit the sampler.

Return type:

CVSMOTEResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.OverlapRegionDetector(sampling_strategy='auto', base_sampler='smote', overlap_estimator=None, k_neighbors=5, threshold=0.3, random_state=None)[source]

Bases: BaseEstimator

Overlap Region Detection meta-method for class imbalance.

Identifies the overlap region between classes using classifier uncertainty, then applies a base sampler with overlap awareness.

Algorithm

  1. Train a classifier to get predicted probabilities.

  2. Samples with high uncertainty (max prob < 1 - threshold) are labelled as “overlap”.

  3. Apply the base sampler on the augmented label space.

  4. Map generated samples back to original labels.

type sampling_strategy:

Text | float | WSGIEnvironment

param sampling_strategy:

See _compute_sampling_targets().

type sampling_strategy:

str, float, or dict, default=’auto’

type base_sampler:

Text | Any

param base_sampler:

Base oversampling method. If a string, looked up in the combined sampler registries. Otherwise must support fit_resample(X, y).

type base_sampler:

str or estimator, default=’smote’

type overlap_estimator:

Any

param overlap_estimator:

Classifier for overlap detection. Defaults to RandomForestClassifier(n_estimators=100).

type overlap_estimator:

estimator or None, default=None

type k_neighbors:

int

param k_neighbors:

Passed to base sampler when constructed from string.

type k_neighbors:

int, default=5

type threshold:

float

param threshold:

Uncertainty threshold: a sample is in the overlap region if max(predicted_proba) < 1 - threshold.

type threshold:

float, default=0.3

type random_state:

int | None

param random_state:

Random seed.

type random_state:

int or None, default=None

References

Overlap Region Detection (AAAI 2025)

fit(X, y)[source]

Fit the sampler.

Return type:

OverlapRegionDetector

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample with overlap awareness.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

Parameters:
  • sampling_strategy (str | float | dict)

  • base_sampler (str | Any)

  • overlap_estimator (Any)

  • k_neighbors (int)

  • threshold (float)

  • random_state (int | None)

class endgame.preprocessing.AutoBalancer(strategy='auto', sampling_strategy='auto', imbalance_threshold=0.5, severe_imbalance_threshold=0.1, include_generative=False, random_state=None, n_jobs=-1, **kwargs)[source]

Bases: BaseEstimator

Automatic class balancing with strategy selection.

Automatically selects and applies the best resampling strategy based on the imbalance ratio and data characteristics.

Parameters:
  • strategy (str, default='auto') – Balancing strategy: - ‘auto’: Automatically select based on imbalance ratio - ‘oversample’: Use SMOTE-based oversampling - ‘undersample’: Use ENN-based undersampling - ‘combine’: Use SMOTE + ENN - ‘geometric’: Use MultivariateGaussianSMOTE (from geometric module) - ‘generative’: Use ForestFlowResampler (from generative module) - Any key from ALL_SAMPLERS (e.g., ‘smote’, ‘borderline_smote’, etc.)

  • sampling_strategy (float, str, dict, or callable, default='auto') – Target class distribution.

  • imbalance_threshold (float, default=0.5) – Ratio below which data is considered imbalanced.

  • severe_imbalance_threshold (float, default=0.1) – Ratio below which imbalance is considered severe.

  • random_state (int or None, default=None) – Random seed for reproducibility.

  • include_generative (bool, default=False) – If True, include generative samplers (from imbalance_generative) in the auto-selection pool.

  • n_jobs (int, default=-1) – Number of parallel jobs.

  • **kwargs (dict) – Additional parameters passed to the selected sampler.

sampler_

The fitted sampler.

Type:

BaseEstimator

imbalance_ratio_

Computed imbalance ratio (minority / majority).

Type:

float

selected_strategy_

The strategy that was selected.

Type:

str

Examples

>>> from endgame.preprocessing import AutoBalancer
>>> balancer = AutoBalancer(strategy='auto', random_state=42)
>>> X_balanced, y_balanced = balancer.fit_resample(X, y)
>>> print(f"Selected: {balancer.selected_strategy_}")
fit(X, y)[source]

Fit the auto-balancer.

Parameters:
Return type:

AutoBalancer

Returns:

self (AutoBalancer) – Fitted balancer.

fit_resample(X, y)[source]

Fit and resample the dataset.

Parameters:
Return type:

tuple[ndarray, ndarray]

Returns:

  • X_resampled (ndarray of shape (n_samples_new, n_features)) – Resampled training data.

  • y_resampled (ndarray of shape (n_samples_new,)) – Resampled target values.

get_sampler()[source]

Get the underlying sampler.

Return type:

BaseEstimator | None

Returns:

sampler (BaseEstimator or None) – The fitted sampler, or None if no resampling was needed.

endgame.preprocessing.get_imbalance_ratio(y)[source]

Compute the imbalance ratio of a target array.

Parameters:

y (array-like of shape (n_samples,)) – Target values.

Return type:

float

Returns:

ratio (float) – Imbalance ratio (minority_count / majority_count). Returns 1.0 if all classes have the same count.

Examples

>>> y = [0, 0, 0, 0, 0, 1, 1]
>>> get_imbalance_ratio(y)
0.4
endgame.preprocessing.get_class_distribution(y)[source]

Get the class distribution of a target array.

Parameters:

y (array-like of shape (n_samples,)) – Target values.

Return type:

WSGIEnvironment[Any, int]

Returns:

distribution (dict) – Dictionary mapping class labels to counts.

Examples

>>> y = [0, 0, 0, 1, 1, 2]
>>> get_class_distribution(y)
{0: 3, 1: 2, 2: 1}
class endgame.preprocessing.DenoisingAutoEncoder(hidden_dims=None, noise_fraction=0.1, dropout=0.1, activation='relu', n_epochs=100, batch_size=256, learning_rate=0.001, weight_decay=1e-05, early_stopping=10, scheduler='cosine', device='auto', random_state=None, verbose=False)[source]

Bases: BaseEstimator, TransformerMixin

Denoising Autoencoder for tabular representation learning.

Corrupts input with swap noise (randomly swapping values between samples), trains to reconstruct the original input, and extracts bottleneck layer embeddings as new features.

This is a key technique from Tabular Playground Series 1st place solutions.

Parameters:
  • hidden_dims (List[int], default=[256, 128, 64]) – Architecture of encoder (decoder mirrors). The last dimension is the bottleneck/embedding size.

  • noise_fraction (float, default=0.1) – Fraction of features to corrupt with swap noise.

  • dropout (float, default=0.1) – Dropout rate for regularization.

  • activation (str, default='relu') – Activation function: ‘relu’, ‘leaky_relu’, ‘elu’, ‘selu’, ‘gelu’, ‘swish’, ‘tanh’.

  • n_epochs (int, default=100) – Maximum training epochs.

  • batch_size (int, default=256) – Training batch size.

  • learning_rate (float, default=1e-3) – Initial learning rate.

  • weight_decay (float, default=1e-5) – L2 regularization strength.

  • early_stopping (int, default=10) – Patience for early stopping (based on reconstruction loss).

  • scheduler (str, default='cosine') – Learning rate scheduler: ‘cosine’, ‘step’, ‘none’.

  • device (str, default='auto') – Device: ‘cuda’, ‘cpu’, or ‘auto’ (auto-detect GPU).

  • random_state (int, optional) – Random seed for reproducibility.

  • verbose (bool, default=False) – Enable verbose output.

model_

Fitted PyTorch DAE model.

Type:

_DAEModule

scaler_

Feature scaler.

Type:

StandardScaler

n_features_in_

Number of input features.

Type:

int

embedding_dim_

Dimension of the learned embeddings.

Type:

int

history_

Training history with ‘train_loss’ and ‘val_loss’.

Type:

dict

Examples

>>> from endgame.preprocessing import DenoisingAutoEncoder
>>> # Create DAE with 64-dimensional embeddings
>>> dae = DenoisingAutoEncoder(hidden_dims=[256, 128, 64], n_epochs=50)
>>> # Fit on training data
>>> dae.fit(X_train)
>>> # Extract embeddings as new features
>>> X_train_embed = dae.transform(X_train)
>>> X_test_embed = dae.transform(X_test)
>>> # Combine with original features
>>> X_train_enriched = np.hstack([X_train, X_train_embed])
fit(X, y=None)[source]

Fit the Denoising Autoencoder.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data.

  • y (ignored) – Not used, present for API consistency.

Return type:

DenoisingAutoEncoder

Returns:

self – Fitted transformer.

transform(X)[source]

Extract bottleneck embeddings.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data to transform.

Return type:

ndarray

Returns:

ndarray of shape (n_samples, embedding_dim) – Bottleneck embeddings.

fit_transform(X, y=None)[source]

Fit and transform in one step.

Parameters:
Return type:

ndarray

Returns:

ndarray of shape (n_samples, embedding_dim) – Bottleneck embeddings.

reconstruct(X)[source]

Reconstruct input from embeddings.

Useful for detecting anomalies (high reconstruction error).

Parameters:

X (array-like of shape (n_samples, n_features)) – Data to reconstruct.

Return type:

ndarray

Returns:

ndarray of shape (n_samples, n_features) – Reconstructed data.

reconstruction_error(X)[source]

Compute per-sample reconstruction error.

Parameters:

X (array-like of shape (n_samples, n_features)) – Data to evaluate.

Return type:

ndarray

Returns:

ndarray of shape (n_samples,) – Mean squared reconstruction error per sample.

get_feature_names_out(input_features=None)[source]

Get output feature names.

Parameters:

input_features (ignored) – Not used.

Return type:

list[Text]

Returns:

List[str] – Output feature names.

class endgame.preprocessing.CTGANResampler(sampling_strategy='auto', embedding_dim=128, generator_dim=(256, 256), discriminator_dim=(256, 256), n_epochs=300, batch_size=500, random_state=None, verbose=False)[source]

Bases: BaseEstimator

Conditional Tabular GAN oversampler.

Thin wrapper around the ctgan.CTGAN package. Trains a conditional GAN on minority class data and generates synthetic samples to balance.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • embedding_dim (int, default=128) – Embedding dimension for the generator.

  • generator_dim (tuple of int, default=(256, 256)) – Generator hidden layer sizes.

  • discriminator_dim (tuple of int, default=(256, 256)) – Discriminator hidden layer sizes.

  • n_epochs (int, default=300) – Training epochs.

  • batch_size (int, default=500) – Training batch size.

  • random_state (int or None, default=None) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

References

CTGAN (NeurIPS 2019)

fit(X, y)[source]

Fit (validate input).

Return type:

CTGANResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample using CTGAN.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.ForestFlowResampler(sampling_strategy='auto', n_estimators=100, max_depth=6, n_steps=50, noise_type='gaussian', random_state=None, verbose=False)[source]

Bases: BaseEstimator

XGBoost-based flow matching oversampler (ForestFlow).

Trains XGBoost to learn the velocity field v(x, t) = x_1 - x_0 of a conditional flow matching ODE, then integrates from noise to data via Euler steps. CPU-friendly — no PyTorch required.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • n_estimators (int, default=100) – Number of trees per XGBoost model.

  • max_depth (int, default=6) – Maximum tree depth.

  • n_steps (int, default=50) – Number of Euler integration steps.

  • noise_type (str, default='gaussian') – Noise distribution for the source: ‘gaussian’ or ‘uniform’.

  • random_state (int or None, default=None) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

References

Jolicoeur-Martineau et al., “Generating and Imputing Tabular Data via Diffusion and Flow XGBoost Models”, 2024.

fit(X, y)[source]

Fit (validate input).

Return type:

ForestFlowResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample using ForestFlow.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.TabDDPMResampler(sampling_strategy='auto', n_timesteps=1000, hidden_dims=None, n_epochs=100, batch_size=256, lr=0.001, device='auto', random_state=None, verbose=False)[source]

Bases: BaseEstimator

Tab-DDPM oversampler: denoising diffusion for tabular data.

Uses Gaussian diffusion with an MLP denoiser that predicts noise given a noisy sample and timestep embedding.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • n_timesteps (int, default=1000) – Number of diffusion timesteps.

  • hidden_dims (list of int, default=[256, 256]) – MLP denoiser hidden layer sizes.

  • n_epochs (int, default=100) – Training epochs.

  • batch_size (int, default=256) – Training batch size.

  • lr (float, default=1e-3) – Learning rate.

  • device (str, default='auto') – Computation device.

  • random_state (int or None, default=None) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

References

TabDDPM (Kotelnikov et al., ICML 2023)

fit(X, y)[source]

Fit (validate input).

Return type:

TabDDPMResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample using TabDDPM.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.TabSynResampler(sampling_strategy='auto', latent_dim=64, vae_hidden_dims=None, vae_epochs=100, diffusion_hidden_dims=None, diffusion_epochs=100, n_timesteps=1000, batch_size=256, lr=0.001, device='auto', random_state=None, verbose=False)[source]

Bases: BaseEstimator

TabSyn oversampler: VAE + latent diffusion for tabular data.

Two-stage approach: 1. Train a VAE on minority data to learn a smooth latent space. 2. Train a diffusion model in the latent space. Generation: reverse diffusion in latent space -> decode through VAE.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • latent_dim (int, default=64) – VAE latent dimension.

  • vae_hidden_dims (list of int, default=[256, 128]) – VAE encoder/decoder hidden sizes.

  • vae_epochs (int, default=100) – VAE training epochs.

  • diffusion_hidden_dims (list of int, default=[256, 256]) – Diffusion denoiser hidden sizes.

  • diffusion_epochs (int, default=100) – Diffusion training epochs.

  • n_timesteps (int, default=1000) – Number of diffusion timesteps.

  • batch_size (int, default=256) – Training batch size.

  • lr (float, default=1e-3) – Learning rate.

  • device (str, default='auto') – Computation device.

  • random_state (int or None, default=None) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

References

TabSyn (Zhang et al., ICLR 2024)

fit(X, y)[source]

Fit (validate input).

Return type:

TabSynResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample using TabSyn.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

class endgame.preprocessing.GReaTResampler(sampling_strategy='auto', model_name='distilgpt2', n_epochs=5, batch_size=8, max_length=256, temperature=0.7, feature_names=None, label_name='Class', device='auto', random_state=None, verbose=False)[source]

Bases: BaseEstimator

GReaT oversampler: LLM-based tabular data generation.

Serializes tabular rows as natural language strings, fine-tunes a small causal language model (e.g. distilgpt2), and generates new minority samples by prompting with the minority class label prefix.

Parameters:
  • sampling_strategy (str, float, or dict, default='auto') – See _compute_sampling_targets().

  • model_name (str, default='distilgpt2') – HuggingFace model name for the causal LM backbone.

  • n_epochs (int, default=5) – Fine-tuning epochs.

  • batch_size (int, default=8) – Training batch size.

  • max_length (int, default=256) – Maximum token length for serialized rows.

  • temperature (float, default=0.7) – Sampling temperature for generation.

  • feature_names (list of str or None, default=None) – Feature names. If None, uses f0, f1, ....

  • label_name (str, default='Class') – Name for the target column in serialization.

  • device (str, default='auto') – Computation device.

  • random_state (int or None, default=None) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

References

GReaT (Borisov et al., 2023), ImbLLM (2025)

fit(X, y)[source]

Fit (validate input).

Return type:

GReaTResampler

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)

fit_resample(X, y)[source]

Fit and resample using GReaT LLM generation.

Return type:

tuple[ndarray, ndarray]

Parameters:
  • X (ArrayLike)

  • y (ArrayLike)