Preprocessing Guide¶

Endgame’s preprocessing module is built on Polars internally, but accepts pandas DataFrames, numpy arrays, and Polars DataFrames as input. All transformers follow the scikit-learn fit/transform API and are compatible with sklearn Pipeline. Output format mirrors the input format by default (output_format='auto').

import endgame as eg
import numpy as np
import pandas as pd
from endgame.preprocessing import SafeTargetEncoder, AutoBalancer

Encoding¶

Categorical encoding transforms string or integer category columns into numeric representations that models can consume. Endgame provides four encoders covering the most common competition and production patterns.

SafeTargetEncoder¶

Target encoding replaces each category with the mean target value for that category. The naive version leaks target information during training — SafeTargetEncoder prevents this via inner-fold cross-validation: each training sample’s encoding is computed from the other folds only. An M-estimate smoothing term regularizes rare categories toward the global mean.

Formula: S_i = (n_i * mu_i + m * mu_global) / (n_i + m) where m is the smoothing parameter.

from endgame.preprocessing import SafeTargetEncoder

encoder = SafeTargetEncoder(
    smoothing=10,   # Higher = more regularization for rare categories
    cv=5,           # Inner folds for leakage prevention
    noise_level=0.0,
    handle_unknown='global_mean',  # 'global_mean', 'nan', or 'error'
)

# fit_transform uses inner-fold encoding (no leakage)
X_train_enc = encoder.fit_transform(X_train, y_train)

# transform uses full-data statistics
X_test_enc = encoder.transform(X_test)

SafeTargetEncoder auto-detects categorical columns when cols=None. To target specific columns:

encoder = SafeTargetEncoder(cols=['city', 'product_id'], smoothing=20)

LeaveOneOutEncoder¶

LOO encoding excludes the current sample’s own target value when computing the category mean, preventing direct self-leakage without requiring cross-validation folds. Suitable for online learning or settings where full CV is too expensive.

from endgame.preprocessing import LeaveOneOutEncoder

encoder = LeaveOneOutEncoder(smoothing=1.0)
X_train_enc = encoder.fit_transform(X_train, y_train)
X_test_enc = encoder.transform(X_test)

CatBoostEncoder¶

Mimics CatBoost’s internal ordered target statistic: for each sample, only the “preceding” samples (in a random permutation) contribute to that sample’s encoding. Prevents leakage without cross-validation overhead.

from endgame.preprocessing import CatBoostEncoder

encoder = CatBoostEncoder(smoothing=1.0, random_state=42)
X_train_enc = encoder.fit_transform(X_train, y_train)

FrequencyEncoder¶

Replaces categories with their frequency (proportion or count) in the training data. Does not require target values — useful for unsupervised settings or as a complement to target encoders.

from endgame.preprocessing import FrequencyEncoder

encoder = FrequencyEncoder(
    normalize=True,         # True = proportions, False = raw counts
    handle_unknown='zero',  # 'zero', 'nan', or 'error'
)
X_enc = encoder.fit_transform(X)

Imputation¶

Missing value imputation fills np.nan entries before model training. Endgame provides four imputers from fast-and-simple to thorough-and-slow, plus an AutoImputer that selects a strategy based on the fraction of missing values.

MICEImputer¶

Multiple Imputation by Chained Equations iteratively models each feature as a function of all other features using BayesianRidge by default. Handles arbitrary missingness patterns and is the standard choice for datasets with moderate-to-heavy missingness.

from endgame.preprocessing import MICEImputer

imputer = MICEImputer(
    max_iter=10,
    initial_strategy='median',
    random_state=42,
    add_indicator=False,  # Set True to append binary missing-indicator columns
)
X_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)

To use a custom estimator (e.g., a random forest):

from sklearn.ensemble import RandomForestRegressor

imputer = MICEImputer(
    estimator=RandomForestRegressor(n_estimators=50, random_state=42),
    max_iter=5,
)

MissForestImputer¶

Uses a RandomForestRegressor as the MICE estimator. Non-parametric and robust to non-linear relationships between features. Slower than MICEImputer with BayesianRidge but often more accurate.

from endgame.preprocessing import MissForestImputer

imputer = MissForestImputer(
    n_estimators=100,
    max_iter=10,
    n_jobs=-1,        # Use all cores
    random_state=42,
)
X_imputed = imputer.fit_transform(X_train)

KNNImputer¶

Fills missing values using the mean of the k nearest observed neighbors. Effective when local structure in the data is informative.

from endgame.preprocessing import KNNImputer

imputer = KNNImputer(
    n_neighbors=5,
    weights='uniform',   # 'uniform' or 'distance'
    add_indicator=False,
)
X_imputed = imputer.fit_transform(X_train)

AutoImputer¶

Inspects the overall fraction of missing values and selects a strategy automatically:

Less than 5% missing: SimpleImputer (median fill, fast)
5–30% missing: KNNImputer
More than 30% missing: MICEImputer

Also runs an approximate Little’s MCAR test and exposes the detected missingness type (MCAR, MAR, or MNAR) via missingness_type_.

from endgame.preprocessing import AutoImputer

imputer = AutoImputer(strategy='auto', random_state=42)
X_imputed = imputer.fit_transform(X_train)

print(imputer.selected_strategy_)    # e.g. 'knn'
print(imputer.missingness_fraction_) # e.g. 0.12
print(imputer.missingness_type_)     # e.g. 'MAR'

To force a specific strategy: strategy='simple', 'knn', 'mice', or 'missforest'.

Class Balancing¶

Imbalanced datasets require resampling before training. Endgame wraps imbalanced-learn with competition-tuned defaults. All resamplers expose a fit_resample(X, y) method and return (X_resampled, y_resampled).

Requires imbalanced-learn: pip install imbalanced-learn.

SMOTEResampler¶

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic minority samples by interpolating between a sample and one of its k nearest neighbors.

from endgame.preprocessing import SMOTEResampler

smote = SMOTEResampler(
    k_neighbors=5,
    sampling_strategy='auto',  # 'auto', 'minority', float, or dict
    random_state=42,
)
X_res, y_res = smote.fit_resample(X_train, y_train)

ADASYNResampler¶

Adaptive Synthetic Sampling generates more synthetic samples in regions where the classifier boundary is difficult. Focuses over-sampling effort on hard-to-classify minority examples.

from endgame.preprocessing import ADASYNResampler

adasyn = ADASYNResampler(
    sampling_strategy='auto',
    n_neighbors=5,
    random_state=42,
)
X_res, y_res = adasyn.fit_resample(X_train, y_train)

BorderlineSMOTEResampler¶

Only generates synthetic samples from minority instances near the decision boundary (borderline instances). Avoids wasting capacity on clearly separable minority samples.

from endgame.preprocessing import BorderlineSMOTEResampler

bsmote = BorderlineSMOTEResampler(
    k_neighbors=5,
    m_neighbors=10,
    kind='borderline-1',  # 'borderline-1' or 'borderline-2'
    random_state=42,
)
X_res, y_res = bsmote.fit_resample(X_train, y_train)

Geometric Samplers¶

MultivariateGaussianSMOTE and SimplicialSMOTE generate synthetic samples that stay within the convex geometry of the minority class, avoiding extrapolation beyond the observed manifold. No additional dependencies required.

from endgame.preprocessing import MultivariateGaussianSMOTE, SimplicialSMOTE

geo_smote = MultivariateGaussianSMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = geo_smote.fit_resample(X_train, y_train)

AutoBalancer¶

Selects a resampling strategy automatically based on the imbalance ratio and dataset size. Evaluates multiple strategies and picks the best for the current dataset.

from endgame.preprocessing import AutoBalancer

balancer = AutoBalancer(strategy='auto', random_state=42)
X_res, y_res = balancer.fit_resample(X_train, y_train)

Pipeline Integration¶

Use imblearn.pipeline.Pipeline (not sklearn.pipeline.Pipeline) to combine resamplers with estimators:

from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from endgame.preprocessing import SMOTEResampler

pipe = Pipeline([
    ('smote', SMOTEResampler(k_neighbors=5, random_state=42)),
    ('clf', RandomForestClassifier(n_estimators=200)),
])
pipe.fit(X_train, y_train)

For a complete list of available samplers, inspect endgame.preprocessing.ALL_SAMPLERS.

Feature Engineering¶

AutoAggregator¶

Generates group-level aggregation features (“magic features”) that capture entity-level statistics. A core technique in many Kaggle winning solutions.

from endgame.preprocessing import AutoAggregator

agg = AutoAggregator(
    group_cols=['customer_id'],
    agg_cols=['amount', 'quantity'],         # None = all numeric columns
    methods=['mean', 'std', 'min', 'max', 'skew'],
    rank_features=True,   # Adds within-group rank (key Optiver technique)
    diff_features=True,   # Adds deviation from group mean
    ratio_features=False, # Adds ratio to group mean
)
X_agg = agg.fit_transform(X_train)

Multi-key grouping:

agg = AutoAggregator(
    group_cols=['store_id', 'product_category'],
    agg_cols=['sales'],
    methods=['mean', 'sum', 'count'],
)

Generated column names follow the pattern {group}_{col}_{method}, e.g. customer_id_amount_mean.

InteractionFeatures¶

Creates pairwise interaction terms (products and ratios) between numeric features.

from endgame.preprocessing import InteractionFeatures

interactions = InteractionFeatures(
    cols=['feature_a', 'feature_b', 'feature_c'],
    include_products=True,
    include_ratios=True,
)
X_inter = interactions.fit_transform(X_train)

Temporal Feature Extraction¶

TemporalFeatures extracts datetime components and cyclical encodings from datetime columns. Cyclical encodings (sin/cos) are important for periodic features like hour-of-day or month so that December and January are close in feature space.

from endgame.preprocessing import TemporalFeatures

tf = TemporalFeatures(
    datetime_cols=['timestamp'],  # None = auto-detect
    cyclical=True,                # Adds sin/cos for periodic features
    drop_original=False,
)
X_temporal = tf.fit_transform(X_train)

Extracted features include: year, month, day, dayofweek, hour, minute, quarter, week_of_year, day_of_year, is_weekend, is_month_start, is_month_end, and cyclical sin/cos variants for periodic components.

For time series contexts, LagFeatures and RollingFeatures are also available:

from endgame.preprocessing import LagFeatures, RollingFeatures

lags = LagFeatures(cols=['value'], lags=[1, 7, 28])
rolls = RollingFeatures(cols=['value'], windows=[7, 28], methods=['mean', 'std'])

X = lags.fit_transform(X)
X = rolls.fit_transform(X)

Noise Detection¶

ConfidentLearningFilter¶

Implements the Confident Learning algorithm (Northcutt et al., 2021) to identify mislabeled training examples. Cross-validated predicted probabilities are used to estimate the joint distribution of noisy and true labels.

from endgame.preprocessing import ConfidentLearningFilter

clf = ConfidentLearningFilter(
    base_estimator='rf',         # 'rf', 'xgboost', 'lgbm', or any sklearn classifier
    cv=5,
    threshold='auto',            # 'auto' = per-class average probability
    method='prune_by_class',     # 'prune_by_class', 'prune_by_noise_rate', or 'both'
    random_state=42,
)

noise_mask = clf.fit_detect(X_train, y_train)
print(f"Detected {noise_mask.sum()} noisy labels out of {len(y_train)}")

X_clean = X_train[~noise_mask]
y_clean = y_train[~noise_mask]

ConsensusFilter and CrossValNoiseDetector are also available for ensemble-based noise detection with multiple base estimators voting on which labels are suspect.

Target Transformation¶

For regression tasks, transforming a skewed target distribution toward normality often improves model performance. TargetTransformer wraps any sklearn regressor and applies an invertible transform to y before fitting, then inverse-transforms predictions automatically.

Supported Transforms¶

Method	Requirements	Notes
`'log'`	All targets > 0	Fast, interpretable
`'log1p'`	All targets >= 0	Handles zeros
`'sqrt'`	All targets >= 0	Mild compression
`'box_cox'`	All targets > 0	Optimal power transform
`'yeo_johnson'`	Any targets	Works with negatives
`'quantile'`	Any targets	Maps to normal distribution
`'auto'`	Any targets	Selects via Shapiro-Wilk test

from endgame.preprocessing import TargetTransformer
from sklearn.ensemble import GradientBoostingRegressor

model = TargetTransformer(
    regressor=GradientBoostingRegressor(),
    method='yeo_johnson',  # or 'auto' to select automatically
)

model.fit(X_train, y_train)
preds = model.predict(X_test)  # Automatically inverse-transformed

Using method='auto' runs a Shapiro-Wilk normality test on the target and selects the transform that most improves normality:

model = TargetTransformer(
    regressor=GradientBoostingRegressor(),
    method='auto',
)
model.fit(X_train, y_train)
print(model.selected_method_)  # e.g. 'log1p'

TargetQuantileTransformer applies quantile normalization as a standalone transformer (without wrapping a regressor):

from endgame.preprocessing import TargetQuantileTransformer

qt = TargetQuantileTransformer(output_distribution='normal')
y_transformed = qt.fit_transform(y_train)
y_pred_original = qt.inverse_transform(y_pred_transformed)

API Reference¶

See the API Reference for the full parameter list of each class. The preprocessing module exports:

Encoding: SafeTargetEncoder, LeaveOneOutEncoder, CatBoostEncoder, FrequencyEncoder
Imputation: SimpleImputer, IndicatorImputer, KNNImputer, MICEImputer, MissForestImputer, AutoImputer
Class Balancing: SMOTEResampler, BorderlineSMOTEResampler, ADASYNResampler, SVMSMOTEResampler, KMeansSMOTEResampler, MultivariateGaussianSMOTE, SimplicialSMOTE, AutoBalancer, and 10+ under-sampling and combined methods
Feature Engineering: AutoAggregator, InteractionFeatures, RankFeatures, TemporalFeatures, LagFeatures, RollingFeatures
Noise Detection: ConfidentLearningFilter, ConsensusFilter, CrossValNoiseDetector
Target Transformation: TargetTransformer, TargetQuantileTransformer
Feature Selection: AdversarialFeatureSelector, PermutationImportanceSelector, NullImportanceSelector
Discretization: BayesianDiscretizer