# Preprocessing Guide

Endgame's preprocessing module is built on Polars internally, but accepts pandas DataFrames, numpy arrays, and Polars DataFrames as input. All transformers follow the scikit-learn `fit`/`transform` API and are compatible with sklearn `Pipeline`. Output format mirrors the input format by default (`output_format='auto'`).

```python
import endgame as eg
import numpy as np
import pandas as pd
from endgame.preprocessing import SafeTargetEncoder, AutoBalancer
```

---

## Encoding

Categorical encoding transforms string or integer category columns into numeric representations that models can consume. Endgame provides four encoders covering the most common competition and production patterns.

### SafeTargetEncoder

Target encoding replaces each category with the mean target value for that category. The naive version leaks target information during training — `SafeTargetEncoder` prevents this via inner-fold cross-validation: each training sample's encoding is computed from the other folds only. An M-estimate smoothing term regularizes rare categories toward the global mean.

Formula: `S_i = (n_i * mu_i + m * mu_global) / (n_i + m)` where `m` is the `smoothing` parameter.

```python
from endgame.preprocessing import SafeTargetEncoder

encoder = SafeTargetEncoder(
    smoothing=10,   # Higher = more regularization for rare categories
    cv=5,           # Inner folds for leakage prevention
    noise_level=0.0,
    handle_unknown='global_mean',  # 'global_mean', 'nan', or 'error'
)

# fit_transform uses inner-fold encoding (no leakage)
X_train_enc = encoder.fit_transform(X_train, y_train)

# transform uses full-data statistics
X_test_enc = encoder.transform(X_test)
```

`SafeTargetEncoder` auto-detects categorical columns when `cols=None`. To target specific columns:

```python
encoder = SafeTargetEncoder(cols=['city', 'product_id'], smoothing=20)
```

### LeaveOneOutEncoder

LOO encoding excludes the current sample's own target value when computing the category mean, preventing direct self-leakage without requiring cross-validation folds. Suitable for online learning or settings where full CV is too expensive.

```python
from endgame.preprocessing import LeaveOneOutEncoder

encoder = LeaveOneOutEncoder(smoothing=1.0)
X_train_enc = encoder.fit_transform(X_train, y_train)
X_test_enc = encoder.transform(X_test)
```

### CatBoostEncoder

Mimics CatBoost's internal ordered target statistic: for each sample, only the "preceding" samples (in a random permutation) contribute to that sample's encoding. Prevents leakage without cross-validation overhead.

```python
from endgame.preprocessing import CatBoostEncoder

encoder = CatBoostEncoder(smoothing=1.0, random_state=42)
X_train_enc = encoder.fit_transform(X_train, y_train)
```

### FrequencyEncoder

Replaces categories with their frequency (proportion or count) in the training data. Does not require target values — useful for unsupervised settings or as a complement to target encoders.

```python
from endgame.preprocessing import FrequencyEncoder

encoder = FrequencyEncoder(
    normalize=True,         # True = proportions, False = raw counts
    handle_unknown='zero',  # 'zero', 'nan', or 'error'
)
X_enc = encoder.fit_transform(X)
```

---

## Imputation

Missing value imputation fills `np.nan` entries before model training. Endgame provides four imputers from fast-and-simple to thorough-and-slow, plus an `AutoImputer` that selects a strategy based on the fraction of missing values.

### MICEImputer

Multiple Imputation by Chained Equations iteratively models each feature as a function of all other features using BayesianRidge by default. Handles arbitrary missingness patterns and is the standard choice for datasets with moderate-to-heavy missingness.

```python
from endgame.preprocessing import MICEImputer

imputer = MICEImputer(
    max_iter=10,
    initial_strategy='median',
    random_state=42,
    add_indicator=False,  # Set True to append binary missing-indicator columns
)
X_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
```

To use a custom estimator (e.g., a random forest):

```python
from sklearn.ensemble import RandomForestRegressor

imputer = MICEImputer(
    estimator=RandomForestRegressor(n_estimators=50, random_state=42),
    max_iter=5,
)
```

### MissForestImputer

Uses a `RandomForestRegressor` as the MICE estimator. Non-parametric and robust to non-linear relationships between features. Slower than `MICEImputer` with BayesianRidge but often more accurate.

```python
from endgame.preprocessing import MissForestImputer

imputer = MissForestImputer(
    n_estimators=100,
    max_iter=10,
    n_jobs=-1,        # Use all cores
    random_state=42,
)
X_imputed = imputer.fit_transform(X_train)
```

### KNNImputer

Fills missing values using the mean of the k nearest observed neighbors. Effective when local structure in the data is informative.

```python
from endgame.preprocessing import KNNImputer

imputer = KNNImputer(
    n_neighbors=5,
    weights='uniform',   # 'uniform' or 'distance'
    add_indicator=False,
)
X_imputed = imputer.fit_transform(X_train)
```

### AutoImputer

Inspects the overall fraction of missing values and selects a strategy automatically:

- Less than 5% missing: `SimpleImputer` (median fill, fast)
- 5–30% missing: `KNNImputer`
- More than 30% missing: `MICEImputer`

Also runs an approximate Little's MCAR test and exposes the detected missingness type (`MCAR`, `MAR`, or `MNAR`) via `missingness_type_`.

```python
from endgame.preprocessing import AutoImputer

imputer = AutoImputer(strategy='auto', random_state=42)
X_imputed = imputer.fit_transform(X_train)

print(imputer.selected_strategy_)    # e.g. 'knn'
print(imputer.missingness_fraction_) # e.g. 0.12
print(imputer.missingness_type_)     # e.g. 'MAR'
```

To force a specific strategy: `strategy='simple'`, `'knn'`, `'mice'`, or `'missforest'`.

---

## Class Balancing

Imbalanced datasets require resampling before training. Endgame wraps imbalanced-learn with competition-tuned defaults. All resamplers expose a `fit_resample(X, y)` method and return `(X_resampled, y_resampled)`.

Requires `imbalanced-learn`: `pip install imbalanced-learn`.

### SMOTEResampler

SMOTE (Synthetic Minority Over-sampling Technique) generates synthetic minority samples by interpolating between a sample and one of its k nearest neighbors.

```python
from endgame.preprocessing import SMOTEResampler

smote = SMOTEResampler(
    k_neighbors=5,
    sampling_strategy='auto',  # 'auto', 'minority', float, or dict
    random_state=42,
)
X_res, y_res = smote.fit_resample(X_train, y_train)
```

### ADASYNResampler

Adaptive Synthetic Sampling generates more synthetic samples in regions where the classifier boundary is difficult. Focuses over-sampling effort on hard-to-classify minority examples.

```python
from endgame.preprocessing import ADASYNResampler

adasyn = ADASYNResampler(
    sampling_strategy='auto',
    n_neighbors=5,
    random_state=42,
)
X_res, y_res = adasyn.fit_resample(X_train, y_train)
```

### BorderlineSMOTEResampler

Only generates synthetic samples from minority instances near the decision boundary (borderline instances). Avoids wasting capacity on clearly separable minority samples.

```python
from endgame.preprocessing import BorderlineSMOTEResampler

bsmote = BorderlineSMOTEResampler(
    k_neighbors=5,
    m_neighbors=10,
    kind='borderline-1',  # 'borderline-1' or 'borderline-2'
    random_state=42,
)
X_res, y_res = bsmote.fit_resample(X_train, y_train)
```

### Geometric Samplers

`MultivariateGaussianSMOTE` and `SimplicialSMOTE` generate synthetic samples that stay within the convex geometry of the minority class, avoiding extrapolation beyond the observed manifold. No additional dependencies required.

```python
from endgame.preprocessing import MultivariateGaussianSMOTE, SimplicialSMOTE

geo_smote = MultivariateGaussianSMOTE(sampling_strategy='auto', random_state=42)
X_res, y_res = geo_smote.fit_resample(X_train, y_train)
```

### AutoBalancer

Selects a resampling strategy automatically based on the imbalance ratio and dataset size. Evaluates multiple strategies and picks the best for the current dataset.

```python
from endgame.preprocessing import AutoBalancer

balancer = AutoBalancer(strategy='auto', random_state=42)
X_res, y_res = balancer.fit_resample(X_train, y_train)
```

### Pipeline Integration

Use `imblearn.pipeline.Pipeline` (not `sklearn.pipeline.Pipeline`) to combine resamplers with estimators:

```python
from imblearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from endgame.preprocessing import SMOTEResampler

pipe = Pipeline([
    ('smote', SMOTEResampler(k_neighbors=5, random_state=42)),
    ('clf', RandomForestClassifier(n_estimators=200)),
])
pipe.fit(X_train, y_train)
```

For a complete list of available samplers, inspect `endgame.preprocessing.ALL_SAMPLERS`.

---

## Feature Engineering

### AutoAggregator

Generates group-level aggregation features ("magic features") that capture entity-level statistics. A core technique in many Kaggle winning solutions.

```python
from endgame.preprocessing import AutoAggregator

agg = AutoAggregator(
    group_cols=['customer_id'],
    agg_cols=['amount', 'quantity'],         # None = all numeric columns
    methods=['mean', 'std', 'min', 'max', 'skew'],
    rank_features=True,   # Adds within-group rank (key Optiver technique)
    diff_features=True,   # Adds deviation from group mean
    ratio_features=False, # Adds ratio to group mean
)
X_agg = agg.fit_transform(X_train)
```

Multi-key grouping:

```python
agg = AutoAggregator(
    group_cols=['store_id', 'product_category'],
    agg_cols=['sales'],
    methods=['mean', 'sum', 'count'],
)
```

Generated column names follow the pattern `{group}_{col}_{method}`, e.g. `customer_id_amount_mean`.

### InteractionFeatures

Creates pairwise interaction terms (products and ratios) between numeric features.

```python
from endgame.preprocessing import InteractionFeatures

interactions = InteractionFeatures(
    cols=['feature_a', 'feature_b', 'feature_c'],
    include_products=True,
    include_ratios=True,
)
X_inter = interactions.fit_transform(X_train)
```

### Temporal Feature Extraction

`TemporalFeatures` extracts datetime components and cyclical encodings from datetime columns. Cyclical encodings (sin/cos) are important for periodic features like hour-of-day or month so that December and January are close in feature space.

```python
from endgame.preprocessing import TemporalFeatures

tf = TemporalFeatures(
    datetime_cols=['timestamp'],  # None = auto-detect
    cyclical=True,                # Adds sin/cos for periodic features
    drop_original=False,
)
X_temporal = tf.fit_transform(X_train)
```

Extracted features include: `year`, `month`, `day`, `dayofweek`, `hour`, `minute`, `quarter`, `week_of_year`, `day_of_year`, `is_weekend`, `is_month_start`, `is_month_end`, and cyclical `sin`/`cos` variants for periodic components.

For time series contexts, `LagFeatures` and `RollingFeatures` are also available:

```python
from endgame.preprocessing import LagFeatures, RollingFeatures

lags = LagFeatures(cols=['value'], lags=[1, 7, 28])
rolls = RollingFeatures(cols=['value'], windows=[7, 28], methods=['mean', 'std'])

X = lags.fit_transform(X)
X = rolls.fit_transform(X)
```

---

## Noise Detection

### ConfidentLearningFilter

Implements the Confident Learning algorithm (Northcutt et al., 2021) to identify mislabeled training examples. Cross-validated predicted probabilities are used to estimate the joint distribution of noisy and true labels.

```python
from endgame.preprocessing import ConfidentLearningFilter

clf = ConfidentLearningFilter(
    base_estimator='rf',         # 'rf', 'xgboost', 'lgbm', or any sklearn classifier
    cv=5,
    threshold='auto',            # 'auto' = per-class average probability
    method='prune_by_class',     # 'prune_by_class', 'prune_by_noise_rate', or 'both'
    random_state=42,
)

noise_mask = clf.fit_detect(X_train, y_train)
print(f"Detected {noise_mask.sum()} noisy labels out of {len(y_train)}")

X_clean = X_train[~noise_mask]
y_clean = y_train[~noise_mask]
```

`ConsensusFilter` and `CrossValNoiseDetector` are also available for ensemble-based noise detection with multiple base estimators voting on which labels are suspect.

---

## Target Transformation

For regression tasks, transforming a skewed target distribution toward normality often improves model performance. `TargetTransformer` wraps any sklearn regressor and applies an invertible transform to `y` before fitting, then inverse-transforms predictions automatically.

### Supported Transforms

| Method | Requirements | Notes |
|---|---|---|
| `'log'` | All targets > 0 | Fast, interpretable |
| `'log1p'` | All targets >= 0 | Handles zeros |
| `'sqrt'` | All targets >= 0 | Mild compression |
| `'box_cox'` | All targets > 0 | Optimal power transform |
| `'yeo_johnson'` | Any targets | Works with negatives |
| `'quantile'` | Any targets | Maps to normal distribution |
| `'auto'` | Any targets | Selects via Shapiro-Wilk test |

```python
from endgame.preprocessing import TargetTransformer
from sklearn.ensemble import GradientBoostingRegressor

model = TargetTransformer(
    regressor=GradientBoostingRegressor(),
    method='yeo_johnson',  # or 'auto' to select automatically
)

model.fit(X_train, y_train)
preds = model.predict(X_test)  # Automatically inverse-transformed
```

Using `method='auto'` runs a Shapiro-Wilk normality test on the target and selects the transform that most improves normality:

```python
model = TargetTransformer(
    regressor=GradientBoostingRegressor(),
    method='auto',
)
model.fit(X_train, y_train)
print(model.selected_method_)  # e.g. 'log1p'
```

`TargetQuantileTransformer` applies quantile normalization as a standalone transformer (without wrapping a regressor):

```python
from endgame.preprocessing import TargetQuantileTransformer

qt = TargetQuantileTransformer(output_distribution='normal')
y_transformed = qt.fit_transform(y_train)
y_pred_original = qt.inverse_transform(y_pred_transformed)
```

---

## API Reference

See the [API Reference](../api/preprocessing) for the full parameter list of each class. The preprocessing module exports:

- **Encoding**: `SafeTargetEncoder`, `LeaveOneOutEncoder`, `CatBoostEncoder`, `FrequencyEncoder`
- **Imputation**: `SimpleImputer`, `IndicatorImputer`, `KNNImputer`, `MICEImputer`, `MissForestImputer`, `AutoImputer`
- **Class Balancing**: `SMOTEResampler`, `BorderlineSMOTEResampler`, `ADASYNResampler`, `SVMSMOTEResampler`, `KMeansSMOTEResampler`, `MultivariateGaussianSMOTE`, `SimplicialSMOTE`, `AutoBalancer`, and 10+ under-sampling and combined methods
- **Feature Engineering**: `AutoAggregator`, `InteractionFeatures`, `RankFeatures`, `TemporalFeatures`, `LagFeatures`, `RollingFeatures`
- **Noise Detection**: `ConfidentLearningFilter`, `ConsensusFilter`, `CrossValNoiseDetector`
- **Target Transformation**: `TargetTransformer`, `TargetQuantileTransformer`
- **Feature Selection**: `AdversarialFeatureSelector`, `PermutationImportanceSelector`, `NullImportanceSelector`
- **Discretization**: `BayesianDiscretizer`