# Ensembles Guide

Endgame provides a full suite of ensemble methods, from classic stacking and blending
to advanced techniques like hill climbing, optimal weight search, and knowledge
distillation. All ensemble classes follow the sklearn interface (`fit`, `predict`,
`predict_proba`).

## SuperLearner

`SuperLearner` combines arbitrary base learners using non-negative least squares (NNLS)
weighting, producing a convex combination that cannot perform worse than the best
individual model on the training data.

```python
from endgame.ensemble import SuperLearner
from endgame.models import LGBMWrapper, XGBWrapper, CatBoostWrapper

base_learners = [
    LGBMWrapper(preset='endgame'),
    XGBWrapper(preset='endgame'),
    CatBoostWrapper(preset='endgame'),
]

sl = SuperLearner(
    base_estimators=[
        ("lgbm", LGBMWrapper(preset='endgame')),
        ("xgb", XGBWrapper(preset='endgame')),
        ("cb", CatBoostWrapper(preset='endgame')),
    ],
    meta_learner="nnls",  # non-negative least squares
    cv=5,                 # inner cross-validation folds for meta-features
)

sl.fit(X_train, y_train)
proba = sl.predict_proba(X_test)
preds = sl.predict(X_test)

# Inspect learned weights
print(sl.coef_)   # non-negative, sum to 1
```

The meta-features are out-of-fold predictions from each base learner. The NNLS
solver finds the weight vector that minimises squared error on those meta-features,
guaranteeing non-negative weights without requiring regularisation.

## HillClimbingEnsemble

`HillClimbingEnsemble` uses greedy forward selection to build an ensemble that
directly optimises an arbitrary metric. At each step it adds the candidate model
(with repetition allowed) that most improves the ensemble score on the hold-out
fold. This mirrors the approach used in many competition-winning solutions.

```python
from endgame.ensemble import HillClimbingEnsemble
from sklearn.metrics import roc_auc_score

hc = HillClimbingEnsemble(
    metric=roc_auc_score,
    maximize=True,
    n_iterations=100,     # maximum greedy steps
    random_state=42,
)

# Pass a list of OOF prediction arrays
oof_preds = [lgbm_oof, xgb_oof, cb_oof, ft_oof]
hc.fit(oof_preds, y_train)

# Apply the discovered weights to test predictions
test_preds = [lgbm_test, xgb_test, cb_test, ft_test]
final = hc.predict(test_preds)

print(hc.weights_)    # float weights, sums to 1
print(hc.best_score_) # best metric achieved on OOF
```

## StackingEnsemble

`StackingEnsemble` trains base learners and a meta-learner in a single `fit` call.
Base learner out-of-fold predictions become features for the meta-learner.

```python
from endgame.ensemble import StackingEnsemble
from endgame.models import LGBMWrapper, XGBWrapper
from sklearn.linear_model import LogisticRegression

stack = StackingEnsemble(
    estimators=[
        ('lgbm', LGBMWrapper(preset='endgame')),
        ('xgb',  XGBWrapper(preset='endgame')),
    ],
    meta_learner=LogisticRegression(),
    cv=5,
    passthrough=True,   # also pass original features to meta-learner
    use_proba=True,     # use predict_proba outputs as meta-features
)

stack.fit(X_train, y_train)
preds = stack.predict(X_test)
proba = stack.predict_proba(X_test)
```

## BlendingEnsemble

`BlendingEnsemble` uses a fixed hold-out split rather than cross-validation to
generate meta-features. This is faster but uses less data for training base learners.

```python
from endgame.ensemble import BlendingEnsemble
from endgame.models import LGBMWrapper, XGBWrapper, CatBoostWrapper

blend = BlendingEnsemble(
    estimators=[
        ('lgbm', LGBMWrapper()),
        ('xgb',  XGBWrapper()),
        ('cb',   CatBoostWrapper()),
    ],
    meta_learner=LGBMWrapper(n_estimators=200),
    holdout_size=0.2,
    random_state=42,
)

blend.fit(X_train, y_train)
preds = blend.predict(X_test)
```

## OptimizedBlender

`OptimizedBlender` finds continuous blend weights by minimising a loss function
over the provided out-of-fold predictions using scipy optimisation (L-BFGS-B with
a simplex constraint).

```python
from endgame.ensemble import OptimizedBlender
from sklearn.metrics import log_loss

blender = OptimizedBlender(
    metric=log_loss,
    maximize=False,     # log_loss should be minimised
    bounds=(0.0, 1.0),  # weight bounds per model
)

blender.fit(oof_preds_matrix, y_train)  # shape (n_samples, n_models)
final = blender.predict(test_preds_matrix)
print(blender.weights_)
```

## RankAverageBlender

`RankAverageBlender` converts each model's predictions to ranks before averaging.
This is robust to scale differences between models and often outperforms simple
averaging when models produce predictions on different scales.

```python
from endgame.ensemble import RankAverageBlender

blender = RankAverageBlender(weights=[0.4, 0.35, 0.25])
final = blender.predict(test_preds_matrix)  # shape (n_samples, n_models)
```

## ThresholdOptimizer

`ThresholdOptimizer` finds the optimal classification threshold by searching over
a grid of cutoffs and maximising a target metric on out-of-fold predictions. This
is particularly useful for imbalanced datasets where the default 0.5 threshold is
suboptimal.

```python
from endgame.ensemble import ThresholdOptimizer
from sklearn.metrics import f1_score

optimizer = ThresholdOptimizer(
    metric=f1_score,
    maximize=True,
    thresholds=100,     # number of candidate thresholds to evaluate
)

optimizer.fit(oof_probabilities, y_train)
print(f"Optimal threshold: {optimizer.threshold_:.4f}")

hard_preds = optimizer.predict(test_probabilities)
```

## Knowledge Distillation

Endgame supports training a lightweight student model to mimic a heavier teacher
model. This is useful when you need a fast inference model that approximates an
expensive ensemble.

```python
from endgame.ensemble import KnowledgeDistiller
from endgame.models import LGBMWrapper
from endgame.models.baselines import LinearClassifier

teacher = LGBMWrapper(preset='endgame')
teacher.fit(X_train, y_train)

student = LinearClassifier()

kd = KnowledgeDistiller(
    teacher=teacher,
    student=student,
    temperature=3.0,       # softens teacher's probability distribution
    alpha=0.5,             # blend of hard labels vs soft labels
)

kd.fit(X_train, y_train)
preds = kd.student_.predict(X_test)
```

The `temperature` parameter controls how much the teacher's soft probabilities are
smoothed before being used as targets. Higher temperature produces softer, more
informative targets. `alpha` controls the trade-off between learning from the hard
ground-truth labels and the soft teacher labels.

## Choosing an Ensemble Strategy

| Strategy | When to use |
|---|---|
| `SuperLearner` | Strong diverse base learners; want theoretically-grounded weighting |
| `HillClimbingEnsemble` | Have OOF predictions; want to directly optimise a target metric |
| `StackingEnsemble` | Standard competition workflow; enough data for CV-based stacking |
| `BlendingEnsemble` | Limited time; large datasets where full CV is expensive |
| `OptimizedBlender` | OOF predictions already computed; want continuous weight optimisation |
| `RankAverageBlender` | Models have incompatible prediction scales |
| `ThresholdOptimizer` | Binary classification with imbalanced classes or custom metric |
| `KnowledgeDistillation` | Need fast inference; ensemble too slow for production |

## See Also

- [API Reference: ensemble](../api/ensemble)
- [Models Guide](models.md) for base learner options
- [Calibration Guide](calibration.md) for post-hoc probability calibration