Ensembles Guide¶

Endgame provides a full suite of ensemble methods, from classic stacking and blending to advanced techniques like hill climbing, optimal weight search, and knowledge distillation. All ensemble classes follow the sklearn interface (fit, predict, predict_proba).

SuperLearner¶

SuperLearner combines arbitrary base learners using non-negative least squares (NNLS) weighting, producing a convex combination that cannot perform worse than the best individual model on the training data.

from endgame.ensemble import SuperLearner
from endgame.models import LGBMWrapper, XGBWrapper, CatBoostWrapper

base_learners = [
    LGBMWrapper(preset='endgame'),
    XGBWrapper(preset='endgame'),
    CatBoostWrapper(preset='endgame'),
]

sl = SuperLearner(
    base_estimators=[
        ("lgbm", LGBMWrapper(preset='endgame')),
        ("xgb", XGBWrapper(preset='endgame')),
        ("cb", CatBoostWrapper(preset='endgame')),
    ],
    meta_learner="nnls",  # non-negative least squares
    cv=5,                 # inner cross-validation folds for meta-features
)

sl.fit(X_train, y_train)
proba = sl.predict_proba(X_test)
preds = sl.predict(X_test)

# Inspect learned weights
print(sl.coef_)   # non-negative, sum to 1

The meta-features are out-of-fold predictions from each base learner. The NNLS solver finds the weight vector that minimises squared error on those meta-features, guaranteeing non-negative weights without requiring regularisation.

HillClimbingEnsemble¶

HillClimbingEnsemble uses greedy forward selection to build an ensemble that directly optimises an arbitrary metric. At each step it adds the candidate model (with repetition allowed) that most improves the ensemble score on the hold-out fold. This mirrors the approach used in many competition-winning solutions.

from endgame.ensemble import HillClimbingEnsemble
from sklearn.metrics import roc_auc_score

hc = HillClimbingEnsemble(
    metric=roc_auc_score,
    maximize=True,
    n_iterations=100,     # maximum greedy steps
    random_state=42,
)

# Pass a list of OOF prediction arrays
oof_preds = [lgbm_oof, xgb_oof, cb_oof, ft_oof]
hc.fit(oof_preds, y_train)

# Apply the discovered weights to test predictions
test_preds = [lgbm_test, xgb_test, cb_test, ft_test]
final = hc.predict(test_preds)

print(hc.weights_)    # float weights, sums to 1
print(hc.best_score_) # best metric achieved on OOF

StackingEnsemble¶

StackingEnsemble trains base learners and a meta-learner in a single fit call. Base learner out-of-fold predictions become features for the meta-learner.

from endgame.ensemble import StackingEnsemble
from endgame.models import LGBMWrapper, XGBWrapper
from sklearn.linear_model import LogisticRegression

stack = StackingEnsemble(
    estimators=[
        ('lgbm', LGBMWrapper(preset='endgame')),
        ('xgb',  XGBWrapper(preset='endgame')),
    ],
    meta_learner=LogisticRegression(),
    cv=5,
    passthrough=True,   # also pass original features to meta-learner
    use_proba=True,     # use predict_proba outputs as meta-features
)

stack.fit(X_train, y_train)
preds = stack.predict(X_test)
proba = stack.predict_proba(X_test)

BlendingEnsemble¶

BlendingEnsemble uses a fixed hold-out split rather than cross-validation to generate meta-features. This is faster but uses less data for training base learners.

from endgame.ensemble import BlendingEnsemble
from endgame.models import LGBMWrapper, XGBWrapper, CatBoostWrapper

blend = BlendingEnsemble(
    estimators=[
        ('lgbm', LGBMWrapper()),
        ('xgb',  XGBWrapper()),
        ('cb',   CatBoostWrapper()),
    ],
    meta_learner=LGBMWrapper(n_estimators=200),
    holdout_size=0.2,
    random_state=42,
)

blend.fit(X_train, y_train)
preds = blend.predict(X_test)

OptimizedBlender¶

OptimizedBlender finds continuous blend weights by minimising a loss function over the provided out-of-fold predictions using scipy optimisation (L-BFGS-B with a simplex constraint).

from endgame.ensemble import OptimizedBlender
from sklearn.metrics import log_loss

blender = OptimizedBlender(
    metric=log_loss,
    maximize=False,     # log_loss should be minimised
    bounds=(0.0, 1.0),  # weight bounds per model
)

blender.fit(oof_preds_matrix, y_train)  # shape (n_samples, n_models)
final = blender.predict(test_preds_matrix)
print(blender.weights_)

RankAverageBlender¶

RankAverageBlender converts each model’s predictions to ranks before averaging. This is robust to scale differences between models and often outperforms simple averaging when models produce predictions on different scales.

from endgame.ensemble import RankAverageBlender

blender = RankAverageBlender(weights=[0.4, 0.35, 0.25])
final = blender.predict(test_preds_matrix)  # shape (n_samples, n_models)

ThresholdOptimizer¶

ThresholdOptimizer finds the optimal classification threshold by searching over a grid of cutoffs and maximising a target metric on out-of-fold predictions. This is particularly useful for imbalanced datasets where the default 0.5 threshold is suboptimal.

from endgame.ensemble import ThresholdOptimizer
from sklearn.metrics import f1_score

optimizer = ThresholdOptimizer(
    metric=f1_score,
    maximize=True,
    thresholds=100,     # number of candidate thresholds to evaluate
)

optimizer.fit(oof_probabilities, y_train)
print(f"Optimal threshold: {optimizer.threshold_:.4f}")

hard_preds = optimizer.predict(test_probabilities)

Knowledge Distillation¶

Endgame supports training a lightweight student model to mimic a heavier teacher model. This is useful when you need a fast inference model that approximates an expensive ensemble.

from endgame.ensemble import KnowledgeDistiller
from endgame.models import LGBMWrapper
from endgame.models.baselines import LinearClassifier

teacher = LGBMWrapper(preset='endgame')
teacher.fit(X_train, y_train)

student = LinearClassifier()

kd = KnowledgeDistiller(
    teacher=teacher,
    student=student,
    temperature=3.0,       # softens teacher's probability distribution
    alpha=0.5,             # blend of hard labels vs soft labels
)

kd.fit(X_train, y_train)
preds = kd.student_.predict(X_test)

The temperature parameter controls how much the teacher’s soft probabilities are smoothed before being used as targets. Higher temperature produces softer, more informative targets. alpha controls the trade-off between learning from the hard ground-truth labels and the soft teacher labels.

Choosing an Ensemble Strategy¶

Strategy	When to use
`SuperLearner`	Strong diverse base learners; want theoretically-grounded weighting
`HillClimbingEnsemble`	Have OOF predictions; want to directly optimise a target metric
`StackingEnsemble`	Standard competition workflow; enough data for CV-based stacking
`BlendingEnsemble`	Limited time; large datasets where full CV is expensive
`OptimizedBlender`	OOF predictions already computed; want continuous weight optimisation
`RankAverageBlender`	Models have incompatible prediction scales
`ThresholdOptimizer`	Binary classification with imbalanced classes or custom metric
`KnowledgeDistillation`	Need fast inference; ensemble too slow for production