Calibration Guide¶

Endgame provides a comprehensive calibration module covering conformal prediction, Venn-ABERS calibration, and classical probability calibration methods. All classes follow the sklearn interface (fit, predict, predict_proba).

Conformal Prediction (Classification)¶

ConformalClassifier wraps any classifier to produce prediction sets that contain the true label with at least 1 - alpha marginal coverage. No distributional assumptions are required beyond exchangeability.

from endgame.calibration import ConformalClassifier
from endgame.models import LGBMWrapper
from sklearn.model_selection import train_test_split

X_train, X_cal, y_train, y_cal = train_test_split(
    X, y, test_size=0.2, random_state=42
)

base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)

cc = ConformalClassifier(
    estimator=base,
    alpha=0.1,          # target miscoverage rate; 90% coverage guaranteed
    method='lac',       # 'lac' (softmax-based) or 'aps' (adaptive prediction sets)
)

cc.fit(X_cal, y_cal)   # calibrate on hold-out set

# Returns a list of sets, one per test point
prediction_sets = cc.predict(X_test)
for i, pset in enumerate(prediction_sets[:5]):
    print(f"Sample {i}: possible classes = {pset}")

# Standard hard prediction uses the singleton with highest score
preds = cc.predict(X_test)

# Empirical coverage on a labelled evaluation set
cov = cc.coverage_score(X_eval, y_eval)
print(f"Empirical coverage: {cov:.3f}")  # should be >= 0.90

The 'aps' score (Adaptive Prediction Sets) produces smaller, class-conditional sets at the cost of slightly weaker marginal guarantees. Use 'lac' (Least Ambiguous Classifier) for standard coverage.

Conformal Prediction (Regression)¶

ConformalRegressor produces prediction intervals with guaranteed marginal coverage. The width of intervals adapts automatically to the local difficulty of each test point when a difficulty estimator is provided.

from endgame.calibration import ConformalRegressor
from endgame.models import LGBMWrapper

base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)

cr = ConformalRegressor(
    estimator=base,
    alpha=0.05,          # 95% coverage
    method='split',      # 'split' (fast) or 'cv+' (cross-conformal, slower)
)

cr.fit(X_cal, y_cal)

# Returns a tuple of (lower, upper) arrays
lower, upper = cr.predict_interval(X_test)

widths = upper - lower
print(f"Median interval width: {np.median(widths):.4f}")

cov = cr.coverage_score(X_eval, y_eval)
print(f"Empirical coverage: {cov:.3f}")

Conformalized Quantile Regression (CQR)¶

ConformizedQuantileRegressor combines a quantile regressor with conformal calibration to produce adaptive intervals. Intervals are wider where the model is less certain, unlike split conformal which uses a fixed residual threshold.

from endgame.calibration import ConformizedQuantileRegressor
from endgame.models import LGBMWrapper

# Base model must support quantile regression
qr = LGBMWrapper(objective='quantile', preset='endgame')

cqr = ConformizedQuantileRegressor(
    estimator=qr,
    alpha=0.1,           # 90% coverage target
    quantile_low=0.05,   # lower quantile for the base regressor
    quantile_high=0.95,  # upper quantile for the base regressor
)

cqr.fit(X_train, y_train, X_cal=X_cal, y_cal=y_cal)

lower, upper = cqr.predict_interval(X_test)

CQR is the recommended method when prediction intervals of varying width are needed. The conformity score is max(q_low - y, y - q_high), so the calibration step only stretches or shrinks the raw quantile interval by a single scalar.

Venn-ABERS Calibration¶

VennABERS produces well-calibrated probability estimates without requiring a specific parametric form. It is guaranteed to be calibrated in a strong sense (individual calibration) under no distributional assumptions.

from endgame.calibration import VennABERS
from endgame.models import LGBMWrapper

base = LGBMWrapper(preset='endgame')
base.fit(X_train, y_train)

va = VennABERS(estimator=base)
va.fit(X_cal, y_cal)

# Returns point probabilities (geometric mean of the interval bounds)
proba = va.predict_proba(X_test)

# Returns the full Venn-ABERS interval [p0, p1] per sample
intervals = va.predict_interval(X_test)
p0, p1 = intervals[:, 0], intervals[:, 1]

# Interval width indicates epistemic uncertainty
uncertainty = p1 - p0

Unlike Platt scaling or isotonic regression, Venn-ABERS does not require tuning and is valid for small calibration sets. It is particularly useful when the base model has poorly calibrated raw probabilities (e.g., a gradient boosting model).

Classical Probability Calibration¶

Temperature Scaling¶

Temperature scaling divides the logits of a neural network (or any model exposing logits) by a single learnable scalar T. It is the most common post-hoc calibration technique for deep learning.

from endgame.calibration import TemperatureScaling

ts = TemperatureScaling()
ts.fit(logits_cal, y_cal)    # calibrate on logits (pre-softmax)

calibrated_proba = ts.predict_proba(logits_test)
print(f"Learned temperature: {ts.temperature_:.4f}")

Platt Scaling¶

Platt scaling fits a logistic regression on the model’s raw scores. It is effective when the raw scores are approximately normally distributed by class.

from endgame.calibration import PlattScaling

ps = PlattScaling()
ps.fit(scores_cal, y_cal)    # 1D array of decision scores

calibrated_proba = ps.predict_proba(scores_test)

Beta Calibration¶

Beta calibration maps scores through a Beta CDF, offering more flexibility than Platt scaling for scores bounded in [0, 1] (e.g., already-softmaxed probabilities).

from endgame.calibration import BetaCalibration

bc = BetaCalibration()
bc.fit(proba_cal, y_cal)   # uncalibrated probabilities in [0, 1]

calibrated_proba = bc.predict_proba(proba_test)

Isotonic Calibration¶

Isotonic regression fits a non-parametric monotone mapping from scores to probabilities. It can perfectly fit calibration data but may overfit with small calibration sets.

from endgame.calibration import IsotonicCalibration

ic = IsotonicCalibration()
ic.fit(proba_cal, y_cal)

calibrated_proba = ic.predict_proba(proba_test)

Evaluating Calibration Quality¶

CalibrationAnalyzer computes multiple calibration diagnostics and generates reliability diagrams.

from endgame.calibration import CalibrationAnalyzer

analyzer = CalibrationAnalyzer(n_bins=10, strategy='uniform')
analyzer.fit(proba_test, y_test)

# Scalar metrics
print(f"ECE  : {analyzer.ece_:.4f}")   # Expected Calibration Error
print(f"MCE  : {analyzer.mce_:.4f}")   # Maximum Calibration Error
print(f"Brier: {analyzer.brier_:.4f}") # Brier Score

# Reliability diagram (matplotlib figure)
fig = analyzer.plot_reliability_diagram(title="Model Calibration")
fig.savefig("reliability.png", dpi=150)

# Per-bin breakdown
print(analyzer.bin_stats_)  # DataFrame: bin_lower, bin_upper, fraction_pos, mean_conf, count

Choosing a Calibration Method¶

Method	Best for
`TemperatureScaling`	Neural networks with logit access; large calibration sets
`PlattScaling`	SVM or other margin-based models; unimodal score distributions
`BetaCalibration`	Models outputting probabilities; flexible boundary handling
`IsotonicCalibration`	Large calibration sets; non-monotone miscalibration patterns
`VennABERS`	Small calibration sets; no distributional assumptions; individual guarantees
`ConformalClassifier`	Hard prediction sets with coverage guarantees
`ConformalRegressor`	Prediction intervals with coverage guarantees
`ConformizedQuantileRegressor`	Adaptive-width intervals; heteroscedastic regression