Validation

class endgame.validation.AdversarialValidator(estimator=None, sample_frac=1.0, cv=5, threshold=0.7, random_state=None, verbose=False)[source]

Bases: EndgameEstimator

Detects train/test distribution drift using adversarial validation.

Trains a classifier to distinguish train from test data. High AUC (>0.5) indicates distribution drift. Feature importances identify drifting features.

This is a critical technique documented across winning solutions to prevent leaderboard overfitting when CV doesn’t correlate with public LB.

Parameters:
  • estimator (sklearn-compatible classifier, optional) – The classifier used for adversarial validation. If None, uses LightGBM if available, else RandomForest.

  • sample_frac (float, default=1.0) – Fraction of data to use (for large datasets).

  • cv (int, default=5) – Number of cross-validation folds.

  • threshold (float, default=0.7) – AUC threshold above which to flag significant drift.

  • random_state (int, optional) – Random seed for reproducibility.

  • verbose (bool, default=False) – Enable verbose output.

auc_score_

ROC-AUC score from adversarial validation.

Type:

float

feature_importances_

Feature importance in distinguishing train/test.

Type:

Dict[str, float]

drifted_features_

Features contributing most to drift (sorted by importance).

Type:

List[str]

Examples

>>> from endgame.validation import AdversarialValidator
>>> av = AdversarialValidator(threshold=0.6)
>>> result = av.check_drift(X_train, X_test)
>>> print(f"Drift AUC: {result.auc_score:.3f}")
>>> if result.drift_severity == 'severe':
...     # Remove drifted features
...     drop_cols = result.drifted_features[:5]
check_drift(X_train, X_test)[source]

Check for distribution drift between train and test data.

Parameters:
  • X_train (array-like of shape (n_train_samples, n_features)) – Training features.

  • X_test (array-like of shape (n_test_samples, n_features)) – Test features.

Return type:

AdversarialValidationResult

Returns:

AdversarialValidationResult – Result containing: - auc_score: float (>0.5 indicates drift) - drifted_features: List[str] (features with high importance) - feature_importances: Dict[str, float] - drift_severity: str (‘none’, ‘mild’, ‘moderate’, ‘severe’)

get_test_like_samples(X_train, y_train, X_test, top_pct=0.2)[source]

Get training samples most similar to test distribution.

Uses adversarial validation predictions to identify training samples that the classifier thinks look like test samples.

Parameters:
  • X_train (array-like) – Training features.

  • y_train (array-like) – Training labels.

  • X_test (array-like) – Test features.

  • top_pct (float, default=0.2) – Top percentage of test-like samples to return.

Return type:

tuple[Any, Any]

Returns:

  • X_selected (array-like) – Selected training features.

  • y_selected (array-like) – Selected training labels.

suggest_features_to_drop(X_train, X_test, max_features=10, min_importance=0.05)[source]

Suggest features to drop to reduce drift.

Parameters:
  • X_train (array-like) – Training features.

  • X_test (array-like) – Test features.

  • max_features (int, default=10) – Maximum number of features to suggest.

  • min_importance (float, default=0.05) – Minimum importance threshold.

Return type:

list[Text]

Returns:

List[str] – Features suggested for removal.

class endgame.validation.PurgedTimeSeriesSplit(n_splits=5, purge_gap=0, embargo_pct=0.01, max_train_size=None)[source]

Bases: BaseCrossValidator

Time series CV with purging and embargo to prevent lookahead bias.

Essential for financial competitions (Optiver, Jane Street) where temporal leakage can severely overfit models.

Purging removes samples between train and validation that might contain information about the validation period.

Embargo adds a gap after validation to prevent using future information.

Parameters:
  • n_splits (int, default=5) – Number of folds.

  • purge_gap (int, default=0) – Number of samples to purge between train and validation.

  • embargo_pct (float, default=0.01) – Percentage of test data to embargo after each split.

  • max_train_size (int, optional) – Maximum size of training set (rolling window).

Examples

>>> cv = PurgedTimeSeriesSplit(n_splits=5, purge_gap=10, embargo_pct=0.01)
>>> for train_idx, val_idx in cv.split(X):
...     # train_idx ends purge_gap samples before val_idx starts
...     pass
get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splits.

Return type:

int

Parameters:
  • X (Any | None)

  • y (Any | None)

  • groups (Any | None)

split(X, y=None, groups=None)[source]

Generate train/validation indices with purging and embargo.

Parameters:
  • X (array-like) – Training data.

  • y (array-like, optional) – Target variable (ignored).

  • groups (array-like, optional) – Group labels (ignored).

Yields:
  • train_idx (ndarray) – Training indices for this fold.

  • val_idx (ndarray) – Validation indices for this fold.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

class endgame.validation.StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=None)[source]

Bases: BaseCrossValidator

Stratified K-Fold that respects groups.

Combines stratification (maintaining class balance) with group constraints (keeping all samples from a group in the same fold).

Essential when samples are related (e.g., patient_id, user_id) to prevent data leakage.

Parameters:
  • n_splits (int, default=5) – Number of folds.

  • shuffle (bool, default=True) – Whether to shuffle groups before splitting.

  • random_state (int, optional) – Random seed for reproducibility.

Examples

>>> cv = StratifiedGroupKFold(n_splits=5)
>>> for train_idx, val_idx in cv.split(X, y, groups=patient_ids):
...     # No patient appears in both train and val
...     pass
get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splits.

Return type:

int

Parameters:
  • X (Any | None)

  • y (Any | None)

  • groups (Any | None)

split(X, y, groups)[source]

Generate stratified group-aware train/validation indices.

Parameters:
  • X (array-like) – Training data.

  • y (array-like) – Target variable for stratification.

  • groups (array-like) – Group labels (e.g., patient_id).

Yields:
  • train_idx (ndarray) – Training indices for this fold.

  • val_idx (ndarray) – Validation indices for this fold.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

class endgame.validation.MultilabelStratifiedKFold(n_splits=5, shuffle=True, random_state=None)[source]

Bases: BaseCrossValidator

Stratified K-Fold for multilabel classification.

Maintains label distribution across folds for multilabel problems using iterative stratification.

Parameters:
  • n_splits (int, default=5) – Number of folds.

  • shuffle (bool, default=True) – Whether to shuffle before splitting.

  • random_state (int, optional) – Random seed for reproducibility.

Examples

>>> # y is shape (n_samples, n_labels) with binary labels
>>> cv = MultilabelStratifiedKFold(n_splits=5)
>>> for train_idx, val_idx in cv.split(X, y):
...     # Label proportions maintained across folds
...     pass
get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splits.

Return type:

int

Parameters:
  • X (Any | None)

  • y (Any | None)

  • groups (Any | None)

split(X, y, groups=None)[source]

Generate multilabel-stratified train/validation indices.

Uses iterative stratification algorithm to maintain label proportions.

Parameters:
  • X (array-like) – Training data.

  • y (array-like of shape (n_samples, n_labels)) – Multilabel target matrix.

  • groups (array-like, optional) – Ignored.

Yields:
  • train_idx (ndarray) – Training indices for this fold.

  • val_idx (ndarray) – Validation indices for this fold.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

class endgame.validation.AdversarialKFold(n_splits=5, test_similarity_threshold=0.5, random_state=None)[source]

Bases: BaseCrossValidator

K-Fold that weights folds by test-similarity.

Uses adversarial validation to identify training samples that look most like test data, then ensures each fold has similar proportions of test-like samples.

Parameters:
  • n_splits (int, default=5) – Number of folds.

  • test_similarity_threshold (float, default=0.5) – Threshold for considering a sample “test-like”.

  • random_state (int, optional) – Random seed for reproducibility.

Examples

>>> cv = AdversarialKFold(n_splits=5)
>>> for train_idx, val_idx in cv.split(X_train, y, X_test=X_test):
...     # Each fold has similar proportion of test-like samples
...     pass
get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splits.

Return type:

int

Parameters:
  • X (Any | None)

  • y (Any | None)

  • groups (Any | None)

fit(X_train, X_test)[source]

Compute test similarity scores for training samples.

Parameters:
  • X_train (array-like) – Training features.

  • X_test (array-like) – Test features.

Return type:

AdversarialKFold

Returns:

self

split(X, y=None, groups=None, X_test=None)[source]

Generate adversarial-aware train/validation indices.

Parameters:
  • X (array-like) – Training data.

  • y (array-like, optional) – Target variable.

  • groups (array-like, optional) – Ignored.

  • X_test (array-like, optional) – Test data for computing similarity (if not already fit).

Yields:
  • train_idx (ndarray) – Training indices for this fold.

  • val_idx (ndarray) – Validation indices for this fold.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

set_fit_request(*, X_test='$UNCHANGED$', X_train='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the fit method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to fit if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to fit.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in fit.

  • X_train (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_train parameter in fit.

  • self (AdversarialKFold)

Returns:

self (object) – The updated object.

Return type:

AdversarialKFold

set_split_request(*, X_test='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the split method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to split if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to split.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • X_test (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_test parameter in split.

  • self (AdversarialKFold)

Returns:

self (object) – The updated object.

Return type:

AdversarialKFold

class endgame.validation.RepeatedStratifiedGroupKFold(n_splits=5, n_repeats=3, random_state=None)[source]

Bases: BaseCrossValidator

Repeated Stratified Group K-Fold.

Runs multiple iterations of StratifiedGroupKFold with different random seeds for more robust CV estimates.

Parameters:
  • n_splits (int, default=5) – Number of folds per repeat.

  • n_repeats (int, default=3) – Number of times to repeat the splits.

  • random_state (int, optional) – Random seed for reproducibility.

get_n_splits(X=None, y=None, groups=None)[source]

Return the total number of splits.

Return type:

int

Parameters:
  • X (Any | None)

  • y (Any | None)

  • groups (Any | None)

split(X, y, groups)[source]

Generate repeated stratified group-aware splits.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

Parameters:
class endgame.validation.CombinatorialPurgedKFold(n_folds=10, n_test_folds=2, purge_gap=0, embargo_pct=0.0)[source]

Bases: BaseCrossValidator

Combinatorial Purged Cross-Validation for time series/financial data.

Implements the CPCV method from Marcos López de Prado’s “Advances in Financial Machine Learning” (Chapter 12). This method:

  1. Divides data into N sequential groups (folds)

  2. Uses combinations of k groups as test sets (C(N,k) total splits)

  3. Applies purging to remove training samples that overlap with test labels

  4. Applies embargo to remove training samples too close to test periods

This generates multiple “backtest paths” that can be recombined to compute statistics like the distribution of Sharpe ratios, enabling detection of backtest overfitting.

Parameters:
  • n_folds (int, default=10) – Number of sequential groups to divide the data into. Must be >= 3.

  • n_test_folds (int, default=2) – Number of folds to use as test set in each split. Must be >= 1 and < n_folds. Total number of splits = C(n_folds, n_test_folds).

  • purge_gap (int, default=0) – Number of samples to purge (remove) from training set at boundaries with test set. These are samples whose labels might overlap with the test period.

  • embargo_pct (float, default=0.0) – Percentage of total samples to embargo after each test period. Embargo removes training samples that occur immediately after test samples to prevent lookahead bias from label leakage.

n_splits

Total number of train/test splits = C(n_folds, n_test_folds).

Type:

int

n_test_paths

Number of reconstructible test paths from combinations.

Type:

int

fold_bounds_

Start and end indices for each fold (set after split is called).

Type:

List[Tuple[int, int]]

Notes

The key insight of CPCV is that standard k-fold CV produces only ONE backtest path (the concatenation of all test folds). CPCV produces MULTIPLE backtest paths by using combinations of test folds, enabling statistical analysis of strategy performance across different scenarios.

For example, with n_folds=6 and n_test_folds=2: - Standard KFold: 6 splits, 1 backtest path - CPCV: C(6,2)=15 splits, multiple backtest paths

References

López de Prado, M. (2018). “Advances in Financial Machine Learning”. Chapter 12: Backtesting through Cross-Validation.

Examples

>>> from endgame.validation import CombinatorialPurgedKFold
>>> import numpy as np
>>>
>>> # Financial time series with 1000 samples
>>> X = np.random.randn(1000, 10)
>>> y = np.random.randn(1000)
>>>
>>> # Use 6 folds, 2 test folds per split, with purging and embargo
>>> cpcv = CombinatorialPurgedKFold(
...     n_folds=6,
...     n_test_folds=2,
...     purge_gap=10,
...     embargo_pct=0.01,
... )
>>>
>>> print(f"Number of splits: {cpcv.get_n_splits()}")  # 15 splits
>>>
>>> for train_idx, test_idx in cpcv.split(X):
...     # Train model on train_idx, evaluate on test_idx
...     pass
>>>
>>> # Get backtest paths for strategy analysis
>>> paths = cpcv.get_test_paths(X)
>>> print(f"Number of backtest paths: {len(paths)}")
property n_splits: int

Total number of train/test splits.

property n_test_paths: int

Number of reconstructible backtest paths.

Each path is a complete sequence through the data using different combinations of the test folds.

get_n_splits(X=None, y=None, groups=None)[source]

Return the number of splits.

Return type:

int

Parameters:
  • X (Any | None)

  • y (Any | None)

  • groups (Any | None)

split(X, y=None, groups=None)[source]

Generate combinatorial purged train/test splits.

Parameters:
  • X (array-like) – Training data. Used only to determine the number of samples.

  • y (array-like, optional) – Target variable (ignored, but accepted for sklearn compatibility).

  • groups (array-like, optional) – Group labels (ignored).

Yields:
  • train_idx (np.ndarray) – Training indices for this split (purged and embargoed).

  • test_idx (np.ndarray) – Test indices for this split.

Return type:

Generator[tuple[ndarray, ndarray], None, None]

get_test_paths(X)[source]

Reconstruct all possible backtest paths from the splits.

A backtest path is a sequence of test sets that together cover the entire dataset in temporal order. CPCV allows reconstructing multiple such paths from the combinatorial splits.

Parameters:

X (array-like) – Training data (used only to determine size).

Return type:

list[list[ndarray]]

Returns:

List[List[np.ndarray]] – List of paths, where each path is a list of test index arrays that together form a complete pass through the data.

get_fold_info(X)[source]

Get detailed information about the fold structure.

Parameters:

X (array-like) – Training data.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Dictionary containing: - n_samples: Total number of samples - n_folds: Number of folds - n_test_folds: Number of test folds per split - n_splits: Total number of splits - n_test_paths: Number of backtest paths - fold_sizes: List of fold sizes - purge_gap: Purge gap setting - embargo_size: Embargo size in samples

endgame.validation.cross_validate_oof(estimator, X, y, cv=5, scoring=None, fit_params=None, return_models=True, return_indices=False, groups=None, verbose=False)[source]

Perform cross-validation and return out-of-fold predictions.

This is the standard approach for building stacked ensembles and getting unbiased training set predictions.

Parameters:
  • estimator (sklearn-compatible estimator) – The model to cross-validate.

  • X (array-like of shape (n_samples, n_features)) – Training features.

  • y (array-like of shape (n_samples,)) – Target values.

  • cv (int or CV splitter, default=5) – Cross-validation strategy.

  • scoring (str or callable, optional) – Scoring metric. If None, uses estimator’s default.

  • fit_params (dict, optional) – Additional parameters to pass to estimator.fit().

  • return_models (bool, default=True) – Whether to return trained models from each fold.

  • return_indices (bool, default=False) – Whether to return train/val indices for each fold.

  • groups (array-like, optional) – Group labels for group-aware CV.

  • verbose (bool, default=False) – Print fold scores during cross-validation.

Return type:

OOFResult

Returns:

OOFResult

  • oof_predictions: Out-of-fold predictions

  • fold_scores: Validation score for each fold

  • mean_score: Mean score across folds

  • std_score: Standard deviation of scores

  • models: List of trained models (if return_models=True)

  • fold_indices: List of (train_idx, val_idx) tuples

Examples

>>> from endgame.validation import cross_validate_oof
>>> result = cross_validate_oof(model, X, y, cv=5, scoring='roc_auc')
>>> print(f"CV Score: {result.mean_score:.4f} ± {result.std_score:.4f}")
endgame.validation.check_cv_lb_correlation(cv_scores, lb_scores)[source]

Compute correlation between CV and leaderboard scores.

Helps validate CV strategy by checking if CV improvements translate to LB improvements.

Parameters:
  • cv_scores (List[float]) – Cross-validation scores from different experiments.

  • lb_scores (List[float]) – Corresponding public leaderboard scores.

Return type:

WSGIEnvironment[Text, float]

Returns:

Dict[str, float]

  • pearson: Pearson correlation coefficient

  • spearman: Spearman rank correlation

  • rmse: RMSE between normalized scores

Examples

>>> cv_scores = [0.85, 0.86, 0.87, 0.88]
>>> lb_scores = [0.82, 0.83, 0.84, 0.85]
>>> result = check_cv_lb_correlation(cv_scores, lb_scores)
>>> print(f"Correlation: {result['pearson']:.3f}")
class endgame.validation.NestedCV(estimator=None, search=None, outer_cv=5, scoring='auto', return_oof=True, random_state=None, verbose=0)[source]

Bases: object

Nested cross-validation for unbiased model evaluation.

The inner loop performs model selection (hyperparameter tuning or algorithm comparison) and the outer loop estimates generalization performance using the best model from each inner fold.

Parameters:
  • estimator (estimator or None) – Base estimator to evaluate. If search is provided, this is ignored (the search object contains the estimator).

  • search (estimator with fit/predict or None) – A search object (e.g., GridSearchCV, RandomizedSearchCV, OptunaOptimizer) that performs inner-loop model selection. Must have best_estimator_ and best_params_ after fitting. If None, estimator is used directly without inner tuning.

  • outer_cv (int or CV splitter, default=5) – Number of outer folds or a CV splitter object.

  • scoring (str or callable, default='auto') – Scoring metric. ‘auto’ uses accuracy for classifiers, r2 for regressors. Can be a string key or a callable(y_true, y_pred).

  • return_oof (bool, default=True) – Whether to return out-of-fold predictions.

  • random_state (int or None, default=None) – Random state for reproducibility.

  • verbose (int, default=0) – Verbosity level. 0=silent, 1=progress, 2=detailed.

Example

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.model_selection import GridSearchCV
>>>
>>> # With hyperparameter search
>>> search = GridSearchCV(
...     RandomForestClassifier(random_state=42),
...     param_grid={'n_estimators': [50, 100, 200]},
...     cv=3, scoring='accuracy', refit=True
... )
>>> ncv = NestedCV(search=search, outer_cv=5)
>>> result = ncv.evaluate(X, y)
>>>
>>> # Without search (just evaluate a fixed model)
>>> ncv = NestedCV(estimator=RandomForestClassifier(n_estimators=100))
>>> result = ncv.evaluate(X, y)
evaluate(X, y, groups=None)[source]

Run nested cross-validation.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training features.

  • y (array-like of shape (n_samples,)) – Target values.

  • groups (array-like of shape (n_samples,), optional) – Group labels for GroupKFold-style splitting.

Return type:

NestedCVResult

Returns:

NestedCVResult – Results containing scores, best params, and OOF predictions.

class endgame.validation.NestedCVResult(outer_scores=<factory>, mean_score=0.0, std_score=0.0, best_params=<factory>, oof_predictions=None, inner_scores=<factory>, scoring='accuracy')[source]

Bases: object

Results from nested cross-validation.

Parameters:
outer_scores

Score for each outer fold.

Type:

list of float

mean_score

Mean of outer fold scores.

Type:

float

std_score

Standard deviation of outer fold scores.

Type:

float

best_params

Best parameters found in each outer fold’s inner search.

Type:

list of dict

oof_predictions

Out-of-fold predictions (if return_oof=True).

Type:

ndarray or None

inner_scores

Best inner CV score for each outer fold.

Type:

list of float

scoring

Metric name used.

Type:

str

outer_scores: list[float]
mean_score: float = 0.0
std_score: float = 0.0
best_params: list[dict[str, Any]]
oof_predictions: ndarray | None = None
inner_scores: list[float]
scoring: str = 'accuracy'