Benchmark

class endgame.benchmark.SuiteLoader(suite='sklearn-classic', max_datasets=None, max_samples=None, max_features=None, cache_dir=None, random_state=42, verbose=True)[source]

Bases: object

Load benchmark datasets from various sources.

Supports OpenML benchmark suites, sklearn built-in datasets, and custom datasets. Provides standardized interface for benchmark experiments.

Parameters:
  • suite (str or List[int]) – Suite name (e.g., “OpenML-CC18”) or list of OpenML task IDs.

  • max_datasets (int, optional) – Maximum number of datasets to load.

  • max_samples (int, optional) – Maximum samples per dataset (larger datasets are sampled).

  • max_features (int, optional) – Maximum features per dataset.

  • cache_dir (str, optional) – Directory for caching downloaded datasets.

  • random_state (int, default=42) – Random seed for sampling.

  • verbose (bool, default=True) – Enable verbose output.

Examples

>>> loader = SuiteLoader(suite="sklearn-classic")
>>> for dataset in loader.load():
...     print(f"{dataset.name}: {dataset.n_samples} samples, {dataset.n_features} features")
>>> loader = SuiteLoader(suite="OpenML-CC18", max_datasets=10)
>>> datasets = list(loader.load())
load()[source]

Load datasets from the suite.

Yields:

DatasetInfo – Dataset information and data.

Return type:

Generator[DatasetInfo, None, None]

static list_suites()[source]

List available benchmark suites.

Return type:

WSGIEnvironment[Text, Text]

static get_suite_info(suite_name)[source]

Get detailed information about a suite.

Return type:

WSGIEnvironment[Text, Any]

Parameters:

suite_name (str)

class endgame.benchmark.DatasetInfo(name, task_type, X, y, feature_names=<factory>, categorical_indicator=<factory>, n_samples=0, n_features=0, n_classes=0, class_distribution=<factory>, source='unknown', openml_id=None, cv_splits=None, metadata=<factory>)[source]

Bases: object

Container for dataset information and data.

Parameters:
name

Name of the dataset.

Type:

str

task_type

Type of ML task.

Type:

TaskType

X

Feature matrix.

Type:

np.ndarray

y

Target variable.

Type:

np.ndarray

feature_names

Names of features.

Type:

List[str]

categorical_indicator

Boolean mask indicating categorical features.

Type:

List[bool]

n_samples

Number of samples.

Type:

int

n_features

Number of features.

Type:

int

n_classes

Number of classes (for classification).

Type:

int

class_distribution

Distribution of classes.

Type:

Dict[Any, int]

source

Source of the dataset (e.g., ‘openml’, ‘sklearn’).

Type:

str

openml_id

OpenML dataset ID if applicable.

Type:

Optional[int]

cv_splits

Predefined cross-validation splits.

Type:

Optional[List[Tuple[np.ndarray, np.ndarray]]]

metadata

Additional metadata.

Type:

Dict[str, Any]

name: str
task_type: TaskType
X: ndarray
y: ndarray
feature_names: list[str]
categorical_indicator: list[bool]
n_samples: int = 0
n_features: int = 0
n_classes: int = 0
class_distribution: dict[Any, int]
source: str = 'unknown'
openml_id: int | None = None
cv_splits: list[tuple[ndarray, ndarray]] | None = None
metadata: dict[str, Any]
property n_categorical: int

Number of categorical features.

property n_numerical: int

Number of numerical features.

property imbalance_ratio: float

Class imbalance ratio (max_count / min_count).

get_cv_splits(n_splits=10, shuffle=True, random_state=42)[source]

Get cross-validation splits.

Returns predefined splits if available, otherwise generates new ones.

Return type:

list[tuple[ndarray, ndarray]]

Parameters:
  • n_splits (int)

  • shuffle (bool)

  • random_state (int)

class endgame.benchmark.MetaProfiler(groups=None, use_pymfe=True, landmarking_cv=3, random_state=42, verbose=False)[source]

Bases: object

Extract meta-features from datasets for meta-learning.

Uses pymfe when available, with fallback to pure numpy/sklearn implementations.

Parameters:
  • groups (List[str], optional) – Meta-feature groups to extract. Default: [“simple”, “statistical”, “info-theory”]. Options: “simple”, “statistical”, “info-theory”, “landmarking”, “complexity”.

  • use_pymfe (bool, default=True) – Use pymfe library when available (more comprehensive features).

  • landmarking_cv (int, default=3) – Number of CV folds for landmarking meta-features.

  • random_state (int, default=42) – Random seed for reproducibility.

  • verbose (bool, default=False) – Enable verbose output.

Examples

>>> profiler = MetaProfiler(groups=["simple", "statistical"])
>>> meta_features = profiler.profile(X, y)
>>> print(meta_features.features)
>>> # With landmarking
>>> profiler = MetaProfiler(groups=["simple", "landmarking"])
>>> meta_features = profiler.profile(X, y)
profile(X, y, categorical_indicator=None, task_type='classification')[source]

Extract meta-features from a dataset.

Parameters:
  • X (np.ndarray) – Feature matrix of shape (n_samples, n_features).

  • y (np.ndarray) – Target variable of shape (n_samples,).

  • categorical_indicator (List[bool], optional) – Boolean mask indicating categorical features.

  • task_type (str, default="classification") – Type of task: “classification” or “regression”.

Return type:

MetaFeatureSet

Returns:

MetaFeatureSet – Extracted meta-features.

get_feature_names()[source]

Get list of all possible meta-feature names.

Return type:

list[Text]

class endgame.benchmark.MetaFeatureSet(features=<factory>, groups=<factory>, extraction_time=0.0, errors=<factory>)[source]

Bases: object

Container for extracted meta-features.

Parameters:
features

Dictionary of meta-feature name to value.

Type:

Dict[str, float]

groups

Mapping from group name to feature names in that group.

Type:

Dict[str, List[str]]

extraction_time

Time taken to extract features (seconds).

Type:

float

errors

Any errors encountered during extraction.

Type:

List[str]

features: dict[str, float]
groups: dict[str, list[str]]
extraction_time: float = 0.0
errors: list[str]
to_dict()[source]

Convert to dictionary.

Return type:

WSGIEnvironment[Text, float]

to_array(feature_names=None)[source]

Convert to numpy array.

Parameters:

feature_names (List[str], optional) – Specific features to include (in order). If None, uses all features in sorted order.

Return type:

ndarray

get_group(group)[source]

Get features from a specific group.

Return type:

WSGIEnvironment[Text, float]

Parameters:

group (str)

class endgame.benchmark.ExperimentTracker(name='benchmark', auto_save=False, save_path=None)[source]

Bases: object

Track and store experiment results.

Provides methods for logging experiments, querying results, and exporting to various formats.

Parameters:
  • name (str, default="benchmark") – Name for this tracking session.

  • auto_save (bool, default=False) – Automatically save after each experiment.

  • save_path (str, optional) – Path for auto-saving results.

Examples

>>> tracker = ExperimentTracker(name="my_benchmark")
>>> tracker.log_experiment(
...     dataset_name="iris",
...     model_name="RandomForest",
...     metrics={"accuracy": 0.95, "f1": 0.94},
...     hyperparameters={"n_estimators": 100},
... )
>>> df = tracker.to_dataframe()
log_experiment(dataset_name, model_name, metrics, hyperparameters=None, pipeline_config=None, meta_features=None, cv_scores=None, fit_time=0.0, predict_time=0.0, memory_mb=0.0, n_samples=0, n_features=0, task_type='classification', dataset_id=None, status='success', error_message=None, tags=None, notes='', model_structure=None)[source]

Log a single experiment.

Parameters:
  • dataset_name (str) – Name of the dataset.

  • model_name (str) – Name of the model/pipeline.

  • metrics (Dict[str, float]) – Performance metrics.

  • hyperparameters (Dict, optional) – Model hyperparameters.

  • pipeline_config (Dict, optional) – Full pipeline configuration.

  • meta_features (Dict, optional) – Dataset meta-features.

  • cv_scores (List[float], optional) – Per-fold CV scores.

  • fit_time (float) – Training time in seconds.

  • predict_time (float) – Prediction time in seconds.

  • memory_mb (float) – Peak memory usage in MB.

  • n_samples (int) – Number of samples.

  • n_features (int) – Number of features.

  • task_type (str) – Task type.

  • dataset_id (str, optional) – External dataset ID.

  • status (str) – Experiment status.

  • error_message (str, optional) – Error message if failed.

  • tags (List[str], optional) – Tags for filtering.

  • notes (str) – Additional notes.

  • model_structure (str | None)

Return type:

ExperimentRecord

Returns:

ExperimentRecord – The logged experiment record.

log_failure(dataset_name, model_name, error_message, **kwargs)[source]

Log a failed experiment.

Return type:

ExperimentRecord

Parameters:
  • dataset_name (str)

  • model_name (str)

  • error_message (str)

property records: list[ExperimentRecord]

Get all experiment records.

get_by_dataset(dataset_name)[source]

Get records for a specific dataset.

Return type:

list[ExperimentRecord]

Parameters:

dataset_name (str)

get_by_model(model_name)[source]

Get records for a specific model.

Return type:

list[ExperimentRecord]

Parameters:

model_name (str)

get_by_tag(tag)[source]

Get records with a specific tag.

Return type:

list[ExperimentRecord]

Parameters:

tag (str)

get_successful()[source]

Get successful experiments only.

Return type:

list[ExperimentRecord]

to_dataframe(include_meta_features=True)[source]

Convert to DataFrame.

Parameters:

include_meta_features (bool, default=True) – Include meta-features as columns.

Returns:

DataFrame – Polars DataFrame (or Pandas if Polars unavailable).

to_dict_list()[source]

Convert to list of dictionaries.

Return type:

list[WSGIEnvironment[Text, Any]]

save(path, append=False, deduplicate=True)[source]

Save results to file.

Parameters:
  • path (str) – Output path. Supports: .parquet, .csv, .json

  • append (bool, default=False) – If True and file exists, append new records to existing file. If False, overwrite existing file.

  • deduplicate (bool, default=True) – When appending, skip records with duplicate config_hash.

Return type:

None

load(path)[source]

Load results from file.

Parameters:

path (str) – Input path.

Return type:

ExperimentTracker

Returns:

self

summary()[source]

Get summary of tracked experiments.

Return type:

Text

clear()[source]

Clear all records.

Return type:

None

get_config_hashes()[source]

Get set of all config hashes in the tracker.

Return type:

set

merge(other, deduplicate=True)[source]

Merge another tracker into this one.

Parameters:
  • other (ExperimentTracker) – Tracker to merge.

  • deduplicate (bool, default=True) – Skip records with duplicate config_hash.

Return type:

ExperimentTracker

Returns:

self

save_to_master(path=None, deduplicate=True)[source]

Save results to master database, appending to existing records.

This is the primary method for building a meta-learning dataset. New experiments are appended to the master database, with duplicate configurations (same dataset + model + hyperparameters) skipped.

Parameters:
  • path (str or Path, optional) – Path to master database. Defaults to ~/.endgame/meta_learning_db.parquet

  • deduplicate (bool, default=True) – Skip records with duplicate config_hash.

Return type:

int

Returns:

int – Number of new records added.

Examples

>>> tracker = ExperimentTracker()
>>> # ... run experiments ...
>>> n_added = tracker.save_to_master()
>>> print(f"Added {n_added} new experiments to master database")
classmethod load_master(path=None)[source]

Load the master meta-learning database.

Parameters:

path (str or Path, optional) – Path to master database. Defaults to ~/.endgame/meta_learning_db.parquet

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Tracker with all historical experiments.

Examples

>>> tracker = ExperimentTracker.load_master()
>>> print(f"Master database has {len(tracker)} experiments")
static get_master_db_path()[source]

Get the default master database path.

Return type:

Path

Returns:

Path – Default path: ~/.endgame/meta_learning_db.parquet

filter_existing(master_path=None)[source]

Find which (dataset, model) pairs already exist in master DB.

Useful for skipping already-benchmarked combinations.

Parameters:

master_path (str or Path, optional) – Path to master database.

Return type:

list[tuple[Text, Text]]

Returns:

List[Tuple[str, str]] – List of (dataset_name, model_name) pairs that exist.

class endgame.benchmark.ExperimentRecord(experiment_id='', timestamp='', dataset_name='', dataset_id=None, model_name='', pipeline_config=<factory>, hyperparameters=<factory>, metrics=<factory>, meta_features=<factory>, cv_scores=None, fit_time=0.0, predict_time=0.0, memory_mb=0.0, n_samples=0, n_features=0, task_type='classification', status='pending', error_message=None, tags=<factory>, notes='', model_structure=None, config_hash='')[source]

Bases: object

Single experiment record.

Parameters:
experiment_id

Unique identifier for this experiment.

Type:

str

timestamp

ISO timestamp of when the experiment was run.

Type:

str

dataset_name

Name of the dataset.

Type:

str

dataset_id

External ID (e.g., OpenML ID).

Type:

Optional[str]

model_name

Name/identifier of the model or pipeline.

Type:

str

pipeline_config

Serialized pipeline configuration.

Type:

Dict

hyperparameters

Model hyperparameters.

Type:

Dict

metrics

Performance metrics.

Type:

Dict[str, float]

meta_features

Dataset meta-features.

Type:

Dict[str, float]

cv_scores

Per-fold CV scores.

Type:

Optional[List[float]]

fit_time

Training time in seconds.

Type:

float

predict_time

Prediction time in seconds.

Type:

float

memory_mb

Peak memory usage in MB.

Type:

float

n_samples

Number of training samples.

Type:

int

n_features

Number of features.

Type:

int

task_type

Type of task.

Type:

str

status

Experiment status: “success”, “failed”, “timeout”.

Type:

str

error_message

Error message if failed.

Type:

Optional[str]

tags

User-defined tags.

Type:

List[str]

notes

Additional notes.

Type:

str

experiment_id: str = ''
timestamp: str = ''
dataset_name: str = ''
dataset_id: str | None = None
model_name: str = ''
pipeline_config: dict[str, Any]
hyperparameters: dict[str, Any]
metrics: dict[str, float]
meta_features: dict[str, float]
cv_scores: list[float] | None = None
fit_time: float = 0.0
predict_time: float = 0.0
memory_mb: float = 0.0
n_samples: int = 0
n_features: int = 0
task_type: str = 'classification'
status: str = 'pending'
error_message: str | None = None
tags: list[str]
notes: str = ''
model_structure: str | None = None
config_hash: str = ''
to_dict()[source]

Convert to dictionary.

Return type:

WSGIEnvironment[Text, Any]

classmethod from_dict(data)[source]

Create from dictionary.

Return type:

ExperimentRecord

Parameters:

data (dict[str, Any])

endgame.benchmark.get_experiment_hash(dataset_name, model_name, hyperparameters, task_type='classification')[source]

Generate a unique hash for an experiment configuration.

This hash is used to detect duplicate experiments in the master database. Two experiments are considered duplicates if they have the same: - dataset name - model name - hyperparameters - task type

Parameters:
  • dataset_name (str) – Name of the dataset.

  • model_name (str) – Name of the model/pipeline.

  • hyperparameters (Dict[str, Any]) – Model hyperparameters.

  • task_type (str) – Task type (classification/regression).

Return type:

Text

Returns:

str – SHA256 hash (first 16 characters) uniquely identifying this config.

class endgame.benchmark.BenchmarkRunner(suite='sklearn-classic', config=None, max_datasets=None, fast_run=False, verbose=True, **kwargs)[source]

Bases: object

Run systematic benchmarks across datasets and models.

Orchestrates the complete benchmark workflow: 1. Load datasets from benchmark suite 2. Profile datasets (extract meta-features) 3. Run cross-validation for each model on each dataset 4. Record results with full provenance

Parameters:
  • suite (str, default="sklearn-classic") – Benchmark suite name.

  • config (BenchmarkConfig, optional) – Full configuration object.

  • max_datasets (int, optional) – Override maximum number of datasets.

  • fast_run (bool, default=False) – Quick run with reduced settings.

  • verbose (bool, default=True) – Enable verbose output.

  • **kwargs – Additional configuration parameters.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.linear_model import LogisticRegression
>>>
>>> models = [
...     ("RF", RandomForestClassifier(n_estimators=100, random_state=42)),
...     ("LR", LogisticRegression(max_iter=1000)),
... ]
>>>
>>> runner = BenchmarkRunner(suite="sklearn-classic")
>>> results = runner.run(models)
>>> print(results.summary())
>>>
>>> # Save results
>>> results.save("benchmark_results.parquet")
run(models, output_file=None, continue_on_error=True)[source]

Run benchmark on all models and datasets.

Parameters:
  • models (List[Union[Tuple[str, BaseEstimator], Tuple[str, BaseEstimator, BaseEstimator]]]) –

    List of model specifications. Each can be either: - (name, estimator): Single estimator used for all tasks - (name, classifier, regressor): Pair of estimators, classifier used for

    classification tasks and regressor for regression tasks. Either can be None to skip that task type.

  • output_file (str, optional) – Path to save results.

  • continue_on_error (bool, default=True) – Continue if a model fails on a dataset.

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Tracker with all experiment results.

property tracker: ExperimentTracker

Get the experiment tracker.

property datasets: list[DatasetInfo]

Get loaded datasets.

property meta_features: dict[str, MetaFeatureSet]

Get extracted meta-features.

get_results_dataframe()[source]

Get results as DataFrame.

class endgame.benchmark.BenchmarkConfig(suite='sklearn-classic', max_datasets=None, max_samples=None, cv_folds=5, scoring_classification=<factory>, scoring_regression=<factory>, profile_datasets=True, profile_groups=<factory>, cache_meta_features=True, meta_features_cache_dir=None, timeout_per_fit=300, n_jobs=1, random_state=42, verbose=True, skip_completed=True)[source]

Bases: object

Configuration for benchmark runs.

Parameters:
  • suite (str)

  • max_datasets (int | None)

  • max_samples (int | None)

  • cv_folds (int)

  • scoring_classification (list[str])

  • scoring_regression (list[str])

  • profile_datasets (bool)

  • profile_groups (list[str])

  • cache_meta_features (bool)

  • meta_features_cache_dir (str | None)

  • timeout_per_fit (int)

  • n_jobs (int)

  • random_state (int)

  • verbose (bool)

  • skip_completed (bool)

suite

Benchmark suite name or list of task IDs.

Type:

str

max_datasets

Maximum number of datasets to run.

Type:

int, optional

max_samples

Maximum samples per dataset.

Type:

int, optional

cv_folds

Number of cross-validation folds.

Type:

int

scoring_classification

Metrics for classification tasks.

Type:

List[str]

scoring_regression

Metrics for regression tasks.

Type:

List[str]

profile_datasets

Whether to extract meta-features.

Type:

bool

profile_groups

Meta-feature groups to extract.

Type:

List[str]

cache_meta_features

Whether to cache meta-features to disk.

Type:

bool

meta_features_cache_dir

Directory to cache meta-features. Defaults to ~/.cache/endgame/meta_features.

Type:

str, optional

timeout_per_fit

Timeout per model fit in seconds.

Type:

int

n_jobs

Number of parallel jobs for CV.

Type:

int

random_state

Random seed.

Type:

int

verbose

Enable verbose output.

Type:

bool

skip_completed

Skip experiments that already succeeded.

Type:

bool

suite: str = 'sklearn-classic'
max_datasets: int | None = None
max_samples: int | None = None
cv_folds: int = 5
scoring_classification: list[str]
scoring_regression: list[str]
profile_datasets: bool = True
profile_groups: list[str]
cache_meta_features: bool = True
meta_features_cache_dir: str | None = None
timeout_per_fit: int = 300
n_jobs: int = 1
random_state: int = 42
verbose: bool = True
skip_completed: bool = True
endgame.benchmark.quick_benchmark(model, model_name='model', suite='quick-test', **kwargs)[source]

Quick benchmark a single model on test datasets.

Parameters:
  • model (BaseEstimator) – Model to benchmark.

  • model_name (str, default="model") – Name for the model.

  • suite (str, default="quick-test") – Benchmark suite.

  • **kwargs – Additional arguments to BenchmarkRunner.

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Results tracker.

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> results = quick_benchmark(RandomForestClassifier(), "RF")
>>> print(results.summary())
endgame.benchmark.compare_models(models, suite='sklearn-classic', **kwargs)[source]

Compare multiple models on benchmark datasets.

Parameters:
  • models (List[Tuple[str, BaseEstimator]]) – List of (name, model) tuples.

  • suite (str, default="sklearn-classic") – Benchmark suite.

  • **kwargs – Additional arguments to BenchmarkRunner.

Return type:

ExperimentTracker

Returns:

ExperimentTracker – Results tracker.

class endgame.benchmark.ResultsAnalyzer(tracker, metric='accuracy', higher_is_better=True, significance_level=0.05)[source]

Bases: object

Analyze and compare benchmark results.

Provides methods for: - Ranking models across datasets - Statistical significance testing - Critical difference diagrams - Performance profiles - Meta-feature correlation analysis

Parameters:
  • tracker (ExperimentTracker) – Tracker containing experiment results.

  • metric (str, default="accuracy") – Primary metric for comparisons.

  • higher_is_better (bool, default=True) – Whether higher metric values are better.

  • significance_level (float, default=0.05) – Alpha level for statistical tests.

Examples

>>> analyzer = ResultsAnalyzer(tracker, metric="accuracy")
>>> rankings = analyzer.rank_models()
>>> print(rankings)
>>>
>>> # Statistical comparison
>>> comparison = analyzer.compare_models("RF", "XGBoost")
>>> print(f"P-value: {comparison.p_value}")
classmethod from_pivot(pivot, metric='accuracy', higher_is_better=True, significance_level=0.05)[source]

Create a ResultsAnalyzer from a pivot dict.

Convenience factory for external experiment systems that already have results in {dataset: {method: score}} form.

Parameters:
  • pivot (Dict[str, Dict[str, float]]) – Mapping of dataset_name -> {method_name: score}.

  • metric (str, default="accuracy") – Name of the metric the scores represent.

  • higher_is_better (bool, default=True) – Whether higher metric values are better.

  • significance_level (float, default=0.05) – Alpha level for statistical tests.

Return type:

ResultsAnalyzer

Returns:

ResultsAnalyzer – Analyzer ready for ranking, comparison, and statistical tests.

Examples

>>> pivot = {
...     "iris": {"RF": 0.95, "XGB": 0.96},
...     "wine": {"RF": 0.97, "XGB": 0.95},
... }
>>> analyzer = ResultsAnalyzer.from_pivot(pivot, metric="accuracy")
>>> print(analyzer.summary_table())
property df

Get results as DataFrame.

get_pivot_table(metric=None)[source]

Get pivot table of models vs datasets.

Parameters:

metric (str, optional) – Metric to use. If None, uses default metric.

Returns:

DataFrame – Pivot table with models as rows, datasets as columns.

rank_models(method=RankingMethod.MEAN_RANK, metric=None)[source]

Rank models across all datasets.

Parameters:
  • method (RankingMethod) – Ranking method to use.

  • metric (str, optional) – Metric to rank by.

Return type:

WSGIEnvironment[Text, float]

Returns:

Dict[str, float] – Model name to rank/score mapping (sorted).

compare_models(model_a, model_b, metric=None, test='wilcoxon')[source]

Compare two models statistically.

Parameters:
  • model_a (str) – Name of first model.

  • model_b (str) – Name of second model.

  • metric (str, optional) – Metric to compare on.

  • test (str, default="wilcoxon") – Statistical test: “wilcoxon”, “paired_t”, “sign”.

Return type:

ModelComparison

Returns:

ModelComparison – Comparison results.

friedman_test(metric=None)[source]

Perform Friedman test across all models.

Parameters:

metric (str, optional) – Metric to test on.

Return type:

tuple[float, float]

Returns:

Tuple[float, float] – (chi2 statistic, p-value)

nemenyi_critical_difference(alpha=0.05)[source]

Compute critical difference for Nemenyi test.

Parameters:

alpha (float, default=0.05) – Significance level.

Return type:

float

Returns:

float – Critical difference value.

get_model_summary(model_name, metric=None)[source]

Get detailed summary for a specific model.

Parameters:
  • model_name (str) – Name of the model.

  • metric (str, optional) – Metric to summarize.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Summary statistics.

get_dataset_summary(dataset_name, metric=None)[source]

Get detailed summary for a specific dataset.

Parameters:
  • dataset_name (str) – Name of the dataset.

  • metric (str, optional) – Metric to summarize.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Summary statistics.

summary_table(metric=None, sort_by='mean_rank')[source]

Generate formatted summary table.

Parameters:
  • metric (str, optional) – Metric to summarize.

  • sort_by (str, default="mean_rank") – Column to sort by.

Return type:

Text

Returns:

str – Formatted table string.

meta_feature_correlation(metric=None, model_name=None)[source]

Compute correlation between meta-features and performance.

Parameters:
  • metric (str, optional) – Performance metric.

  • model_name (str, optional) – Specific model to analyze. If None, averages across models.

Return type:

WSGIEnvironment[Text, float]

Returns:

Dict[str, float] – Meta-feature name to correlation mapping.

class endgame.benchmark.RankingMethod(*values)[source]

Bases: str, Enum

Methods for ranking models.

MEAN_SCORE = 'mean_score'
MEAN_RANK = 'mean_rank'
WIN_COUNT = 'win_count'
BORDA_COUNT = 'borda_count'
FRIEDMAN = 'friedman'
class endgame.benchmark.MetaLearner(approach='ranking', base_estimator=None, metric='accuracy', n_top_models=3, random_state=42, verbose=False)[source]

Bases: object

Learn to predict optimal models from dataset meta-features.

Trains a meta-model that predicts which model will perform best on a new dataset based on its meta-features.

Parameters:
  • approach (str, default="ranking") – Meta-learning approach: - “ranking”: Predict model rankings - “classification”: Predict best model (classification) - “regression”: Predict model scores (regression)

  • base_estimator (BaseEstimator, optional) – Base model for meta-learning. If None, uses RandomForest.

  • metric (str, default="accuracy") – Target metric to optimize.

  • n_top_models (int, default=3) – Number of top models to consider for recommendations.

  • random_state (int, default=42) – Random seed.

  • verbose (bool, default=False) – Enable verbose output.

Examples

>>> # Train meta-learner from benchmark results
>>> meta_learner = MetaLearner()
>>> meta_learner.fit(tracker)
>>>
>>> # Get recommendation for new dataset
>>> recommendation = meta_learner.recommend(X_new, y_new)
>>> print(f"Best model: {recommendation.model_name}")
fit(tracker, metric=None)[source]

Fit meta-learner from benchmark results.

Parameters:
  • tracker (ExperimentTracker) – Tracker containing benchmark results.

  • metric (str, optional) – Override target metric.

Return type:

MetaLearner

Returns:

self

recommend(X, y, categorical_indicator=None, task_type='classification')[source]

Get model recommendation for a new dataset.

Parameters:
  • X (np.ndarray) – Feature matrix.

  • y (np.ndarray) – Target variable.

  • categorical_indicator (List[bool], optional) – Boolean mask for categorical features.

  • task_type (str, default="classification") – Task type: “classification” or “regression”.

Return type:

ModelRecommendation

Returns:

ModelRecommendation – Recommended model with confidence and alternatives.

recommend_from_features(meta_features)[source]

Get recommendation from pre-computed meta-features.

Parameters:

meta_features (MetaFeatureSet or Dict) – Pre-computed meta-features.

Return type:

ModelRecommendation

Returns:

ModelRecommendation – Recommended model.

get_feature_importances()[source]

Get feature importances from meta-model.

Return type:

WSGIEnvironment[Text, float]

Returns:

Dict[str, float] – Feature name to importance mapping.

class endgame.benchmark.PipelineRecommender(meta_learner=None, preprocessing_options=None, verbose=False)[source]

Bases: object

Recommend complete pipelines (preprocessing + model) for new datasets.

Extends MetaLearner to recommend full preprocessing pipelines in addition to models.

Parameters:
  • meta_learner (MetaLearner, optional) – Pre-trained meta-learner.

  • preprocessing_options (List[str], default=["none", "scaling", "imputation"]) – Available preprocessing options.

  • verbose (bool, default=False) – Enable verbose output.

Examples

>>> recommender = PipelineRecommender()
>>> recommender.fit(tracker)
>>> pipeline = recommender.recommend_pipeline(X, y)
>>> print(pipeline)
fit(tracker, **kwargs)[source]

Fit recommender from benchmark results.

Return type:

PipelineRecommender

Parameters:

tracker (ExperimentTracker)

recommend_pipeline(X, y, categorical_indicator=None, task_type='classification')[source]

Recommend a complete pipeline.

Parameters:
  • X (np.ndarray) – Feature matrix.

  • y (np.ndarray) – Target variable.

  • categorical_indicator (List[bool], optional) – Boolean mask for categorical features.

  • task_type (str) – Task type.

Return type:

WSGIEnvironment[Text, Any]

Returns:

Dict[str, Any] – Pipeline recommendation with model and preprocessing.

class endgame.benchmark.BenchmarkReportGenerator(tracker, title='Endgame Benchmark Report')[source]

Bases: object

Generate HTML reports from benchmark results.

Parameters:
  • tracker (ExperimentTracker) – The experiment tracker with benchmark results.

  • title (str, optional) – Report title.

Examples

>>> from endgame.benchmark import BenchmarkRunner, BenchmarkReportGenerator
>>> runner = BenchmarkRunner(suite="sklearn-classic")
>>> tracker = runner.run(models)
>>> report = BenchmarkReportGenerator(tracker)
>>> report.generate("benchmark_report.html")
add_interpretability_output(model_name, dataset_name, output, output_type='text')[source]

Add interpretability output for a model.

Parameters:
  • model_name (str) – Name of the model.

  • dataset_name (str) – Name of the dataset.

  • output (str) – The interpretability output (rules, tree structure, equation, etc.)

  • output_type (str) – Type of output: “text”, “html”, “latex”, “code”

Return type:

None

generate(output_path, include_interpretability=True, include_meta_features=False)[source]

Generate the HTML report.

Parameters:
  • output_path (str) – Path to save the HTML report.

  • include_interpretability (bool) – Include interpretability outputs section.

  • include_meta_features (bool) – Include dataset meta-features section.

Return type:

Text

Returns:

str – Path to the generated report.

endgame.benchmark.extract_interpretability_outputs(models, X_sample, y_sample, dataset_name, feature_names=None)[source]

Extract interpretability outputs from fitted models.

Parameters:
  • models (List[Tuple]) – List of (name, fitted_model) tuples.

  • X_sample (np.ndarray) – Sample data used for fitting.

  • y_sample (np.ndarray) – Sample targets.

  • dataset_name (str) – Name of the dataset.

  • feature_names (List[str], optional) – Feature names for better output.

Return type:

WSGIEnvironment[Text, Text]

Returns:

Dict[str, str] – Dictionary mapping model names to their interpretability outputs.

class endgame.benchmark.LearningCurveExperiment(suite, config=None, max_datasets=None, verbose=True)[source]

Bases: object

Run learning curve experiments across datasets.

Implements the LCDB (Learning Curve Database) protocol for systematic evaluation of sample efficiency.

Parameters:
  • suite (str or List[DatasetInfo]) – Benchmark suite name or list of datasets.

  • config (LearningCurveConfig, optional) – Experiment configuration.

  • max_datasets (int, optional) – Maximum number of datasets.

  • verbose (bool) – Enable verbose output.

Examples

>>> from endgame.benchmark import LearningCurveExperiment, LearningCurveConfig
>>> from endgame.models import LGBMWrapper
>>>
>>> config = LearningCurveConfig(anchors=[0.1, 0.5, 1.0], n_seeds=3)
>>> exp = LearningCurveExperiment(suite="sklearn-classic", config=config)
>>>
>>> models = [
...     ("LGBM", LGBMWrapper(preset="fast")),
... ]
>>> results = exp.run(models)
>>> print(results.summary())
run(models, output_file=None, continue_on_error=True)[source]

Run learning curve experiments.

Parameters:
  • models (List[Tuple[str, BaseEstimator]]) – List of (name, model) tuples.

  • output_file (str, optional) – Path to save results.

  • continue_on_error (bool) – Continue if a model fails.

Return type:

LearningCurveResults

Returns:

LearningCurveResults – Experiment results.

class endgame.benchmark.LearningCurveConfig(anchors=<factory>, n_seeds=5, cv_folds=0, test_fraction=0.2, metrics_classification=<factory>, metrics_regression=<factory>, timeout_per_fit=600, random_state=42, verbose=True)[source]

Bases: object

Configuration for learning curve experiments.

Parameters:
  • anchors (List[float]) – Training set fractions (LCDB protocol default).

  • n_seeds (int) – Number of random seeds per anchor point.

  • cv_folds (int) – Cross-validation folds per seed (0 = holdout only).

  • test_fraction (float) – Holdout test set fraction.

  • metrics_classification (List[str]) – Metrics for classification tasks.

  • metrics_regression (List[str]) – Metrics for regression tasks.

  • timeout_per_fit (int) – Timeout per model fit in seconds.

  • random_state (int) – Base random seed.

  • verbose (bool) – Enable verbose output.

anchors: list[float]
n_seeds: int = 5
cv_folds: int = 0
test_fraction: float = 0.2
metrics_classification: list[str]
metrics_regression: list[str]
timeout_per_fit: int = 600
random_state: int = 42
verbose: bool = True
class endgame.benchmark.LearningCurveResults(records=<factory>, config=None)[source]

Bases: object

Container for learning curve results with analysis methods.

Parameters:
records

All experiment records.

Type:

List[LearningCurveRecord]

config

Configuration used.

Type:

LearningCurveConfig

records: list[LearningCurveRecord]
config: LearningCurveConfig | None = None
add_record(record)[source]

Add a record to results.

Parameters:

record (LearningCurveRecord)

to_dataframe()[source]

Convert results to DataFrame.

Returns:

DataFrame – Results in tabular format.

save(path)[source]

Save results to file.

Parameters:

path (str) – Output path (.parquet, .csv, or .json).

get_learning_curve(dataset, model, metric='accuracy')[source]

Get learning curve for a specific dataset/model.

Parameters:
  • dataset (str) – Dataset name.

  • model (str) – Model name.

  • metric (str) – Metric to retrieve.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

  • anchors (ndarray) – Training fractions.

  • means (ndarray) – Mean metric values.

  • stds (ndarray) – Standard deviations.

compute_aulc(dataset, model, metric='accuracy')[source]

Compute Area Under Learning Curve.

Higher AULC indicates better sample efficiency (learns faster).

Parameters:
  • dataset (str) – Dataset name.

  • model (str) – Model name.

  • metric (str) – Metric to use.

Return type:

float

Returns:

float – Area under learning curve (normalized to [0, 1]).

summary(metric='accuracy')[source]

Generate summary statistics.

Parameters:

metric (str) – Primary metric for summary.

Return type:

WSGIEnvironment[Text, WSGIEnvironment[Text, float]]

Returns:

dict – Summary with AULC and final performance per model.

plot_learning_curves(dataset, metric='accuracy', models=None, ax=None, **kwargs)[source]

Plot learning curves for a dataset.

Parameters:
  • dataset (str) – Dataset name.

  • metric (str) – Metric to plot.

  • models (List[str], optional) – Models to include (default: all).

  • ax (matplotlib.axes.Axes, optional) – Axes to plot on.

  • **kwargs – Additional arguments to plt.plot.

Returns:

ax (matplotlib.axes.Axes) – The axes with the plot.

class endgame.benchmark.LearningCurveRecord(dataset_name, model_name, anchor, n_train, seed, metrics, fit_time, status='success', error_message=None)[source]

Bases: object

Single learning curve data point.

Parameters:
dataset_name

Name of the dataset.

Type:

str

model_name

Name of the model.

Type:

str

anchor

Training set fraction.

Type:

float

n_train

Actual number of training samples.

Type:

int

seed

Random seed used.

Type:

int

metrics

Performance metrics.

Type:

Dict[str, float]

fit_time

Training time in seconds.

Type:

float

status

‘success’ or ‘error’.

Type:

str

error_message

Error message if failed.

Type:

str, optional

dataset_name: str
model_name: str
anchor: float
n_train: int
seed: int
metrics: dict[str, float]
fit_time: float
status: str = 'success'
error_message: str | None = None
endgame.benchmark.quick_learning_curve(model, X, y, anchors=None, n_seeds=3, test_fraction=0.2, random_state=42)[source]

Quick learning curve for a single model/dataset.

Parameters:
  • model (BaseEstimator) – Model to evaluate.

  • X (ndarray) – Features.

  • y (ndarray) – Targets.

  • anchors (List[float], optional) – Training fractions.

  • n_seeds (int) – Seeds per anchor.

  • test_fraction (float) – Test set fraction.

  • random_state (int) – Random seed.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

  • anchors (ndarray) – Training fractions.

  • means (ndarray) – Mean accuracies.

  • stds (ndarray) – Standard deviations.

endgame.benchmark.make_rotated_blobs(n_samples=1000, n_features=10, n_classes=3, rotation_angle=45.0, cluster_std=1.0, noise=0.0, random_state=None)[source]

Generate synthetic dataset with known rotation.

Creates Gaussian blobs that are axis-aligned in a rotated coordinate system. rotation learning should be able to recover the rotation and achieve high accuracy by axis-aligned splits in the rotated space.

This is the critical control experiment from the paper. Standard GBDTs fail on this because the decision boundaries are diagonal, while rotation learning should match MLP performance by learning the rotation.

Parameters:
  • n_samples (int, default=1000) – Number of samples.

  • n_features (int, default=10) – Number of features.

  • n_classes (int, default=3) – Number of classes (blob centers).

  • rotation_angle (float, default=45.0) – Rotation angle in degrees applied pairwise to features.

  • cluster_std (float, default=1.0) – Standard deviation of clusters before rotation.

  • noise (float, default=0.0) – Additional Gaussian noise after rotation.

  • random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – Synthetic dataset with metadata including ground truth rotation.

Examples

>>> from endgame.benchmark.synthetic import make_rotated_blobs
>>> dataset = make_rotated_blobs(n_samples=500, rotation_angle=45.0)
>>> print(dataset.name)
synthetic_rotated_45
>>> print(dataset.metadata['ground_truth_rotation'].shape)
(10, 10)
endgame.benchmark.make_hidden_structure(n_samples=1000, n_features=20, n_informative=5, structure_type='diagonal', flip_y=0.01, random_state=None)[source]

Generate dataset with hidden linear structure.

The true decision boundary is simple (axis-aligned) in a rotated coordinate system. This tests whether rotation learning can discover the useful feature combinations.

Parameters:
  • n_samples (int, default=1000) – Number of samples.

  • n_features (int, default=20) – Total number of features.

  • n_informative (int, default=5) – Number of truly informative features.

  • structure_type (str, default='diagonal') – Type of hidden structure: - ‘diagonal’: Linear combination of pairs - ‘block’: Block structure in feature space - ‘random’: Random orthogonal transformation

  • flip_y (float, default=0.01) – Fraction of labels to flip (noise).

  • random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – Synthetic dataset with hidden structure.

endgame.benchmark.make_xor_rotated(n_samples=1000, n_features=10, rotation_angle=45.0, noise=0.1, random_state=None)[source]

Generate XOR problem in rotated space.

Classic XOR problem where the decision boundary is the product of two features, but rotated so that axis-aligned trees fail.

Parameters:
  • n_samples (int, default=1000) – Number of samples.

  • n_features (int, default=10) – Total features (XOR uses first 2, rest are noise).

  • rotation_angle (float, default=45.0) – Rotation angle for XOR features.

  • noise (float, default=0.1) – Gaussian noise level.

  • random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – XOR dataset with rotation.

endgame.benchmark.make_regression_rotated(n_samples=1000, n_features=10, n_informative=5, rotation_angle=45.0, noise=0.1, random_state=None)[source]

Generate regression dataset with rotated structure.

Linear regression problem where the true coefficients are axis-aligned in a rotated space.

Parameters:
  • n_samples (int, default=1000) – Number of samples.

  • n_features (int, default=10) – Total features.

  • n_informative (int, default=5) – Number of features with non-zero coefficients.

  • rotation_angle (float, default=45.0) – Rotation angle.

  • noise (float, default=0.1) – Target noise level.

  • random_state (int, optional) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – Regression dataset.

endgame.benchmark.get_synthetic_suite(random_state=42)[source]

Get dictionary of all synthetic datasets for benchmarking.

Returns a comprehensive suite of synthetic datasets designed to test rotation learning methods.

Parameters:

random_state (int, default=42) – Random seed for reproducibility.

Return type:

WSGIEnvironment[Text, DatasetInfo]

Returns:

Dict[str, DatasetInfo] – Dictionary mapping dataset names to DatasetInfo objects.

Examples

>>> from endgame.benchmark.synthetic import get_synthetic_suite
>>> suite = get_synthetic_suite()
>>> for name, dataset in suite.items():
...     print(f"{name}: {dataset.n_samples} samples, {dataset.n_features} features")
endgame.benchmark.get_control_dataset(random_state=42)[source]

Get the primary control dataset from the paper.

This is the Synthetic Rotated dataset used as the critical control experiment. Standard GBDTs should fail here while rotation learning should recover the rotation and match MLP performance.

Parameters:

random_state (int, default=42) – Random seed.

Return type:

DatasetInfo

Returns:

DatasetInfo – The control dataset.