# AutoML Guide Endgame provides a full AutoML system that automatically profiles data, checks quality, selects and trains models, tunes hyperparameters, builds ensembles, optimizes thresholds, generates explanations, and produces a structured performance report — all behind a single `fit` / `predict` call. **Import convention:** `import endgame as eg` --- ## Architecture ![Endgame AutoML Pipeline](endgame_automl.png) The AutoML pipeline executes 16 stages with intelligent time budget management. Each stage receives a fraction of the total time budget and unused time is automatically redistributed to later stages. | # | Stage | Purpose | |---|-------|---------| | 1 | **Profiling** | Extract dataset meta-features (size, types, class balance, correlations) | | 2 | **Quality Guardrails** | Detect target leakage, feature redundancy, data health issues | | 3 | **Data Cleaning** | Handle missing values, remove constant columns | | 4 | **Preprocessing** | Encoding, scaling, imputation | | 5 | **Feature Engineering** | Aggregations, interactions, polynomial features | | 6 | **Data Augmentation** | SMOTE, ADASYN for imbalanced datasets | | 7 | **Model Selection** | Search strategy suggests model configurations | | 8 | **Model Training** | Train models with cross-validation from 76 registered models | | 9 | **Constraint Check** | Validate models against deployment constraints (latency, size) | | 10 | **Hyperparameter Tuning** | Optuna-based HPO for top-3 models | | 11 | **Ensembling** | Hill climbing, stacking, blending, rank averaging, or auto-selection | | 12 | **Threshold Optimization** | Optimize classification decision thresholds on OOF predictions | | 13 | **Calibration** | Probability calibration (Platt, isotonic, temperature scaling) | | 14 | **Post-Training** | Knowledge distillation, conformal prediction | | 15 | **Explainability** | SHAP feature importances and feature interactions | | 16 | **Persistence** | Save trained models and pipeline artifacts to disk | After the linear pipeline completes, a **feedback loop** can run up to 3 additional iterations if time permits — updating the search strategy with results, suggesting new model configurations, and re-running ensembling with all models. When `keep_training=True`, the pipeline enters a **continuous optimization loop** that alternates between model search, training, optional HPO, and re-ensembling until convergence or interruption. A **performance report** is generated after the pipeline finishes, summarizing the full run with leaderboard, stage timing, quality warnings, tuning results, and top features. --- ## Quick Start ```python from endgame.automl import TabularPredictor predictor = TabularPredictor(label="target", presets="best_quality") predictor.fit(train_df) y_pred = predictor.predict(test_df) y_proba = predictor.predict_proba(test_df) predictor.leaderboard() ``` `leaderboard()` returns a `pandas.DataFrame` ranked by validation score, one row per trained model: ``` model val_score fit_time_s pred_time_s 0 LGBMWrapper 0.9312 14.2 0.04 1 XGBWrapper 0.9287 18.6 0.06 2 FTTransformer 0.9241 92.0 0.31 3 HillClimbingEnsemble 0.9341 2.1 0.41 ``` --- ## Constructor Parameters `TabularPredictor` accepts the following parameters: | Parameter | Type | Default | Description | |---|---|---|---| | `label` | `str` | *(required)* | Name of the target column | | `problem_type` | `str` | `"auto"` | `"auto"`, `"binary"`, `"multiclass"`, or `"regression"` | | `eval_metric` | `str` | `"auto"` | Evaluation metric (`"roc_auc"`, `"accuracy"`, `"rmse"`, `"mae"`, `"log_loss"`, `"f1"`, `"r2"`, or a callable) | | `presets` | `str` | `"medium_quality"` | Quality preset (see [Preset System](#preset-system)) | | `time_limit` | `int \| None` | `None` | Time budget in seconds. `None` uses preset default | | `search_strategy` | `str` | `"portfolio"` | Search strategy (see [Search Strategies](#search-strategies)) | | `track_experiments` | `bool` | `True` | Track experiments to the meta-learning database | | `output_path` | `str \| None` | `None` | Path to save outputs (models, logs) | | `random_state` | `int` | `42` | Random seed for reproducibility | | `verbosity` | `int` | `2` | Verbosity level (0=silent, 1=progress, 2=detailed, 3=debug) | | `logger` | `ExperimentLogger \| None` | `None` | Experiment logger instance (e.g. MLflow) | | `constraints` | `DeploymentConstraints \| None` | `None` | Deployment constraints (latency, model size) | | `guardrails_strict` | `bool` | `False` | Abort on critical quality issues instead of warning | | `checkpoint_dir` | `str \| None` | `None` | Directory for incremental checkpoints. Saves top-N models after key stages | | `keep_training` | `bool` | `False` | Enable continuous optimization loop after main pipeline | | `patience` | `int` | `5` | Consecutive rounds without improvement before stopping (continuous loop). Set to `0` for unlimited | | `min_improvement` | `float` | `1e-4` | Minimum score improvement to count as progress | | `min_model_time` | `float` | `300.0` | Minimum time budget (seconds) per model. Stops training stage if remaining time is less than this | | `max_model_time` | `float` | `600.0` | Hard ceiling (seconds) per model. Prevents slow models from monopolizing the budget | | `excluded_models` | `list[str] \| None` | `None` | Model names to exclude from the search | | `early_stopping_rounds` | `int` | `50` | Early stopping patience for GBDT models (LightGBM, XGBoost, CatBoost) during CV | | `use_gpu` | `bool` | `False` | Enable GPU acceleration for supported models | ```python predictor = TabularPredictor( label="target", presets="best_quality", time_limit=7200, checkpoint_dir="checkpoints/", keep_training=True, patience=10, min_improvement=1e-5, excluded_models=["saint", "tabpfn"], early_stopping_rounds=100, ) predictor.fit(train_df) ``` --- ## Preset System The `preset` argument controls the quality / speed trade-off. Seven built-in presets are available: | Preset | Description | Default time | CV folds | Ensemble | HPO | |---|---|---|---|---|---| | `'best_quality'` | Maximum accuracy, all model families | No limit | 8 | Auto (6 methods) | 100 trials | | `'high_quality'` | High accuracy, most model families | 4 hours | 5 | Auto (6 methods) | 50 trials | | `'good_quality'` | Balanced speed and quality | 1 hour | 5 | Auto (6 methods) | 25 trials | | `'medium_quality'` | Fast with reasonable quality (default) | 15 min | 5 | Auto (6 methods) | 10 trials | | `'fast'` | GBDTs only, no HPO or ensembling | 5 min | 3 | None | None | | `'interpretable'` | Glass-box models only (EBM, GAM, rules, trees) | 15 min | 3 | None | 25 trials | | `'exhaustive'` | Evolutionary search over all models + preprocessing + ensembles | No limit | 3 | Auto (6 methods) | Genetic | ```python # Fast experiment — good for initial data exploration predictor = TabularPredictor(label="target", presets="fast") predictor.fit(train_df) # Competition-grade — leave running overnight predictor = TabularPredictor(label="target", presets="best_quality") predictor.fit(train_df) # Regulatory/compliance — interpretable models only predictor = TabularPredictor(label="target", presets="interpretable") predictor.fit(train_df) ``` Each preset defines time allocations for all 16 pipeline stages, curated model pools, and search budgets. See `endgame/automl/presets.py` for full details. --- ## Prediction Methods `TabularPredictor` provides four prediction methods: ### `predict(data, model=None)` Returns point predictions. For classification, applies threshold optimization automatically when available (trained during the threshold optimization stage on OOF predictions). For regression, returns raw predicted values. ```python y_pred = predictor.predict(test_df) # Use a specific model instead of the ensemble y_pred = predictor.predict(test_df, model="lgbm_standard") ``` ### `predict_proba(data, model=None)` Returns probability predictions for classification tasks. Applies calibration automatically when a calibrator was fitted during the calibration stage. ```python y_proba = predictor.predict_proba(test_df) # shape (n_samples, n_classes) ``` ### `predict_sets(data, alpha=0.1)` Returns conformal prediction sets (classification) or prediction intervals (regression) with statistical coverage guarantees. Requires a preset that enables conformal prediction (`best_quality` or `high_quality` with validation data). ```python # 90% coverage prediction sets pred_sets = predictor.predict_sets(test_df, alpha=0.1) # Classification: boolean array (n_samples, n_classes) — True = class in set # Regression: array (n_samples, 2) — [lower_bound, upper_bound] ``` ### `predict_distilled(data)` Returns predictions from the lightweight distilled student model, trained via knowledge distillation from the ensemble teacher. Faster inference while approximating ensemble accuracy. ```python y_fast = predictor.predict_distilled(test_df) ``` --- ## Search Strategies Eight search strategies are available: | Strategy | Description | |---|---| | `'portfolio'` | Diverse model portfolio with heuristic ranking (default) | | `'heuristic'` | Data-driven rules based on meta-features | | `'genetic'` | Evolutionary optimization of full pipelines (model + preprocessing + hyperparameters) | | `'random'` | Random valid pipeline sampling | | `'bayesian'` | Optuna-based Bayesian optimization | | `'bandit'` | Successive Halving multi-fidelity search | | `'adaptive'` | Meta-strategy: Portfolio → Bayesian on stagnation | ```python predictor = TabularPredictor( label="target", presets="good_quality", search_strategy="bayesian", ) predictor.fit(train_df) ``` ### Bandit Search (Successive Halving) The `'bandit'` strategy implements multi-fidelity optimization via Successive Halving. Many configurations are trained cheaply on small data fractions, and only the top performers are promoted to progressively larger fractions. This is far more time-efficient than training every configuration on the full dataset. - **Rung 0**: Train all configurations on ~11% of data - **Rung 1**: Promote top 1/3 to ~33% of data - **Rung 2**: Promote top 1/3 to 100% of data The reduction factor (`eta=3`) controls how aggressively configurations are pruned at each rung. ```python predictor = TabularPredictor( label="target", presets="good_quality", search_strategy="bandit", ) predictor.fit(train_df, time_limit=1800) ``` ### Adaptive Search The `'adaptive'` strategy is a meta-strategy that switches between sub-strategies based on performance feedback: 1. **Phase 1 — Portfolio**: Diverse model sweep for broad coverage (first 15 rounds) 2. **Phase 2 — Bayesian**: Focused HPO on top performers (unlimited rounds) The switch happens early when the current strategy stagnates (no improvement for 5 consecutive rounds). ```python predictor = TabularPredictor( label="target", presets="high_quality", search_strategy="adaptive", ) predictor.fit(train_df, time_limit=3600) ``` ### Genetic / Evolutionary Search The `'genetic'` strategy treats the entire pipeline as a genome and evolves it using tournament selection, crossover, and mutation. Each individual encodes: - **Model choice** and hyperparameters - **Preprocessing steps** (imputation strategy, scaling, encoding) - **Feature selection** method and top-k count - **Dimensionality reduction** (PCA, none) ```python predictor = TabularPredictor( label="target", presets="good_quality", search_strategy="genetic", ) predictor.fit(train_df, time_limit=3600) ``` The genetic search is most effective with longer time budgets (30+ minutes) where it has room for multiple generations. For quick experiments, `'portfolio'` or `'heuristic'` converge faster. --- ## Quality Guardrails The guardrails stage runs early in the pipeline and checks for: - **Target leakage** — features with |correlation| > 0.95 with the target - **Feature redundancy** — feature pairs with |correlation| > 0.98 - **Data health** — constant columns, all-missing columns, too few samples, extreme feature-to-sample ratio, minority class < 1%, ID-like columns By default, issues are logged as warnings and the pipeline continues. To abort on critical issues: ```python predictor = TabularPredictor( label="target", presets="good_quality", guardrails_strict=True, # Abort on critical issues ) predictor.fit(train_df) ``` Quality warnings are included in the performance report: ```python report = predictor.report() for warning in report.quality_warnings: print(f"[{warning.severity}] {warning.message}") ``` --- ## Deployment Constraints Specify deployment constraints to automatically filter out non-compliant models: ```python from endgame.automl import TabularPredictor, DeploymentConstraints predictor = TabularPredictor( label="target", presets="good_quality", constraints=DeploymentConstraints( max_predict_latency_ms=10.0, # Max 10ms per 100-sample batch max_model_size_mb=50.0, # Max 50MB serialized require_interpretable=False, # Allow black-box models ), ) predictor.fit(train_df) ``` The constraint check stage runs after model training and before HPO, measuring prediction latency and model size for each trained model. Non-compliant models are flagged in the report but still available for inspection. --- ## Intelligent CV Selection The pipeline automatically selects the most appropriate cross-validation strategy based on data characteristics: | Data Characteristic | CV Strategy | Notes | |---|---|---| | Time series detected | `PurgedTimeSeriesSplit` | Uses purging and embargo to prevent lookahead | | Group column present | `StratifiedGroupKFold` | Keeps groups intact across folds | | Small dataset (< 500 samples) | `RepeatedStratifiedKFold` / `RepeatedKFold` | 3 repeats for stable estimates | | Imbalanced classification | `StratifiedKFold` | Preserves class balance in each fold | | Default classification | `StratifiedKFold` | Standard stratified k-fold | | Default regression | `KFold` | Standard k-fold | The strategy is chosen once per run and applied consistently across all model evaluations. The number of folds is set by the preset (e.g. 8 for `best_quality`, 5 for `good_quality`, 3 for `fast`). --- ## Hyperparameter Tuning When enabled in the preset (`hyperparameter_tune=True`), the HPO stage selects the top-3 models by CV score and tunes them with Optuna. Tuning spaces are defined per model in the model registry (e.g., `lgbm_standard`, `xgb_standard`, `catboost_standard`). The time budget for HPO is divided evenly across the top models. If tuning improves a model's score, the tuned version replaces the original. ```python # HPO is enabled by default for good_quality and above predictor = TabularPredictor(label="target", presets="good_quality") predictor.fit(train_df, time_limit=3600) # Check tuning results report = predictor.report() for entry in report.tuning_summary: print(f"{entry['model']}: {entry['original_score']:.4f} → {entry['tuned_score']:.4f}") ``` --- ## Ensembling After individual models are trained, `TabularPredictor` builds an ensemble. When the preset uses `ensemble_method="auto"` (default for most presets), all six ensemble methods are tried and the best is selected by OOF score: | Method | Description | |---|---| | **Hill climbing** | Forward model selection optimizing the evaluation metric | | **Stacking** | Meta-learner trained on out-of-fold predictions | | **Optimized blend** | Optuna-optimized blending weights | | **Power blend** | Score-proportional power weighting | | **Rank averaging** | Rank-based blending for heterogeneous predictions | | **Uniform averaging** | Simple equal-weight averaging (baseline) | The `fast` and `interpretable` presets disable ensembling (`ensemble_method="none"`) to prioritize speed and interpretability respectively. Ensembling runs after HPO and threshold optimization, so it operates on the best available versions of each model. --- ## Threshold Optimization For classification tasks, the threshold optimization stage finds optimal decision thresholds using out-of-fold predictions. This is particularly valuable for imbalanced datasets where the default 0.5 threshold is suboptimal. The optimized thresholds are automatically applied in `predict()` when available. This is transparent — no code changes needed. --- ## Continuous Training When `keep_training=True`, the predictor enters a continuous optimization loop after the main pipeline completes. This loop alternates between: 1. **Model search** — ask the search strategy for new configurations 2. **Training** — fit the suggested configurations with CV 3. **Optional HPO** — run Optuna on the best models if time permits 4. **Re-ensembling** — rebuild the ensemble with the expanded model pool The loop runs until one of: - `patience` consecutive rounds without improvement exceeding `min_improvement` - Total `time_limit` reached - `KeyboardInterrupt` (saves checkpoint and exits gracefully) ```python # Run until convergence with periodic checkpoints predictor = TabularPredictor( label="target", presets="exhaustive", keep_training=True, patience=10, min_improvement=1e-5, checkpoint_dir="checkpoints/", ) predictor.fit(train_df) ``` Set `patience=0` for truly unlimited optimization (useful with `search_strategy="genetic"` or `"exhaustive"` preset). --- ## Early Stopping for GBDTs Gradient-boosted decision tree models (LightGBM, XGBoost, CatBoost) use early stopping during cross-validation to avoid training unnecessary boosting rounds. A validation set from each CV fold monitors performance, and training halts when no improvement is seen for `early_stopping_rounds` consecutive rounds. This is enabled by default (`early_stopping_rounds=50`) and applies only during CV scoring — the final refit on all data trains for the full `n_estimators`. ```python # Increase patience for noisy datasets predictor = TabularPredictor( label="target", presets="best_quality", early_stopping_rounds=100, ) predictor.fit(train_df) ``` --- ## GPU Support Set `use_gpu=True` to enable GPU acceleration for models that support it (e.g. XGBoost, LightGBM, CatBoost, PyTorch-based neural models). ```python predictor = TabularPredictor( label="target", presets="best_quality", use_gpu=True, ) predictor.fit(train_df) ``` When GPU mode is enabled: - CUDA is validated at startup; a warning is emitted if no GPU is detected - Training uses thread-based execution instead of fork to avoid CUDA re-initialization issues - If a model encounters a CUDA out-of-memory error, it automatically falls back to CPU for that model - When `use_gpu=False` (default), `CUDA_VISIBLE_DEVICES=""` is set to force CPU-only mode in worker processes --- ## Model Interpretability After fitting, inspect the learned structures of trained models: ### `display_models()` Prints rules, trees, equations, scorecards, coefficients, and feature importances for every trained model. ```python predictor = TabularPredictor(label="target", presets="interpretable") predictor.fit(train_df) # Display all trained models text = predictor.display_models() ``` ### `display_model(name)` Display the learned structure of a single model: ```python # Display a specific model's rules/structure predictor.display_model("ebm") predictor.display_model("rulefit") ``` Both methods accept `top_rules` (max rules/terms per model, default 15) and `top_features` (max features per importance display, default 10). --- ## Explainability The explainability stage computes SHAP-based feature importances for the best model using a subsample of the training data. Results are stored in the predictor and the performance report. ```python predictor.fit(train_df) # Access explanations explanations = predictor.explain() print("Top features:", explanations["top_features"]) print(explanations["feature_importance_df"]) ``` --- ## Performance Report After fitting, a structured `AutoMLReport` is generated automatically. It contains: - **Summary** — preset, time limit, total time, best score, number of models - **Stage summary** — per-stage timing and success status - **Model leaderboard** — all trained models ranked by score - **Quality warnings** — issues detected by the guardrails stage - **Feature importances** — SHAP-based importances from the explainability stage - **Tuning summary** — per-model HPO results (original vs tuned score) - **Constraint violations** — deployment constraint failures ```python predictor.fit(train_df) # Get the report object report = predictor.report() # Print as markdown print(report.to_markdown()) # Or convert to dict for programmatic access data = report.to_dict() # Display to stdout report.display() ``` ### HTML Reports Generate self-contained HTML reports with embedded CSS — no external dependencies required: ```python report = predictor.report() # Get HTML string html = report.to_html(title="My Experiment") # Save directly to file report.save_html("report.html", title="My Experiment") ``` The HTML report includes the full leaderboard, stage timing breakdown, quality warnings, feature importances chart, and tuning results in a styled, printable format. --- ## Feedback Loop When the preset enables HPO and time remains after the linear pipeline, a feedback loop runs up to 3 additional iterations: 1. Update the search strategy with all results collected so far 2. Suggest 2 new model configurations not yet tried 3. Train them with 50% of remaining time 4. Merge results and re-run ensembling This iterative refinement is automatic and requires no configuration. It activates when at least 60 seconds remain in the time budget. --- ## Task Inference `TabularPredictor` infers the task type from `y_train` automatically: - Integer or string labels with fewer than 20 unique values → classification - Float labels or integers with many unique values → regression Override with the `problem_type` argument when automatic inference is wrong: ```python predictor = TabularPredictor(label="target", problem_type="regression") predictor.fit(train_df) ``` Supported values: `'binary'`, `'multiclass'`, `'regression'`, `'auto'`. --- ## Customising the Search ### Time limits ```python predictor = TabularPredictor( label="target", presets="high_quality", time_limit=1800, # seconds; stops search after 30 minutes ) predictor.fit(train_df) ``` ### Custom evaluation metric ```python from sklearn.metrics import f1_score def macro_f1(y_true, y_pred): return f1_score(y_true, y_pred, average='macro') predictor = TabularPredictor( label="target", presets="good_quality", eval_metric=macro_f1, ) predictor.fit(train_df) ``` Built-in metric strings (`'roc_auc'`, `'accuracy'`, `'rmse'`, `'mae'`, `'log_loss'`) are also accepted. --- ## Retrieving the Best Model ```python best = predictor.get_model(predictor.fit_summary_.best_model) y_pred = best.predict(X_test) # Or use the predictor directly — delegates to the ensemble / best model y_pred = predictor.predict(test_df) ``` --- ## Incremental Checkpointing Save progress during long runs with `checkpoint_dir`. The top-N models (by score) are saved after key stages and each continuous-loop iteration. Stale models from earlier iterations are automatically removed. ```python predictor = TabularPredictor( label="target", presets="exhaustive", checkpoint_dir="checkpoints/my_run", keep_training=True, ) predictor.fit(train_df) ``` The checkpoint directory contains: - `models/` — top-N serialized models - `ensemble` — current ensemble - `preprocessor` — fitted preprocessor - `leaderboard.csv` — full result history - `checkpoint_meta.pkl` — metadata (preset, problem type, timestamp) --- ## Domain-Specific Predictors Specialised predictors extend `TabularPredictor` with domain defaults: | Class | Domain | Notes | |---|---|---| | `TimeSeriesPredictor` | Forecasting | Wraps `eg.timeseries` models | | `TextPredictor` | NLP / classification | Wraps `eg.nlp` transformers | | `VisionPredictor` | Computer vision | Wraps `eg.vision` backbones | | `MultiModalPredictor` | Multi-modal fusion | Combines tabular + text + image + audio | ```python from endgame.automl import TimeSeriesPredictor ts_pred = TimeSeriesPredictor(preset='high_quality', horizon=12) ts_pred.fit(train_df, target_col='sales') forecast = ts_pred.predict() ``` --- ## Refit for Deployment After `fit()` selects the best model via cross-validation, call `refit_full()` to retrain on **all** available data (train + validation) for maximum deployment performance: ```python predictor = TabularPredictor(label="target", presets="best_quality") predictor.fit(train_df) # Retrain best model on all data before deploying predictor.refit_full() # Now predict with the full-data model y_pred = predictor.predict(test_df) ``` Note: after `refit_full()`, the model can no longer be evaluated on a holdout set. Use this only when you are ready to deploy. --- ## Experiment Tracking Pass an experiment logger to automatically track parameters and metrics: ```python from endgame.automl import TabularPredictor from endgame.tracking import MLflowLogger with MLflowLogger(experiment_name="my_project") as logger: predictor = TabularPredictor(label="target", logger=logger) predictor.fit(train_df) ``` See the [Tracking Guide](tracking.md) for full details on console logging, MLflow integration, and custom backends. --- ## MultiModal Fusion Strategies `MultiModalPredictor` supports five fusion strategies for combining predictions across modalities (tabular, text, image, audio): | Strategy | Description | |---|---| | `"late"` | Equal-weight averaging of predictions | | `"weighted"` | Score-proportional or manual weights | | `"stacking"` | Meta-learner (LogisticRegression/Ridge) on modality outputs | | `"attention"` | Learned per-sample weights via MLP | | `"embedding"` | Mid-level feature concatenation with GradientBoosting on top | ```python from endgame.automl import MultiModalPredictor predictor = MultiModalPredictor( label="sentiment", fusion_strategy="embedding", text_columns=["review"], tabular_columns=["price", "rating"], ) predictor.fit(train_df) ``` --- ## Saving and Loading ```python from endgame.persistence import save, load save(predictor, 'my_predictor.eg') # Later, in a new session: predictor = load('my_predictor.eg') y_pred = predictor.predict(X_test) ``` --- ## API Reference Full parameter documentation is available in the auto-generated API reference at `docs/api/automl.rst` or by calling `help(TabularPredictor)` at the Python prompt.