Semi-Supervised

class endgame.semi_supervised.SelfTrainingClassifier(base_estimator, criterion='threshold', threshold=0.75, k_best=10, max_iter=10, sample_weight_decay=1.0, progressive_weight=False, min_confidence_increase=0.0, verbose=False, random_state=None)[source]

Bases: BaseEstimator, ClassifierMixin, MetaEstimatorMixin

Self-training classifier for semi-supervised learning.

Wraps any sklearn-compatible classifier to perform iterative pseudo-labeling. The algorithm repeatedly: 1. Trains on labeled + pseudo-labeled data 2. Predicts on remaining unlabeled data 3. Selects high-confidence predictions as new pseudo-labels 4. Repeats until convergence or max iterations

Parameters:
  • base_estimator (estimator object) – Any sklearn-compatible classifier with fit, predict, and predict_proba methods. Will be cloned for each iteration.

  • criterion ({'threshold', 'k_best'}, default='threshold') – Selection strategy for pseudo-labels: - ‘threshold’: Select samples with confidence >= threshold - ‘k_best’: Select top k most confident samples per iteration

  • threshold (float, default=0.75) – Minimum confidence (max probability) required to add a pseudo-label. Only used when criterion=’threshold’.

  • k_best (int, default=10) – Number of samples to pseudo-label per iteration. Only used when criterion=’k_best’.

  • max_iter (int, default=10) – Maximum number of self-training iterations. Set to None for unlimited iterations (until no more samples meet the criterion).

  • sample_weight_decay (float, default=1.0) – Weight multiplier for pseudo-labeled samples relative to true labels. - 1.0: Equal weight to pseudo-labels and true labels - < 1.0: Lower weight for pseudo-labels (more conservative) Values < 1 recommended when noise in pseudo-labels is a concern.

  • progressive_weight (bool, default=False) – If True, weight pseudo-labels by their confidence score. Overrides sample_weight_decay for pseudo-labeled samples.

  • min_confidence_increase (float, default=0.0) – Minimum increase in average confidence required to continue. Helps detect when self-training has converged.

  • verbose (bool, default=False) – Print progress information during training.

  • random_state (int, RandomState, or None, default=None) – Random seed for reproducibility (used in k_best tie-breaking).

base_estimator_

The fitted base estimator.

Type:

estimator

classes_

Class labels.

Type:

ndarray of shape (n_classes,)

n_classes_

Number of classes.

Type:

int

n_features_in_

Number of features seen during fit.

Type:

int

n_iter_

Number of self-training iterations performed.

Type:

int

labeled_iter_

Iteration when each sample was labeled: - 0: Originally labeled - i > 0: Pseudo-labeled in iteration i - -1: Never labeled (still unlabeled)

Type:

ndarray of shape (n_samples,)

pseudo_labels_

Final labels for all samples (true labels + pseudo-labels).

Type:

ndarray of shape (n_samples,)

transduction_

Same as pseudo_labels_ (sklearn compatibility).

Type:

ndarray of shape (n_samples,)

termination_condition_

Reason for stopping: ‘max_iter’, ‘no_change’, ‘all_labeled’, or ‘confidence_plateau’.

Type:

str

history_

Training history with keys: - ‘n_pseudo_labeled’: List of cumulative pseudo-labeled counts - ‘mean_confidence’: List of mean confidence per iteration - ‘selected_per_iter’: List of samples selected per iteration

Type:

dict

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from endgame.semi_supervised import SelfTrainingClassifier
>>>
>>> # Prepare data: -1 indicates unlabeled samples
>>> y_train = np.array([0, 1, 0, -1, -1, -1, 1, -1])
>>>
>>> # Create self-training classifier
>>> st = SelfTrainingClassifier(
...     base_estimator=RandomForestClassifier(n_estimators=100),
...     threshold=0.8,
...     max_iter=10,
... )
>>> st.fit(X_train, y_train)
>>>
>>> # Predict on new data
>>> predictions = st.predict(X_test)
>>> probabilities = st.predict_proba(X_test)
>>>
>>> # Check which samples were pseudo-labeled
>>> print(f"Pseudo-labeled in iter 1: {np.sum(st.labeled_iter_ == 1)}")

Notes

Choosing threshold vs k_best:

  • threshold is preferred when you have a good sense of model calibration. It naturally adapts the number of samples based on confidence.

  • k_best is preferred for controlled expansion. It guarantees progress each iteration but may add low-confidence samples if k is too large.

Avoiding confirmation bias:

Self-training can reinforce the model’s mistakes (confirmation bias). To mitigate this: - Use a high threshold (0.9+) - Use sample_weight_decay < 1.0 to trust pseudo-labels less - Set min_confidence_increase > 0 to detect plateaus - Consider using progressive_weight=True

Memory efficiency:

The wrapper stores labeled_iter_ for all samples. For very large unlabeled sets, consider batching the unlabeled data.

fit(X, y, **fit_params)[source]

Fit the self-training classifier.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data (labeled + unlabeled).

  • y (array-like of shape (n_samples,)) – Target values. Use -1 to indicate unlabeled samples.

  • **fit_params (dict) – Additional parameters passed to base_estimator.fit(). Note: sample_weight is handled internally.

Return type:

SelfTrainingClassifier

Returns:

self (object) – Fitted estimator.

predict(X)[source]

Predict class labels for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to predict.

Return type:

ndarray

Returns:

y_pred (ndarray of shape (n_samples,)) – Predicted class labels.

predict_proba(X)[source]

Predict class probabilities for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to predict.

Return type:

ndarray

Returns:

proba (ndarray of shape (n_samples, n_classes)) – Class probabilities.

predict_log_proba(X)[source]

Predict class log-probabilities for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to predict.

Return type:

ndarray

Returns:

log_proba (ndarray of shape (n_samples, n_classes)) – Class log-probabilities.

decision_function(X)[source]

Compute decision function for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples.

Return type:

ndarray

Returns:

decision (ndarray) – Decision function values.

get_pseudo_labeled_samples()[source]

Get indices, labels, and iterations of pseudo-labeled samples.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

  • indices (ndarray) – Indices of pseudo-labeled samples.

  • labels (ndarray) – Pseudo-labels assigned.

  • iterations (ndarray) – Iteration when each sample was pseudo-labeled.

set_score_request(*, sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (SelfTrainingClassifier)

Returns:

self (object) – The updated object.

Return type:

SelfTrainingClassifier

class endgame.semi_supervised.SelfTrainingRegressor(base_estimator, criterion='threshold', threshold=1.0, k_best=10, uncertainty_method='ensemble', max_iter=10, sample_weight_decay=1.0, verbose=False, random_state=None)[source]

Bases: BaseEstimator, RegressorMixin, MetaEstimatorMixin

Self-training regressor for semi-supervised learning.

Extends self-training to regression by using prediction uncertainty instead of class probabilities for sample selection.

The uncertainty can be estimated via: - Ensemble variance (if base_estimator is an ensemble) - Quantile predictions (if supported) - Residual-based heuristics

Parameters:
  • base_estimator (estimator object) – Any sklearn-compatible regressor with fit and predict methods. For best results, use an estimator that can provide uncertainty estimates (e.g., RandomForestRegressor, GradientBoostingRegressor, QuantileRegressorForest).

  • criterion ({'threshold', 'k_best'}, default='threshold') – Selection strategy: - ‘threshold’: Select samples with uncertainty <= threshold - ‘k_best’: Select k samples with lowest uncertainty

  • threshold (float, default=1.0) – Maximum uncertainty (std dev) allowed for pseudo-labeling. Only used when criterion=’threshold’.

  • k_best (int, default=10) – Number of samples to pseudo-label per iteration. Only used when criterion=’k_best’.

  • uncertainty_method ({'ensemble', 'knn', 'residual'}, default='ensemble') –

    Method for estimating prediction uncertainty: - ‘ensemble’: Use variance across ensemble members (requires

    ensemble with estimators_ attribute, e.g., RandomForest)

    • ’knn’: Use variance among k nearest labeled neighbors

    • ’residual’: Use cross-validated residual magnitude

  • max_iter (int, default=10) – Maximum number of self-training iterations.

  • sample_weight_decay (float, default=1.0) – Weight multiplier for pseudo-labeled samples.

  • verbose (bool, default=False) – Print progress information.

  • random_state (int, RandomState, or None, default=None) – Random seed.

base_estimator_

The fitted base estimator.

Type:

estimator

n_features_in_

Number of features.

Type:

int

n_iter_

Number of iterations performed.

Type:

int

labeled_iter_

Iteration when each sample was labeled (0=original, -1=unlabeled).

Type:

ndarray of shape (n_samples,)

pseudo_labels_

Final labels including pseudo-labels.

Type:

ndarray of shape (n_samples,)

Examples

>>> from sklearn.ensemble import RandomForestRegressor
>>> from endgame.semi_supervised import SelfTrainingRegressor
>>>
>>> # Prepare data: np.nan indicates unlabeled samples
>>> y_train = np.array([1.0, 2.5, 3.0, np.nan, np.nan, np.nan])
>>>
>>> st = SelfTrainingRegressor(
...     base_estimator=RandomForestRegressor(n_estimators=100),
...     threshold=0.5,  # Max std dev for pseudo-labeling
... )
>>> st.fit(X_train, y_train)
>>> predictions = st.predict(X_test)
fit(X, y, **fit_params)[source]

Fit the self-training regressor.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Training data (labeled + unlabeled).

  • y (array-like of shape (n_samples,)) – Target values. Use np.nan to indicate unlabeled samples.

  • **fit_params (dict) – Additional parameters passed to base_estimator.fit().

Return type:

SelfTrainingRegressor

Returns:

self (object) – Fitted estimator.

predict(X)[source]

Predict target values for samples in X.

Parameters:

X (array-like of shape (n_samples, n_features)) – Samples to predict.

Return type:

ndarray

Returns:

y_pred (ndarray of shape (n_samples,)) – Predicted values.

get_pseudo_labeled_samples()[source]

Get indices, labels, and iterations of pseudo-labeled samples.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

  • indices (ndarray) – Indices of pseudo-labeled samples.

  • labels (ndarray) – Pseudo-labels assigned.

  • iterations (ndarray) – Iteration when each sample was pseudo-labeled.

set_score_request(*, sample_weight='$UNCHANGED$')

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

  • True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.

  • False: metadata is not requested and the meta-estimator will not pass it to score.

  • None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.

  • str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:
  • sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.

  • self (SelfTrainingRegressor)

Returns:

self (object) – The updated object.

Return type:

SelfTrainingRegressor