Semi-Supervised¶

class endgame.semi_supervised.SelfTrainingClassifier(base_estimator, criterion='threshold', threshold=0.75, k_best=10, max_iter=10, sample_weight_decay=1.0, progressive_weight=False, min_confidence_increase=0.0, verbose=False, random_state=None)[source]¶

Bases: BaseEstimator, ClassifierMixin, MetaEstimatorMixin

Self-training classifier for semi-supervised learning.

Wraps any sklearn-compatible classifier to perform iterative pseudo-labeling. The algorithm repeatedly: 1. Trains on labeled + pseudo-labeled data 2. Predicts on remaining unlabeled data 3. Selects high-confidence predictions as new pseudo-labels 4. Repeats until convergence or max iterations

Parameters:

base_estimator (estimator object) – Any sklearn-compatible classifier with fit, predict, and predict_proba methods. Will be cloned for each iteration.
criterion ({'threshold', 'k_best'}, default='threshold') – Selection strategy for pseudo-labels: - ‘threshold’: Select samples with confidence >= threshold - ‘k_best’: Select top k most confident samples per iteration
threshold (float, default=0.75) – Minimum confidence (max probability) required to add a pseudo-label. Only used when criterion=’threshold’.
k_best (int, default=10) – Number of samples to pseudo-label per iteration. Only used when criterion=’k_best’.
max_iter (int, default=10) – Maximum number of self-training iterations. Set to None for unlimited iterations (until no more samples meet the criterion).
sample_weight_decay (float, default=1.0) – Weight multiplier for pseudo-labeled samples relative to true labels. - 1.0: Equal weight to pseudo-labels and true labels - < 1.0: Lower weight for pseudo-labels (more conservative) Values < 1 recommended when noise in pseudo-labels is a concern.
progressive_weight (bool, default=False) – If True, weight pseudo-labels by their confidence score. Overrides sample_weight_decay for pseudo-labeled samples.
min_confidence_increase (float, default=0.0) – Minimum increase in average confidence required to continue. Helps detect when self-training has converged.
verbose (bool, default=False) – Print progress information during training.
random_state (int, RandomState, or None, default=None) – Random seed for reproducibility (used in k_best tie-breaking).

base_estimator_¶

The fitted base estimator.

Type:: estimator

classes_¶

Class labels.

Type:: ndarray of shape (n_classes,)

n_classes_¶

Number of classes.

Type:: int

n_features_in_¶

Number of features seen during fit.

Type:: int

n_iter_¶

Number of self-training iterations performed.

Type:: int

labeled_iter_¶

Iteration when each sample was labeled: - 0: Originally labeled - i > 0: Pseudo-labeled in iteration i - -1: Never labeled (still unlabeled)

Type:: ndarray of shape (n_samples,)

pseudo_labels_¶

Final labels for all samples (true labels + pseudo-labels).

Type:: ndarray of shape (n_samples,)

transduction_¶

Same as pseudo_labels_ (sklearn compatibility).

Type:: ndarray of shape (n_samples,)

termination_condition_¶

Reason for stopping: ‘max_iter’, ‘no_change’, ‘all_labeled’, or ‘confidence_plateau’.

Type:: str

history_¶

Training history with keys: - ‘n_pseudo_labeled’: List of cumulative pseudo-labeled counts - ‘mean_confidence’: List of mean confidence per iteration - ‘selected_per_iter’: List of samples selected per iteration

Type:: dict

Examples

>>> from sklearn.ensemble import RandomForestClassifier
>>> from endgame.semi_supervised import SelfTrainingClassifier
>>>
>>> # Prepare data: -1 indicates unlabeled samples
>>> y_train = np.array([0, 1, 0, -1, -1, -1, 1, -1])
>>>
>>> # Create self-training classifier
>>> st = SelfTrainingClassifier(
...     base_estimator=RandomForestClassifier(n_estimators=100),
...     threshold=0.8,
...     max_iter=10,
... )
>>> st.fit(X_train, y_train)
>>>
>>> # Predict on new data
>>> predictions = st.predict(X_test)
>>> probabilities = st.predict_proba(X_test)
>>>
>>> # Check which samples were pseudo-labeled
>>> print(f"Pseudo-labeled in iter 1: {np.sum(st.labeled_iter_ == 1)}")

Notes

Choosing threshold vs k_best:

threshold is preferred when you have a good sense of model calibration. It naturally adapts the number of samples based on confidence.
k_best is preferred for controlled expansion. It guarantees progress each iteration but may add low-confidence samples if k is too large.

Avoiding confirmation bias:

Self-training can reinforce the model’s mistakes (confirmation bias). To mitigate this: - Use a high threshold (0.9+) - Use sample_weight_decay < 1.0 to trust pseudo-labels less - Set min_confidence_increase > 0 to detect plateaus - Consider using progressive_weight=True

Memory efficiency:

The wrapper stores labeled_iter_ for all samples. For very large unlabeled sets, consider batching the unlabeled data.

fit(X, y, **fit_params)[source]¶

Fit the self-training classifier.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data (labeled + unlabeled).
y (array-like of shape (n_samples,)) – Target values. Use -1 to indicate unlabeled samples.
**fit_params (dict) – Additional parameters passed to base_estimator.fit(). Note: sample_weight is handled internally.

Return type:

SelfTrainingClassifier

Returns:

self (object) – Fitted estimator.

predict(X)[source]¶

Predict class labels for samples in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to predict.
Return type:: ndarray
Returns:: y_pred (ndarray of shape (n_samples,)) – Predicted class labels.

predict_proba(X)[source]¶

Predict class probabilities for samples in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to predict.
Return type:: ndarray
Returns:: proba (ndarray of shape (n_samples, n_classes)) – Class probabilities.

predict_log_proba(X)[source]¶

Predict class log-probabilities for samples in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to predict.
Return type:: ndarray
Returns:: log_proba (ndarray of shape (n_samples, n_classes)) – Class log-probabilities.

decision_function(X)[source]¶

Compute decision function for samples in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples.
Return type:: ndarray
Returns:: decision (ndarray) – Decision function values.

get_pseudo_labeled_samples()[source]¶

Get indices, labels, and iterations of pseudo-labeled samples.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

indices (ndarray) – Indices of pseudo-labeled samples.
labels (ndarray) – Pseudo-labels assigned.
iterations (ndarray) – Iteration when each sample was pseudo-labeled.

set_score_request(*, sample_weight='$UNCHANGED$')¶

Configure whether metadata should be requested to be passed to the score method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (SelfTrainingClassifier)

Returns:

self (object) – The updated object.

Return type:

SelfTrainingClassifier

class endgame.semi_supervised.SelfTrainingRegressor(base_estimator, criterion='threshold', threshold=1.0, k_best=10, uncertainty_method='ensemble', max_iter=10, sample_weight_decay=1.0, verbose=False, random_state=None)[source]¶

Bases: BaseEstimator, RegressorMixin, MetaEstimatorMixin

Self-training regressor for semi-supervised learning.

Extends self-training to regression by using prediction uncertainty instead of class probabilities for sample selection.

The uncertainty can be estimated via: - Ensemble variance (if base_estimator is an ensemble) - Quantile predictions (if supported) - Residual-based heuristics

Parameters:

base_estimator (estimator object) – Any sklearn-compatible regressor with fit and predict methods. For best results, use an estimator that can provide uncertainty estimates (e.g., RandomForestRegressor, GradientBoostingRegressor, QuantileRegressorForest).
criterion ({'threshold', 'k_best'}, default='threshold') – Selection strategy: - ‘threshold’: Select samples with uncertainty <= threshold - ‘k_best’: Select k samples with lowest uncertainty
threshold (float, default=1.0) – Maximum uncertainty (std dev) allowed for pseudo-labeling. Only used when criterion=’threshold’.
k_best (int, default=10) – Number of samples to pseudo-label per iteration. Only used when criterion=’k_best’.
uncertainty_method ({'ensemble', 'knn', 'residual'}, default='ensemble') –
Method for estimating prediction uncertainty: - ‘ensemble’: Use variance across ensemble members (requires

ensemble with estimators_ attribute, e.g., RandomForest)
- ’knn’: Use variance among k nearest labeled neighbors
- ’residual’: Use cross-validated residual magnitude
max_iter (int, default=10) – Maximum number of self-training iterations.
sample_weight_decay (float, default=1.0) – Weight multiplier for pseudo-labeled samples.
verbose (bool, default=False) – Print progress information.
random_state (int, RandomState, or None, default=None) – Random seed.

base_estimator_¶

The fitted base estimator.

Type:: estimator

n_features_in_¶

Number of features.

Type:: int

n_iter_¶

Number of iterations performed.

Type:: int

labeled_iter_¶

Iteration when each sample was labeled (0=original, -1=unlabeled).

Type:: ndarray of shape (n_samples,)

pseudo_labels_¶

Final labels including pseudo-labels.

Type:: ndarray of shape (n_samples,)

Examples

>>> from sklearn.ensemble import RandomForestRegressor
>>> from endgame.semi_supervised import SelfTrainingRegressor
>>>
>>> # Prepare data: np.nan indicates unlabeled samples
>>> y_train = np.array([1.0, 2.5, 3.0, np.nan, np.nan, np.nan])
>>>
>>> st = SelfTrainingRegressor(
...     base_estimator=RandomForestRegressor(n_estimators=100),
...     threshold=0.5,  # Max std dev for pseudo-labeling
... )
>>> st.fit(X_train, y_train)
>>> predictions = st.predict(X_test)

fit(X, y, **fit_params)[source]¶

Fit the self-training regressor.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data (labeled + unlabeled).
y (array-like of shape (n_samples,)) – Target values. Use np.nan to indicate unlabeled samples.
**fit_params (dict) – Additional parameters passed to base_estimator.fit().

Return type:

SelfTrainingRegressor

Returns:

self (object) – Fitted estimator.

predict(X)[source]¶

Predict target values for samples in X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Samples to predict.
Return type:: ndarray
Returns:: y_pred (ndarray of shape (n_samples,)) – Predicted values.

get_pseudo_labeled_samples()[source]¶

Get indices, labels, and iterations of pseudo-labeled samples.

Return type:

tuple[ndarray, ndarray, ndarray]

Returns:

indices (ndarray) – Indices of pseudo-labeled samples.
labels (ndarray) – Pseudo-labels assigned.
iterations (ndarray) – Iteration when each sample was pseudo-labeled.

set_score_request(*, sample_weight='$UNCHANGED$')¶

Configure whether metadata should be requested to be passed to the score method.

The options for each parameter are:

True: metadata is requested, and passed to score if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to score.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for sample_weight parameter in score.
self (SelfTrainingRegressor)

Returns:

self (object) – The updated object.

Return type:

SelfTrainingRegressor