Semi-Supervised¶
- class endgame.semi_supervised.SelfTrainingClassifier(base_estimator, criterion='threshold', threshold=0.75, k_best=10, max_iter=10, sample_weight_decay=1.0, progressive_weight=False, min_confidence_increase=0.0, verbose=False, random_state=None)[source]¶
Bases:
BaseEstimator,ClassifierMixin,MetaEstimatorMixinSelf-training classifier for semi-supervised learning.
Wraps any sklearn-compatible classifier to perform iterative pseudo-labeling. The algorithm repeatedly: 1. Trains on labeled + pseudo-labeled data 2. Predicts on remaining unlabeled data 3. Selects high-confidence predictions as new pseudo-labels 4. Repeats until convergence or max iterations
- Parameters:
base_estimator (estimator object) – Any sklearn-compatible classifier with fit, predict, and predict_proba methods. Will be cloned for each iteration.
criterion ({'threshold', 'k_best'}, default='threshold') – Selection strategy for pseudo-labels: - ‘threshold’: Select samples with confidence >= threshold - ‘k_best’: Select top k most confident samples per iteration
threshold (float, default=0.75) – Minimum confidence (max probability) required to add a pseudo-label. Only used when criterion=’threshold’.
k_best (int, default=10) – Number of samples to pseudo-label per iteration. Only used when criterion=’k_best’.
max_iter (int, default=10) – Maximum number of self-training iterations. Set to None for unlimited iterations (until no more samples meet the criterion).
sample_weight_decay (float, default=1.0) – Weight multiplier for pseudo-labeled samples relative to true labels. - 1.0: Equal weight to pseudo-labels and true labels - < 1.0: Lower weight for pseudo-labels (more conservative) Values < 1 recommended when noise in pseudo-labels is a concern.
progressive_weight (bool, default=False) – If True, weight pseudo-labels by their confidence score. Overrides sample_weight_decay for pseudo-labeled samples.
min_confidence_increase (float, default=0.0) – Minimum increase in average confidence required to continue. Helps detect when self-training has converged.
verbose (bool, default=False) – Print progress information during training.
random_state (int, RandomState, or None, default=None) – Random seed for reproducibility (used in k_best tie-breaking).
- base_estimator_¶
The fitted base estimator.
- Type:
estimator
- labeled_iter_¶
Iteration when each sample was labeled: - 0: Originally labeled - i > 0: Pseudo-labeled in iteration i - -1: Never labeled (still unlabeled)
- Type:
ndarray of shape (n_samples,)
- pseudo_labels_¶
Final labels for all samples (true labels + pseudo-labels).
- Type:
ndarray of shape (n_samples,)
- transduction_¶
Same as pseudo_labels_ (sklearn compatibility).
- Type:
ndarray of shape (n_samples,)
- termination_condition_¶
Reason for stopping: ‘max_iter’, ‘no_change’, ‘all_labeled’, or ‘confidence_plateau’.
- Type:
- history_¶
Training history with keys: - ‘n_pseudo_labeled’: List of cumulative pseudo-labeled counts - ‘mean_confidence’: List of mean confidence per iteration - ‘selected_per_iter’: List of samples selected per iteration
- Type:
Examples
>>> from sklearn.ensemble import RandomForestClassifier >>> from endgame.semi_supervised import SelfTrainingClassifier >>> >>> # Prepare data: -1 indicates unlabeled samples >>> y_train = np.array([0, 1, 0, -1, -1, -1, 1, -1]) >>> >>> # Create self-training classifier >>> st = SelfTrainingClassifier( ... base_estimator=RandomForestClassifier(n_estimators=100), ... threshold=0.8, ... max_iter=10, ... ) >>> st.fit(X_train, y_train) >>> >>> # Predict on new data >>> predictions = st.predict(X_test) >>> probabilities = st.predict_proba(X_test) >>> >>> # Check which samples were pseudo-labeled >>> print(f"Pseudo-labeled in iter 1: {np.sum(st.labeled_iter_ == 1)}")
Notes
Choosing threshold vs k_best:
threshold is preferred when you have a good sense of model calibration. It naturally adapts the number of samples based on confidence.
k_best is preferred for controlled expansion. It guarantees progress each iteration but may add low-confidence samples if k is too large.
Avoiding confirmation bias:
Self-training can reinforce the model’s mistakes (confirmation bias). To mitigate this: - Use a high threshold (0.9+) - Use sample_weight_decay < 1.0 to trust pseudo-labels less - Set min_confidence_increase > 0 to detect plateaus - Consider using progressive_weight=True
Memory efficiency:
The wrapper stores labeled_iter_ for all samples. For very large unlabeled sets, consider batching the unlabeled data.
- fit(X, y, **fit_params)[source]¶
Fit the self-training classifier.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data (labeled + unlabeled).
y (array-like of shape (n_samples,)) – Target values. Use -1 to indicate unlabeled samples.
**fit_params (dict) – Additional parameters passed to base_estimator.fit(). Note: sample_weight is handled internally.
- Return type:
- Returns:
self (object) – Fitted estimator.
- predict(X)[source]¶
Predict class labels for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to predict.
- Return type:
- Returns:
y_pred (ndarray of shape (n_samples,)) – Predicted class labels.
- predict_proba(X)[source]¶
Predict class probabilities for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to predict.
- Return type:
- Returns:
proba (ndarray of shape (n_samples, n_classes)) – Class probabilities.
- predict_log_proba(X)[source]¶
Predict class log-probabilities for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to predict.
- Return type:
- Returns:
log_proba (ndarray of shape (n_samples, n_classes)) – Class log-probabilities.
- decision_function(X)[source]¶
Compute decision function for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples.
- Return type:
- Returns:
decision (ndarray) – Decision function values.
- get_pseudo_labeled_samples()[source]¶
Get indices, labels, and iterations of pseudo-labeled samples.
- set_score_request(*, sample_weight='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.self (SelfTrainingClassifier)
- Returns:
self (object) – The updated object.
- Return type:
- class endgame.semi_supervised.SelfTrainingRegressor(base_estimator, criterion='threshold', threshold=1.0, k_best=10, uncertainty_method='ensemble', max_iter=10, sample_weight_decay=1.0, verbose=False, random_state=None)[source]¶
Bases:
BaseEstimator,RegressorMixin,MetaEstimatorMixinSelf-training regressor for semi-supervised learning.
Extends self-training to regression by using prediction uncertainty instead of class probabilities for sample selection.
The uncertainty can be estimated via: - Ensemble variance (if base_estimator is an ensemble) - Quantile predictions (if supported) - Residual-based heuristics
- Parameters:
base_estimator (estimator object) – Any sklearn-compatible regressor with fit and predict methods. For best results, use an estimator that can provide uncertainty estimates (e.g., RandomForestRegressor, GradientBoostingRegressor, QuantileRegressorForest).
criterion ({'threshold', 'k_best'}, default='threshold') – Selection strategy: - ‘threshold’: Select samples with uncertainty <= threshold - ‘k_best’: Select k samples with lowest uncertainty
threshold (float, default=1.0) – Maximum uncertainty (std dev) allowed for pseudo-labeling. Only used when criterion=’threshold’.
k_best (int, default=10) – Number of samples to pseudo-label per iteration. Only used when criterion=’k_best’.
uncertainty_method ({'ensemble', 'knn', 'residual'}, default='ensemble') –
Method for estimating prediction uncertainty: - ‘ensemble’: Use variance across ensemble members (requires
ensemble with estimators_ attribute, e.g., RandomForest)
’knn’: Use variance among k nearest labeled neighbors
’residual’: Use cross-validated residual magnitude
max_iter (int, default=10) – Maximum number of self-training iterations.
sample_weight_decay (float, default=1.0) – Weight multiplier for pseudo-labeled samples.
verbose (bool, default=False) – Print progress information.
random_state (int, RandomState, or None, default=None) – Random seed.
- base_estimator_¶
The fitted base estimator.
- Type:
estimator
- labeled_iter_¶
Iteration when each sample was labeled (0=original, -1=unlabeled).
- Type:
ndarray of shape (n_samples,)
Examples
>>> from sklearn.ensemble import RandomForestRegressor >>> from endgame.semi_supervised import SelfTrainingRegressor >>> >>> # Prepare data: np.nan indicates unlabeled samples >>> y_train = np.array([1.0, 2.5, 3.0, np.nan, np.nan, np.nan]) >>> >>> st = SelfTrainingRegressor( ... base_estimator=RandomForestRegressor(n_estimators=100), ... threshold=0.5, # Max std dev for pseudo-labeling ... ) >>> st.fit(X_train, y_train) >>> predictions = st.predict(X_test)
- fit(X, y, **fit_params)[source]¶
Fit the self-training regressor.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data (labeled + unlabeled).
y (array-like of shape (n_samples,)) – Target values. Use np.nan to indicate unlabeled samples.
**fit_params (dict) – Additional parameters passed to base_estimator.fit().
- Return type:
- Returns:
self (object) – Fitted estimator.
- predict(X)[source]¶
Predict target values for samples in X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Samples to predict.
- Return type:
- Returns:
y_pred (ndarray of shape (n_samples,)) – Predicted values.
- get_pseudo_labeled_samples()[source]¶
Get indices, labels, and iterations of pseudo-labeled samples.
- set_score_request(*, sample_weight='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
scoremethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toscoreif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toscore.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
sample_weight (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
sample_weightparameter inscore.self (SelfTrainingRegressor)
- Returns:
self (object) – The updated object.
- Return type: