Dimensionality Reduction¶
- class endgame.dimensionality_reduction.PCAReducer(n_components=None, whiten=False, svd_solver='auto', random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorPrincipal Component Analysis for dimensionality reduction.
A thin wrapper around sklearn’s PCA with additional utilities for variance analysis and automatic component selection.
- Parameters:
n_components (int, float, or 'mle', default=None) – Number of components to keep. - If int, selects that many components. - If float (0-1), selects components to explain that fraction of variance. - If ‘mle’, uses Minka’s MLE to guess the dimension. - If None, keeps all components.
whiten (bool, default=False) – Whether to whiten the data (unit variance in each component).
svd_solver ({'auto', 'full', 'arpack', 'randomized'}, default='auto') – SVD solver to use.
random_state (int, optional) – Random seed for reproducibility.
- components_¶
Principal axes in feature space.
- Type:
ndarray of shape (n_components, n_features)
- explained_variance_ratio_¶
Percentage of variance explained by each component.
- Type:
ndarray
Example
>>> from endgame.dimensionality_reduction import PCAReducer >>> pca = PCAReducer(n_components=0.95) # Keep 95% variance >>> X_reduced = pca.fit_transform(X) >>> print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} dimensions")
- fit(X, y=None)[source]¶
Fit the PCA model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Not used, present for API consistency.
- Returns:
self (PCAReducer)
- transform(X)[source]¶
Apply dimensionality reduction to X.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Transformed data.
- class endgame.dimensionality_reduction.RandomizedPCA(n_components=50, n_oversamples=10, n_iter='auto', whiten=False, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorRandomized PCA using randomized SVD.
Faster than standard PCA for large datasets with many features. Uses the randomized SVD algorithm which is more efficient when n_components << min(n_samples, n_features).
- Parameters:
n_components (int, default=50) – Number of components to keep.
n_oversamples (int, default=10) – Additional samples for the randomized SVD solver.
n_iter (int or 'auto', default='auto') – Number of power iterations for the randomized SVD solver.
whiten (bool, default=False) – Whether to whiten the data.
random_state (int, optional) – Random seed.
Example
>>> from endgame.dimensionality_reduction import RandomizedPCA >>> rpca = RandomizedPCA(n_components=100) >>> X_reduced = rpca.fit_transform(X_large) # Fast for large X
- class endgame.dimensionality_reduction.TruncatedSVDReducer(n_components=50, algorithm='randomized', n_iter=5, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorTruncated SVD (LSA) for dimensionality reduction.
Unlike PCA, this works directly with sparse matrices without centering, making it suitable for text data (TF-IDF).
- Parameters:
Example
>>> from endgame.dimensionality_reduction import TruncatedSVDReducer >>> from sklearn.feature_extraction.text import TfidfVectorizer >>> tfidf = TfidfVectorizer() >>> X_sparse = tfidf.fit_transform(texts) >>> svd = TruncatedSVDReducer(n_components=100) >>> X_dense = svd.fit_transform(X_sparse) # Works with sparse input
- class endgame.dimensionality_reduction.KernelPCAReducer(n_components=50, kernel='rbf', gamma=None, degree=3, coef0=1.0, fit_inverse_transform=False, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorKernel PCA for nonlinear dimensionality reduction.
Applies PCA in a kernel-induced feature space, allowing for nonlinear projections while remaining computationally tractable.
- Parameters:
n_components (int, default=50) – Number of components.
kernel (str, default='rbf') – Kernel type: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’.
gamma (float, optional) – Kernel coefficient for ‘rbf’, ‘poly’, ‘sigmoid’. If None, defaults to 1/n_features.
degree (int, default=3) – Degree for polynomial kernel.
coef0 (float, default=1.0) – Independent term in ‘poly’ and ‘sigmoid’.
fit_inverse_transform (bool, default=False) – Whether to learn the inverse transform (expensive).
random_state (int, optional) – Random seed.
Example
>>> from endgame.dimensionality_reduction import KernelPCAReducer >>> kpca = KernelPCAReducer(n_components=2, kernel='rbf', gamma=0.1) >>> X_nonlinear = kpca.fit_transform(X)
- class endgame.dimensionality_reduction.ICAReducer(n_components=None, algorithm='parallel', whiten='unit-variance', fun='logcosh', max_iter=200, tol=0.0001, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorIndependent Component Analysis for dimensionality reduction.
ICA separates a multivariate signal into additive, independent components. Useful when the underlying sources are non-Gaussian.
- Parameters:
n_components (int, optional) – Number of components. If None, uses all features.
algorithm ({'parallel', 'deflation'}, default='parallel') – ICA algorithm to use.
whiten (str, default='unit-variance') – Whitening strategy. Use ‘unit-variance’ for sklearn >= 1.1.
fun ({'logcosh', 'exp', 'cube'}, default='logcosh') – Functional form of the G function for approximating negentropy.
max_iter (int, default=200) – Maximum number of iterations.
tol (float, default=1e-4) – Tolerance for convergence.
random_state (int, optional) – Random seed.
Example
>>> from endgame.dimensionality_reduction import ICAReducer >>> ica = ICAReducer(n_components=10) >>> X_independent = ica.fit_transform(X)
- class endgame.dimensionality_reduction.UMAPReducer(n_components=2, n_neighbors=15, min_dist=0.1, metric='euclidean', spread=1.0, learning_rate=1.0, n_epochs=None, init='spectral', random_state=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorUniform Manifold Approximation and Projection (UMAP).
UMAP is a manifold learning technique that preserves both local and global structure better than t-SNE while being significantly faster.
- Parameters:
n_components (int, default=2) – Dimension of the embedded space.
n_neighbors (int, default=15) – Number of neighbors for constructing the local manifold. Larger values capture more global structure.
min_dist (float, default=0.1) – Minimum distance between embedded points. Smaller values create tighter clusters.
metric (str, default='euclidean') – Distance metric: ‘euclidean’, ‘manhattan’, ‘cosine’, ‘correlation’, etc.
spread (float, default=1.0) – Effective scale of embedded points.
learning_rate (float, default=1.0) – Learning rate for the embedding optimization.
n_epochs (int, optional) – Number of training epochs. If None, auto-determined.
init (str, default='spectral') – Initialization: ‘spectral’, ‘random’, or array.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.
Example
>>> from endgame.dimensionality_reduction import UMAPReducer >>> umap = UMAPReducer(n_components=2, n_neighbors=30) >>> X_2d = umap.fit_transform(X) >>> # For new data (uses approximate transform) >>> X_new_2d = umap.transform(X_new)
- fit(X, y=None)[source]¶
Fit the UMAP model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like, optional) – Target labels for semi-supervised mode.
- Returns:
self (UMAPReducer)
- transform(X)[source]¶
Transform new data to the embedding space.
Uses the learned transform to embed new points. Note that this is an approximation based on the nearest neighbors in the training set.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Transformed data.
- fit_transform(X, y=None)[source]¶
Fit and transform in one step.
This is more efficient than calling fit then transform.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like, optional) – Target labels for semi-supervised mode.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Embedding of training data.
- class endgame.dimensionality_reduction.ParametricUMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric='euclidean', encoder_layers=None, decoder_layers=None, n_training_epochs=100, batch_size=256, random_state=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorParametric UMAP using neural networks.
Unlike standard UMAP, Parametric UMAP learns an explicit mapping function using a neural network, enabling faster transforms on new data and the ability to train on mini-batches.
Requires TensorFlow/Keras to be installed.
- Parameters:
n_components (int, default=2) – Dimension of the embedded space.
n_neighbors (int, default=15) – Number of neighbors for constructing the local manifold.
min_dist (float, default=0.1) – Minimum distance between embedded points.
metric (str, default='euclidean') – Distance metric.
encoder_layers (list of int, optional) – Sizes of encoder hidden layers. Default [256, 256].
decoder_layers (list of int, optional) – Sizes of decoder hidden layers for reconstruction. Default [256, 256].
n_training_epochs (int, default=100) – Number of training epochs for the neural network.
batch_size (int, default=256) – Mini-batch size for training.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.
Example
>>> from endgame.dimensionality_reduction import ParametricUMAP >>> pumap = ParametricUMAP(n_components=2) >>> pumap.fit(X_train) >>> # Fast transform on new data >>> X_new_2d = pumap.transform(X_test)
- fit(X, y=None)[source]¶
Fit the Parametric UMAP model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)
- Returns:
self (ParametricUMAP)
- transform(X)[source]¶
Transform new data using the learned encoder.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Transformed data.
- class endgame.dimensionality_reduction.TriMAPReducer(n_components=2, n_inliers=12, n_outliers=4, n_random=3, weight_adj=None, n_iters=400, apply_pca=True, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorTriMAP: Dimensionality Reduction Using Triplet Constraints.
TriMAP uses triplet constraints to capture both local and global structure better than t-SNE or UMAP, particularly for hierarchical data structures.
- Parameters:
n_components (int, default=2) – Dimension of the embedded space.
n_inliers (int, default=12) – Number of nearest neighbor inliers per point.
n_outliers (int, default=4) – Number of random outliers per point.
n_random (int, default=3) – Number of random triplets per point.
weight_adj (float, optional) – Weight adjustment factor for triplet loss.
n_iters (int, default=400) – Number of optimization iterations.
apply_pca (bool, default=True) – Whether to apply PCA for initialization.
verbose (bool, default=False) – Whether to print progress.
Example
>>> from endgame.dimensionality_reduction import TriMAPReducer >>> trimap = TriMAPReducer(n_components=2, n_inliers=15) >>> X_2d = trimap.fit_transform(X)
- fit(X, y=None)[source]¶
Fit the TriMAP model.
Note: TriMAP is a transductive method, so fit stores the embedding but doesn’t create a general transform function.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)
- Returns:
self (TriMAPReducer)
- transform(X)[source]¶
Transform new data.
Note: TriMAP doesn’t have native out-of-sample support. This uses a nearest-neighbor approximation.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Approximate embedding.
- class endgame.dimensionality_reduction.PHATEReducer(n_components=2, knn=5, decay=40, t='auto', gamma=1.0, n_pca=100, knn_dist='euclidean', mds_solver='sgd', random_state=None, verbose=0)[source]¶
Bases:
TransformerMixin,BaseEstimatorPHATE: Potential of Heat-diffusion for Affinity-based Transition Embedding.
PHATE is designed for visualizing trajectories and progressions in high-dimensional biological data. It preserves both local and global structures through diffusion-based distances.
- Parameters:
n_components (int, default=2) – Dimension of the embedded space.
knn (int, default=5) – Number of nearest neighbors for graph construction.
decay (int, default=40) – Decay rate of the kernel tails.
t (int or 'auto', default='auto') – Power of the diffusion operator.
gamma (float, default=1.0) – Informational distance constant between -1 and 1.
n_pca (int, default=100) – Number of principal components for initial reduction.
knn_dist (str, default='euclidean') – Distance metric for KNN graph.
mds_solver (str, default='sgd') – MDS solver: ‘sgd’ or ‘smacof’.
random_state (int, optional) – Random seed.
verbose (int, default=0) – Verbosity level.
Example
>>> from endgame.dimensionality_reduction import PHATEReducer >>> phate = PHATEReducer(n_components=2, knn=10) >>> X_2d = phate.fit_transform(X)
- fit(X, y=None)[source]¶
Fit the PHATE model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)
- Returns:
self (PHATEReducer)
- transform(X)[source]¶
Transform new data.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Transformed data.
- class endgame.dimensionality_reduction.PaCMAPReducer(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, num_iters=450, lr=1.0, apply_pca=True, verbose=False, random_state=None)[source]¶
Bases:
TransformerMixin,BaseEstimatorPaCMAP: Pairwise Controlled Manifold Approximation.
PaCMAP preserves both local and global structure by considering pairs (neighbors), mid-near pairs, and far pairs during optimization. It’s faster than t-SNE and UMAP with competitive quality.
- Parameters:
n_components (int, default=2) – Dimension of the embedded space.
n_neighbors (int, default=10) – Number of neighbors for local structure.
MN_ratio (float, default=0.5) – Ratio of mid-near pairs to neighbor pairs.
FP_ratio (float, default=2.0) – Ratio of further pairs to neighbor pairs.
num_iters (int, default=450) – Number of iterations for optimization.
lr (float, default=1.0) – Learning rate.
apply_pca (bool, default=True) – Whether to apply PCA for initialization.
verbose (bool, default=False) – Whether to print progress.
random_state (int, optional) – Random seed.
Example
>>> from endgame.dimensionality_reduction import PaCMAPReducer >>> pacmap = PaCMAPReducer(n_components=2, n_neighbors=15) >>> X_2d = pacmap.fit_transform(X)
- fit(X, y=None)[source]¶
Fit the PaCMAP model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)
- Returns:
self (PaCMAPReducer)
- transform(X)[source]¶
Transform new data.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_new (ndarray of shape (n_samples, n_components)) – Transformed data.
- class endgame.dimensionality_reduction.VAEReducer(n_components=2, encoder_layers=None, decoder_layers=None, activation='relu', dropout=0.0, learning_rate=0.001, batch_size=128, n_epochs=100, beta=1.0, early_stopping=10, validation_fraction=0.1, scale_data=True, device='auto', random_state=None, verbose=False)[source]¶
Bases:
TransformerMixin,BaseEstimatorVariational Autoencoder for Dimensionality Reduction.
VAE learns a probabilistic mapping to a lower-dimensional latent space with a smooth structure, useful for both dimensionality reduction and generative modeling.
- Parameters:
n_components (int, default=2) – Dimension of the latent space.
encoder_layers (list of int, default=[256, 128]) – Hidden layer sizes for the encoder.
decoder_layers (list of int, default=[128, 256]) – Hidden layer sizes for the decoder.
activation (str, default='relu') – Activation function: ‘relu’, ‘leaky_relu’, ‘elu’, ‘tanh’.
dropout (float, default=0.0) – Dropout rate in hidden layers.
learning_rate (float, default=1e-3) – Learning rate for Adam optimizer.
batch_size (int, default=128) – Mini-batch size.
n_epochs (int, default=100) – Number of training epochs.
beta (float, default=1.0) – Weight of KL divergence term (beta-VAE parameter). Higher values create more disentangled representations.
early_stopping (int, default=10) – Stop if no improvement for this many epochs.
validation_fraction (float, default=0.1) – Fraction of data for validation.
scale_data (bool, default=True) – Whether to standardize input features.
device (str, default='auto') – Device: ‘auto’, ‘cpu’, or ‘cuda’.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print training progress.
- model_¶
Fitted VAE model.
- Type:
_VAEModule
Example
>>> from endgame.dimensionality_reduction import VAEReducer >>> vae = VAEReducer(n_components=10, encoder_layers=[512, 256]) >>> X_latent = vae.fit_transform(X) >>> # Reconstruct data >>> X_recon = vae.inverse_transform(X_latent) >>> # Generate new samples >>> z_random = np.random.randn(100, 10) >>> X_generated = vae.decode(z_random)
- fit(X, y=None)[source]¶
Fit the VAE model.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)
- Returns:
self (VAEReducer)
- transform(X)[source]¶
Transform data to latent space.
Uses the mean of the latent distribution (deterministic).
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to transform.
- Return type:
- Returns:
X_latent (ndarray of shape (n_samples, n_components)) – Latent representation.
- decode(z)[source]¶
Decode latent vectors to data space.
Alias for inverse_transform, useful for generation.
- set_inverse_transform_request(*, X_latent='$UNCHANGED$')¶
Configure whether metadata should be requested to be passed to the
inverse_transformmethod.Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with
enable_metadata_routing=True(seesklearn.set_config()). Please check the User Guide on how the routing mechanism works.The options for each parameter are:
True: metadata is requested, and passed toinverse_transformif provided. The request is ignored if metadata is not provided.False: metadata is not requested and the meta-estimator will not pass it toinverse_transform.None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
- Parameters:
X_latent (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for
X_latentparameter ininverse_transform.self (VAEReducer)
- Returns:
self (object) – The updated object.
- Return type:
- reconstruct(X)[source]¶
Reconstruct input through the VAE.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to reconstruct.
- Return type:
- Returns:
X_recon (ndarray of shape (n_samples, n_features)) – Reconstructed data.
- reconstruction_error(X)[source]¶
Compute reconstruction error for each sample.
Useful for anomaly detection.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Data to evaluate.
- Return type:
- Returns:
errors (ndarray of shape (n_samples,)) – Reconstruction error (MSE) per sample.