Dimensionality Reduction¶

class endgame.dimensionality_reduction.PCAReducer(n_components=None, whiten=False, svd_solver='auto', random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Principal Component Analysis for dimensionality reduction.

A thin wrapper around sklearn’s PCA with additional utilities for variance analysis and automatic component selection.

Parameters:

n_components (int, float, or 'mle', default=None) – Number of components to keep. - If int, selects that many components. - If float (0-1), selects components to explain that fraction of variance. - If ‘mle’, uses Minka’s MLE to guess the dimension. - If None, keeps all components.
whiten (bool, default=False) – Whether to whiten the data (unit variance in each component).
svd_solver ({'auto', 'full', 'arpack', 'randomized'}, default='auto') – SVD solver to use.
random_state (int, optional) – Random seed for reproducibility.

components_¶

Principal axes in feature space.

Type:: ndarray of shape (n_components, n_features)

explained_variance_ratio_¶

Percentage of variance explained by each component.

Type:: ndarray

n_components_¶

The estimated number of components.

Type:: int

Example

>>> from endgame.dimensionality_reduction import PCAReducer
>>> pca = PCAReducer(n_components=0.95)  # Keep 95% variance
>>> X_reduced = pca.fit_transform(X)
>>> print(f"Reduced from {X.shape[1]} to {X_reduced.shape[1]} dimensions")

fit(X, y=None)[source]¶

Fit the PCA model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored) – Not used, present for API consistency.

Returns:

self (PCAReducer)

transform(X)[source]¶

Apply dimensionality reduction to X.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_new (ndarray of shape (n_samples, n_components)) – Transformed data.

fit_transform(X, y=None)[source]¶

Fit and transform in one step.

Return type:: ndarray

inverse_transform(X)[source]¶

Transform data back to original space.

Parameters:: X (array-like of shape (n_samples, n_components)) – Data in reduced space.
Return type:: ndarray
Returns:: X_original (ndarray of shape (n_samples, n_features)) – Reconstructed data.

get_cumulative_variance()[source]¶

Get cumulative explained variance ratio.

Return type:: ndarray
Returns:: cumulative (ndarray) – Cumulative sum of explained variance ratios.

get_n_components_for_variance(variance_threshold)[source]¶

Get number of components needed to explain given variance.

Parameters:: variance_threshold (float) – Desired cumulative explained variance (0 to 1).
Return type:: int
Returns:: n_components (int) – Number of components needed.

class endgame.dimensionality_reduction.RandomizedPCA(n_components=50, n_oversamples=10, n_iter='auto', whiten=False, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Randomized PCA using randomized SVD.

Faster than standard PCA for large datasets with many features. Uses the randomized SVD algorithm which is more efficient when n_components << min(n_samples, n_features).

Parameters:

n_components (int, default=50) – Number of components to keep.
n_oversamples (int, default=10) – Additional samples for the randomized SVD solver.
n_iter (int or 'auto', default='auto') – Number of power iterations for the randomized SVD solver.
whiten (bool, default=False) – Whether to whiten the data.
random_state (int, optional) – Random seed.

Example

>>> from endgame.dimensionality_reduction import RandomizedPCA
>>> rpca = RandomizedPCA(n_components=100)
>>> X_reduced = rpca.fit_transform(X_large)  # Fast for large X

fit(X, y=None)[source]¶: Fit the randomized PCA model.

transform(X)[source]¶

Apply dimensionality reduction.

Return type:: ndarray

fit_transform(X, y=None)[source]¶

Fit and transform.

Return type:: ndarray

inverse_transform(X)[source]¶

Reconstruct data from reduced representation.

Return type:: ndarray

class endgame.dimensionality_reduction.TruncatedSVDReducer(n_components=50, algorithm='randomized', n_iter=5, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Truncated SVD (LSA) for dimensionality reduction.

Unlike PCA, this works directly with sparse matrices without centering, making it suitable for text data (TF-IDF).

Parameters:

n_components (int, default=50) – Number of components.
algorithm ({'arpack', 'randomized'}, default='randomized') – SVD solver to use.
n_iter (int, default=5) – Number of iterations for randomized SVD.
random_state (int, optional) – Random seed.

Example

>>> from endgame.dimensionality_reduction import TruncatedSVDReducer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> tfidf = TfidfVectorizer()
>>> X_sparse = tfidf.fit_transform(texts)
>>> svd = TruncatedSVDReducer(n_components=100)
>>> X_dense = svd.fit_transform(X_sparse)  # Works with sparse input

fit(X, y=None)[source]¶: Fit the truncated SVD model.

transform(X)[source]¶

Apply dimensionality reduction.

Return type:: ndarray

fit_transform(X, y=None)[source]¶

Fit and transform.

Return type:: ndarray

inverse_transform(X)[source]¶

Reconstruct from reduced representation.

Return type:: ndarray

class endgame.dimensionality_reduction.KernelPCAReducer(n_components=50, kernel='rbf', gamma=None, degree=3, coef0=1.0, fit_inverse_transform=False, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Kernel PCA for nonlinear dimensionality reduction.

Applies PCA in a kernel-induced feature space, allowing for nonlinear projections while remaining computationally tractable.

Parameters:

n_components (int, default=50) – Number of components.
kernel (str, default='rbf') – Kernel type: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘cosine’.
gamma (float, optional) – Kernel coefficient for ‘rbf’, ‘poly’, ‘sigmoid’. If None, defaults to 1/n_features.
degree (int, default=3) – Degree for polynomial kernel.
coef0 (float, default=1.0) – Independent term in ‘poly’ and ‘sigmoid’.
fit_inverse_transform (bool, default=False) – Whether to learn the inverse transform (expensive).
random_state (int, optional) – Random seed.

Example

>>> from endgame.dimensionality_reduction import KernelPCAReducer
>>> kpca = KernelPCAReducer(n_components=2, kernel='rbf', gamma=0.1)
>>> X_nonlinear = kpca.fit_transform(X)

fit(X, y=None)[source]¶: Fit the Kernel PCA model.

transform(X)[source]¶

Apply dimensionality reduction.

Return type:: ndarray

fit_transform(X, y=None)[source]¶

Fit and transform.

Return type:: ndarray

inverse_transform(X)[source]¶

Reconstruct from reduced representation.

Only available if fit_inverse_transform=True.

Return type:: ndarray

class endgame.dimensionality_reduction.ICAReducer(n_components=None, algorithm='parallel', whiten='unit-variance', fun='logcosh', max_iter=200, tol=0.0001, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

Independent Component Analysis for dimensionality reduction.

ICA separates a multivariate signal into additive, independent components. Useful when the underlying sources are non-Gaussian.

Parameters:

n_components (int, optional) – Number of components. If None, uses all features.
algorithm ({'parallel', 'deflation'}, default='parallel') – ICA algorithm to use.
whiten (str, default='unit-variance') – Whitening strategy. Use ‘unit-variance’ for sklearn >= 1.1.
fun ({'logcosh', 'exp', 'cube'}, default='logcosh') – Functional form of the G function for approximating negentropy.
max_iter (int, default=200) – Maximum number of iterations.
tol (float, default=1e-4) – Tolerance for convergence.
random_state (int, optional) – Random seed.

Example

>>> from endgame.dimensionality_reduction import ICAReducer
>>> ica = ICAReducer(n_components=10)
>>> X_independent = ica.fit_transform(X)

fit(X, y=None)[source]¶: Fit the ICA model.

transform(X)[source]¶

Apply ICA transformation.

Return type:: ndarray

fit_transform(X, y=None)[source]¶

Fit and transform.

Return type:: ndarray

inverse_transform(X)[source]¶

Reconstruct signals from independent components.

Return type:: ndarray

class endgame.dimensionality_reduction.UMAPReducer(n_components=2, n_neighbors=15, min_dist=0.1, metric='euclidean', spread=1.0, learning_rate=1.0, n_epochs=None, init='spectral', random_state=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

Uniform Manifold Approximation and Projection (UMAP).

UMAP is a manifold learning technique that preserves both local and global structure better than t-SNE while being significantly faster.

Parameters:

n_components (int, default=2) – Dimension of the embedded space.
n_neighbors (int, default=15) – Number of neighbors for constructing the local manifold. Larger values capture more global structure.
min_dist (float, default=0.1) – Minimum distance between embedded points. Smaller values create tighter clusters.
metric (str, default='euclidean') – Distance metric: ‘euclidean’, ‘manhattan’, ‘cosine’, ‘correlation’, etc.
spread (float, default=1.0) – Effective scale of embedded points.
learning_rate (float, default=1.0) – Learning rate for the embedding optimization.
n_epochs (int, optional) – Number of training epochs. If None, auto-determined.
init (str, default='spectral') – Initialization: ‘spectral’, ‘random’, or array.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.

embedding_¶

Embedding of the training data.

Type:: ndarray of shape (n_samples, n_components)

Example

>>> from endgame.dimensionality_reduction import UMAPReducer
>>> umap = UMAPReducer(n_components=2, n_neighbors=30)
>>> X_2d = umap.fit_transform(X)
>>> # For new data (uses approximate transform)
>>> X_new_2d = umap.transform(X_new)

fit(X, y=None)[source]¶

Fit the UMAP model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like, optional) – Target labels for semi-supervised mode.

Returns:

self (UMAPReducer)

transform(X)[source]¶

Transform new data to the embedding space.

Uses the learned transform to embed new points. Note that this is an approximation based on the nearest neighbors in the training set.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_new (ndarray of shape (n_samples, n_components)) – Transformed data.

fit_transform(X, y=None)[source]¶

Fit and transform in one step.

This is more efficient than calling fit then transform.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (array-like, optional) – Target labels for semi-supervised mode.

Return type:

ndarray

Returns:

X_new (ndarray of shape (n_samples, n_components)) – Embedding of training data.

inverse_transform(X)[source]¶

Transform from embedding space back to data space.

Note: This is an approximation.

Parameters:: X (array-like of shape (n_samples, n_components)) – Data in embedding space.
Return type:: ndarray
Returns:: X_original (ndarray of shape (n_samples, n_features)) – Approximate reconstruction.

class endgame.dimensionality_reduction.ParametricUMAP(n_components=2, n_neighbors=15, min_dist=0.1, metric='euclidean', encoder_layers=None, decoder_layers=None, n_training_epochs=100, batch_size=256, random_state=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

Parametric UMAP using neural networks.

Unlike standard UMAP, Parametric UMAP learns an explicit mapping function using a neural network, enabling faster transforms on new data and the ability to train on mini-batches.

Requires TensorFlow/Keras to be installed.

Parameters:

n_components (int, default=2) – Dimension of the embedded space.
n_neighbors (int, default=15) – Number of neighbors for constructing the local manifold.
min_dist (float, default=0.1) – Minimum distance between embedded points.
metric (str, default='euclidean') – Distance metric.
encoder_layers (list of int, optional) – Sizes of encoder hidden layers. Default [256, 256].
decoder_layers (list of int, optional) – Sizes of decoder hidden layers for reconstruction. Default [256, 256].
n_training_epochs (int, default=100) – Number of training epochs for the neural network.
batch_size (int, default=256) – Mini-batch size for training.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print progress.

Example

>>> from endgame.dimensionality_reduction import ParametricUMAP
>>> pumap = ParametricUMAP(n_components=2)
>>> pumap.fit(X_train)
>>> # Fast transform on new data
>>> X_new_2d = pumap.transform(X_test)

fit(X, y=None)[source]¶

Fit the Parametric UMAP model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)

Returns:

self (ParametricUMAP)

transform(X)[source]¶

Transform new data using the learned encoder.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_new (ndarray of shape (n_samples, n_components)) – Transformed data.

fit_transform(X, y=None)[source]¶

Fit and transform in one step.

Return type:: ndarray

inverse_transform(X)[source]¶

Transform from embedding space back to data space.

Parameters:: X (array-like of shape (n_samples, n_components)) – Data in embedding space.
Return type:: ndarray
Returns:: X_original (ndarray of shape (n_samples, n_features)) – Reconstructed data.

class endgame.dimensionality_reduction.TriMAPReducer(n_components=2, n_inliers=12, n_outliers=4, n_random=3, weight_adj=None, n_iters=400, apply_pca=True, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

TriMAP: Dimensionality Reduction Using Triplet Constraints.

TriMAP uses triplet constraints to capture both local and global structure better than t-SNE or UMAP, particularly for hierarchical data structures.

Parameters:

n_components (int, default=2) – Dimension of the embedded space.
n_inliers (int, default=12) – Number of nearest neighbor inliers per point.
n_outliers (int, default=4) – Number of random outliers per point.
n_random (int, default=3) – Number of random triplets per point.
weight_adj (float, optional) – Weight adjustment factor for triplet loss.
n_iters (int, default=400) – Number of optimization iterations.
apply_pca (bool, default=True) – Whether to apply PCA for initialization.
verbose (bool, default=False) – Whether to print progress.

Example

>>> from endgame.dimensionality_reduction import TriMAPReducer
>>> trimap = TriMAPReducer(n_components=2, n_inliers=15)
>>> X_2d = trimap.fit_transform(X)

fit(X, y=None)[source]¶

Fit the TriMAP model.

Note: TriMAP is a transductive method, so fit stores the embedding but doesn’t create a general transform function.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)

Returns:

self (TriMAPReducer)

transform(X)[source]¶

Transform new data.

Note: TriMAP doesn’t have native out-of-sample support. This uses a nearest-neighbor approximation.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_new (ndarray of shape (n_samples, n_components)) – Approximate embedding.

fit_transform(X, y=None)[source]¶

Fit and return the embedding.

Return type:: ndarray

class endgame.dimensionality_reduction.PHATEReducer(n_components=2, knn=5, decay=40, t='auto', gamma=1.0, n_pca=100, knn_dist='euclidean', mds_solver='sgd', random_state=None, verbose=0)[source]¶

Bases: TransformerMixin, BaseEstimator

PHATE: Potential of Heat-diffusion for Affinity-based Transition Embedding.

PHATE is designed for visualizing trajectories and progressions in high-dimensional biological data. It preserves both local and global structures through diffusion-based distances.

Parameters:

n_components (int, default=2) – Dimension of the embedded space.
knn (int, default=5) – Number of nearest neighbors for graph construction.
decay (int, default=40) – Decay rate of the kernel tails.
t (int or 'auto', default='auto') – Power of the diffusion operator.
gamma (float, default=1.0) – Informational distance constant between -1 and 1.
n_pca (int, default=100) – Number of principal components for initial reduction.
knn_dist (str, default='euclidean') – Distance metric for KNN graph.
mds_solver (str, default='sgd') – MDS solver: ‘sgd’ or ‘smacof’.
random_state (int, optional) – Random seed.
verbose (int, default=0) – Verbosity level.

Example

>>> from endgame.dimensionality_reduction import PHATEReducer
>>> phate = PHATEReducer(n_components=2, knn=10)
>>> X_2d = phate.fit_transform(X)

fit(X, y=None)[source]¶

Fit the PHATE model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)

Returns:

self (PHATEReducer)

transform(X)[source]¶

Transform new data.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_new (ndarray of shape (n_samples, n_components)) – Transformed data.

fit_transform(X, y=None)[source]¶

Fit and return the embedding.

Return type:: ndarray

class endgame.dimensionality_reduction.PaCMAPReducer(n_components=2, n_neighbors=10, MN_ratio=0.5, FP_ratio=2.0, num_iters=450, lr=1.0, apply_pca=True, verbose=False, random_state=None)[source]¶

Bases: TransformerMixin, BaseEstimator

PaCMAP: Pairwise Controlled Manifold Approximation.

PaCMAP preserves both local and global structure by considering pairs (neighbors), mid-near pairs, and far pairs during optimization. It’s faster than t-SNE and UMAP with competitive quality.

Parameters:

n_components (int, default=2) – Dimension of the embedded space.
n_neighbors (int, default=10) – Number of neighbors for local structure.
MN_ratio (float, default=0.5) – Ratio of mid-near pairs to neighbor pairs.
FP_ratio (float, default=2.0) – Ratio of further pairs to neighbor pairs.
num_iters (int, default=450) – Number of iterations for optimization.
lr (float, default=1.0) – Learning rate.
apply_pca (bool, default=True) – Whether to apply PCA for initialization.
verbose (bool, default=False) – Whether to print progress.
random_state (int, optional) – Random seed.

Example

>>> from endgame.dimensionality_reduction import PaCMAPReducer
>>> pacmap = PaCMAPReducer(n_components=2, n_neighbors=15)
>>> X_2d = pacmap.fit_transform(X)

fit(X, y=None)[source]¶

Fit the PaCMAP model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)

Returns:

self (PaCMAPReducer)

transform(X)[source]¶

Transform new data.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_new (ndarray of shape (n_samples, n_components)) – Transformed data.

fit_transform(X, y=None)[source]¶

Fit and return the embedding.

Return type:: ndarray

class endgame.dimensionality_reduction.VAEReducer(n_components=2, encoder_layers=None, decoder_layers=None, activation='relu', dropout=0.0, learning_rate=0.001, batch_size=128, n_epochs=100, beta=1.0, early_stopping=10, validation_fraction=0.1, scale_data=True, device='auto', random_state=None, verbose=False)[source]¶

Bases: TransformerMixin, BaseEstimator

Variational Autoencoder for Dimensionality Reduction.

VAE learns a probabilistic mapping to a lower-dimensional latent space with a smooth structure, useful for both dimensionality reduction and generative modeling.

Parameters:

n_components (int, default=2) – Dimension of the latent space.
encoder_layers (list of int, default=[256, 128]) – Hidden layer sizes for the encoder.
decoder_layers (list of int, default=[128, 256]) – Hidden layer sizes for the decoder.
activation (str, default='relu') – Activation function: ‘relu’, ‘leaky_relu’, ‘elu’, ‘tanh’.
dropout (float, default=0.0) – Dropout rate in hidden layers.
learning_rate (float, default=1e-3) – Learning rate for Adam optimizer.
batch_size (int, default=128) – Mini-batch size.
n_epochs (int, default=100) – Number of training epochs.
beta (float, default=1.0) – Weight of KL divergence term (beta-VAE parameter). Higher values create more disentangled representations.
early_stopping (int, default=10) – Stop if no improvement for this many epochs.
validation_fraction (float, default=0.1) – Fraction of data for validation.
scale_data (bool, default=True) – Whether to standardize input features.
device (str, default='auto') – Device: ‘auto’, ‘cpu’, or ‘cuda’.
random_state (int, optional) – Random seed.
verbose (bool, default=False) – Whether to print training progress.

model_¶

Fitted VAE model.

Type:: _VAEModule

reconstruction_loss_¶

Final reconstruction loss.

Type:: float

kl_loss_¶

Final KL divergence loss.

Type:: float

Example

>>> from endgame.dimensionality_reduction import VAEReducer
>>> vae = VAEReducer(n_components=10, encoder_layers=[512, 256])
>>> X_latent = vae.fit_transform(X)
>>> # Reconstruct data
>>> X_recon = vae.inverse_transform(X_latent)
>>> # Generate new samples
>>> z_random = np.random.randn(100, 10)
>>> X_generated = vae.decode(z_random)

fit(X, y=None)[source]¶

Fit the VAE model.

Parameters:

X (array-like of shape (n_samples, n_features)) – Training data.
y (Ignored)

Returns:

self (VAEReducer)

transform(X)[source]¶

Transform data to latent space.

Uses the mean of the latent distribution (deterministic).

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to transform.
Return type:: ndarray
Returns:: X_latent (ndarray of shape (n_samples, n_components)) – Latent representation.

fit_transform(X, y=None)[source]¶

Fit and transform in one step.

Return type:: ndarray

inverse_transform(X_latent)[source]¶

Transform from latent space back to data space.

Parameters:: X_latent (array-like of shape (n_samples, n_components)) – Data in latent space.
Return type:: ndarray
Returns:: X_recon (ndarray of shape (n_samples, n_features)) – Reconstructed data.

decode(z)[source]¶

Decode latent vectors to data space.

Alias for inverse_transform, useful for generation.

Parameters:: z (array-like of shape (n_samples, n_components)) – Latent vectors.
Return type:: ndarray
Returns:: X (ndarray of shape (n_samples, n_features)) – Generated data.

set_inverse_transform_request(*, X_latent='$UNCHANGED$')¶

Configure whether metadata should be requested to be passed to the inverse_transform method.

Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config()). Please check the User Guide on how the routing mechanism works.

The options for each parameter are:

True: metadata is requested, and passed to inverse_transform if provided. The request is ignored if metadata is not provided.
False: metadata is not requested and the meta-estimator will not pass it to inverse_transform.
None: metadata is not requested, and the meta-estimator will raise an error if the user provides it.
str: metadata should be passed to the meta-estimator with this given alias instead of the original name.

The default (sklearn.utils.metadata_routing.UNCHANGED) retains the existing request. This allows you to change the request for some parameters and not others.

Added in version 1.3.

Parameters:

X_latent (str, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED) – Metadata routing for X_latent parameter in inverse_transform.
self (VAEReducer)

Returns:

self (object) – The updated object.

Return type:

VAEReducer

sample(n_samples=100)[source]¶

Generate new samples from the prior.

Parameters:: n_samples (int, default=100) – Number of samples to generate.
Return type:: ndarray
Returns:: X (ndarray of shape (n_samples, n_features)) – Generated samples.

reconstruct(X)[source]¶

Reconstruct input through the VAE.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to reconstruct.
Return type:: ndarray
Returns:: X_recon (ndarray of shape (n_samples, n_features)) – Reconstructed data.

reconstruction_error(X)[source]¶

Compute reconstruction error for each sample.

Useful for anomaly detection.

Parameters:: X (array-like of shape (n_samples, n_features)) – Data to evaluate.
Return type:: ndarray
Returns:: errors (ndarray of shape (n_samples,)) – Reconstruction error (MSE) per sample.