Features Module¶

The features module provides unified feature extraction for molecules and proteins. It handles featurization, caching, and batch processing.

Overview¶

The features system consists of three main components:

MoleculeFeaturizer - Extract molecular representations (fingerprints, descriptors, embeddings)
ProteinFeaturizer - Extract protein sequence embeddings (ESM2, ESM3)
FeatureCache - Efficient caching for expensive feature computations

Molecule Featurizer¶

MoleculeFeaturizer¶

themap.features.molecule.MoleculeFeaturizer ¶

Efficient molecule featurization using molfeat.

Provides batch featurization with SMILES deduplication for efficiency. Supports fingerprints, descriptors, and neural embeddings.

Attributes:

Name	Type	Description
`featurizer_name`		Name of the featurizer to use.
`n_jobs`		Number of parallel workers for featurization.
`_transformer`		Cached molfeat transformer instance.

Examples:

>>> featurizer = MoleculeFeaturizer("ecfp")
>>> features = featurizer.featurize(["CCO", "CCCO", "CCCCO"])
>>> print(features.shape)  # (3, 2048)

>>> # With SMILES deduplication
>>> smiles = ["CCO", "CCCO", "CCO", "CCCCO", "CCCO"]
>>> features = featurizer.featurize_deduplicated(smiles)
>>> print(features.shape)  # (5, 2048) - returns features for all input SMILES

transformer `property` ¶

transformer

Get or create the molfeat transformer (lazy initialization).

is_fingerprint `property` ¶

is_fingerprint: bool

Check if this is a fingerprint-based featurizer.

is_neural `property` ¶

is_neural: bool

Check if this is a neural embedding featurizer.

init ¶

__init__(featurizer_name: str = 'ecfp', n_jobs: int = 8, device: str = 'auto')

Initialize the molecule featurizer.

Parameters:

Name	Type	Description	Default
`featurizer_name`	`str`	Name of the molfeat featurizer to use.	`'ecfp'`
`n_jobs`	`int`	Number of parallel workers for featurization.	`8`
`device`	`str`	Device for neural featurizers ('auto', 'cpu', 'cuda').	`'auto'`

featurize ¶

featurize(
    smiles: Union[str, List[str]], ignore_errors: bool = True
) -> NDArray[np.float32]

Featurize one or more SMILES strings.

Parameters:

Name	Type	Description	Default
`smiles`	`Union[str, List[str]]`	Single SMILES string or list of SMILES.	required
`ignore_errors`	`bool`	If True, return NaN for invalid SMILES.	`True`

Returns:

Type	Description
`NDArray[float32]`	Feature array of shape (n_molecules, feature_dim).

featurize_deduplicated ¶

featurize_deduplicated(
    smiles: List[str], ignore_errors: bool = True
) -> NDArray[np.float32]

Featurize SMILES with deduplication for efficiency.

Unique SMILES are featurized once, then results are mapped back to the original list. This is efficient when there are many duplicate SMILES across datasets.

Parameters:

Name	Type	Description	Default
`smiles`	`List[str]`	List of SMILES strings (may contain duplicates).	required
`ignore_errors`	`bool`	If True, return NaN for invalid SMILES.	`True`

Returns:

Type	Description
`NDArray[float32]`	Feature array of shape (n_molecules, feature_dim) in original order.

featurize_datasets ¶

featurize_datasets(
    datasets: Dict[str, MoleculeDataset], deduplicate: bool = True
) -> Dict[str, NDArray[np.float32]]

Featurize multiple datasets efficiently.

When deduplicate=True, collects all unique SMILES across all datasets, featurizes them once, then maps back to each dataset.

Parameters:

Name	Type	Description	Default
`datasets`	`Dict[str, MoleculeDataset]`	Dictionary mapping task IDs to MoleculeDataset instances.	required
`deduplicate`	`bool`	If True, deduplicate SMILES across all datasets.	`True`

Returns:

Type	Description
`Dict[str, NDArray[float32]]`	Dictionary mapping task IDs to feature arrays.

get_feature_dim ¶

get_feature_dim() -> int

Get the feature dimension for this featurizer.

Returns:

Type	Description
`int`	Number of features produced by this featurizer.

Available Featurizers¶

Fingerprints (Fast)¶

Featurizer	Description	Dimensions
`ecfp`	Extended Connectivity Fingerprints	2048
`maccs`	MACCS Structural Keys	167
`topological`	Topological Fingerprints	2048
`avalon`	Avalon Fingerprints	512

Descriptors (Medium Speed)¶

Featurizer	Description	Dimensions
`desc2D`	2D Molecular Descriptors	~200
`mordred`	Mordred Descriptors	~1600

Neural Embeddings (Slow, GPU Recommended)¶

Featurizer	Description	Dimensions
`ChemBERTa-77M-MLM`	ChemBERTa masked language model	384
`ChemBERTa-77M-MTR`	ChemBERTa multi-task regression	384
`MolT5`	Molecular T5 embeddings	768
`Roberta-Zinc480M-102M`	RoBERTa trained on ZINC	768
`gin_supervised_*`	Graph neural network embeddings	300

Usage Examples¶

from themap.features import MoleculeFeaturizer

# Initialize featurizer
featurizer = MoleculeFeaturizer(
    featurizer_name="ecfp",
    n_jobs=8
)

# Featurize a list of SMILES
smiles_list = ["CCO", "CCCO", "CC(=O)O"]
features = featurizer.featurize(smiles_list)

print(f"Features shape: {features.shape}")
# Features shape: (3, 2048)

Batch Processing with Deduplication¶

from themap.features import MoleculeFeaturizer

featurizer = MoleculeFeaturizer(featurizer_name="ecfp")

# Featurize multiple datasets with global deduplication
datasets = {
    "task1": dataset1,  # MoleculeDataset objects
    "task2": dataset2,
}

features = featurizer.featurize_datasets(
    datasets,
    deduplicate=True  # Avoid re-computing for duplicate SMILES
)

for task_id, task_features in features.items():
    print(f"{task_id}: {task_features.shape}")

Protein Featurizer¶

ProteinFeaturizer¶

themap.features.protein.ProteinFeaturizer ¶

Efficient protein featurization using ESM models.

Provides batch featurization of protein sequences using ESM2 or ESM3 models. Models are cached globally to avoid reloading.

Attributes:

Name	Type	Description
`featurizer_name`		Name of the ESM model to use.
`layer`		Which transformer layer to extract embeddings from.
`device`		Device for computation ('auto', 'cpu', 'cuda').

Examples:

>>> featurizer = ProteinFeaturizer("esm2_t33_650M_UR50D")
>>> sequences = {"P1": "MKTVRQ...", "P2": "MENLNM..."}
>>> features = featurizer.featurize(sequences)
>>> print(features.shape)  # (2, 1280)

is_esm2 `property` ¶

is_esm2: bool

Check if this is an ESM2 model.

is_esm3 `property` ¶

is_esm3: bool

Check if this is an ESM3 model.

init ¶

__init__(
    featurizer_name: str = "esm2_t33_650M_UR50D",
    layer: Optional[int] = None,
    device: str = "auto",
)

Initialize the protein featurizer.

Parameters:

Name	Type	Description	Default
`featurizer_name`	`str`	Name of the ESM model to use.	`'esm2_t33_650M_UR50D'`
`layer`	`Optional[int]`	Which transformer layer to extract embeddings from. If None, uses the default for the model.	`None`
`device`	`str`	Device for computation ('auto', 'cpu', 'cuda').	`'auto'`

featurize ¶

featurize(sequences: Union[Dict[str, str], List[str]]) -> NDArray[np.float32]

Featurize protein sequences.

Parameters:

Name	Type	Description	Default
`sequences`	`Union[Dict[str, str], List[str]]`	Either a dictionary mapping protein IDs to sequences, or a list of sequences.	required

Returns:

Type	Description
`NDArray[float32]`	Feature array of shape (n_proteins, embedding_dim).

featurize_from_fasta ¶

featurize_from_fasta(
    fasta_path: Union[str, Path],
) -> Dict[str, NDArray[np.float32]]

Featurize proteins from a FASTA file.

Parameters:

Name	Type	Description	Default
`fasta_path`	`Union[str, Path]`	Path to the FASTA file.	required

Returns:

Type	Description
`Dict[str, NDArray[float32]]`	Dictionary mapping protein IDs to feature vectors.

featurize_directory ¶

featurize_directory(
    directory: Union[str, Path], pattern: str = "*.fasta"
) -> Dict[str, NDArray[np.float32]]

Featurize all proteins from FASTA files in a directory.

Each FASTA file is expected to contain one protein sequence. The filename (without extension) is used as the task/protein ID.

Parameters:

Name	Type	Description	Default
`directory`	`Union[str, Path]`	Path to directory containing FASTA files.	required
`pattern`	`str`	Glob pattern for finding FASTA files.	`'*.fasta'`

Returns:

Type	Description
`Dict[str, NDArray[float32]]`	Dictionary mapping task IDs to feature vectors.

get_feature_dim ¶

get_feature_dim() -> int

Get the feature dimension for this featurizer.

Returns:

Type	Description
`int`	Number of features produced by this featurizer.

Available Models¶

ESM2 Models¶

Model	Parameters	Layers	Embedding Dim
`esm2_t6_8M_UR50D`	8M	6	320
`esm2_t12_35M_UR50D`	35M	12	480
`esm2_t30_150M_UR50D`	150M	30	640
`esm2_t33_650M_UR50D`	650M	33	1280

ESM3 Models¶

Model	Description
`esm3_sm_open_v1`	ESM3 small open model

Usage Examples¶

from themap.features import ProteinFeaturizer

# Initialize with ESM2
featurizer = ProteinFeaturizer(
    model_name="esm2_t33_650M_UR50D",
    device="cuda"  # Use GPU if available
)

# Featurize protein sequences
sequences = [
    "MKTVRQERLKSIVRILERSKEPVSG",
    "MGSSHHHHHHSSGLVPRGSHM"
]

embeddings = featurizer.featurize(sequences)
print(f"Embeddings shape: {embeddings.shape}")
# Embeddings shape: (2, 1280)

Reading from FASTA Files¶

from themap.features.protein import read_fasta_file

# Read sequences from FASTA
sequences = read_fasta_file("proteins.fasta")

for seq_id, sequence in sequences.items():
    print(f"{seq_id}: {len(sequence)} residues")

Feature Cache¶

FeatureCache¶

themap.features.cache.FeatureCache ¶

Disk-based feature caching for molecules and proteins.

Caches computed features to NPZ/NPY files for efficient reuse across runs. Supports both molecule features (per-dataset with labels) and protein features (single vector per task).

Attributes:

Name	Type	Description
`cache_dir`		Root directory for cached features.
`molecule_dir`		Directory for molecule features.
`protein_dir`		Directory for protein features.

Directory Structure

cache_dir/ ├── molecule/ │ └── {featurizer}/ │ └── {task_id}.npz # features + labels └── protein/ └── {featurizer}/ └── {task_id}.npy # single vector

Examples:

>>> cache = FeatureCache("./feature_cache")
>>> # Save molecule features
>>> cache.save_molecule_features("CHEMBL123", "ecfp", features, labels)
>>> # Load if exists
>>> features, labels = cache.load_molecule_features("CHEMBL123", "ecfp")
>>> if features is None:
...     # Compute and save
...     cache.save_molecule_features("CHEMBL123", "ecfp", features, labels)

init ¶

__init__(cache_dir: Union[str, Path])

Initialize the feature cache.

Parameters:

Name	Type	Description	Default
`cache_dir`	`Union[str, Path]`	Root directory for cached features.	required

has_molecule_features ¶

has_molecule_features(task_id: str, featurizer: str) -> bool

Check if molecule features are cached.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	Task ID to check.	required
`featurizer`	`str`	Name of the featurizer.	required

Returns:

Type	Description
`bool`	True if features are cached.

has_protein_features ¶

has_protein_features(task_id: str, featurizer: str) -> bool

Check if protein features are cached.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	Task ID to check.	required
`featurizer`	`str`	Name of the featurizer.	required

Returns:

Type	Description
`bool`	True if features are cached.

save_molecule_features ¶

save_molecule_features(
    task_id: str,
    featurizer: str,
    features: NDArray[float32],
    labels: NDArray[int32],
    metadata: Optional[Dict[str, Any]] = None,
) -> Path

Save molecule features to cache.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	Task ID for the dataset.	required
`featurizer`	`str`	Name of the featurizer used.	required
`features`	`NDArray[float32]`	Feature matrix of shape (n_molecules, feature_dim).	required
`labels`	`NDArray[int32]`	Binary labels of shape (n_molecules,).	required
`metadata`	`Optional[Dict[str, Any]]`	Optional metadata dictionary.	`None`

Returns:

Type	Description
`Path`	Path to the saved file.

load_molecule_features ¶

load_molecule_features(
    task_id: str, featurizer: str
) -> Tuple[Optional[NDArray[np.float32]], Optional[NDArray[np.int32]]]

Load molecule features from cache.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	Task ID for the dataset.	required
`featurizer`	`str`	Name of the featurizer.	required

Returns:

Type	Description
`Tuple[Optional[NDArray[float32]], Optional[NDArray[int32]]]`	Tuple of (features, labels), or (None, None) if not cached.

save_protein_features ¶

save_protein_features(
    task_id: str, featurizer: str, features: NDArray[float32]
) -> Path

Save protein features to cache.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	Task ID for the protein.	required
`featurizer`	`str`	Name of the featurizer used.	required
`features`	`NDArray[float32]`	Feature vector of shape (feature_dim,).	required

Returns:

Type	Description
`Path`	Path to the saved file.

load_protein_features ¶

load_protein_features(
    task_id: str, featurizer: str
) -> Optional[NDArray[np.float32]]

Load protein features from cache.

Parameters:

Name	Type	Description	Default
`task_id`	`str`	Task ID for the protein.	required
`featurizer`	`str`	Name of the featurizer.	required

Returns:

Type	Description
`Optional[NDArray[float32]]`	Feature vector, or None if not cached.

save_all_molecule_features ¶

save_all_molecule_features(
    features_dict: Dict[str, NDArray[float32]],
    labels_dict: Dict[str, NDArray[int32]],
    featurizer: str,
) -> List[Path]

Save molecule features for multiple datasets.

Parameters:

Name	Type	Description	Default
`features_dict`	`Dict[str, NDArray[float32]]`	Dictionary mapping task IDs to feature matrices.	required
`labels_dict`	`Dict[str, NDArray[int32]]`	Dictionary mapping task IDs to label arrays.	required
`featurizer`	`str`	Name of the featurizer used.	required

Returns:

Type	Description
`List[Path]`	List of paths to saved files.

save_all_protein_features ¶

save_all_protein_features(
    features_dict: Dict[str, NDArray[float32]], featurizer: str
) -> List[Path]

Save protein features for multiple tasks.

Parameters:

Name	Type	Description	Default
`features_dict`	`Dict[str, NDArray[float32]]`	Dictionary mapping task IDs to feature vectors.	required
`featurizer`	`str`	Name of the featurizer used.	required

Returns:

Type	Description
`List[Path]`	List of paths to saved files.

load_all_molecule_features ¶

load_all_molecule_features(
    task_ids: List[str], featurizer: str
) -> Tuple[
    Dict[str, NDArray[np.float32]], Dict[str, NDArray[np.int32]], List[str]
]

Load molecule features for multiple datasets.

Parameters:

Name	Type	Description	Default
`task_ids`	`List[str]`	List of task IDs to load.	required
`featurizer`	`str`	Name of the featurizer.	required

Returns:

Type	Description
`Dict[str, NDArray[float32]]`	Tuple of (features_dict, labels_dict, missing_ids).
`Dict[str, NDArray[int32]]`	missing_ids contains task IDs that were not found in cache.

load_all_protein_features ¶

load_all_protein_features(
    task_ids: List[str], featurizer: str
) -> Tuple[Dict[str, NDArray[np.float32]], List[str]]

Load protein features for multiple tasks.

Parameters:

Name	Type	Description	Default
`task_ids`	`List[str]`	List of task IDs to load.	required
`featurizer`	`str`	Name of the featurizer.	required

Returns:

Type	Description
`Dict[str, NDArray[float32]]`	Tuple of (features_dict, missing_ids).
`List[str]`	missing_ids contains task IDs that were not found in cache.

clear ¶

clear(featurizer: Optional[str] = None) -> int

Clear cached features.

Parameters:

Name	Type	Description	Default
`featurizer`	`Optional[str]`	If specified, only clear features for this featurizer. If None, clear all cached features.	`None`

Returns:

Type	Description
`int`	Number of files deleted.

get_statistics ¶

get_statistics() -> Dict[str, Any]

Get statistics about cached features.

Returns:

Type	Description
`Dict[str, Any]`	Dictionary with cache statistics.

Usage Examples¶

from themap.features import FeatureCache

# Initialize cache
cache = FeatureCache(cache_dir="cache/features")

# Check if features are cached
cache_key = "ecfp_task1"
if cache.has(cache_key):
    features = cache.load(cache_key)
else:
    features = compute_features()
    cache.save(cache_key, features)

Automatic Caching¶

from themap.features import MoleculeFeaturizer, FeatureCache

cache = FeatureCache(cache_dir="cache/")
featurizer = MoleculeFeaturizer(
    featurizer_name="ecfp",
    cache=cache  # Enable automatic caching
)

# First call computes and caches
features1 = featurizer.featurize(smiles_list)

# Second call loads from cache (fast)
features2 = featurizer.featurize(smiles_list)

Performance Optimization¶

Choosing the Right Featurizer¶

def choose_featurizer(dataset_size: int, accuracy_priority: bool) -> str:
    """Choose appropriate featurizer based on requirements."""
    if dataset_size > 100000:
        return "ecfp"  # Fast fingerprints for large datasets
    elif accuracy_priority:
        return "ChemBERTa-77M-MLM"  # Neural embeddings for accuracy
    else:
        return "desc2D"  # Good balance of speed and quality

Parallel Processing¶

from themap.features import MoleculeFeaturizer

# Use multiple CPU cores
featurizer = MoleculeFeaturizer(
    featurizer_name="mordred",
    n_jobs=16  # Use 16 parallel workers
)

GPU Acceleration¶

from themap.features import ProteinFeaturizer

# Use GPU for neural models
featurizer = ProteinFeaturizer(
    model_name="esm2_t33_650M_UR50D",
    device="cuda:0"  # Specific GPU
)

# Batch processing for efficiency
embeddings = featurizer.featurize(
    sequences,
    batch_size=32  # Process 32 sequences at a time
)

Error Handling¶

from themap.features import MoleculeFeaturizer

featurizer = MoleculeFeaturizer(featurizer_name="ecfp")

# Handle invalid SMILES
smiles_list = ["CCO", "invalid_smiles", "CCCO"]

try:
    features = featurizer.featurize(smiles_list)
except ValueError as e:
    print(f"Invalid SMILES: {e}")

# Or use safe mode
features = featurizer.featurize(
    smiles_list,
    on_error="skip"  # Skip invalid molecules
)

Integration with Distance Computation¶

from themap.features import MoleculeFeaturizer
from themap.distance import compute_dataset_distance_matrix
import numpy as np

# Featurize datasets
featurizer = MoleculeFeaturizer(featurizer_name="ecfp")

source_features = featurizer.featurize(source_smiles)
target_features = featurizer.featurize(target_smiles)

# Compute distances
distances = compute_dataset_distance_matrix(
    source_features,
    target_features,
    method="euclidean"
)

Constants¶

Available Featurizer Names¶

from themap.features.molecule import (
    FINGERPRINT_FEATURIZERS,
    DESCRIPTOR_FEATURIZERS,
    NEURAL_FEATURIZERS,
)

print("Fingerprints:", FINGERPRINT_FEATURIZERS)
print("Descriptors:", DESCRIPTOR_FEATURIZERS)
print("Neural:", NEURAL_FEATURIZERS)

ESM Model Names¶

from themap.features.protein import ESM2_MODELS, ESM3_MODELS

print("ESM2 models:", ESM2_MODELS)
print("ESM3 models:", ESM3_MODELS)

Features Module¶

Overview¶

Molecule Featurizer¶

MoleculeFeaturizer¶

themap.features.molecule.MoleculeFeaturizer ¶

transformer property ¶

is_fingerprint property ¶

is_neural property ¶

__init__ ¶

featurize ¶

featurize_deduplicated ¶

featurize_datasets ¶

get_feature_dim ¶

Available Featurizers¶

Fingerprints (Fast)¶

Descriptors (Medium Speed)¶

Neural Embeddings (Slow, GPU Recommended)¶

Usage Examples¶

Batch Processing with Deduplication¶

Protein Featurizer¶

ProteinFeaturizer¶

themap.features.protein.ProteinFeaturizer ¶

is_esm2 property ¶

is_esm3 property ¶

__init__ ¶

featurize ¶

featurize_from_fasta ¶

featurize_directory ¶

get_feature_dim ¶

Available Models¶

ESM2 Models¶

ESM3 Models¶

Usage Examples¶

Reading from FASTA Files¶

Feature Cache¶

FeatureCache¶

themap.features.cache.FeatureCache ¶

__init__ ¶

has_molecule_features ¶

has_protein_features ¶

save_molecule_features ¶

load_molecule_features ¶

save_protein_features ¶

load_protein_features ¶

save_all_molecule_features ¶

save_all_protein_features ¶

load_all_molecule_features ¶

load_all_protein_features ¶

clear ¶

get_statistics ¶

Usage Examples¶

Automatic Caching¶

Performance Optimization¶

Choosing the Right Featurizer¶

Parallel Processing¶

GPU Acceleration¶

Error Handling¶

Integration with Distance Computation¶

Constants¶

Available Featurizer Names¶

ESM Model Names¶

transformer `property` ¶

is_fingerprint `property` ¶

is_neural `property` ¶

init ¶

is_esm2 `property` ¶

is_esm3 `property` ¶

init ¶

init ¶