Features Module¶
The features module provides unified feature extraction for molecules and proteins. It handles featurization, caching, and batch processing.
Overview¶
The features system consists of three main components:
MoleculeFeaturizer- Extract molecular representations (fingerprints, descriptors, embeddings)ProteinFeaturizer- Extract protein sequence embeddings (ESM2, ESM3)FeatureCache- Efficient caching for expensive feature computations
Molecule Featurizer¶
MoleculeFeaturizer¶
themap.features.molecule.MoleculeFeaturizer ¶
Efficient molecule featurization using molfeat.
Provides batch featurization with SMILES deduplication for efficiency. Supports fingerprints, descriptors, and neural embeddings.
Attributes:
| Name | Type | Description |
|---|---|---|
featurizer_name |
Name of the featurizer to use. |
|
n_jobs |
Number of parallel workers for featurization. |
|
_transformer |
Cached molfeat transformer instance. |
Examples:
>>> featurizer = MoleculeFeaturizer("ecfp")
>>> features = featurizer.featurize(["CCO", "CCCO", "CCCCO"])
>>> print(features.shape) # (3, 2048)
>>> # With SMILES deduplication
>>> smiles = ["CCO", "CCCO", "CCO", "CCCCO", "CCCO"]
>>> features = featurizer.featurize_deduplicated(smiles)
>>> print(features.shape) # (5, 2048) - returns features for all input SMILES
__init__ ¶
Initialize the molecule featurizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of the molfeat featurizer to use. |
'ecfp'
|
n_jobs
|
int
|
Number of parallel workers for featurization. |
8
|
device
|
str
|
Device for neural featurizers ('auto', 'cpu', 'cuda'). |
'auto'
|
featurize ¶
Featurize one or more SMILES strings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles
|
Union[str, List[str]]
|
Single SMILES string or list of SMILES. |
required |
ignore_errors
|
bool
|
If True, return NaN for invalid SMILES. |
True
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
Feature array of shape (n_molecules, feature_dim). |
featurize_deduplicated ¶
Featurize SMILES with deduplication for efficiency.
Unique SMILES are featurized once, then results are mapped back to the original list. This is efficient when there are many duplicate SMILES across datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
smiles
|
List[str]
|
List of SMILES strings (may contain duplicates). |
required |
ignore_errors
|
bool
|
If True, return NaN for invalid SMILES. |
True
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
Feature array of shape (n_molecules, feature_dim) in original order. |
featurize_datasets ¶
featurize_datasets(
datasets: Dict[str, MoleculeDataset], deduplicate: bool = True
) -> Dict[str, NDArray[np.float32]]
Featurize multiple datasets efficiently.
When deduplicate=True, collects all unique SMILES across all datasets, featurizes them once, then maps back to each dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Dict[str, MoleculeDataset]
|
Dictionary mapping task IDs to MoleculeDataset instances. |
required |
deduplicate
|
bool
|
If True, deduplicate SMILES across all datasets. |
True
|
Returns:
| Type | Description |
|---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping task IDs to feature arrays. |
get_feature_dim ¶
Get the feature dimension for this featurizer.
Returns:
| Type | Description |
|---|---|
int
|
Number of features produced by this featurizer. |
Available Featurizers¶
Fingerprints (Fast)¶
| Featurizer | Description | Dimensions |
|---|---|---|
ecfp |
Extended Connectivity Fingerprints | 2048 |
maccs |
MACCS Structural Keys | 167 |
topological |
Topological Fingerprints | 2048 |
avalon |
Avalon Fingerprints | 512 |
Descriptors (Medium Speed)¶
| Featurizer | Description | Dimensions |
|---|---|---|
desc2D |
2D Molecular Descriptors | ~200 |
mordred |
Mordred Descriptors | ~1600 |
Neural Embeddings (Slow, GPU Recommended)¶
| Featurizer | Description | Dimensions |
|---|---|---|
ChemBERTa-77M-MLM |
ChemBERTa masked language model | 384 |
ChemBERTa-77M-MTR |
ChemBERTa multi-task regression | 384 |
MolT5 |
Molecular T5 embeddings | 768 |
Roberta-Zinc480M-102M |
RoBERTa trained on ZINC | 768 |
gin_supervised_* |
Graph neural network embeddings | 300 |
Usage Examples¶
from themap.features import MoleculeFeaturizer
# Initialize featurizer
featurizer = MoleculeFeaturizer(
featurizer_name="ecfp",
n_jobs=8
)
# Featurize a list of SMILES
smiles_list = ["CCO", "CCCO", "CC(=O)O"]
features = featurizer.featurize(smiles_list)
print(f"Features shape: {features.shape}")
# Features shape: (3, 2048)
Batch Processing with Deduplication¶
from themap.features import MoleculeFeaturizer
featurizer = MoleculeFeaturizer(featurizer_name="ecfp")
# Featurize multiple datasets with global deduplication
datasets = {
"task1": dataset1, # MoleculeDataset objects
"task2": dataset2,
}
features = featurizer.featurize_datasets(
datasets,
deduplicate=True # Avoid re-computing for duplicate SMILES
)
for task_id, task_features in features.items():
print(f"{task_id}: {task_features.shape}")
Protein Featurizer¶
ProteinFeaturizer¶
themap.features.protein.ProteinFeaturizer ¶
Efficient protein featurization using ESM models.
Provides batch featurization of protein sequences using ESM2 or ESM3 models. Models are cached globally to avoid reloading.
Attributes:
| Name | Type | Description |
|---|---|---|
featurizer_name |
Name of the ESM model to use. |
|
layer |
Which transformer layer to extract embeddings from. |
|
device |
Device for computation ('auto', 'cpu', 'cuda'). |
Examples:
>>> featurizer = ProteinFeaturizer("esm2_t33_650M_UR50D")
>>> sequences = {"P1": "MKTVRQ...", "P2": "MENLNM..."}
>>> features = featurizer.featurize(sequences)
>>> print(features.shape) # (2, 1280)
__init__ ¶
__init__(
featurizer_name: str = "esm2_t33_650M_UR50D",
layer: Optional[int] = None,
device: str = "auto",
)
Initialize the protein featurizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of the ESM model to use. |
'esm2_t33_650M_UR50D'
|
layer
|
Optional[int]
|
Which transformer layer to extract embeddings from. If None, uses the default for the model. |
None
|
device
|
str
|
Device for computation ('auto', 'cpu', 'cuda'). |
'auto'
|
featurize ¶
Featurize protein sequences.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
sequences
|
Union[Dict[str, str], List[str]]
|
Either a dictionary mapping protein IDs to sequences, or a list of sequences. |
required |
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
Feature array of shape (n_proteins, embedding_dim). |
featurize_from_fasta ¶
Featurize proteins from a FASTA file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fasta_path
|
Union[str, Path]
|
Path to the FASTA file. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping protein IDs to feature vectors. |
featurize_directory ¶
featurize_directory(
directory: Union[str, Path], pattern: str = "*.fasta"
) -> Dict[str, NDArray[np.float32]]
Featurize all proteins from FASTA files in a directory.
Each FASTA file is expected to contain one protein sequence. The filename (without extension) is used as the task/protein ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Union[str, Path]
|
Path to directory containing FASTA files. |
required |
pattern
|
str
|
Glob pattern for finding FASTA files. |
'*.fasta'
|
Returns:
| Type | Description |
|---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping task IDs to feature vectors. |
get_feature_dim ¶
Get the feature dimension for this featurizer.
Returns:
| Type | Description |
|---|---|
int
|
Number of features produced by this featurizer. |
Available Models¶
ESM2 Models¶
| Model | Parameters | Layers | Embedding Dim |
|---|---|---|---|
esm2_t6_8M_UR50D |
8M | 6 | 320 |
esm2_t12_35M_UR50D |
35M | 12 | 480 |
esm2_t30_150M_UR50D |
150M | 30 | 640 |
esm2_t33_650M_UR50D |
650M | 33 | 1280 |
ESM3 Models¶
| Model | Description |
|---|---|
esm3_sm_open_v1 |
ESM3 small open model |
Usage Examples¶
from themap.features import ProteinFeaturizer
# Initialize with ESM2
featurizer = ProteinFeaturizer(
model_name="esm2_t33_650M_UR50D",
device="cuda" # Use GPU if available
)
# Featurize protein sequences
sequences = [
"MKTVRQERLKSIVRILERSKEPVSG",
"MGSSHHHHHHSSGLVPRGSHM"
]
embeddings = featurizer.featurize(sequences)
print(f"Embeddings shape: {embeddings.shape}")
# Embeddings shape: (2, 1280)
Reading from FASTA Files¶
from themap.features.protein import read_fasta_file
# Read sequences from FASTA
sequences = read_fasta_file("proteins.fasta")
for seq_id, sequence in sequences.items():
print(f"{seq_id}: {len(sequence)} residues")
Feature Cache¶
FeatureCache¶
themap.features.cache.FeatureCache ¶
Disk-based feature caching for molecules and proteins.
Caches computed features to NPZ/NPY files for efficient reuse across runs. Supports both molecule features (per-dataset with labels) and protein features (single vector per task).
Attributes:
| Name | Type | Description |
|---|---|---|
cache_dir |
Root directory for cached features. |
|
molecule_dir |
Directory for molecule features. |
|
protein_dir |
Directory for protein features. |
Directory Structure
cache_dir/ ├── molecule/ │ └── {featurizer}/ │ └── {task_id}.npz # features + labels └── protein/ └── {featurizer}/ └── {task_id}.npy # single vector
Examples:
>>> cache = FeatureCache("./feature_cache")
>>> # Save molecule features
>>> cache.save_molecule_features("CHEMBL123", "ecfp", features, labels)
>>> # Load if exists
>>> features, labels = cache.load_molecule_features("CHEMBL123", "ecfp")
>>> if features is None:
... # Compute and save
... cache.save_molecule_features("CHEMBL123", "ecfp", features, labels)
__init__ ¶
Initialize the feature cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cache_dir
|
Union[str, Path]
|
Root directory for cached features. |
required |
has_molecule_features ¶
Check if molecule features are cached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID to check. |
required |
featurizer
|
str
|
Name of the featurizer. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if features are cached. |
has_protein_features ¶
Check if protein features are cached.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID to check. |
required |
featurizer
|
str
|
Name of the featurizer. |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if features are cached. |
save_molecule_features ¶
save_molecule_features(
task_id: str,
featurizer: str,
features: NDArray[float32],
labels: NDArray[int32],
metadata: Optional[Dict[str, Any]] = None,
) -> Path
Save molecule features to cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID for the dataset. |
required |
featurizer
|
str
|
Name of the featurizer used. |
required |
features
|
NDArray[float32]
|
Feature matrix of shape (n_molecules, feature_dim). |
required |
labels
|
NDArray[int32]
|
Binary labels of shape (n_molecules,). |
required |
metadata
|
Optional[Dict[str, Any]]
|
Optional metadata dictionary. |
None
|
Returns:
| Type | Description |
|---|---|
Path
|
Path to the saved file. |
load_molecule_features ¶
load_molecule_features(
task_id: str, featurizer: str
) -> Tuple[Optional[NDArray[np.float32]], Optional[NDArray[np.int32]]]
Load molecule features from cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID for the dataset. |
required |
featurizer
|
str
|
Name of the featurizer. |
required |
Returns:
| Type | Description |
|---|---|
Tuple[Optional[NDArray[float32]], Optional[NDArray[int32]]]
|
Tuple of (features, labels), or (None, None) if not cached. |
save_protein_features ¶
Save protein features to cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID for the protein. |
required |
featurizer
|
str
|
Name of the featurizer used. |
required |
features
|
NDArray[float32]
|
Feature vector of shape (feature_dim,). |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to the saved file. |
load_protein_features ¶
Load protein features from cache.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID for the protein. |
required |
featurizer
|
str
|
Name of the featurizer. |
required |
Returns:
| Type | Description |
|---|---|
Optional[NDArray[float32]]
|
Feature vector, or None if not cached. |
save_all_molecule_features ¶
save_all_molecule_features(
features_dict: Dict[str, NDArray[float32]],
labels_dict: Dict[str, NDArray[int32]],
featurizer: str,
) -> List[Path]
Save molecule features for multiple datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_dict
|
Dict[str, NDArray[float32]]
|
Dictionary mapping task IDs to feature matrices. |
required |
labels_dict
|
Dict[str, NDArray[int32]]
|
Dictionary mapping task IDs to label arrays. |
required |
featurizer
|
str
|
Name of the featurizer used. |
required |
Returns:
| Type | Description |
|---|---|
List[Path]
|
List of paths to saved files. |
save_all_protein_features ¶
save_all_protein_features(
features_dict: Dict[str, NDArray[float32]], featurizer: str
) -> List[Path]
Save protein features for multiple tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features_dict
|
Dict[str, NDArray[float32]]
|
Dictionary mapping task IDs to feature vectors. |
required |
featurizer
|
str
|
Name of the featurizer used. |
required |
Returns:
| Type | Description |
|---|---|
List[Path]
|
List of paths to saved files. |
load_all_molecule_features ¶
load_all_molecule_features(
task_ids: List[str], featurizer: str
) -> Tuple[
Dict[str, NDArray[np.float32]], Dict[str, NDArray[np.int32]], List[str]
]
Load molecule features for multiple datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_ids
|
List[str]
|
List of task IDs to load. |
required |
featurizer
|
str
|
Name of the featurizer. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, NDArray[float32]]
|
Tuple of (features_dict, labels_dict, missing_ids). |
Dict[str, NDArray[int32]]
|
missing_ids contains task IDs that were not found in cache. |
load_all_protein_features ¶
load_all_protein_features(
task_ids: List[str], featurizer: str
) -> Tuple[Dict[str, NDArray[np.float32]], List[str]]
Load protein features for multiple tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_ids
|
List[str]
|
List of task IDs to load. |
required |
featurizer
|
str
|
Name of the featurizer. |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, NDArray[float32]]
|
Tuple of (features_dict, missing_ids). |
List[str]
|
missing_ids contains task IDs that were not found in cache. |
clear ¶
Clear cached features.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer
|
Optional[str]
|
If specified, only clear features for this featurizer. If None, clear all cached features. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Number of files deleted. |
get_statistics ¶
Get statistics about cached features.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dictionary with cache statistics. |
Usage Examples¶
from themap.features import FeatureCache
# Initialize cache
cache = FeatureCache(cache_dir="cache/features")
# Check if features are cached
cache_key = "ecfp_task1"
if cache.has(cache_key):
features = cache.load(cache_key)
else:
features = compute_features()
cache.save(cache_key, features)
Automatic Caching¶
from themap.features import MoleculeFeaturizer, FeatureCache
cache = FeatureCache(cache_dir="cache/")
featurizer = MoleculeFeaturizer(
featurizer_name="ecfp",
cache=cache # Enable automatic caching
)
# First call computes and caches
features1 = featurizer.featurize(smiles_list)
# Second call loads from cache (fast)
features2 = featurizer.featurize(smiles_list)
Performance Optimization¶
Choosing the Right Featurizer¶
def choose_featurizer(dataset_size: int, accuracy_priority: bool) -> str:
"""Choose appropriate featurizer based on requirements."""
if dataset_size > 100000:
return "ecfp" # Fast fingerprints for large datasets
elif accuracy_priority:
return "ChemBERTa-77M-MLM" # Neural embeddings for accuracy
else:
return "desc2D" # Good balance of speed and quality
Parallel Processing¶
from themap.features import MoleculeFeaturizer
# Use multiple CPU cores
featurizer = MoleculeFeaturizer(
featurizer_name="mordred",
n_jobs=16 # Use 16 parallel workers
)
GPU Acceleration¶
from themap.features import ProteinFeaturizer
# Use GPU for neural models
featurizer = ProteinFeaturizer(
model_name="esm2_t33_650M_UR50D",
device="cuda:0" # Specific GPU
)
# Batch processing for efficiency
embeddings = featurizer.featurize(
sequences,
batch_size=32 # Process 32 sequences at a time
)
Error Handling¶
from themap.features import MoleculeFeaturizer
featurizer = MoleculeFeaturizer(featurizer_name="ecfp")
# Handle invalid SMILES
smiles_list = ["CCO", "invalid_smiles", "CCCO"]
try:
features = featurizer.featurize(smiles_list)
except ValueError as e:
print(f"Invalid SMILES: {e}")
# Or use safe mode
features = featurizer.featurize(
smiles_list,
on_error="skip" # Skip invalid molecules
)
Integration with Distance Computation¶
from themap.features import MoleculeFeaturizer
from themap.distance import compute_dataset_distance_matrix
import numpy as np
# Featurize datasets
featurizer = MoleculeFeaturizer(featurizer_name="ecfp")
source_features = featurizer.featurize(source_smiles)
target_features = featurizer.featurize(target_smiles)
# Compute distances
distances = compute_dataset_distance_matrix(
source_features,
target_features,
method="euclidean"
)
Constants¶
Available Featurizer Names¶
from themap.features.molecule import (
FINGERPRINT_FEATURIZERS,
DESCRIPTOR_FEATURIZERS,
NEURAL_FEATURIZERS,
)
print("Fingerprints:", FINGERPRINT_FEATURIZERS)
print("Descriptors:", DESCRIPTOR_FEATURIZERS)
print("Neural:", NEURAL_FEATURIZERS)