Skip to content

Features Module

The features module provides unified feature extraction for molecules and proteins. It handles featurization, caching, and batch processing.

Overview

The features system consists of three main components:

  • MoleculeFeaturizer - Extract molecular representations (fingerprints, descriptors, embeddings)
  • ProteinFeaturizer - Extract protein sequence embeddings (ESM2, ESM3)
  • FeatureCache - Efficient caching for expensive feature computations

Molecule Featurizer

MoleculeFeaturizer

themap.features.molecule.MoleculeFeaturizer

Efficient molecule featurization using molfeat.

Provides batch featurization with SMILES deduplication for efficiency. Supports fingerprints, descriptors, and neural embeddings.

Attributes:

Name Type Description
featurizer_name

Name of the featurizer to use.

n_jobs

Number of parallel workers for featurization.

_transformer

Cached molfeat transformer instance.

Examples:

>>> featurizer = MoleculeFeaturizer("ecfp")
>>> features = featurizer.featurize(["CCO", "CCCO", "CCCCO"])
>>> print(features.shape)  # (3, 2048)
>>> # With SMILES deduplication
>>> smiles = ["CCO", "CCCO", "CCO", "CCCCO", "CCCO"]
>>> features = featurizer.featurize_deduplicated(smiles)
>>> print(features.shape)  # (5, 2048) - returns features for all input SMILES

transformer property

transformer

Get or create the molfeat transformer (lazy initialization).

is_fingerprint property

is_fingerprint: bool

Check if this is a fingerprint-based featurizer.

is_neural property

is_neural: bool

Check if this is a neural embedding featurizer.

__init__

__init__(featurizer_name: str = 'ecfp', n_jobs: int = 8, device: str = 'auto')

Initialize the molecule featurizer.

Parameters:

Name Type Description Default
featurizer_name str

Name of the molfeat featurizer to use.

'ecfp'
n_jobs int

Number of parallel workers for featurization.

8
device str

Device for neural featurizers ('auto', 'cpu', 'cuda').

'auto'

featurize

featurize(
    smiles: Union[str, List[str]], ignore_errors: bool = True
) -> NDArray[np.float32]

Featurize one or more SMILES strings.

Parameters:

Name Type Description Default
smiles Union[str, List[str]]

Single SMILES string or list of SMILES.

required
ignore_errors bool

If True, return NaN for invalid SMILES.

True

Returns:

Type Description
NDArray[float32]

Feature array of shape (n_molecules, feature_dim).

featurize_deduplicated

featurize_deduplicated(
    smiles: List[str], ignore_errors: bool = True
) -> NDArray[np.float32]

Featurize SMILES with deduplication for efficiency.

Unique SMILES are featurized once, then results are mapped back to the original list. This is efficient when there are many duplicate SMILES across datasets.

Parameters:

Name Type Description Default
smiles List[str]

List of SMILES strings (may contain duplicates).

required
ignore_errors bool

If True, return NaN for invalid SMILES.

True

Returns:

Type Description
NDArray[float32]

Feature array of shape (n_molecules, feature_dim) in original order.

featurize_datasets

featurize_datasets(
    datasets: Dict[str, MoleculeDataset], deduplicate: bool = True
) -> Dict[str, NDArray[np.float32]]

Featurize multiple datasets efficiently.

When deduplicate=True, collects all unique SMILES across all datasets, featurizes them once, then maps back to each dataset.

Parameters:

Name Type Description Default
datasets Dict[str, MoleculeDataset]

Dictionary mapping task IDs to MoleculeDataset instances.

required
deduplicate bool

If True, deduplicate SMILES across all datasets.

True

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping task IDs to feature arrays.

get_feature_dim

get_feature_dim() -> int

Get the feature dimension for this featurizer.

Returns:

Type Description
int

Number of features produced by this featurizer.

Available Featurizers

Fingerprints (Fast)

Featurizer Description Dimensions
ecfp Extended Connectivity Fingerprints 2048
maccs MACCS Structural Keys 167
topological Topological Fingerprints 2048
avalon Avalon Fingerprints 512

Descriptors (Medium Speed)

Featurizer Description Dimensions
desc2D 2D Molecular Descriptors ~200
mordred Mordred Descriptors ~1600
Featurizer Description Dimensions
ChemBERTa-77M-MLM ChemBERTa masked language model 384
ChemBERTa-77M-MTR ChemBERTa multi-task regression 384
MolT5 Molecular T5 embeddings 768
Roberta-Zinc480M-102M RoBERTa trained on ZINC 768
gin_supervised_* Graph neural network embeddings 300

Usage Examples

from themap.features import MoleculeFeaturizer

# Initialize featurizer
featurizer = MoleculeFeaturizer(
    featurizer_name="ecfp",
    n_jobs=8
)

# Featurize a list of SMILES
smiles_list = ["CCO", "CCCO", "CC(=O)O"]
features = featurizer.featurize(smiles_list)

print(f"Features shape: {features.shape}")
# Features shape: (3, 2048)

Batch Processing with Deduplication

from themap.features import MoleculeFeaturizer

featurizer = MoleculeFeaturizer(featurizer_name="ecfp")

# Featurize multiple datasets with global deduplication
datasets = {
    "task1": dataset1,  # MoleculeDataset objects
    "task2": dataset2,
}

features = featurizer.featurize_datasets(
    datasets,
    deduplicate=True  # Avoid re-computing for duplicate SMILES
)

for task_id, task_features in features.items():
    print(f"{task_id}: {task_features.shape}")

Protein Featurizer

ProteinFeaturizer

themap.features.protein.ProteinFeaturizer

Efficient protein featurization using ESM models.

Provides batch featurization of protein sequences using ESM2 or ESM3 models. Models are cached globally to avoid reloading.

Attributes:

Name Type Description
featurizer_name

Name of the ESM model to use.

layer

Which transformer layer to extract embeddings from.

device

Device for computation ('auto', 'cpu', 'cuda').

Examples:

>>> featurizer = ProteinFeaturizer("esm2_t33_650M_UR50D")
>>> sequences = {"P1": "MKTVRQ...", "P2": "MENLNM..."}
>>> features = featurizer.featurize(sequences)
>>> print(features.shape)  # (2, 1280)

is_esm2 property

is_esm2: bool

Check if this is an ESM2 model.

is_esm3 property

is_esm3: bool

Check if this is an ESM3 model.

__init__

__init__(
    featurizer_name: str = "esm2_t33_650M_UR50D",
    layer: Optional[int] = None,
    device: str = "auto",
)

Initialize the protein featurizer.

Parameters:

Name Type Description Default
featurizer_name str

Name of the ESM model to use.

'esm2_t33_650M_UR50D'
layer Optional[int]

Which transformer layer to extract embeddings from. If None, uses the default for the model.

None
device str

Device for computation ('auto', 'cpu', 'cuda').

'auto'

featurize

featurize(sequences: Union[Dict[str, str], List[str]]) -> NDArray[np.float32]

Featurize protein sequences.

Parameters:

Name Type Description Default
sequences Union[Dict[str, str], List[str]]

Either a dictionary mapping protein IDs to sequences, or a list of sequences.

required

Returns:

Type Description
NDArray[float32]

Feature array of shape (n_proteins, embedding_dim).

featurize_from_fasta

featurize_from_fasta(
    fasta_path: Union[str, Path],
) -> Dict[str, NDArray[np.float32]]

Featurize proteins from a FASTA file.

Parameters:

Name Type Description Default
fasta_path Union[str, Path]

Path to the FASTA file.

required

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping protein IDs to feature vectors.

featurize_directory

featurize_directory(
    directory: Union[str, Path], pattern: str = "*.fasta"
) -> Dict[str, NDArray[np.float32]]

Featurize all proteins from FASTA files in a directory.

Each FASTA file is expected to contain one protein sequence. The filename (without extension) is used as the task/protein ID.

Parameters:

Name Type Description Default
directory Union[str, Path]

Path to directory containing FASTA files.

required
pattern str

Glob pattern for finding FASTA files.

'*.fasta'

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping task IDs to feature vectors.

get_feature_dim

get_feature_dim() -> int

Get the feature dimension for this featurizer.

Returns:

Type Description
int

Number of features produced by this featurizer.

Available Models

ESM2 Models

Model Parameters Layers Embedding Dim
esm2_t6_8M_UR50D 8M 6 320
esm2_t12_35M_UR50D 35M 12 480
esm2_t30_150M_UR50D 150M 30 640
esm2_t33_650M_UR50D 650M 33 1280

ESM3 Models

Model Description
esm3_sm_open_v1 ESM3 small open model

Usage Examples

from themap.features import ProteinFeaturizer

# Initialize with ESM2
featurizer = ProteinFeaturizer(
    model_name="esm2_t33_650M_UR50D",
    device="cuda"  # Use GPU if available
)

# Featurize protein sequences
sequences = [
    "MKTVRQERLKSIVRILERSKEPVSG",
    "MGSSHHHHHHSSGLVPRGSHM"
]

embeddings = featurizer.featurize(sequences)
print(f"Embeddings shape: {embeddings.shape}")
# Embeddings shape: (2, 1280)

Reading from FASTA Files

from themap.features.protein import read_fasta_file

# Read sequences from FASTA
sequences = read_fasta_file("proteins.fasta")

for seq_id, sequence in sequences.items():
    print(f"{seq_id}: {len(sequence)} residues")

Feature Cache

FeatureCache

themap.features.cache.FeatureCache

Disk-based feature caching for molecules and proteins.

Caches computed features to NPZ/NPY files for efficient reuse across runs. Supports both molecule features (per-dataset with labels) and protein features (single vector per task).

Attributes:

Name Type Description
cache_dir

Root directory for cached features.

molecule_dir

Directory for molecule features.

protein_dir

Directory for protein features.

Directory Structure

cache_dir/ ├── molecule/ │ └── {featurizer}/ │ └── {task_id}.npz # features + labels └── protein/ └── {featurizer}/ └── {task_id}.npy # single vector

Examples:

>>> cache = FeatureCache("./feature_cache")
>>> # Save molecule features
>>> cache.save_molecule_features("CHEMBL123", "ecfp", features, labels)
>>> # Load if exists
>>> features, labels = cache.load_molecule_features("CHEMBL123", "ecfp")
>>> if features is None:
...     # Compute and save
...     cache.save_molecule_features("CHEMBL123", "ecfp", features, labels)

__init__

__init__(cache_dir: Union[str, Path])

Initialize the feature cache.

Parameters:

Name Type Description Default
cache_dir Union[str, Path]

Root directory for cached features.

required

has_molecule_features

has_molecule_features(task_id: str, featurizer: str) -> bool

Check if molecule features are cached.

Parameters:

Name Type Description Default
task_id str

Task ID to check.

required
featurizer str

Name of the featurizer.

required

Returns:

Type Description
bool

True if features are cached.

has_protein_features

has_protein_features(task_id: str, featurizer: str) -> bool

Check if protein features are cached.

Parameters:

Name Type Description Default
task_id str

Task ID to check.

required
featurizer str

Name of the featurizer.

required

Returns:

Type Description
bool

True if features are cached.

save_molecule_features

save_molecule_features(
    task_id: str,
    featurizer: str,
    features: NDArray[float32],
    labels: NDArray[int32],
    metadata: Optional[Dict[str, Any]] = None,
) -> Path

Save molecule features to cache.

Parameters:

Name Type Description Default
task_id str

Task ID for the dataset.

required
featurizer str

Name of the featurizer used.

required
features NDArray[float32]

Feature matrix of shape (n_molecules, feature_dim).

required
labels NDArray[int32]

Binary labels of shape (n_molecules,).

required
metadata Optional[Dict[str, Any]]

Optional metadata dictionary.

None

Returns:

Type Description
Path

Path to the saved file.

load_molecule_features

load_molecule_features(
    task_id: str, featurizer: str
) -> Tuple[Optional[NDArray[np.float32]], Optional[NDArray[np.int32]]]

Load molecule features from cache.

Parameters:

Name Type Description Default
task_id str

Task ID for the dataset.

required
featurizer str

Name of the featurizer.

required

Returns:

Type Description
Tuple[Optional[NDArray[float32]], Optional[NDArray[int32]]]

Tuple of (features, labels), or (None, None) if not cached.

save_protein_features

save_protein_features(
    task_id: str, featurizer: str, features: NDArray[float32]
) -> Path

Save protein features to cache.

Parameters:

Name Type Description Default
task_id str

Task ID for the protein.

required
featurizer str

Name of the featurizer used.

required
features NDArray[float32]

Feature vector of shape (feature_dim,).

required

Returns:

Type Description
Path

Path to the saved file.

load_protein_features

load_protein_features(
    task_id: str, featurizer: str
) -> Optional[NDArray[np.float32]]

Load protein features from cache.

Parameters:

Name Type Description Default
task_id str

Task ID for the protein.

required
featurizer str

Name of the featurizer.

required

Returns:

Type Description
Optional[NDArray[float32]]

Feature vector, or None if not cached.

save_all_molecule_features

save_all_molecule_features(
    features_dict: Dict[str, NDArray[float32]],
    labels_dict: Dict[str, NDArray[int32]],
    featurizer: str,
) -> List[Path]

Save molecule features for multiple datasets.

Parameters:

Name Type Description Default
features_dict Dict[str, NDArray[float32]]

Dictionary mapping task IDs to feature matrices.

required
labels_dict Dict[str, NDArray[int32]]

Dictionary mapping task IDs to label arrays.

required
featurizer str

Name of the featurizer used.

required

Returns:

Type Description
List[Path]

List of paths to saved files.

save_all_protein_features

save_all_protein_features(
    features_dict: Dict[str, NDArray[float32]], featurizer: str
) -> List[Path]

Save protein features for multiple tasks.

Parameters:

Name Type Description Default
features_dict Dict[str, NDArray[float32]]

Dictionary mapping task IDs to feature vectors.

required
featurizer str

Name of the featurizer used.

required

Returns:

Type Description
List[Path]

List of paths to saved files.

load_all_molecule_features

load_all_molecule_features(
    task_ids: List[str], featurizer: str
) -> Tuple[
    Dict[str, NDArray[np.float32]], Dict[str, NDArray[np.int32]], List[str]
]

Load molecule features for multiple datasets.

Parameters:

Name Type Description Default
task_ids List[str]

List of task IDs to load.

required
featurizer str

Name of the featurizer.

required

Returns:

Type Description
Dict[str, NDArray[float32]]

Tuple of (features_dict, labels_dict, missing_ids).

Dict[str, NDArray[int32]]

missing_ids contains task IDs that were not found in cache.

load_all_protein_features

load_all_protein_features(
    task_ids: List[str], featurizer: str
) -> Tuple[Dict[str, NDArray[np.float32]], List[str]]

Load protein features for multiple tasks.

Parameters:

Name Type Description Default
task_ids List[str]

List of task IDs to load.

required
featurizer str

Name of the featurizer.

required

Returns:

Type Description
Dict[str, NDArray[float32]]

Tuple of (features_dict, missing_ids).

List[str]

missing_ids contains task IDs that were not found in cache.

clear

clear(featurizer: Optional[str] = None) -> int

Clear cached features.

Parameters:

Name Type Description Default
featurizer Optional[str]

If specified, only clear features for this featurizer. If None, clear all cached features.

None

Returns:

Type Description
int

Number of files deleted.

get_statistics

get_statistics() -> Dict[str, Any]

Get statistics about cached features.

Returns:

Type Description
Dict[str, Any]

Dictionary with cache statistics.

Usage Examples

from themap.features import FeatureCache

# Initialize cache
cache = FeatureCache(cache_dir="cache/features")

# Check if features are cached
cache_key = "ecfp_task1"
if cache.has(cache_key):
    features = cache.load(cache_key)
else:
    features = compute_features()
    cache.save(cache_key, features)

Automatic Caching

from themap.features import MoleculeFeaturizer, FeatureCache

cache = FeatureCache(cache_dir="cache/")
featurizer = MoleculeFeaturizer(
    featurizer_name="ecfp",
    cache=cache  # Enable automatic caching
)

# First call computes and caches
features1 = featurizer.featurize(smiles_list)

# Second call loads from cache (fast)
features2 = featurizer.featurize(smiles_list)

Performance Optimization

Choosing the Right Featurizer

def choose_featurizer(dataset_size: int, accuracy_priority: bool) -> str:
    """Choose appropriate featurizer based on requirements."""
    if dataset_size > 100000:
        return "ecfp"  # Fast fingerprints for large datasets
    elif accuracy_priority:
        return "ChemBERTa-77M-MLM"  # Neural embeddings for accuracy
    else:
        return "desc2D"  # Good balance of speed and quality

Parallel Processing

from themap.features import MoleculeFeaturizer

# Use multiple CPU cores
featurizer = MoleculeFeaturizer(
    featurizer_name="mordred",
    n_jobs=16  # Use 16 parallel workers
)

GPU Acceleration

from themap.features import ProteinFeaturizer

# Use GPU for neural models
featurizer = ProteinFeaturizer(
    model_name="esm2_t33_650M_UR50D",
    device="cuda:0"  # Specific GPU
)

# Batch processing for efficiency
embeddings = featurizer.featurize(
    sequences,
    batch_size=32  # Process 32 sequences at a time
)

Error Handling

from themap.features import MoleculeFeaturizer

featurizer = MoleculeFeaturizer(featurizer_name="ecfp")

# Handle invalid SMILES
smiles_list = ["CCO", "invalid_smiles", "CCCO"]

try:
    features = featurizer.featurize(smiles_list)
except ValueError as e:
    print(f"Invalid SMILES: {e}")

# Or use safe mode
features = featurizer.featurize(
    smiles_list,
    on_error="skip"  # Skip invalid molecules
)

Integration with Distance Computation

from themap.features import MoleculeFeaturizer
from themap.distance import compute_dataset_distance_matrix
import numpy as np

# Featurize datasets
featurizer = MoleculeFeaturizer(featurizer_name="ecfp")

source_features = featurizer.featurize(source_smiles)
target_features = featurizer.featurize(target_smiles)

# Compute distances
distances = compute_dataset_distance_matrix(
    source_features,
    target_features,
    method="euclidean"
)

Constants

Available Featurizer Names

from themap.features.molecule import (
    FINGERPRINT_FEATURIZERS,
    DESCRIPTOR_FEATURIZERS,
    NEURAL_FEATURIZERS,
)

print("Fingerprints:", FINGERPRINT_FEATURIZERS)
print("Descriptors:", DESCRIPTOR_FEATURIZERS)
print("Neural:", NEURAL_FEATURIZERS)

ESM Model Names

from themap.features.protein import ESM2_MODELS, ESM3_MODELS

print("ESM2 models:", ESM2_MODELS)
print("ESM3 models:", ESM3_MODELS)