Skip to content

themap.data

themap.data

MoleculeDatapoint dataclass

Data structure holding information for a single molecule and associated features.

This class represents a single molecule datapoint with its associated features and labels. It provides methods to compute molecular fingerprints and features, and includes various molecular properties as properties.

Attributes:

Name Type Description
task_id str

String describing the task this datapoint is taken from.

smiles str

SMILES string describing the molecule this datapoint corresponds to.

bool_label bool

bool classification label, usually derived from the numeric label using a threshold.

numeric_label Optional[float]

numerical label (e.g., activity), usually measured in the lab

_rdkit_mol Optional[Mol]

cached RDKit molecule object

Properties

number_of_atoms (int): Number of heavy atoms in the molecule number_of_bonds (int): Number of bonds in the molecule molecular_weight (float): Molecular weight in atomic mass units logp (float): Octanol-water partition coefficient (LogP) num_rotatable_bonds (int): Number of rotatable bonds in the molecule smiles_canonical (str): Canonical SMILES representation rdkit_mol (Chem.Mol): RDKit molecule object (lazy loaded)

Methods:

Name Description
get_fingerprint

Computes and returns the Morgan fingerprint for the molecule

get_features

Computes and returns molecular features using specified featurizer

Example
Create a molecule datapoint

datapoint = MoleculeDatapoint( ... task_id="toxicity_prediction", ... smiles="CCO", # ethanol ... bool_label=True, ... numeric_label=0.8 ... )

Access molecular properties

print(f"Number of atoms: {datapoint.number_of_atoms}")

Number of atoms: 9

print(f"Molecular weight: {datapoint.molecular_weight:.2f}")

Molecular weight: 46.07

print(f"LogP: {datapoint.logp:.2f}")

LogP: -0.31

print(f"Number of rotatable bonds: {datapoint.num_rotatable_bonds}")

Number of rotatable bonds: 0

print(f"SMILES canonical: {datapoint.smiles_canonical}")

SMILES canonical: CCO
Get molecular features

fingerprint = datapoint.get_fingerprint() print(f"Fingerprint shape: {fingerprint.shape if fingerprint is not None else None}")

Fingerprint shape: (2048,)

features = datapoint.get_features(featurizer_name="ecfp") print(f"Features shape: {features.shape if features is not None else None}")

Features shape: (2048,)
rdkit_mol property
rdkit_mol: Optional[Mol]

Get the RDKit molecule object.

This property lazily initializes the RDKit molecule if it hasn't been created yet. The molecule is cached to avoid recreating it multiple times.

Returns:

Type Description
Optional[Mol]

Optional[Chem.Mol]: RDKit molecule object. Returns None if molecule creation fails.

number_of_atoms property
number_of_atoms: int

Get the number of heavy atoms in the molecule.

Returns:

Name Type Description
int int

Number of heavy atoms in the molecule.

number_of_bonds property
number_of_bonds: int

Get the number of bonds in the molecule.

Returns:

Name Type Description
int int

Number of bonds in the molecule.

molecular_weight property
molecular_weight: float

Get the molecular weight of the molecule.

Returns:

Name Type Description
float float

Molecular weight of the molecule in atomic mass units.

logp property
logp: float

Calculate octanol-water partition coefficient.

Returns:

Name Type Description
float float

LogP value of the molecule.

num_rotatable_bonds property
num_rotatable_bonds: int

Get number of rotatable bonds.

Returns:

Name Type Description
int int

Number of rotatable bonds in the molecule.

Raises: ValueError: If the molecule cannot be created.

smiles_canonical property
smiles_canonical: str

Get canonical SMILES representation.

Returns:

Name Type Description
str str

Canonical SMILES string for the molecule.

Raises: ValueError: If the molecule cannot be created.

__post_init__
__post_init__() -> None

Validate initialization data.

to_dict
to_dict() -> dict

Convert datapoint to dictionary for serialization.

from_dict classmethod
from_dict(data: dict) -> MoleculeDatapoint

Create datapoint from dictionary.

get_fingerprint
get_fingerprint(force_recompute: bool = False) -> Optional[np.ndarray]

Get the Morgan fingerprint for a molecule.

This method computes the Extended-Connectivity Fingerprint (ECFP) for the molecule using RDKit's Morgan fingerprint generator. Features are cached globally to avoid recomputation across different instances.

Parameters:

Name Type Description Default
force_recompute bool

If True, the fingerprint is recomputed even if cached.

False

Returns:

Type Description
Optional[ndarray]

Optional[np.ndarray]: Morgan fingerprint for the molecule (r=2, nbits=2048). The fingerprint is a binary vector representing the molecular structure. Returns None if fingerprint generation fails.

get_features
get_features(
    featurizer_name: Optional[str] = None, force_recompute: bool = False
) -> Optional[np.ndarray]

Get features for a molecule using a featurizer model.

This method computes molecular features using the specified featurizer model. Features are cached globally to avoid recomputation across different instances.

Parameters:

Name Type Description Default
featurizer_name Optional[str]

Name of the featurizer model to use. If None, returns None.

None
force_recompute bool

If True, features are recomputed even if cached.

False

Returns:

Type Description
Optional[ndarray]

Optional[np.ndarray]: Features for the molecule. The shape and content depend on the featurizer used. Returns None if feature generation fails or featurizer_name is None.

MoleculeDataset dataclass

Data structure holding information for a dataset of molecules.

This class represents a collection of molecule datapoints, providing methods for dataset manipulation, feature computation, and statistical analysis.

Attributes:

Name Type Description
task_id str

String describing the task this dataset is taken from.

data List[MoleculeDatapoint]

List of MoleculeDatapoint objects.

_features Optional[NDArray[float32]]

Cached features for the dataset.

_cache_info Dict[str, Any]

Information about the feature caching.

get_features property
get_features: Optional[NDArray[float32]]

Get the cached features for the dataset.

Returns:

Type Description
Optional[NDArray[float32]]

Optional[NDArray[np.float32]]: Cached features for the dataset if available, None otherwise.

get_ratio property
get_ratio: float

Get the ratio of positive to negative examples in the dataset.

Returns:

Name Type Description
float float

Ratio of positive to negative examples in the dataset.

__post_init__
__post_init__() -> None

Validate dataset initialization.

get_dataset_embedding
get_dataset_embedding(
    featurizer_name: str,
    n_jobs: Optional[int] = None,
    force_recompute: bool = False,
    batch_size: int = 1000,
) -> NDArray[np.float32]

Get the features for the entire dataset using a featurizer.

Efficiently computes features for all molecules in the dataset using the specified featurizer, taking advantage of the featurizer's built-in parallelization capabilities and maintaining a two-level cache strategy.

Parameters:

Name Type Description Default
featurizer_name str

Name of the featurizer to use

required
n_jobs Optional[int]

Number of parallel jobs. If provided, temporarily overrides the featurizer's own setting

None
force_recompute bool

Whether to force recomputation even if cached

False
batch_size int

Batch size for processing, used for memory efficiency when handling large datasets

1000

Returns:

Type Description
NDArray[float32]

NDArray[np.float32]: Features for the entire dataset, shape (n_samples, n_features)

Raises:

Type Description
ValueError

If the generated features length doesn't match the dataset length

TypeError

If featurizer_name is not a string

RuntimeError

If featurization fails

IndexError

If dataset is empty

validate_dataset_integrity
validate_dataset_integrity() -> bool

Validate the integrity of the dataset.

Returns:

Name Type Description
bool bool

True if dataset is valid, False otherwise

Raises:

Type Description
ValueError

If critical integrity issues are found

get_memory_usage
get_memory_usage() -> Dict[str, float]

Get memory usage statistics for the dataset.

Returns:

Type Description
Dict[str, float]

Dictionary with memory usage in MB for different components

optimize_memory
optimize_memory() -> Dict[str, Any]

Optimize memory usage by cleaning up unnecessary data.

Returns:

Type Description
Dict[str, Any]

Dictionary with optimization results

clear_cache
clear_cache() -> None

Clear cached features for this dataset from global cache.

enable_persistent_cache
enable_persistent_cache(cache_dir: Union[str, Path]) -> None

Enable persistent caching for this dataset.

Parameters:

Name Type Description Default
cache_dir Union[str, Path]

Directory for storing cached features

required
get_dataset_embedding_with_persistent_cache
get_dataset_embedding_with_persistent_cache(
    featurizer_name: str,
    cache_dir: Optional[Union[str, Path]] = None,
    n_jobs: Optional[int] = None,
    force_recompute: bool = False,
    batch_size: int = 1000,
) -> NDArray[np.float32]

Get dataset features with persistent caching enabled.

This method provides the same functionality as get_dataset_embedding but with persistent disk caching to avoid recomputation across sessions.

Parameters:

Name Type Description Default
featurizer_name str

Name of the featurizer to use

required
cache_dir Optional[Union[str, Path]]

Directory for persistent cache (if None, uses existing cache)

None
n_jobs Optional[int]

Number of parallel jobs

None
force_recompute bool

Whether to force recomputation even if cached

False
batch_size int

Batch size for processing

1000

Returns:

Type Description
NDArray[float32]

Features for the entire dataset

get_persistent_cache_stats
get_persistent_cache_stats() -> Optional[Dict[str, Any]]

Get statistics about the persistent cache.

Returns:

Type Description
Optional[Dict[str, Any]]

Cache statistics if persistent cache is enabled, None otherwise

get_cache_info
get_cache_info() -> Dict[str, Any]

Get information about the current cache state.

get_prototype
get_prototype(
    featurizer_name: str,
) -> Tuple[NDArray[np.float32], NDArray[np.float32]]

Get the prototype of the dataset.

This method calculates the mean feature vector of positive and negative examples in the dataset using the specified featurizer.

Parameters:

Name Type Description Default
featurizer_name str

Name of the featurizer to use.

required

Returns:

Type Description
Tuple[NDArray[float32], NDArray[float32]]

Tuple[NDArray[np.float32], NDArray[np.float32]]: Tuple containing: - positive_prototype: Mean feature vector of positive examples - negative_prototype: Mean feature vector of negative examples

Raises:

Type Description
ValueError

If there are no positive or negative examples in the dataset

TypeError

If featurizer_name is not a string

RuntimeError

If feature computation fails

load_from_file staticmethod
load_from_file(path: Union[str, RichPath]) -> MoleculeDataset

Load dataset from a JSONL.GZ file.

Parameters:

Name Type Description Default
path Union[str, RichPath]

Path to the JSONL.GZ file.

required

Returns:

Name Type Description
MoleculeDataset MoleculeDataset

Loaded dataset.

filter
filter(condition: Callable[[MoleculeDatapoint], bool]) -> MoleculeDataset

Filter dataset based on a condition.

Parameters:

Name Type Description Default
condition Callable[[MoleculeDatapoint], bool]

Function that returns True/False for each datapoint.

required

Returns:

Name Type Description
MoleculeDataset MoleculeDataset

New dataset containing only the filtered datapoints.

get_statistics
get_statistics() -> DatasetStats

Get statistics about the dataset.

Returns:

Name Type Description
DatasetStats DatasetStats

Dictionary containing: - size: Total number of datapoints - positive_ratio: Ratio of positive to negative examples - avg_molecular_weight: Average molecular weight - avg_atoms: Average number of atoms - avg_bonds: Average number of bonds

Raises:

Type Description
ValueError

If the dataset is empty

DataFold

Bases: IntEnum

Enum for data fold types.

This enum represents the different data splits used in machine learning: - TRAIN (0): Training/source tasks - VALIDATION (1): Validation/development tasks - TEST (2): Test/target tasks

By inheriting from IntEnum, each fold type is assigned an integer value which allows for easy indexing and comparison operations.

MoleculeDatasets

Dataset of related tasks, provided as individual files split into meta-train, meta-valid and meta-test sets.

__init__
__init__(
    train_data_paths: Optional[List[RichPath]] = None,
    valid_data_paths: Optional[List[RichPath]] = None,
    test_data_paths: Optional[List[RichPath]] = None,
    num_workers: Optional[int] = None,
    cache_dir: Optional[Union[str, Path]] = None,
) -> None

Initialize MoleculeDatasets.

Parameters:

Name Type Description Default
train_data_paths List[RichPath]

List of paths to training data files.

None
valid_data_paths List[RichPath]

List of paths to validation data files.

None
test_data_paths List[RichPath]

List of paths to test data files.

None
num_workers Optional[int]

Number of workers for data loading.

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching.

None
get_num_fold_tasks
get_num_fold_tasks(fold: DataFold) -> int

Get number of tasks in a specific fold.

Parameters:

Name Type Description Default
fold DataFold

The fold to get number of tasks for.

required

Returns:

Name Type Description
int int

Number of tasks in the fold.

from_directory staticmethod
from_directory(
    directory: Union[str, RichPath],
    task_list_file: Optional[Union[str, RichPath]] = None,
    cache_dir: Optional[Union[str, Path]] = None,
    **kwargs: Any,
) -> MoleculeDatasets

Create MoleculeDatasets from a directory.

Parameters:

Name Type Description Default
directory Union[str, RichPath]

Directory containing train/valid/test subdirectories.

required
task_list_file Optional[Union[str, RichPath]]

File containing list of tasks to include. Can be either a text file (one task per line) or JSON file with fold-specific task lists.

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching.

None
**kwargs any

Additional arguments to pass to MoleculeDatasets constructor.

{}

Returns:

Name Type Description
MoleculeDatasets MoleculeDatasets

Created dataset.

get_task_names
get_task_names(data_fold: DataFold) -> List[str]

Get list of task names in a specific fold.

Parameters:

Name Type Description Default
data_fold DataFold

The fold to get task names for.

required

Returns:

Type Description
List[str]

List[str]: List of task names in the fold.

load_datasets
load_datasets(
    folds: Optional[List[DataFold]] = None,
) -> Dict[str, MoleculeDataset]

Load all datasets from specified folds.

Parameters:

Name Type Description Default
folds Optional[List[DataFold]]

List of folds to load. If None, loads all folds.

None

Returns:

Type Description
Dict[str, MoleculeDataset]

Dictionary mapping dataset names to loaded datasets

compute_all_features_with_deduplication
compute_all_features_with_deduplication(
    featurizer_name: str,
    folds: Optional[List[DataFold]] = None,
    batch_size: int = 1000,
    n_jobs: int = -1,
    force_recompute: bool = False,
) -> Dict[str, np.ndarray]

Compute features for all datasets with global SMILES deduplication.

This method provides significant efficiency gains by: 1. Finding all unique SMILES across all datasets 2. Computing features only once per unique SMILES 3. Distributing computed features back to all datasets 4. Using persistent caching to avoid recomputation

Parameters:

Name Type Description Default
featurizer_name str

Name of featurizer to use

required
folds Optional[List[DataFold]]

List of folds to process. If None, processes all folds

None
batch_size int

Batch size for feature computation

1000
n_jobs int

Number of parallel jobs

-1
force_recompute bool

Whether to force recomputation even if cached

False

Returns:

Type Description
Dict[str, ndarray]

Dictionary mapping dataset names to computed features

Note

This method has side effects: - It modifies the datasets in place by adding features to the dataset objects - It modifies the molecules in place by adding features to the molecule objects

get_distance_computation_ready_features
get_distance_computation_ready_features(
    featurizer_name: str,
    source_fold: DataFold = DataFold.TRAIN,
    target_folds: Optional[List[DataFold]] = None,
) -> Tuple[List[np.ndarray], List[np.ndarray], List[str], List[str]]

Get features organized for efficient N×M distance matrix computation.

Parameters:

Name Type Description Default
featurizer_name str

Name of featurizer to use

required
source_fold DataFold

Fold to use as source datasets (N)

TRAIN
target_folds Optional[List[DataFold]]

Folds to use as target datasets (M)

None

Returns:

Type Description
List[ndarray]

Tuple containing:

List[ndarray]
  • source_features: List of feature arrays for source datasets
List[str]
  • target_features: List of feature arrays for target datasets
List[str]
  • source_names: List of source dataset names
Tuple[List[ndarray], List[ndarray], List[str], List[str]]
  • target_names: List of target dataset names
get_global_cache_stats
get_global_cache_stats() -> Optional[Dict]

Get statistics about the global cache usage.

Returns:

Type Description
Optional[Dict]

Cache statistics if global cache is enabled, None otherwise

ProteinDatasets

Collection of protein datasets for different folds (train/validation/test).

Similar to MoleculeDatasets but specifically designed for protein data management, including FASTA file downloading, caching, and feature computation.

uniprot_mapping property
uniprot_mapping: DataFrame

Lazy load UniProt mapping dataframe.

__init__
__init__(
    train_data_paths: Optional[List[RichPath]] = None,
    valid_data_paths: Optional[List[RichPath]] = None,
    test_data_paths: Optional[List[RichPath]] = None,
    num_workers: Optional[int] = None,
    cache_dir: Optional[Union[str, Path]] = None,
    uniprot_mapping_file: Optional[Union[str, Path]] = None,
) -> None

Initialize ProteinDatasets.

Parameters:

Name Type Description Default
train_data_paths Optional[List[RichPath]]

List of paths to training FASTA files

None
valid_data_paths Optional[List[RichPath]]

List of paths to validation FASTA files

None
test_data_paths Optional[List[RichPath]]

List of paths to test FASTA files

None
num_workers Optional[int]

Number of workers for data loading

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching

None
uniprot_mapping_file Optional[Union[str, Path]]

Path to CHEMBLID -> UNIPROT mapping file

None
get_uniprot_id_from_chembl
get_uniprot_id_from_chembl(chembl_id: str) -> Optional[str]

Get UniProt ID from ChEMBL ID using mapping file.

Parameters:

Name Type Description Default
chembl_id str

ChEMBL task ID

required

Returns:

Type Description
Optional[str]

UniProt accession ID if found, None otherwise

download_fasta_for_task
download_fasta_for_task(chembl_id: str, output_path: Path) -> bool

Download FASTA file for a single task.

Parameters:

Name Type Description Default
chembl_id str

ChEMBL task ID

required
output_path Path

Path where to save the FASTA file

required

Returns:

Type Description
bool

True if successful, False otherwise

create_fasta_files_from_task_list staticmethod
create_fasta_files_from_task_list(
    task_list_file: Union[str, Path],
    output_dir: Union[str, Path],
    uniprot_mapping_file: Optional[Union[str, Path]] = None,
) -> ProteinDatasets

Create FASTA files from a task list and return ProteinDatasets.

Parameters:

Name Type Description Default
task_list_file Union[str, Path]

Path to JSON file containing fold-specific task lists

required
output_dir Union[str, Path]

Base directory where to create train/test subdirectories

required
uniprot_mapping_file Optional[Union[str, Path]]

Path to CHEMBLID -> UNIPROT mapping file

None

Returns:

Type Description
ProteinDatasets

ProteinDatasets instance with paths to created FASTA files

from_directory staticmethod
from_directory(
    directory: Union[str, RichPath],
    task_list_file: Optional[Union[str, RichPath]] = None,
    cache_dir: Optional[Union[str, Path]] = None,
    uniprot_mapping_file: Optional[Union[str, Path]] = None,
    **kwargs: Any,
) -> ProteinDatasets

Create ProteinDatasets from a directory containing FASTA files.

Parameters:

Name Type Description Default
directory Union[str, RichPath]

Directory containing train/valid/test subdirectories with FASTA files

required
task_list_file Optional[Union[str, RichPath]]

File containing list of tasks to include

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching

None
uniprot_mapping_file Optional[Union[str, Path]]

Path to CHEMBLID -> UNIPROT mapping file

None
**kwargs Any

Additional arguments

{}

Returns:

Type Description
ProteinDatasets

ProteinDatasets instance

get_num_fold_tasks
get_num_fold_tasks(fold: DataFold) -> int

Get number of tasks in a specific fold.

get_task_names
get_task_names(data_fold: DataFold) -> List[str]

Get list of task names in a specific fold.

load_datasets
load_datasets(
    folds: Optional[List[DataFold]] = None,
) -> Dict[str, ProteinDataset]

Load all protein datasets from specified folds.

Parameters:

Name Type Description Default
folds Optional[List[DataFold]]

List of folds to load. If None, loads all folds.

None

Returns:

Type Description
Dict[str, ProteinDataset]

Dictionary mapping dataset names to loaded ProteinDataset objects

compute_all_features_with_deduplication
compute_all_features_with_deduplication(
    featurizer_name: str = "esm3_sm_open_v1",
    layer: Optional[int] = None,
    folds: Optional[List[DataFold]] = None,
    batch_size: int = 100,
    force_recompute: bool = False,
) -> Dict[str, NDArray[np.float32]]

Compute features for all protein datasets with UniProt ID deduplication.

Parameters:

Name Type Description Default
featurizer_name str

Name of protein featurizer to use

'esm3_sm_open_v1'
layer Optional[int]

Layer number for ESM models

None
folds Optional[List[DataFold]]

List of folds to process. If None, processes all folds

None
batch_size int

Batch size for feature computation

100
force_recompute bool

Whether to force recomputation even if cached

False

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping dataset names to computed features

get_distance_computation_ready_features
get_distance_computation_ready_features(
    featurizer_name: str = "esm3_sm_open_v1",
    layer: Optional[int] = None,
    source_fold: DataFold = DataFold.TRAIN,
    target_folds: Optional[List[DataFold]] = None,
) -> Tuple[
    List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]

Get protein features organized for efficient N×M distance matrix computation.

Parameters:

Name Type Description Default
featurizer_name str

Name of protein featurizer to use

'esm3_sm_open_v1'
layer Optional[int]

Layer number for ESM models

None
source_fold DataFold

Fold to use as source datasets (N)

TRAIN
target_folds Optional[List[DataFold]]

Folds to use as target datasets (M)

None

Returns:

Type Description
List[NDArray[float32]]

Tuple containing:

List[NDArray[float32]]
  • source_features: List of feature arrays for source datasets
List[str]
  • target_features: List of feature arrays for target datasets
List[str]
  • source_names: List of source dataset names
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
  • target_names: List of target dataset names
save_features_to_file
save_features_to_file(
    output_path: Union[str, Path],
    featurizer_name: str = "esm3_sm_open_v1",
    layer: Optional[int] = None,
    folds: Optional[List[DataFold]] = None,
) -> None

Save computed features to a pickle file for efficient loading.

Parameters:

Name Type Description Default
output_path Union[str, Path]

Path where to save the features

required
featurizer_name str

Name of protein featurizer used

'esm3_sm_open_v1'
layer Optional[int]

Layer number for ESM models

None
folds Optional[List[DataFold]]

List of folds to save. If None, saves all folds

None
load_features_from_file staticmethod
load_features_from_file(
    file_path: Union[str, Path],
) -> Dict[str, NDArray[np.float32]]

Load precomputed features from a pickle file.

Parameters:

Name Type Description Default
file_path Union[str, Path]

Path to the saved features file

required

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping dataset names to features

get_global_cache_stats
get_global_cache_stats() -> Optional[Dict[str, Any]]

Get statistics about the global cache usage.

Tasks

Collection of tasks for molecular property prediction across different folds.

This class manages multiple Task objects and provides unified access to molecular, protein, and metadata features across train/validation/test splits.

__init__
__init__(
    train_tasks: Optional[List[Task]] = None,
    valid_tasks: Optional[List[Task]] = None,
    test_tasks: Optional[List[Task]] = None,
    cache_dir: Optional[Union[str, Path]] = None,
) -> None

Initialize Tasks collection.

Parameters:

Name Type Description Default
train_tasks Optional[List[Task]]

List of training tasks

None
valid_tasks Optional[List[Task]]

List of validation tasks

None
test_tasks Optional[List[Task]]

List of test tasks

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching

None
from_directory staticmethod
from_directory(
    directory: Union[str, RichPath],
    task_list_file: Optional[Union[str, RichPath]] = None,
    cache_dir: Optional[Union[str, Path]] = None,
    load_molecules: bool = True,
    load_proteins: bool = True,
    load_metadata: bool = True,
    metadata_types: Optional[List[str]] = None,
    **kwargs: Any,
) -> Tasks

Create Tasks from a directory structure.

Expected directory structure: directory/ ├── train/ │ ├── CHEMBL123.jsonl.gz (molecules) │ ├── CHEMBL123.fasta (proteins) │ ├── CHEMBL123_assay.json (metadata) │ └── ... ├── valid/ └── test/

Parameters:

Name Type Description Default
directory Union[str, RichPath]

Base directory containing task data

required
task_list_file Optional[Union[str, RichPath]]

JSON file with fold-specific task lists

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching

None
load_molecules bool

Whether to load molecular data

True
load_proteins bool

Whether to load protein data

True
load_metadata bool

Whether to load metadata

True
metadata_types Optional[List[str]]

List of metadata types to load

None
**kwargs Any

Additional arguments

{}

Returns:

Type Description
Tasks

Tasks instance with loaded data

get_num_fold_tasks
get_num_fold_tasks(fold: DataFold) -> int

Get number of tasks in a specific fold.

get_task_ids
get_task_ids(fold: DataFold) -> List[str]

Get list of task IDs in a specific fold.

get_tasks
get_tasks(fold: DataFold) -> List[Task]

Get list of tasks in a specific fold.

__len__
__len__() -> int

Get number of tasks.

__getitem__
__getitem__(index: int) -> List[Task]

Get a task by index.

Parameters:

Name Type Description Default
index int

int: index of the task

required

Returns:

Type Description
List[Task]

List[Task]: list of tasks

Note

index 0: Train Tasks index 1: Validation Tasks index 2: Test Tasks

Raises:

Type Description
IndexError

if index is out of range

get_task_by_id
get_task_by_id(task_id: str) -> Optional[Task]

Get a specific task by its ID.

compute_all_task_features
compute_all_task_features(
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    folds: Optional[List[DataFold]] = None,
    force_recompute: bool = False,
    **kwargs: Any,
) -> Dict[str, NDArray[np.float32]]

Compute combined features for all tasks.

Parameters:

Name Type Description Default
molecule_featurizer Optional[str]

Molecular featurizer name

None
protein_featurizer Optional[str]

Protein featurizer name

None
metadata_configs Optional[Dict[str, Dict[str, Any]]]

Metadata featurizer configurations

None
combination_method str

How to combine features

'concatenate'
folds Optional[List[DataFold]]

List of folds to process

None
force_recompute bool

Whether to force recomputation

False
**kwargs Any

Additional arguments

{}

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping task names to combined features

get_distance_computation_ready_features
get_distance_computation_ready_features(
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    source_fold: DataFold = DataFold.TRAIN,
    target_folds: Optional[List[DataFold]] = None,
    **kwargs: Any,
) -> Tuple[
    List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]

Get task features organized for efficient N×M distance matrix computation.

Parameters:

Name Type Description Default
molecule_featurizer Optional[str]

Molecular featurizer name

None
protein_featurizer Optional[str]

Protein featurizer name

None
metadata_configs Optional[Dict[str, Dict[str, Any]]]

Metadata featurizer configurations

None
combination_method str

How to combine features

'concatenate'
source_fold DataFold

Fold to use as source tasks (N)

TRAIN
target_folds Optional[List[DataFold]]

Folds to use as target tasks (M)

None
**kwargs Any

Additional arguments

{}

Returns:

Type Description
List[NDArray[float32]]

Tuple containing:

List[NDArray[float32]]
  • source_features: List of feature arrays for source tasks
List[str]
  • target_features: List of feature arrays for target tasks
List[str]
  • source_names: List of source task names
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
  • target_names: List of target task names
save_task_features_to_file
save_task_features_to_file(
    output_path: Union[str, Path],
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    folds: Optional[List[DataFold]] = None,
    **kwargs: Any,
) -> None

Save computed task features to a pickle file for efficient loading.

load_task_features_from_file staticmethod
load_task_features_from_file(
    file_path: Union[str, Path],
) -> Dict[str, NDArray[np.float32]]

Load precomputed task features from a pickle file.

get_cache_stats
get_cache_stats() -> Dict[str, Any]

Get statistics about feature caching.

TorchMoleculeDataset

Bases: Dataset

PYTORCH Dataset for molecular data.

Parameters:

Name Type Description Default
data MoleculeDataset

MoleculeDataset object

required
transform callable

transform to apply to data

None
target_transform callable

transform to apply to targets

None
__init__
__init__(
    data: MoleculeDataset,
    transform: Optional[Callable] = None,
    target_transform: Optional[Callable] = None,
) -> None

Initialize TorchMoleculeDataset.

Parameters:

Name Type Description Default
data MoleculeDataset

Input dataset

required
transform callable

Transform to apply to data

None
target_transform callable

Transform to apply to targets

None
__getitem__
__getitem__(index: int) -> tuple[torch.Tensor, torch.Tensor]

Get a data sample.

Parameters:

Name Type Description Default
index int

Index of the sample to get

required

Returns:

Type Description
tuple[Tensor, Tensor]

tuple[torch.Tensor, torch.Tensor]: Tuple of (features, label)

__len__
__len__() -> int

Get the number of samples in the dataset.

Returns:

Name Type Description
int int

Number of samples

create_dataloader classmethod
create_dataloader(
    data: MoleculeDataset,
    batch_size: int = 64,
    shuffle: bool = True,
    **kwargs: Any,
) -> torch.utils.data.DataLoader

Create PyTorch DataLoader.

Parameters:

Name Type Description Default
data MoleculeDataset

Input dataset

required
batch_size int

Batch size

64
shuffle bool

Whether to shuffle data

True
**kwargs any

Additional arguments for DataLoader

{}

Returns:

Name Type Description
DataLoader DataLoader

PyTorch data loader

MoleculeDataloader

MoleculeDataloader(
    data: MoleculeDataset,
    batch_size: int = 64,
    shuffle: bool = True,
    transform: Optional[Callable] = None,
    target_transform: Optional[Callable] = None,
) -> torch.utils.data.DataLoader

Load molecular data and create PYTORCH dataloader.

Parameters:

Name Type Description Default
data MoleculeDataset

MoleculeDataset object

required
batch_size int

batch size

64
shuffle bool

whether to shuffle data

True
transform callable

transform to apply to data

None
target_transform callable

transform to apply to targets

None

Returns:

Name Type Description
dataset_loader DataLoader

PYTORCH dataloader

Example

from themap.data.torch_dataset import MoleculeDataloader from themap.data.tasks import Tasks tasks = Tasks.from_directory( directory="datasets/", task_list_file="datasets/sample_tasks_list.json", load_molecules=True, load_proteins=False, load_metadata=False, cache_dir="./cache" ) dataset_loader = MoleculeDataloader(tasks.get_task("TASK_ID").molecule_dataset, batch_size=10, shuffle=True) for batch in dataset_loader: print(batch) break