Skip to content

Data Module

The data module provides classes and utilities for loading, managing, and converting molecular and protein datasets.

Overview

The data system consists of these main components:

  • MoleculeDataset - Container for molecular data (SMILES, labels)
  • DatasetLoader - Load datasets from directory structures
  • CSVConverter - Convert CSV files to JSONL.GZ format
  • Tasks - Unified task management across train/test/valid splits

MoleculeDataset

MoleculeDataset

themap.data.molecule_dataset.MoleculeDataset dataclass

Simplified dataset structure for molecules.

Optimized for batch distance computation between tasks. Stores SMILES strings and labels directly without per-molecule object overhead.

Attributes:

Name Type Description
task_id str

String identifying the task this dataset belongs to.

smiles_list List[str]

List of SMILES strings for all molecules.

labels NDArray[int32]

Binary labels as numpy array (0/1).

numeric_labels Optional[NDArray[float32]]

Optional continuous labels (e.g., pIC50).

_features Optional[NDArray[float32]]

Precomputed feature matrix (set via set_features or pipeline).

_featurizer_name Optional[str]

Name of featurizer used for current features.

Examples:

>>> dataset = MoleculeDataset.load_from_file("datasets/train/CHEMBL123.jsonl.gz")
>>> print(len(dataset))  # Number of molecules
>>> print(dataset.positive_ratio)  # Ratio of positive labels
>>> # Features are set externally via FeaturizationPipeline
>>> dataset.set_features(features_array, "ecfp")
>>> pos_proto, neg_proto = dataset.get_prototype()

smiles property

smiles: List[str]

Get SMILES list (alias for backward compatibility).

positive_ratio property

positive_ratio: float

Get ratio of positive to total examples.

features property

features: Optional[NDArray[float32]]

Get precomputed features if available.

featurizer_name property

featurizer_name: Optional[str]

Get name of featurizer used for current features.

datapoints property

datapoints: List[Dict[str, Any]]

Legacy property for backward compatibility with metalearning module.

Returns list of dictionaries with molecule data.

data property

data: List[Dict[str, Any]]

Legacy property - alias for datapoints.

__post_init__

__post_init__() -> None

Validate dataset initialization.

__len__

__len__() -> int

Return number of molecules in the dataset.

has_features

has_features() -> bool

Check if features have been computed.

set_features

set_features(features: NDArray[float32], featurizer_name: str) -> None

Set precomputed features for this dataset.

Parameters:

Name Type Description Default
features NDArray[float32]

Feature matrix of shape (n_molecules, feature_dim)

required
featurizer_name str

Name of the featurizer used

required

Raises:

Type Description
ValueError

If feature dimensions don't match dataset size

clear_features

clear_features() -> None

Clear cached features to free memory.

get_features

get_features(
    featurizer_name: str = "ecfp", **kwargs: Any
) -> NDArray[np.float32]

Get molecular features, computing on demand if necessary.

This method returns pre-computed features if available (set via set_features or FeaturizationPipeline), or computes features on demand using the specified featurizer.

Parameters:

Name Type Description Default
featurizer_name str

Name of molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D")

'ecfp'
**kwargs Any

Additional featurizer arguments

{}

Returns:

Type Description
NDArray[float32]

Feature matrix of shape (n_molecules, feature_dim)

Raises:

Type Description
ValueError

If no molecules in dataset or featurization fails

get_prototype

get_prototype(
    featurizer_name: Optional[str] = None,
) -> Tuple[NDArray[np.float32], NDArray[np.float32]]

Compute positive and negative prototypes from features.

Prototypes are the mean feature vectors for each class.

Parameters:

Name Type Description Default
featurizer_name Optional[str]

Optional featurizer name. If provided and features aren't yet computed, they will be computed on demand.

None

Returns:

Type Description
Tuple[NDArray[float32], NDArray[float32]]

Tuple of (positive_prototype, negative_prototype)

Raises:

Type Description
ValueError

If features haven't been set or no examples exist for a class

get_class_features

get_class_features() -> Tuple[NDArray[np.float32], NDArray[np.float32]]

Get features separated by class.

Returns:

Type Description
Tuple[NDArray[float32], NDArray[float32]]

Tuple of (positive_features, negative_features)

Raises:

Type Description
ValueError

If features haven't been set

load_from_file staticmethod

load_from_file(path: Union[str, RichPath, Path]) -> MoleculeDataset

Load dataset from a JSONL.GZ file.

Parameters:

Name Type Description Default
path Union[str, RichPath, Path]

Path to the JSONL.GZ file.

required

Returns:

Type Description
MoleculeDataset

MoleculeDataset with loaded SMILES and labels.

to_dict

to_dict() -> Dict[str, Any]

Convert dataset to dictionary representation.

from_dict classmethod

from_dict(data: Dict[str, Any]) -> MoleculeDataset

Create dataset from dictionary representation.

filter_by_indices

filter_by_indices(indices: List[int]) -> MoleculeDataset

Create a new dataset with only the specified indices.

Parameters:

Name Type Description Default
indices List[int]

List of indices to keep

required

Returns:

Type Description
MoleculeDataset

New MoleculeDataset with filtered data

get_statistics

get_statistics() -> Dict[str, Any]

Get basic statistics about the dataset.

Returns:

Type Description
Dict[str, Any]

Dictionary with dataset statistics

Usage Examples

from themap.data import MoleculeDataset

# Load from JSONL.GZ file
dataset = MoleculeDataset.from_jsonl_gz("datasets/train/CHEMBL123456.jsonl.gz")

print(f"Number of molecules: {len(dataset)}")
print(f"SMILES: {dataset.smiles_list[:3]}")
print(f"Labels: {dataset.labels[:3]}")

Creating from Data

from themap.data import MoleculeDataset

# Create from lists
dataset = MoleculeDataset(
    smiles_list=["CCO", "CCCO", "CC(=O)O"],
    labels=[1, 0, 1],
    task_id="my_task"
)

# Save to file
dataset.to_jsonl_gz("output/my_task.jsonl.gz")

DatasetLoader

DatasetLoader

themap.data.loader.DatasetLoader

Load datasets from directory structure.

Supports the following directory structure:

data_dir/
├── train/           # Training tasks (source)
│   ├── TASK1.csv
│   ├── TASK2.jsonl.gz
│   └── ...
├── test/            # Test tasks (target)
│   ├── TASK3.csv
│   └── ...
├── valid/           # Optional validation tasks
│   └── ...
├── proteins/        # Optional protein FASTA files
│   ├── TASK1.fasta
│   └── ...
└── tasks.json       # Optional task list

If tasks.json is not provided, all CSV/JSONL.GZ files are auto-discovered.

Attributes:

Name Type Description
data_dir

Root directory containing train/test/valid folders.

task_list Optional[Dict[str, List[str]]]

Optional task list loaded from tasks.json.

Examples:

>>> loader = DatasetLoader(Path("datasets/TDC"))
>>> train_datasets = loader.load_datasets("train")
>>> test_datasets = loader.load_datasets("test")
>>> # Get task IDs
>>> train_ids = list(train_datasets.keys())

__init__

__init__(data_dir: Union[str, Path], task_list_file: Optional[str] = None)

Initialize the dataset loader.

Parameters:

Name Type Description Default
data_dir Union[str, Path]

Root directory containing train/test/valid folders.

required
task_list_file Optional[str]

Optional name of task list JSON file in data_dir. If None, all files are auto-discovered.

None

get_fold_dir

get_fold_dir(fold: str) -> Path

Get the directory for a specific fold.

Parameters:

Name Type Description Default
fold str

Fold name (train, test, or valid).

required

Returns:

Type Description
Path

Path to the fold directory.

Raises:

Type Description
ValueError

If fold name is invalid.

get_task_ids

get_task_ids(fold: str) -> List[str]

Get list of task IDs for a fold.

If task_list is provided, uses that. Otherwise auto-discovers files.

Parameters:

Name Type Description Default
fold str

Fold name (train, test, or valid).

required

Returns:

Type Description
List[str]

List of task IDs.

load_dataset

load_dataset(
    fold: str, task_id: str, convert_csv: bool = True
) -> MoleculeDataset

Load a single dataset.

Parameters:

Name Type Description Default
fold str

Fold name (train, test, or valid).

required
task_id str

Task ID to load.

required
convert_csv bool

If True, convert CSV to JSONL.GZ format automatically.

True

Returns:

Type Description
MoleculeDataset

MoleculeDataset instance.

Raises:

Type Description
FileNotFoundError

If dataset file not found.

load_datasets

load_datasets(
    fold: str, task_ids: Optional[List[str]] = None, convert_csv: bool = True
) -> Dict[str, MoleculeDataset]

Load all datasets for a fold.

Parameters:

Name Type Description Default
fold str

Fold name (train, test, or valid).

required
task_ids Optional[List[str]]

Optional list of specific task IDs to load. If None, loads all tasks in the fold.

None
convert_csv bool

If True, convert CSV to JSONL.GZ format automatically.

True

Returns:

Type Description
Dict[str, MoleculeDataset]

Dictionary mapping task IDs to MoleculeDataset instances.

load_all_folds

load_all_folds(
    convert_csv: bool = True,
) -> Dict[str, Dict[str, MoleculeDataset]]

Load datasets from all available folds.

Returns:

Type Description
Dict[str, Dict[str, MoleculeDataset]]

Dictionary mapping fold names to dictionaries of datasets.

get_protein_file

get_protein_file(task_id: str) -> Optional[Path]

Get path to protein FASTA file for a task.

Parameters:

Name Type Description Default
task_id str

Task ID to find protein for.

required

Returns:

Type Description
Optional[Path]

Path to FASTA file, or None if not found.

load_protein_sequences

load_protein_sequences() -> Dict[str, str]

Load all protein sequences from the proteins directory.

Returns:

Type Description
Dict[str, str]

Dictionary mapping task IDs to protein sequences.

get_statistics

get_statistics() -> Dict[str, Any]

Get statistics about available datasets.

Returns:

Type Description
Dict[str, Any]

Dictionary with dataset counts and information.

Usage Examples

from themap.data import DatasetLoader

# Initialize loader
loader = DatasetLoader(
    data_dir="datasets",
    task_list_file="datasets/sample_tasks_list.json"
)

# Load all datasets
train_datasets = loader.load_fold("train")
test_datasets = loader.load_fold("test")

print(f"Loaded {len(train_datasets)} training datasets")
print(f"Loaded {len(test_datasets)} test datasets")

Loading Specific Tasks

from themap.data import DatasetLoader

loader = DatasetLoader(data_dir="datasets")

# Load specific dataset
dataset = loader.load_dataset(
    task_id="CHEMBL123456",
    fold="train"
)

# Load multiple datasets
datasets = loader.load_datasets(
    task_ids=["CHEMBL123456", "CHEMBL789012"],
    fold="train"
)

Get Dataset Statistics

from themap.data import DatasetLoader

loader = DatasetLoader(data_dir="datasets")
stats = loader.get_statistics()

print(f"Data directory: {stats['data_dir']}")
for fold, fold_stats in stats['folds'].items():
    print(f"  {fold}: {fold_stats['task_count']} tasks")

CSVConverter

CSVConverter

themap.data.converter.CSVConverter

Convert CSV files to JSONL.GZ format for THEMAP.

Supports auto-detection of SMILES and activity columns, RDKit-based SMILES validation, and various CSV formats.

Examples:

>>> converter = CSVConverter()
>>> stats = converter.convert("input.csv", "output.jsonl.gz", "CHEMBL123")
>>> print(f"Converted {stats.valid_molecules} molecules")

__init__

__init__(
    validate_smiles: bool = True,
    strict_validation: bool = True,
    auto_detect_columns: bool = True,
)

Initialize the converter.

Parameters:

Name Type Description Default
validate_smiles bool

Whether to validate SMILES with RDKit.

True
strict_validation bool

If True, use strict sanitization.

True
auto_detect_columns bool

If True, auto-detect column names.

True

read_csv

read_csv(
    path: Union[str, Path],
    smiles_column: Optional[str] = None,
    activity_column: Optional[str] = None,
) -> Dict[str, Any]

Read CSV file and extract SMILES and labels.

Parameters:

Name Type Description Default
path Union[str, Path]

Path to the CSV file.

required
smiles_column Optional[str]

Name of the SMILES column (auto-detected if None).

None
activity_column Optional[str]

Name of the activity column (auto-detected if None).

None

Returns:

Type Description
Dict[str, Any]

Dictionary with 'smiles', 'labels', 'numeric_labels' keys.

convert

convert(
    input_path: Union[str, Path],
    output_path: Union[str, Path],
    task_id: str,
    smiles_column: Optional[str] = None,
    activity_column: Optional[str] = None,
) -> ConversionStats

Convert CSV file to JSONL.GZ format.

Parameters:

Name Type Description Default
input_path Union[str, Path]

Path to input CSV file.

required
output_path Union[str, Path]

Path for output JSONL.GZ file.

required
task_id str

Task/assay ID for the dataset.

required
smiles_column Optional[str]

Name of SMILES column (auto-detected if None).

None
activity_column Optional[str]

Name of activity column (auto-detected if None).

None

Returns:

Type Description
ConversionStats

ConversionStats with conversion statistics.

Usage Examples

from themap.data.converter import CSVConverter
from pathlib import Path

# Initialize converter
converter = CSVConverter(
    validate_smiles=True,
    auto_detect_columns=True
)

# Convert a CSV file
stats = converter.convert(
    input_path=Path("data/raw.csv"),
    output_path=Path("datasets/train/CHEMBL123456.jsonl.gz"),
    task_id="CHEMBL123456"
)

print(f"Converted {stats.valid_molecules}/{stats.total_rows} molecules")
print(f"Success rate: {stats.success_rate:.1f}%")

Specifying Column Names

from themap.data.converter import CSVConverter
from pathlib import Path

converter = CSVConverter(validate_smiles=True)

stats = converter.convert(
    input_path=Path("data.csv"),
    output_path=Path("output.jsonl.gz"),
    task_id="my_task",
    smiles_column="canonical_smiles",
    activity_column="pIC50"
)

Batch Conversion

from themap.data.converter import CSVConverter
from pathlib import Path

converter = CSVConverter()

# Convert multiple files
csv_files = Path("raw_data").glob("*.csv")

for csv_file in csv_files:
    task_id = csv_file.stem
    output_path = Path(f"datasets/train/{task_id}.jsonl.gz")

    stats = converter.convert(csv_file, output_path, task_id)
    print(f"{task_id}: {stats.valid_molecules} molecules")

Tasks

Tasks

themap.data.tasks.Tasks

Collection of tasks for molecular property prediction across different folds.

This class manages multiple Task objects and provides unified access to molecular, protein, and metadata features across train/validation/test splits.

__init__

__init__(
    train_tasks: Optional[List[Task]] = None,
    valid_tasks: Optional[List[Task]] = None,
    test_tasks: Optional[List[Task]] = None,
    cache_dir: Optional[Union[str, Path]] = None,
) -> None

Initialize Tasks collection.

Parameters:

Name Type Description Default
train_tasks Optional[List[Task]]

List of training tasks

None
valid_tasks Optional[List[Task]]

List of validation tasks

None
test_tasks Optional[List[Task]]

List of test tasks

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching

None

from_directory staticmethod

from_directory(
    directory: Union[str, RichPath],
    task_list_file: Optional[Union[str, RichPath]] = None,
    cache_dir: Optional[Union[str, Path]] = None,
    load_molecules: bool = True,
    load_proteins: bool = True,
    load_metadata: bool = True,
    metadata_types: Optional[List[str]] = None,
    **kwargs: Any,
) -> Tasks

Create Tasks from a directory structure.

Expected directory structure: directory/ ├── train/ │ ├── CHEMBL123.jsonl.gz (molecules) │ ├── CHEMBL123.fasta (proteins) │ ├── CHEMBL123_assay.json (metadata) │ └── ... ├── valid/ └── test/

Parameters:

Name Type Description Default
directory Union[str, RichPath]

Base directory containing task data

required
task_list_file Optional[Union[str, RichPath]]

JSON file with fold-specific task lists

None
cache_dir Optional[Union[str, Path]]

Directory for persistent caching

None
load_molecules bool

Whether to load molecular data

True
load_proteins bool

Whether to load protein data

True
load_metadata bool

Whether to load metadata

True
metadata_types Optional[List[str]]

List of metadata types to load

None
**kwargs Any

Additional arguments

{}

Returns:

Type Description
Tasks

Tasks instance with loaded data

get_num_fold_tasks

get_num_fold_tasks(fold: DataFold) -> int

Get number of tasks in a specific fold.

get_task_ids

get_task_ids(fold: DataFold) -> List[str]

Get list of task IDs in a specific fold.

get_tasks

get_tasks(fold: DataFold) -> List[Task]

Get list of tasks in a specific fold.

__len__

__len__() -> int

Get number of tasks.

__getitem__

__getitem__(index: int) -> List[Task]

Get a task by index.

Parameters:

Name Type Description Default
index int

int: index of the task

required

Returns:

Type Description
List[Task]

List[Task]: list of tasks

Note

index 0: Train Tasks index 1: Validation Tasks index 2: Test Tasks

Raises:

Type Description
IndexError

if index is out of range

get_task_by_id

get_task_by_id(task_id: str) -> Optional[Task]

Get a specific task by its ID.

compute_all_task_features

compute_all_task_features(
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    folds: Optional[List[DataFold]] = None,
    force_recompute: bool = False,
    **kwargs: Any,
) -> Dict[str, NDArray[np.float32]]

Compute combined features for all tasks.

Parameters:

Name Type Description Default
molecule_featurizer Optional[str]

Molecular featurizer name

None
protein_featurizer Optional[str]

Protein featurizer name

None
metadata_configs Optional[Dict[str, Dict[str, Any]]]

Metadata featurizer configurations

None
combination_method str

How to combine features

'concatenate'
folds Optional[List[DataFold]]

List of folds to process

None
force_recompute bool

Whether to force recomputation

False
**kwargs Any

Additional arguments

{}

Returns:

Type Description
Dict[str, NDArray[float32]]

Dictionary mapping task names to combined features

get_distance_computation_ready_features

get_distance_computation_ready_features(
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    source_fold: DataFold = DataFold.TRAIN,
    target_folds: Optional[List[DataFold]] = None,
    **kwargs: Any,
) -> Tuple[
    List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]

Get task features organized for efficient N×M distance matrix computation.

Parameters:

Name Type Description Default
molecule_featurizer Optional[str]

Molecular featurizer name

None
protein_featurizer Optional[str]

Protein featurizer name

None
metadata_configs Optional[Dict[str, Dict[str, Any]]]

Metadata featurizer configurations

None
combination_method str

How to combine features

'concatenate'
source_fold DataFold

Fold to use as source tasks (N)

TRAIN
target_folds Optional[List[DataFold]]

Folds to use as target tasks (M)

None
**kwargs Any

Additional arguments

{}

Returns:

Type Description
List[NDArray[float32]]

Tuple containing:

List[NDArray[float32]]
  • source_features: List of feature arrays for source tasks
List[str]
  • target_features: List of feature arrays for target tasks
List[str]
  • source_names: List of source task names
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
  • target_names: List of target task names

save_task_features_to_file

save_task_features_to_file(
    output_path: Union[str, Path],
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    folds: Optional[List[DataFold]] = None,
    **kwargs: Any,
) -> None

Save computed task features to a pickle file for efficient loading.

load_task_features_from_file staticmethod

load_task_features_from_file(
    file_path: Union[str, Path],
) -> Dict[str, NDArray[np.float32]]

Load precomputed task features from a pickle file.

get_cache_stats

get_cache_stats() -> Dict[str, Any]

Get statistics about feature caching.

Usage Examples

from themap.data.tasks import Tasks

# Load tasks from directory
tasks = Tasks.from_directory(
    directory="datasets/",
    task_list_file="datasets/sample_tasks_list.json",
    load_molecules=True,
    load_proteins=True
)

print(f"Train tasks: {tasks.get_num_fold_tasks('TRAIN')}")
print(f"Test tasks: {tasks.get_num_fold_tasks('TEST')}")

Accessing Tasks

# Get task IDs by fold
train_ids = tasks.get_task_ids(fold="TRAIN")
test_ids = tasks.get_task_ids(fold="TEST")

# Get specific task
task = tasks.get_task("CHEMBL123456")
print(f"Task {task.task_id}: {len(task.molecule_dataset)} molecules")

Working with Features

# Compute features for all tasks
all_features = tasks.compute_all_task_features(
    molecule_featurizer="ecfp",
    protein_featurizer="esm2_t33_650M_UR50D",
    folds=["TRAIN", "TEST"]
)

# Get features ready for distance computation
source_features, target_features, source_names, target_names = (
    tasks.get_distance_computation_ready_features(
        molecule_featurizer="ecfp",
        source_fold="TRAIN",
        target_folds=["TEST"]
    )
)

Task Class

Task

themap.data.tasks.Task dataclass

A task represents a complete molecular property prediction problem.

Each task contains: - Dataset: MoleculeDataset (set of molecules with SMILES and labels) - Metadata: Various metadata types including protein (single vectors per task)

Parameters:

Name Type Description Default
task_id str

Unique identifier for the task (e.g., CHEMBL ID)

required
molecule_dataset Optional[MoleculeDataset]

THE dataset - set of molecules for this task

None
metadata_datasets Optional[Dict[str, Any]]

Dictionary of metadata by type - Can include "protein" for protein metadata (single vector per task) - Can include "assay_description", "target_info", etc.

None
hardness Optional[float]

Optional measure of task difficulty

None
Note

protein_dataset is deprecated - protein data should be stored in metadata_datasets["protein"]

__post_init__

__post_init__() -> None

Validate task initialization and handle backward compatibility.

get_molecule_features

get_molecule_features(
    featurizer_name: str, **kwargs: Any
) -> Optional[NDArray[np.float32]]

Get molecular features for this task.

This method returns pre-computed features if available (set via set_features or FeaturizationPipeline), or computes features on demand using the specified featurizer.

Parameters:

Name Type Description Default
featurizer_name str

Name of molecular featurizer to use

required
**kwargs Any

Additional featurizer arguments

{}

Returns:

Type Description
Optional[NDArray[float32]]

Molecular features or None if no molecule data

get_protein_features

get_protein_features(
    featurizer_name: str = "esm2_t33_650M_UR50D", layer: int = 33, **kwargs: Any
) -> Optional[NDArray[np.float32]]

Get protein features for this task.

Parameters:

Name Type Description Default
featurizer_name str

Name of protein featurizer to use

'esm2_t33_650M_UR50D'
layer int

Layer number for ESM models

33
**kwargs Any

Additional featurizer arguments

{}

Returns:

Type Description
Optional[NDArray[float32]]

Protein features or None if no protein data

get_metadata_features

get_metadata_features(
    metadata_type: str, featurizer_name: str, **kwargs: Any
) -> Optional[NDArray[np.float32]]

Get metadata features for this task.

Parameters:

Name Type Description Default
metadata_type str

Type of metadata to get features for

required
featurizer_name str

Name of metadata featurizer to use

required
**kwargs Any

Additional featurizer arguments

{}

Returns:

Type Description
Optional[NDArray[float32]]

Metadata features or None if metadata type not available

get_combined_features

get_combined_features(
    molecule_featurizer: Optional[str] = None,
    protein_featurizer: Optional[str] = None,
    metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
    combination_method: str = "concatenate",
    **kwargs: Any,
) -> NDArray[np.float32]

Get combined features from all available data types.

Parameters:

Name Type Description Default
molecule_featurizer Optional[str]

Molecular featurizer name

None
protein_featurizer Optional[str]

Protein featurizer name

None
metadata_configs Optional[Dict[str, Dict[str, Any]]]

Dict mapping metadata types to featurizer configs

None
combination_method str

How to combine features ('concatenate', 'average', 'weighted_average')

'concatenate'
**kwargs Any

Additional arguments

{}

Returns:

Type Description
NDArray[float32]

Combined feature vector

get_task_embedding

get_task_embedding(data_model: Any, metadata_model: Any) -> NDArray[np.float32]

Legacy method for backward compatibility.

Parameters:

Name Type Description Default
data_model Any

Model for data feature extraction

required
metadata_model Any

Model for metadata feature extraction

required

Returns:

Type Description
NDArray[float32]

Combined feature vector

__len__

__len__() -> int

Get number of tasks.

Usage Examples

from themap.data.tasks import Task

# Access task data
task = tasks.get_task("CHEMBL123456")

# Molecular data
if task.molecule_dataset:
    smiles = task.molecule_dataset.smiles_list
    labels = task.molecule_dataset.labels

# Protein data
if task.protein_dataset:
    sequences = task.protein_dataset.sequences

# Get features
mol_features = task.get_molecule_features("ecfp")
prot_features = task.get_protein_features("esm2_t33_650M_UR50D")

TorchDataset

TorchMoleculeDataset

themap.data.torch_dataset.TorchMoleculeDataset

Bases: Dataset

Enhanced PyTorch Dataset wrapper for molecular data.

This class wraps a MoleculeDataset to provide PyTorch Dataset functionality while maintaining access to all original MoleculeDataset methods through delegation.

Parameters:

Name Type Description Default
data MoleculeDataset

MoleculeDataset object

required
transform callable

Transform to apply to features

None
target_transform callable

Transform to apply to labels

None
lazy_loading bool

Whether to load data lazily. Defaults to False.

False
Example

from themap.data import MoleculeDataset from themap.data.torch_dataset import TorchMoleculeDataset

Load molecular dataset

mol_dataset = MoleculeDataset.load_from_file("data.jsonl.gz")

Create PyTorch wrapper

torch_dataset = TorchMoleculeDataset(mol_dataset)

Use as PyTorch Dataset

dataloader = torch.utils.data.DataLoader(torch_dataset, batch_size=32)

Access original methods through delegation

stats = torch_dataset.get_statistics() features = torch_dataset.get_features("ecfp")

dataset property

dataset: MoleculeDataset

Access to the underlying MoleculeDataset.

Returns:

Name Type Description
MoleculeDataset MoleculeDataset

The wrapped dataset

__init__

__init__(
    data: MoleculeDataset,
    transform: Optional[Callable] = None,
    target_transform: Optional[Callable] = None,
    lazy_loading: bool = False,
) -> None

Initialize TorchMoleculeDataset.

Parameters:

Name Type Description Default
data MoleculeDataset

Input molecular dataset

required
transform callable

Transform to apply to features

None
target_transform callable

Transform to apply to labels

None
lazy_loading bool

Whether to load tensors lazily

False

Raises:

Type Description
ValueError

If the dataset is empty or features/labels are invalid

TypeError

If data is not a MoleculeDataset instance

__getitem__

__getitem__(index: int) -> tuple[torch.Tensor, torch.Tensor]

Get a data sample.

Parameters:

Name Type Description Default
index int

Index of the sample to get

required

Returns:

Type Description
tuple[Tensor, Tensor]

tuple[torch.Tensor, torch.Tensor]: Tuple of (features, label)

Raises:

Type Description
IndexError

If index is out of bounds

RuntimeError

If lazy loading fails

__len__

__len__() -> int

Get the number of samples in the dataset.

Returns:

Name Type Description
int int

Number of samples

__repr__

__repr__() -> str

String representation of the dataset.

__getattr__

__getattr__(name: str) -> Any

Delegate attribute access to underlying MoleculeDataset.

Parameters:

Name Type Description Default
name str

Attribute name

required

Returns:

Type Description
Any

The attribute from the underlying dataset

Raises:

Type Description
AttributeError

If attribute doesn't exist in underlying dataset

get_smiles

get_smiles() -> list[str]

Get SMILES strings for all molecules.

Returns:

Type Description
list[str]

list[str]: List of SMILES strings

refresh_tensors

refresh_tensors() -> None

Refresh cached tensors from the underlying dataset.

Useful when the underlying dataset has been modified.

create_dataloader classmethod

create_dataloader(
    data: MoleculeDataset,
    batch_size: int = 64,
    shuffle: bool = True,
    transform: Optional[Callable] = None,
    target_transform: Optional[Callable] = None,
    lazy_loading: bool = False,
    **kwargs: Any,
) -> torch.utils.data.DataLoader

Create PyTorch DataLoader with enhanced options.

Parameters:

Name Type Description Default
data MoleculeDataset

Input molecular dataset

required
batch_size int

Batch size

64
shuffle bool

Whether to shuffle data

True
transform Optional[Callable]

Transform to apply to features

None
target_transform Optional[Callable]

Transform to apply to labels

None
lazy_loading bool

Whether to use lazy loading

False
**kwargs Any

Additional arguments for DataLoader

{}

Returns:

Name Type Description
DataLoader DataLoader

PyTorch data loader

Example

loader = TorchMoleculeDataset.create_dataloader( ... dataset, ... batch_size=32, ... shuffle=True, ... num_workers=4 ... )

Usage Examples

from themap.data.torch_dataset import TorchMoleculeDataset
from torch.utils.data import DataLoader

# Create PyTorch dataset
torch_dataset = TorchMoleculeDataset(
    dataset=molecule_dataset,
    featurizer="ecfp"
)

# Use with DataLoader
dataloader = DataLoader(
    torch_dataset,
    batch_size=32,
    shuffle=True
)

for batch in dataloader:
    features, labels = batch
    # Train your model...

Data Format

JSONL.GZ Format

THEMAP uses compressed JSON Lines format for molecular data:

{"SMILES": "CCO", "Property": 1}
{"SMILES": "CCCO", "Property": 0}
{"SMILES": "CC(=O)O", "Property": 1}

Directory Structure

datasets/
├── sample_tasks_list.json      # Task organization
├── train/
│   ├── CHEMBL123456.jsonl.gz   # Molecular data
│   ├── CHEMBL123456.fasta      # Protein sequences
│   └── ...
├── test/
│   └── ...
└── valid/
    └── ...

Task List Format

{
    "train": ["CHEMBL123456", "CHEMBL789012", ...],
    "test": ["CHEMBL111111", "CHEMBL222222", ...],
    "valid": ["CHEMBL333333", ...]
}

Utility Functions

Validation

from themap.data.molecule_dataset import validate_smiles

# Validate a SMILES string
is_valid = validate_smiles("CCO")
print(f"Valid: {is_valid}")  # True

is_valid = validate_smiles("invalid")
print(f"Valid: {is_valid}")  # False

Canonicalization

from themap.data.molecule_dataset import canonicalize_smiles

# Canonicalize SMILES
canonical = canonicalize_smiles("C(C)O")
print(canonical)  # "CCO"

Error Handling

from themap.data import DatasetLoader, MoleculeDataset

try:
    loader = DatasetLoader(data_dir="datasets")
    dataset = loader.load_dataset("CHEMBL123456", fold="train")
except FileNotFoundError:
    print("Dataset file not found")
except ValueError as e:
    print(f"Invalid data format: {e}")

Performance Tips

  1. Lazy loading: Use DatasetLoader to load datasets on demand
  2. Caching: Enable feature caching for repeated computations
  3. Batch processing: Process datasets in batches for memory efficiency
  4. Parallel loading: Use n_jobs parameter for parallel dataset loading
from themap.data import DatasetLoader

# Parallel loading
loader = DatasetLoader(data_dir="datasets", n_jobs=8)
datasets = loader.load_all_folds()