Data Module¶
The data module provides classes and utilities for loading, managing, and converting molecular and protein datasets.
Overview¶
The data system consists of these main components:
MoleculeDataset- Container for molecular data (SMILES, labels)DatasetLoader- Load datasets from directory structuresCSVConverter- Convert CSV files to JSONL.GZ formatTasks- Unified task management across train/test/valid splits
MoleculeDataset¶
MoleculeDataset¶
themap.data.molecule_dataset.MoleculeDataset
dataclass
¶
Simplified dataset structure for molecules.
Optimized for batch distance computation between tasks. Stores SMILES strings and labels directly without per-molecule object overhead.
Attributes:
| Name | Type | Description |
|---|---|---|
task_id |
str
|
String identifying the task this dataset belongs to. |
smiles_list |
List[str]
|
List of SMILES strings for all molecules. |
labels |
NDArray[int32]
|
Binary labels as numpy array (0/1). |
numeric_labels |
Optional[NDArray[float32]]
|
Optional continuous labels (e.g., pIC50). |
_features |
Optional[NDArray[float32]]
|
Precomputed feature matrix (set via set_features or pipeline). |
_featurizer_name |
Optional[str]
|
Name of featurizer used for current features. |
Examples:
>>> dataset = MoleculeDataset.load_from_file("datasets/train/CHEMBL123.jsonl.gz")
>>> print(len(dataset)) # Number of molecules
>>> print(dataset.positive_ratio) # Ratio of positive labels
>>> # Features are set externally via FeaturizationPipeline
>>> dataset.set_features(features_array, "ecfp")
>>> pos_proto, neg_proto = dataset.get_prototype()
featurizer_name
property
¶
Get name of featurizer used for current features.
datapoints
property
¶
Legacy property for backward compatibility with metalearning module.
Returns list of dictionaries with molecule data.
set_features ¶
Set precomputed features for this dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
NDArray[float32]
|
Feature matrix of shape (n_molecules, feature_dim) |
required |
featurizer_name
|
str
|
Name of the featurizer used |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If feature dimensions don't match dataset size |
get_features ¶
Get molecular features, computing on demand if necessary.
This method returns pre-computed features if available (set via set_features or FeaturizationPipeline), or computes features on demand using the specified featurizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D") |
'ecfp'
|
**kwargs
|
Any
|
Additional featurizer arguments |
{}
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
Feature matrix of shape (n_molecules, feature_dim) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If no molecules in dataset or featurization fails |
get_prototype ¶
get_prototype(
featurizer_name: Optional[str] = None,
) -> Tuple[NDArray[np.float32], NDArray[np.float32]]
Compute positive and negative prototypes from features.
Prototypes are the mean feature vectors for each class.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
Optional[str]
|
Optional featurizer name. If provided and features aren't yet computed, they will be computed on demand. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[NDArray[float32], NDArray[float32]]
|
Tuple of (positive_prototype, negative_prototype) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If features haven't been set or no examples exist for a class |
get_class_features ¶
Get features separated by class.
Returns:
| Type | Description |
|---|---|
Tuple[NDArray[float32], NDArray[float32]]
|
Tuple of (positive_features, negative_features) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If features haven't been set |
load_from_file
staticmethod
¶
Load dataset from a JSONL.GZ file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, RichPath, Path]
|
Path to the JSONL.GZ file. |
required |
Returns:
| Type | Description |
|---|---|
MoleculeDataset
|
MoleculeDataset with loaded SMILES and labels. |
from_dict
classmethod
¶
Create dataset from dictionary representation.
filter_by_indices ¶
Create a new dataset with only the specified indices.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
indices
|
List[int]
|
List of indices to keep |
required |
Returns:
| Type | Description |
|---|---|
MoleculeDataset
|
New MoleculeDataset with filtered data |
get_statistics ¶
Get basic statistics about the dataset.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dictionary with dataset statistics |
Usage Examples¶
from themap.data import MoleculeDataset
# Load from JSONL.GZ file
dataset = MoleculeDataset.from_jsonl_gz("datasets/train/CHEMBL123456.jsonl.gz")
print(f"Number of molecules: {len(dataset)}")
print(f"SMILES: {dataset.smiles_list[:3]}")
print(f"Labels: {dataset.labels[:3]}")
Creating from Data¶
from themap.data import MoleculeDataset
# Create from lists
dataset = MoleculeDataset(
smiles_list=["CCO", "CCCO", "CC(=O)O"],
labels=[1, 0, 1],
task_id="my_task"
)
# Save to file
dataset.to_jsonl_gz("output/my_task.jsonl.gz")
DatasetLoader¶
DatasetLoader¶
themap.data.loader.DatasetLoader ¶
Load datasets from directory structure.
Supports the following directory structure:
data_dir/
├── train/ # Training tasks (source)
│ ├── TASK1.csv
│ ├── TASK2.jsonl.gz
│ └── ...
├── test/ # Test tasks (target)
│ ├── TASK3.csv
│ └── ...
├── valid/ # Optional validation tasks
│ └── ...
├── proteins/ # Optional protein FASTA files
│ ├── TASK1.fasta
│ └── ...
└── tasks.json # Optional task list
If tasks.json is not provided, all CSV/JSONL.GZ files are auto-discovered.
Attributes:
| Name | Type | Description |
|---|---|---|
data_dir |
Root directory containing train/test/valid folders. |
|
task_list |
Optional[Dict[str, List[str]]]
|
Optional task list loaded from tasks.json. |
Examples:
>>> loader = DatasetLoader(Path("datasets/TDC"))
>>> train_datasets = loader.load_datasets("train")
>>> test_datasets = loader.load_datasets("test")
>>> # Get task IDs
>>> train_ids = list(train_datasets.keys())
__init__ ¶
Initialize the dataset loader.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_dir
|
Union[str, Path]
|
Root directory containing train/test/valid folders. |
required |
task_list_file
|
Optional[str]
|
Optional name of task list JSON file in data_dir. If None, all files are auto-discovered. |
None
|
get_fold_dir ¶
Get the directory for a specific fold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fold
|
str
|
Fold name (train, test, or valid). |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to the fold directory. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If fold name is invalid. |
get_task_ids ¶
Get list of task IDs for a fold.
If task_list is provided, uses that. Otherwise auto-discovers files.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fold
|
str
|
Fold name (train, test, or valid). |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of task IDs. |
load_dataset ¶
Load a single dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fold
|
str
|
Fold name (train, test, or valid). |
required |
task_id
|
str
|
Task ID to load. |
required |
convert_csv
|
bool
|
If True, convert CSV to JSONL.GZ format automatically. |
True
|
Returns:
| Type | Description |
|---|---|
MoleculeDataset
|
MoleculeDataset instance. |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If dataset file not found. |
load_datasets ¶
load_datasets(
fold: str, task_ids: Optional[List[str]] = None, convert_csv: bool = True
) -> Dict[str, MoleculeDataset]
Load all datasets for a fold.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
fold
|
str
|
Fold name (train, test, or valid). |
required |
task_ids
|
Optional[List[str]]
|
Optional list of specific task IDs to load. If None, loads all tasks in the fold. |
None
|
convert_csv
|
bool
|
If True, convert CSV to JSONL.GZ format automatically. |
True
|
Returns:
| Type | Description |
|---|---|
Dict[str, MoleculeDataset]
|
Dictionary mapping task IDs to MoleculeDataset instances. |
load_all_folds ¶
Load datasets from all available folds.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, MoleculeDataset]]
|
Dictionary mapping fold names to dictionaries of datasets. |
get_protein_file ¶
Get path to protein FASTA file for a task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Task ID to find protein for. |
required |
Returns:
| Type | Description |
|---|---|
Optional[Path]
|
Path to FASTA file, or None if not found. |
load_protein_sequences ¶
Load all protein sequences from the proteins directory.
Returns:
| Type | Description |
|---|---|
Dict[str, str]
|
Dictionary mapping task IDs to protein sequences. |
get_statistics ¶
Get statistics about available datasets.
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dictionary with dataset counts and information. |
Usage Examples¶
from themap.data import DatasetLoader
# Initialize loader
loader = DatasetLoader(
data_dir="datasets",
task_list_file="datasets/sample_tasks_list.json"
)
# Load all datasets
train_datasets = loader.load_fold("train")
test_datasets = loader.load_fold("test")
print(f"Loaded {len(train_datasets)} training datasets")
print(f"Loaded {len(test_datasets)} test datasets")
Loading Specific Tasks¶
from themap.data import DatasetLoader
loader = DatasetLoader(data_dir="datasets")
# Load specific dataset
dataset = loader.load_dataset(
task_id="CHEMBL123456",
fold="train"
)
# Load multiple datasets
datasets = loader.load_datasets(
task_ids=["CHEMBL123456", "CHEMBL789012"],
fold="train"
)
Get Dataset Statistics¶
from themap.data import DatasetLoader
loader = DatasetLoader(data_dir="datasets")
stats = loader.get_statistics()
print(f"Data directory: {stats['data_dir']}")
for fold, fold_stats in stats['folds'].items():
print(f" {fold}: {fold_stats['task_count']} tasks")
CSVConverter¶
CSVConverter¶
themap.data.converter.CSVConverter ¶
Convert CSV files to JSONL.GZ format for THEMAP.
Supports auto-detection of SMILES and activity columns, RDKit-based SMILES validation, and various CSV formats.
Examples:
>>> converter = CSVConverter()
>>> stats = converter.convert("input.csv", "output.jsonl.gz", "CHEMBL123")
>>> print(f"Converted {stats.valid_molecules} molecules")
__init__ ¶
__init__(
validate_smiles: bool = True,
strict_validation: bool = True,
auto_detect_columns: bool = True,
)
Initialize the converter.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
validate_smiles
|
bool
|
Whether to validate SMILES with RDKit. |
True
|
strict_validation
|
bool
|
If True, use strict sanitization. |
True
|
auto_detect_columns
|
bool
|
If True, auto-detect column names. |
True
|
read_csv ¶
read_csv(
path: Union[str, Path],
smiles_column: Optional[str] = None,
activity_column: Optional[str] = None,
) -> Dict[str, Any]
Read CSV file and extract SMILES and labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Union[str, Path]
|
Path to the CSV file. |
required |
smiles_column
|
Optional[str]
|
Name of the SMILES column (auto-detected if None). |
None
|
activity_column
|
Optional[str]
|
Name of the activity column (auto-detected if None). |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, Any]
|
Dictionary with 'smiles', 'labels', 'numeric_labels' keys. |
convert ¶
convert(
input_path: Union[str, Path],
output_path: Union[str, Path],
task_id: str,
smiles_column: Optional[str] = None,
activity_column: Optional[str] = None,
) -> ConversionStats
Convert CSV file to JSONL.GZ format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input_path
|
Union[str, Path]
|
Path to input CSV file. |
required |
output_path
|
Union[str, Path]
|
Path for output JSONL.GZ file. |
required |
task_id
|
str
|
Task/assay ID for the dataset. |
required |
smiles_column
|
Optional[str]
|
Name of SMILES column (auto-detected if None). |
None
|
activity_column
|
Optional[str]
|
Name of activity column (auto-detected if None). |
None
|
Returns:
| Type | Description |
|---|---|
ConversionStats
|
ConversionStats with conversion statistics. |
Usage Examples¶
from themap.data.converter import CSVConverter
from pathlib import Path
# Initialize converter
converter = CSVConverter(
validate_smiles=True,
auto_detect_columns=True
)
# Convert a CSV file
stats = converter.convert(
input_path=Path("data/raw.csv"),
output_path=Path("datasets/train/CHEMBL123456.jsonl.gz"),
task_id="CHEMBL123456"
)
print(f"Converted {stats.valid_molecules}/{stats.total_rows} molecules")
print(f"Success rate: {stats.success_rate:.1f}%")
Specifying Column Names¶
from themap.data.converter import CSVConverter
from pathlib import Path
converter = CSVConverter(validate_smiles=True)
stats = converter.convert(
input_path=Path("data.csv"),
output_path=Path("output.jsonl.gz"),
task_id="my_task",
smiles_column="canonical_smiles",
activity_column="pIC50"
)
Batch Conversion¶
from themap.data.converter import CSVConverter
from pathlib import Path
converter = CSVConverter()
# Convert multiple files
csv_files = Path("raw_data").glob("*.csv")
for csv_file in csv_files:
task_id = csv_file.stem
output_path = Path(f"datasets/train/{task_id}.jsonl.gz")
stats = converter.convert(csv_file, output_path, task_id)
print(f"{task_id}: {stats.valid_molecules} molecules")
Tasks¶
Tasks¶
themap.data.tasks.Tasks ¶
Collection of tasks for molecular property prediction across different folds.
This class manages multiple Task objects and provides unified access to molecular, protein, and metadata features across train/validation/test splits.
__init__ ¶
__init__(
train_tasks: Optional[List[Task]] = None,
valid_tasks: Optional[List[Task]] = None,
test_tasks: Optional[List[Task]] = None,
cache_dir: Optional[Union[str, Path]] = None,
) -> None
Initialize Tasks collection.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
train_tasks
|
Optional[List[Task]]
|
List of training tasks |
None
|
valid_tasks
|
Optional[List[Task]]
|
List of validation tasks |
None
|
test_tasks
|
Optional[List[Task]]
|
List of test tasks |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
from_directory
staticmethod
¶
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
load_molecules: bool = True,
load_proteins: bool = True,
load_metadata: bool = True,
metadata_types: Optional[List[str]] = None,
**kwargs: Any,
) -> Tasks
Create Tasks from a directory structure.
Expected directory structure: directory/ ├── train/ │ ├── CHEMBL123.jsonl.gz (molecules) │ ├── CHEMBL123.fasta (proteins) │ ├── CHEMBL123_assay.json (metadata) │ └── ... ├── valid/ └── test/
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Union[str, RichPath]
|
Base directory containing task data |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
JSON file with fold-specific task lists |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
load_molecules
|
bool
|
Whether to load molecular data |
True
|
load_proteins
|
bool
|
Whether to load protein data |
True
|
load_metadata
|
bool
|
Whether to load metadata |
True
|
metadata_types
|
Optional[List[str]]
|
List of metadata types to load |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
| Type | Description |
|---|---|
Tasks
|
Tasks instance with loaded data |
get_num_fold_tasks ¶
Get number of tasks in a specific fold.
__getitem__ ¶
Get a task by index.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
int: index of the task |
required |
Returns:
| Type | Description |
|---|---|
List[Task]
|
List[Task]: list of tasks |
Note
index 0: Train Tasks index 1: Validation Tasks index 2: Test Tasks
Raises:
| Type | Description |
|---|---|
IndexError
|
if index is out of range |
compute_all_task_features ¶
compute_all_task_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
folds: Optional[List[DataFold]] = None,
force_recompute: bool = False,
**kwargs: Any,
) -> Dict[str, NDArray[np.float32]]
Compute combined features for all tasks.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Metadata featurizer configurations |
None
|
combination_method
|
str
|
How to combine features |
'concatenate'
|
folds
|
Optional[List[DataFold]]
|
List of folds to process |
None
|
force_recompute
|
bool
|
Whether to force recomputation |
False
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
| Type | Description |
|---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping task names to combined features |
get_distance_computation_ready_features ¶
get_distance_computation_ready_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
**kwargs: Any,
) -> Tuple[
List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]
Get task features organized for efficient N×M distance matrix computation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Metadata featurizer configurations |
None
|
combination_method
|
str
|
How to combine features |
'concatenate'
|
source_fold
|
DataFold
|
Fold to use as source tasks (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target tasks (M) |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
| Type | Description |
|---|---|
List[NDArray[float32]]
|
Tuple containing: |
List[NDArray[float32]]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
|
|
save_task_features_to_file ¶
save_task_features_to_file(
output_path: Union[str, Path],
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
folds: Optional[List[DataFold]] = None,
**kwargs: Any,
) -> None
Save computed task features to a pickle file for efficient loading.
load_task_features_from_file
staticmethod
¶
Load precomputed task features from a pickle file.
Usage Examples¶
from themap.data.tasks import Tasks
# Load tasks from directory
tasks = Tasks.from_directory(
directory="datasets/",
task_list_file="datasets/sample_tasks_list.json",
load_molecules=True,
load_proteins=True
)
print(f"Train tasks: {tasks.get_num_fold_tasks('TRAIN')}")
print(f"Test tasks: {tasks.get_num_fold_tasks('TEST')}")
Accessing Tasks¶
# Get task IDs by fold
train_ids = tasks.get_task_ids(fold="TRAIN")
test_ids = tasks.get_task_ids(fold="TEST")
# Get specific task
task = tasks.get_task("CHEMBL123456")
print(f"Task {task.task_id}: {len(task.molecule_dataset)} molecules")
Working with Features¶
# Compute features for all tasks
all_features = tasks.compute_all_task_features(
molecule_featurizer="ecfp",
protein_featurizer="esm2_t33_650M_UR50D",
folds=["TRAIN", "TEST"]
)
# Get features ready for distance computation
source_features, target_features, source_names, target_names = (
tasks.get_distance_computation_ready_features(
molecule_featurizer="ecfp",
source_fold="TRAIN",
target_folds=["TEST"]
)
)
Task Class¶
Task¶
themap.data.tasks.Task
dataclass
¶
A task represents a complete molecular property prediction problem.
Each task contains: - Dataset: MoleculeDataset (set of molecules with SMILES and labels) - Metadata: Various metadata types including protein (single vectors per task)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_id
|
str
|
Unique identifier for the task (e.g., CHEMBL ID) |
required |
molecule_dataset
|
Optional[MoleculeDataset]
|
THE dataset - set of molecules for this task |
None
|
metadata_datasets
|
Optional[Dict[str, Any]]
|
Dictionary of metadata by type - Can include "protein" for protein metadata (single vector per task) - Can include "assay_description", "target_info", etc. |
None
|
hardness
|
Optional[float]
|
Optional measure of task difficulty |
None
|
Note
protein_dataset is deprecated - protein data should be stored in metadata_datasets["protein"]
__post_init__ ¶
Validate task initialization and handle backward compatibility.
get_molecule_features ¶
Get molecular features for this task.
This method returns pre-computed features if available (set via set_features or FeaturizationPipeline), or computes features on demand using the specified featurizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of molecular featurizer to use |
required |
**kwargs
|
Any
|
Additional featurizer arguments |
{}
|
Returns:
| Type | Description |
|---|---|
Optional[NDArray[float32]]
|
Molecular features or None if no molecule data |
get_protein_features ¶
get_protein_features(
featurizer_name: str = "esm2_t33_650M_UR50D", layer: int = 33, **kwargs: Any
) -> Optional[NDArray[np.float32]]
Get protein features for this task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of protein featurizer to use |
'esm2_t33_650M_UR50D'
|
layer
|
int
|
Layer number for ESM models |
33
|
**kwargs
|
Any
|
Additional featurizer arguments |
{}
|
Returns:
| Type | Description |
|---|---|
Optional[NDArray[float32]]
|
Protein features or None if no protein data |
get_metadata_features ¶
get_metadata_features(
metadata_type: str, featurizer_name: str, **kwargs: Any
) -> Optional[NDArray[np.float32]]
Get metadata features for this task.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
metadata_type
|
str
|
Type of metadata to get features for |
required |
featurizer_name
|
str
|
Name of metadata featurizer to use |
required |
**kwargs
|
Any
|
Additional featurizer arguments |
{}
|
Returns:
| Type | Description |
|---|---|
Optional[NDArray[float32]]
|
Metadata features or None if metadata type not available |
get_combined_features ¶
get_combined_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
**kwargs: Any,
) -> NDArray[np.float32]
Get combined features from all available data types.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Dict mapping metadata types to featurizer configs |
None
|
combination_method
|
str
|
How to combine features ('concatenate', 'average', 'weighted_average') |
'concatenate'
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
Combined feature vector |
get_task_embedding ¶
Legacy method for backward compatibility.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_model
|
Any
|
Model for data feature extraction |
required |
metadata_model
|
Any
|
Model for metadata feature extraction |
required |
Returns:
| Type | Description |
|---|---|
NDArray[float32]
|
Combined feature vector |
Usage Examples¶
from themap.data.tasks import Task
# Access task data
task = tasks.get_task("CHEMBL123456")
# Molecular data
if task.molecule_dataset:
smiles = task.molecule_dataset.smiles_list
labels = task.molecule_dataset.labels
# Protein data
if task.protein_dataset:
sequences = task.protein_dataset.sequences
# Get features
mol_features = task.get_molecule_features("ecfp")
prot_features = task.get_protein_features("esm2_t33_650M_UR50D")
TorchDataset¶
TorchMoleculeDataset¶
themap.data.torch_dataset.TorchMoleculeDataset ¶
Bases: Dataset
Enhanced PyTorch Dataset wrapper for molecular data.
This class wraps a MoleculeDataset to provide PyTorch Dataset functionality while maintaining access to all original MoleculeDataset methods through delegation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MoleculeDataset
|
MoleculeDataset object |
required |
transform
|
callable
|
Transform to apply to features |
None
|
target_transform
|
callable
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to load data lazily. Defaults to False. |
False
|
Example
from themap.data import MoleculeDataset from themap.data.torch_dataset import TorchMoleculeDataset
Load molecular dataset¶
mol_dataset = MoleculeDataset.load_from_file("data.jsonl.gz")
Create PyTorch wrapper¶
torch_dataset = TorchMoleculeDataset(mol_dataset)
Use as PyTorch Dataset¶
dataloader = torch.utils.data.DataLoader(torch_dataset, batch_size=32)
Access original methods through delegation¶
stats = torch_dataset.get_statistics() features = torch_dataset.get_features("ecfp")
dataset
property
¶
Access to the underlying MoleculeDataset.
Returns:
| Name | Type | Description |
|---|---|---|
MoleculeDataset |
MoleculeDataset
|
The wrapped dataset |
__init__ ¶
__init__(
data: MoleculeDataset,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
lazy_loading: bool = False,
) -> None
Initialize TorchMoleculeDataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MoleculeDataset
|
Input molecular dataset |
required |
transform
|
callable
|
Transform to apply to features |
None
|
target_transform
|
callable
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to load tensors lazily |
False
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the dataset is empty or features/labels are invalid |
TypeError
|
If data is not a MoleculeDataset instance |
__getitem__ ¶
Get a data sample.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Index of the sample to get |
required |
Returns:
| Type | Description |
|---|---|
tuple[Tensor, Tensor]
|
tuple[torch.Tensor, torch.Tensor]: Tuple of (features, label) |
Raises:
| Type | Description |
|---|---|
IndexError
|
If index is out of bounds |
RuntimeError
|
If lazy loading fails |
__len__ ¶
Get the number of samples in the dataset.
Returns:
| Name | Type | Description |
|---|---|---|
int |
int
|
Number of samples |
__getattr__ ¶
Delegate attribute access to underlying MoleculeDataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Attribute name |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The attribute from the underlying dataset |
Raises:
| Type | Description |
|---|---|
AttributeError
|
If attribute doesn't exist in underlying dataset |
get_smiles ¶
Get SMILES strings for all molecules.
Returns:
| Type | Description |
|---|---|
list[str]
|
list[str]: List of SMILES strings |
refresh_tensors ¶
Refresh cached tensors from the underlying dataset.
Useful when the underlying dataset has been modified.
create_dataloader
classmethod
¶
create_dataloader(
data: MoleculeDataset,
batch_size: int = 64,
shuffle: bool = True,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
lazy_loading: bool = False,
**kwargs: Any,
) -> torch.utils.data.DataLoader
Create PyTorch DataLoader with enhanced options.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
MoleculeDataset
|
Input molecular dataset |
required |
batch_size
|
int
|
Batch size |
64
|
shuffle
|
bool
|
Whether to shuffle data |
True
|
transform
|
Optional[Callable]
|
Transform to apply to features |
None
|
target_transform
|
Optional[Callable]
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to use lazy loading |
False
|
**kwargs
|
Any
|
Additional arguments for DataLoader |
{}
|
Returns:
| Name | Type | Description |
|---|---|---|
DataLoader |
DataLoader
|
PyTorch data loader |
Example
loader = TorchMoleculeDataset.create_dataloader( ... dataset, ... batch_size=32, ... shuffle=True, ... num_workers=4 ... )
Usage Examples¶
from themap.data.torch_dataset import TorchMoleculeDataset
from torch.utils.data import DataLoader
# Create PyTorch dataset
torch_dataset = TorchMoleculeDataset(
dataset=molecule_dataset,
featurizer="ecfp"
)
# Use with DataLoader
dataloader = DataLoader(
torch_dataset,
batch_size=32,
shuffle=True
)
for batch in dataloader:
features, labels = batch
# Train your model...
Data Format¶
JSONL.GZ Format¶
THEMAP uses compressed JSON Lines format for molecular data:
{"SMILES": "CCO", "Property": 1}
{"SMILES": "CCCO", "Property": 0}
{"SMILES": "CC(=O)O", "Property": 1}
Directory Structure¶
datasets/
├── sample_tasks_list.json # Task organization
├── train/
│ ├── CHEMBL123456.jsonl.gz # Molecular data
│ ├── CHEMBL123456.fasta # Protein sequences
│ └── ...
├── test/
│ └── ...
└── valid/
└── ...
Task List Format¶
{
"train": ["CHEMBL123456", "CHEMBL789012", ...],
"test": ["CHEMBL111111", "CHEMBL222222", ...],
"valid": ["CHEMBL333333", ...]
}
Utility Functions¶
Validation¶
from themap.data.molecule_dataset import validate_smiles
# Validate a SMILES string
is_valid = validate_smiles("CCO")
print(f"Valid: {is_valid}") # True
is_valid = validate_smiles("invalid")
print(f"Valid: {is_valid}") # False
Canonicalization¶
from themap.data.molecule_dataset import canonicalize_smiles
# Canonicalize SMILES
canonical = canonicalize_smiles("C(C)O")
print(canonical) # "CCO"
Error Handling¶
from themap.data import DatasetLoader, MoleculeDataset
try:
loader = DatasetLoader(data_dir="datasets")
dataset = loader.load_dataset("CHEMBL123456", fold="train")
except FileNotFoundError:
print("Dataset file not found")
except ValueError as e:
print(f"Invalid data format: {e}")
Performance Tips¶
- Lazy loading: Use
DatasetLoaderto load datasets on demand - Caching: Enable feature caching for repeated computations
- Batch processing: Process datasets in batches for memory efficiency
- Parallel loading: Use
n_jobsparameter for parallel dataset loading