themap.data
themap.data
MoleculeDatapoint
dataclass
Data structure holding information for a single molecule and associated features.
This class represents a single molecule datapoint with its associated features and labels. It provides methods to compute molecular fingerprints and features, and includes various molecular properties as properties.
Attributes:
Name | Type | Description |
---|---|---|
task_id |
str
|
String describing the task this datapoint is taken from. |
smiles |
str
|
SMILES string describing the molecule this datapoint corresponds to. |
bool_label |
bool
|
bool classification label, usually derived from the numeric label using a threshold. |
numeric_label |
Optional[float]
|
numerical label (e.g., activity), usually measured in the lab |
_rdkit_mol |
Optional[Mol]
|
cached RDKit molecule object |
Properties
number_of_atoms (int): Number of heavy atoms in the molecule number_of_bonds (int): Number of bonds in the molecule molecular_weight (float): Molecular weight in atomic mass units logp (float): Octanol-water partition coefficient (LogP) num_rotatable_bonds (int): Number of rotatable bonds in the molecule smiles_canonical (str): Canonical SMILES representation rdkit_mol (Chem.Mol): RDKit molecule object (lazy loaded)
Methods:
Name | Description |
---|---|
get_fingerprint |
Computes and returns the Morgan fingerprint for the molecule |
get_features |
Computes and returns molecular features using specified featurizer |
Example
Create a molecule datapoint
datapoint = MoleculeDatapoint( ... task_id="toxicity_prediction", ... smiles="CCO", # ethanol ... bool_label=True, ... numeric_label=0.8 ... )
Access molecular properties
print(f"Number of atoms: {datapoint.number_of_atoms}")
Number of atoms: 9
print(f"Molecular weight: {datapoint.molecular_weight:.2f}")
Molecular weight: 46.07
print(f"LogP: {datapoint.logp:.2f}")
LogP: -0.31
print(f"Number of rotatable bonds: {datapoint.num_rotatable_bonds}")
Number of rotatable bonds: 0
print(f"SMILES canonical: {datapoint.smiles_canonical}")
SMILES canonical: CCO
Get molecular features
fingerprint = datapoint.get_fingerprint() print(f"Fingerprint shape: {fingerprint.shape if fingerprint is not None else None}")
Fingerprint shape: (2048,)
features = datapoint.get_features(featurizer_name="ecfp") print(f"Features shape: {features.shape if features is not None else None}")
Features shape: (2048,)
rdkit_mol
property
Get the RDKit molecule object.
This property lazily initializes the RDKit molecule if it hasn't been created yet. The molecule is cached to avoid recreating it multiple times.
Returns:
Type | Description |
---|---|
Optional[Mol]
|
Optional[Chem.Mol]: RDKit molecule object. Returns None if molecule creation fails. |
number_of_atoms
property
Get the number of heavy atoms in the molecule.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of heavy atoms in the molecule. |
number_of_bonds
property
Get the number of bonds in the molecule.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of bonds in the molecule. |
molecular_weight
property
Get the molecular weight of the molecule.
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Molecular weight of the molecule in atomic mass units. |
logp
property
Calculate octanol-water partition coefficient.
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
LogP value of the molecule. |
num_rotatable_bonds
property
Get number of rotatable bonds.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of rotatable bonds in the molecule. |
Raises: ValueError: If the molecule cannot be created.
smiles_canonical
property
Get canonical SMILES representation.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Canonical SMILES string for the molecule. |
Raises: ValueError: If the molecule cannot be created.
get_fingerprint
Get the Morgan fingerprint for a molecule.
This method computes the Extended-Connectivity Fingerprint (ECFP) for the molecule using RDKit's Morgan fingerprint generator. Features are cached globally to avoid recomputation across different instances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
force_recompute
|
bool
|
If True, the fingerprint is recomputed even if cached. |
False
|
Returns:
Type | Description |
---|---|
Optional[ndarray]
|
Optional[np.ndarray]: Morgan fingerprint for the molecule (r=2, nbits=2048). The fingerprint is a binary vector representing the molecular structure. Returns None if fingerprint generation fails. |
get_features
get_features(
featurizer_name: Optional[str] = None, force_recompute: bool = False
) -> Optional[np.ndarray]
Get features for a molecule using a featurizer model.
This method computes molecular features using the specified featurizer model. Features are cached globally to avoid recomputation across different instances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
Optional[str]
|
Name of the featurizer model to use. If None, returns None. |
None
|
force_recompute
|
bool
|
If True, features are recomputed even if cached. |
False
|
Returns:
Type | Description |
---|---|
Optional[ndarray]
|
Optional[np.ndarray]: Features for the molecule. The shape and content depend on the featurizer used. Returns None if feature generation fails or featurizer_name is None. |
MoleculeDataset
dataclass
Data structure holding information for a dataset of molecules.
This class represents a collection of molecule datapoints, providing methods for dataset manipulation, feature computation, and statistical analysis.
Attributes:
Name | Type | Description |
---|---|---|
task_id |
str
|
String describing the task this dataset is taken from. |
data |
List[MoleculeDatapoint]
|
List of MoleculeDatapoint objects. |
_features |
Optional[NDArray[float32]]
|
Cached features for the dataset. |
_cache_info |
Dict[str, Any]
|
Information about the feature caching. |
get_features
property
Get the cached features for the dataset.
Returns:
Type | Description |
---|---|
Optional[NDArray[float32]]
|
Optional[NDArray[np.float32]]: Cached features for the dataset if available, None otherwise. |
get_ratio
property
Get the ratio of positive to negative examples in the dataset.
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Ratio of positive to negative examples in the dataset. |
get_dataset_embedding
get_dataset_embedding(
featurizer_name: str,
n_jobs: Optional[int] = None,
force_recompute: bool = False,
batch_size: int = 1000,
) -> NDArray[np.float32]
Get the features for the entire dataset using a featurizer.
Efficiently computes features for all molecules in the dataset using the specified featurizer, taking advantage of the featurizer's built-in parallelization capabilities and maintaining a two-level cache strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the featurizer to use |
required |
n_jobs
|
Optional[int]
|
Number of parallel jobs. If provided, temporarily overrides the featurizer's own setting |
None
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
batch_size
|
int
|
Batch size for processing, used for memory efficiency when handling large datasets |
1000
|
Returns:
Type | Description |
---|---|
NDArray[float32]
|
NDArray[np.float32]: Features for the entire dataset, shape (n_samples, n_features) |
Raises:
Type | Description |
---|---|
ValueError
|
If the generated features length doesn't match the dataset length |
TypeError
|
If featurizer_name is not a string |
RuntimeError
|
If featurization fails |
IndexError
|
If dataset is empty |
validate_dataset_integrity
Validate the integrity of the dataset.
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if dataset is valid, False otherwise |
Raises:
Type | Description |
---|---|
ValueError
|
If critical integrity issues are found |
get_memory_usage
Get memory usage statistics for the dataset.
Returns:
Type | Description |
---|---|
Dict[str, float]
|
Dictionary with memory usage in MB for different components |
optimize_memory
Optimize memory usage by cleaning up unnecessary data.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary with optimization results |
enable_persistent_cache
Enable persistent caching for this dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache_dir
|
Union[str, Path]
|
Directory for storing cached features |
required |
get_dataset_embedding_with_persistent_cache
get_dataset_embedding_with_persistent_cache(
featurizer_name: str,
cache_dir: Optional[Union[str, Path]] = None,
n_jobs: Optional[int] = None,
force_recompute: bool = False,
batch_size: int = 1000,
) -> NDArray[np.float32]
Get dataset features with persistent caching enabled.
This method provides the same functionality as get_dataset_embedding but with persistent disk caching to avoid recomputation across sessions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the featurizer to use |
required |
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent cache (if None, uses existing cache) |
None
|
n_jobs
|
Optional[int]
|
Number of parallel jobs |
None
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
batch_size
|
int
|
Batch size for processing |
1000
|
Returns:
Type | Description |
---|---|
NDArray[float32]
|
Features for the entire dataset |
get_persistent_cache_stats
Get statistics about the persistent cache.
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
Cache statistics if persistent cache is enabled, None otherwise |
get_prototype
Get the prototype of the dataset.
This method calculates the mean feature vector of positive and negative examples in the dataset using the specified featurizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the featurizer to use. |
required |
Returns:
Type | Description |
---|---|
Tuple[NDArray[float32], NDArray[float32]]
|
Tuple[NDArray[np.float32], NDArray[np.float32]]: Tuple containing: - positive_prototype: Mean feature vector of positive examples - negative_prototype: Mean feature vector of negative examples |
Raises:
Type | Description |
---|---|
ValueError
|
If there are no positive or negative examples in the dataset |
TypeError
|
If featurizer_name is not a string |
RuntimeError
|
If feature computation fails |
load_from_file
staticmethod
Load dataset from a JSONL.GZ file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Union[str, RichPath]
|
Path to the JSONL.GZ file. |
required |
Returns:
Name | Type | Description |
---|---|---|
MoleculeDataset |
MoleculeDataset
|
Loaded dataset. |
filter
Filter dataset based on a condition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
condition
|
Callable[[MoleculeDatapoint], bool]
|
Function that returns True/False for each datapoint. |
required |
Returns:
Name | Type | Description |
---|---|---|
MoleculeDataset |
MoleculeDataset
|
New dataset containing only the filtered datapoints. |
get_statistics
Get statistics about the dataset.
Returns:
Name | Type | Description |
---|---|---|
DatasetStats |
DatasetStats
|
Dictionary containing: - size: Total number of datapoints - positive_ratio: Ratio of positive to negative examples - avg_molecular_weight: Average molecular weight - avg_atoms: Average number of atoms - avg_bonds: Average number of bonds |
Raises:
Type | Description |
---|---|
ValueError
|
If the dataset is empty |
DataFold
Bases: IntEnum
Enum for data fold types.
This enum represents the different data splits used in machine learning: - TRAIN (0): Training/source tasks - VALIDATION (1): Validation/development tasks - TEST (2): Test/target tasks
By inheriting from IntEnum, each fold type is assigned an integer value which allows for easy indexing and comparison operations.
MoleculeDatasets
Dataset of related tasks, provided as individual files split into meta-train, meta-valid and meta-test sets.
__init__
__init__(
train_data_paths: Optional[List[RichPath]] = None,
valid_data_paths: Optional[List[RichPath]] = None,
test_data_paths: Optional[List[RichPath]] = None,
num_workers: Optional[int] = None,
cache_dir: Optional[Union[str, Path]] = None,
) -> None
Initialize MoleculeDatasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_data_paths
|
List[RichPath]
|
List of paths to training data files. |
None
|
valid_data_paths
|
List[RichPath]
|
List of paths to validation data files. |
None
|
test_data_paths
|
List[RichPath]
|
List of paths to test data files. |
None
|
num_workers
|
Optional[int]
|
Number of workers for data loading. |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching. |
None
|
get_num_fold_tasks
Get number of tasks in a specific fold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fold
|
DataFold
|
The fold to get number of tasks for. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of tasks in the fold. |
from_directory
staticmethod
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
**kwargs: Any,
) -> MoleculeDatasets
Create MoleculeDatasets from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Union[str, RichPath]
|
Directory containing train/valid/test subdirectories. |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
File containing list of tasks to include. Can be either a text file (one task per line) or JSON file with fold-specific task lists. |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching. |
None
|
**kwargs
|
any
|
Additional arguments to pass to MoleculeDatasets constructor. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
MoleculeDatasets |
MoleculeDatasets
|
Created dataset. |
get_task_names
Get list of task names in a specific fold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_fold
|
DataFold
|
The fold to get task names for. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of task names in the fold. |
load_datasets
Load all datasets from specified folds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folds
|
Optional[List[DataFold]]
|
List of folds to load. If None, loads all folds. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, MoleculeDataset]
|
Dictionary mapping dataset names to loaded datasets |
compute_all_features_with_deduplication
compute_all_features_with_deduplication(
featurizer_name: str,
folds: Optional[List[DataFold]] = None,
batch_size: int = 1000,
n_jobs: int = -1,
force_recompute: bool = False,
) -> Dict[str, np.ndarray]
Compute features for all datasets with global SMILES deduplication.
This method provides significant efficiency gains by: 1. Finding all unique SMILES across all datasets 2. Computing features only once per unique SMILES 3. Distributing computed features back to all datasets 4. Using persistent caching to avoid recomputation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of featurizer to use |
required |
folds
|
Optional[List[DataFold]]
|
List of folds to process. If None, processes all folds |
None
|
batch_size
|
int
|
Batch size for feature computation |
1000
|
n_jobs
|
int
|
Number of parallel jobs |
-1
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
Returns:
Type | Description |
---|---|
Dict[str, ndarray]
|
Dictionary mapping dataset names to computed features |
Note
This method has side effects: - It modifies the datasets in place by adding features to the dataset objects - It modifies the molecules in place by adding features to the molecule objects
get_distance_computation_ready_features
get_distance_computation_ready_features(
featurizer_name: str,
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
) -> Tuple[List[np.ndarray], List[np.ndarray], List[str], List[str]]
Get features organized for efficient N×M distance matrix computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of featurizer to use |
required |
source_fold
|
DataFold
|
Fold to use as source datasets (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target datasets (M) |
None
|
Returns:
Type | Description |
---|---|
List[ndarray]
|
Tuple containing: |
List[ndarray]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[ndarray], List[ndarray], List[str], List[str]]
|
|
ProteinDatasets
Collection of protein datasets for different folds (train/validation/test).
Similar to MoleculeDatasets but specifically designed for protein data management, including FASTA file downloading, caching, and feature computation.
__init__
__init__(
train_data_paths: Optional[List[RichPath]] = None,
valid_data_paths: Optional[List[RichPath]] = None,
test_data_paths: Optional[List[RichPath]] = None,
num_workers: Optional[int] = None,
cache_dir: Optional[Union[str, Path]] = None,
uniprot_mapping_file: Optional[Union[str, Path]] = None,
) -> None
Initialize ProteinDatasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_data_paths
|
Optional[List[RichPath]]
|
List of paths to training FASTA files |
None
|
valid_data_paths
|
Optional[List[RichPath]]
|
List of paths to validation FASTA files |
None
|
test_data_paths
|
Optional[List[RichPath]]
|
List of paths to test FASTA files |
None
|
num_workers
|
Optional[int]
|
Number of workers for data loading |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
uniprot_mapping_file
|
Optional[Union[str, Path]]
|
Path to CHEMBLID -> UNIPROT mapping file |
None
|
get_uniprot_id_from_chembl
Get UniProt ID from ChEMBL ID using mapping file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chembl_id
|
str
|
ChEMBL task ID |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
UniProt accession ID if found, None otherwise |
download_fasta_for_task
Download FASTA file for a single task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chembl_id
|
str
|
ChEMBL task ID |
required |
output_path
|
Path
|
Path where to save the FASTA file |
required |
Returns:
Type | Description |
---|---|
bool
|
True if successful, False otherwise |
create_fasta_files_from_task_list
staticmethod
create_fasta_files_from_task_list(
task_list_file: Union[str, Path],
output_dir: Union[str, Path],
uniprot_mapping_file: Optional[Union[str, Path]] = None,
) -> ProteinDatasets
Create FASTA files from a task list and return ProteinDatasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task_list_file
|
Union[str, Path]
|
Path to JSON file containing fold-specific task lists |
required |
output_dir
|
Union[str, Path]
|
Base directory where to create train/test subdirectories |
required |
uniprot_mapping_file
|
Optional[Union[str, Path]]
|
Path to CHEMBLID -> UNIPROT mapping file |
None
|
Returns:
Type | Description |
---|---|
ProteinDatasets
|
ProteinDatasets instance with paths to created FASTA files |
from_directory
staticmethod
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
uniprot_mapping_file: Optional[Union[str, Path]] = None,
**kwargs: Any,
) -> ProteinDatasets
Create ProteinDatasets from a directory containing FASTA files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Union[str, RichPath]
|
Directory containing train/valid/test subdirectories with FASTA files |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
File containing list of tasks to include |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
uniprot_mapping_file
|
Optional[Union[str, Path]]
|
Path to CHEMBLID -> UNIPROT mapping file |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
ProteinDatasets
|
ProteinDatasets instance |
get_num_fold_tasks
Get number of tasks in a specific fold.
get_task_names
Get list of task names in a specific fold.
load_datasets
Load all protein datasets from specified folds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folds
|
Optional[List[DataFold]]
|
List of folds to load. If None, loads all folds. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, ProteinDataset]
|
Dictionary mapping dataset names to loaded ProteinDataset objects |
compute_all_features_with_deduplication
compute_all_features_with_deduplication(
featurizer_name: str = "esm3_sm_open_v1",
layer: Optional[int] = None,
folds: Optional[List[DataFold]] = None,
batch_size: int = 100,
force_recompute: bool = False,
) -> Dict[str, NDArray[np.float32]]
Compute features for all protein datasets with UniProt ID deduplication.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of protein featurizer to use |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
folds
|
Optional[List[DataFold]]
|
List of folds to process. If None, processes all folds |
None
|
batch_size
|
int
|
Batch size for feature computation |
100
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
Returns:
Type | Description |
---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping dataset names to computed features |
get_distance_computation_ready_features
get_distance_computation_ready_features(
featurizer_name: str = "esm3_sm_open_v1",
layer: Optional[int] = None,
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
) -> Tuple[
List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]
Get protein features organized for efficient N×M distance matrix computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of protein featurizer to use |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
source_fold
|
DataFold
|
Fold to use as source datasets (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target datasets (M) |
None
|
Returns:
Type | Description |
---|---|
List[NDArray[float32]]
|
Tuple containing: |
List[NDArray[float32]]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
|
|
save_features_to_file
save_features_to_file(
output_path: Union[str, Path],
featurizer_name: str = "esm3_sm_open_v1",
layer: Optional[int] = None,
folds: Optional[List[DataFold]] = None,
) -> None
Save computed features to a pickle file for efficient loading.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_path
|
Union[str, Path]
|
Path where to save the features |
required |
featurizer_name
|
str
|
Name of protein featurizer used |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
folds
|
Optional[List[DataFold]]
|
List of folds to save. If None, saves all folds |
None
|
load_features_from_file
staticmethod
Load precomputed features from a pickle file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Union[str, Path]
|
Path to the saved features file |
required |
Returns:
Type | Description |
---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping dataset names to features |
Tasks
Collection of tasks for molecular property prediction across different folds.
This class manages multiple Task objects and provides unified access to molecular, protein, and metadata features across train/validation/test splits.
__init__
__init__(
train_tasks: Optional[List[Task]] = None,
valid_tasks: Optional[List[Task]] = None,
test_tasks: Optional[List[Task]] = None,
cache_dir: Optional[Union[str, Path]] = None,
) -> None
Initialize Tasks collection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_tasks
|
Optional[List[Task]]
|
List of training tasks |
None
|
valid_tasks
|
Optional[List[Task]]
|
List of validation tasks |
None
|
test_tasks
|
Optional[List[Task]]
|
List of test tasks |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
from_directory
staticmethod
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
load_molecules: bool = True,
load_proteins: bool = True,
load_metadata: bool = True,
metadata_types: Optional[List[str]] = None,
**kwargs: Any,
) -> Tasks
Create Tasks from a directory structure.
Expected directory structure: directory/ ├── train/ │ ├── CHEMBL123.jsonl.gz (molecules) │ ├── CHEMBL123.fasta (proteins) │ ├── CHEMBL123_assay.json (metadata) │ └── ... ├── valid/ └── test/
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Union[str, RichPath]
|
Base directory containing task data |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
JSON file with fold-specific task lists |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
load_molecules
|
bool
|
Whether to load molecular data |
True
|
load_proteins
|
bool
|
Whether to load protein data |
True
|
load_metadata
|
bool
|
Whether to load metadata |
True
|
metadata_types
|
Optional[List[str]]
|
List of metadata types to load |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
Tasks
|
Tasks instance with loaded data |
get_num_fold_tasks
Get number of tasks in a specific fold.
__getitem__
Get a task by index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
int: index of the task |
required |
Returns:
Type | Description |
---|---|
List[Task]
|
List[Task]: list of tasks |
Note
index 0: Train Tasks index 1: Validation Tasks index 2: Test Tasks
Raises:
Type | Description |
---|---|
IndexError
|
if index is out of range |
compute_all_task_features
compute_all_task_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
folds: Optional[List[DataFold]] = None,
force_recompute: bool = False,
**kwargs: Any,
) -> Dict[str, NDArray[np.float32]]
Compute combined features for all tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Metadata featurizer configurations |
None
|
combination_method
|
str
|
How to combine features |
'concatenate'
|
folds
|
Optional[List[DataFold]]
|
List of folds to process |
None
|
force_recompute
|
bool
|
Whether to force recomputation |
False
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping task names to combined features |
get_distance_computation_ready_features
get_distance_computation_ready_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
**kwargs: Any,
) -> Tuple[
List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]
Get task features organized for efficient N×M distance matrix computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Metadata featurizer configurations |
None
|
combination_method
|
str
|
How to combine features |
'concatenate'
|
source_fold
|
DataFold
|
Fold to use as source tasks (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target tasks (M) |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
List[NDArray[float32]]
|
Tuple containing: |
List[NDArray[float32]]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
|
|
save_task_features_to_file
save_task_features_to_file(
output_path: Union[str, Path],
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
folds: Optional[List[DataFold]] = None,
**kwargs: Any,
) -> None
Save computed task features to a pickle file for efficient loading.
load_task_features_from_file
staticmethod
Load precomputed task features from a pickle file.
TorchMoleculeDataset
Bases: Dataset
PYTORCH Dataset for molecular data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
MoleculeDataset object |
required |
transform
|
callable
|
transform to apply to data |
None
|
target_transform
|
callable
|
transform to apply to targets |
None
|
__init__
__init__(
data: MoleculeDataset,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
) -> None
Initialize TorchMoleculeDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
Input dataset |
required |
transform
|
callable
|
Transform to apply to data |
None
|
target_transform
|
callable
|
Transform to apply to targets |
None
|
__getitem__
Get a data sample.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
Index of the sample to get |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, Tensor]
|
tuple[torch.Tensor, torch.Tensor]: Tuple of (features, label) |
__len__
Get the number of samples in the dataset.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of samples |
create_dataloader
classmethod
create_dataloader(
data: MoleculeDataset,
batch_size: int = 64,
shuffle: bool = True,
**kwargs: Any,
) -> torch.utils.data.DataLoader
Create PyTorch DataLoader.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
Input dataset |
required |
batch_size
|
int
|
Batch size |
64
|
shuffle
|
bool
|
Whether to shuffle data |
True
|
**kwargs
|
any
|
Additional arguments for DataLoader |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DataLoader |
DataLoader
|
PyTorch data loader |
MoleculeDataloader
MoleculeDataloader(
data: MoleculeDataset,
batch_size: int = 64,
shuffle: bool = True,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
) -> torch.utils.data.DataLoader
Load molecular data and create PYTORCH dataloader.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
MoleculeDataset object |
required |
batch_size
|
int
|
batch size |
64
|
shuffle
|
bool
|
whether to shuffle data |
True
|
transform
|
callable
|
transform to apply to data |
None
|
target_transform
|
callable
|
transform to apply to targets |
None
|
Returns:
Name | Type | Description |
---|---|---|
dataset_loader |
DataLoader
|
PYTORCH dataloader |
Example
from themap.data.torch_dataset import MoleculeDataloader from themap.data.tasks import Tasks tasks = Tasks.from_directory( directory="datasets/", task_list_file="datasets/sample_tasks_list.json", load_molecules=True, load_proteins=False, load_metadata=False, cache_dir="./cache" ) dataset_loader = MoleculeDataloader(tasks.get_task("TASK_ID").molecule_dataset, batch_size=10, shuffle=True) for batch in dataset_loader: print(batch) break