themap.data
themap.data
MoleculeDatapoint
dataclass
Data structure holding information for a single molecule and associated features.
This class represents a single molecule datapoint with its associated features and labels. It provides methods to compute molecular fingerprints and features, and includes various molecular properties as properties.
Attributes:
Name | Type | Description |
---|---|---|
task_id |
str
|
String describing the task this datapoint is taken from. |
smiles |
str
|
SMILES string describing the molecule this datapoint corresponds to. |
bool_label |
bool
|
bool classification label, usually derived from the numeric label using a threshold. |
numeric_label |
Optional[float]
|
numerical label (e.g., activity), usually measured in the lab |
_rdkit_mol |
Optional[Mol]
|
cached RDKit molecule object |
Properties
number_of_atoms (int): Number of heavy atoms in the molecule number_of_bonds (int): Number of bonds in the molecule molecular_weight (float): Molecular weight in atomic mass units logp (float): Octanol-water partition coefficient (LogP) num_rotatable_bonds (int): Number of rotatable bonds in the molecule smiles_canonical (str): Canonical SMILES representation rdkit_mol (Chem.Mol): RDKit molecule object (lazy loaded)
Methods:
Name | Description |
---|---|
from_dict |
Create datapoint from dictionary (class method) |
to_dict |
Convert datapoint to dictionary |
get_fingerprint |
Computes and returns the Morgan fingerprint for the molecule |
get_features |
Computes and returns molecular features using specified featurizer |
Notes
By design, if the SMILES is invalid and can not be parsed with RDKit, it will result in a InvalidSMILESError. So make sure to validate and sanitize your SMILES strings before creating a MoleculeDatapoint.
Example
Create a molecule datapoint
datapoint = MoleculeDatapoint( ... task_id="toxicity_prediction", ... smiles="CCCO", # propanol ... bool_label=True, ... numeric_label=0.8 ... )
Access molecular properties
print(f"Number of heavy atoms: {datapoint.number_of_atoms}")
Number of heavy atoms: 4
print(f"Molecular weight: {datapoint.molecular_weight:.2f}")
Molecular weight: 60.06
print(f"LogP: {datapoint.logp:.2f}")
LogP: 0.39
print(f"Number of rotatable bonds: {datapoint.num_rotatable_bonds}")
Number of rotatable bonds: 1
print(f"SMILES canonical: {datapoint.smiles_canonical}")
SMILES canonical: CCCO
Get molecular features
fingerprint = datapoint.get_fingerprint() print(f"Fingerprint shape: {fingerprint.shape if fingerprint is not None else None}")
Fingerprint shape: (2048,)
features = datapoint.get_features(featurizer_name="ecfp") print(f"Features shape: {features.shape if features is not None else None}")
Features shape: (2048,)
rdkit_mol
property
Get the RDKit molecule object.
This property lazily initializes the RDKit molecule if it hasn't been created yet. The molecule is cached to avoid recreating it multiple times.
Returns:
Type | Description |
---|---|
Optional[Mol]
|
Optional[Chem.Mol]: RDKit molecule object. Returns None if molecule creation fails. |
number_of_atoms
property
Get the number of heavy atoms in the molecule.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of heavy atoms in the molecule. |
Raises: ValueError: If the molecule cannot be created.
number_of_bonds
property
Get the number of bonds in the molecule.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of bonds in the molecule. |
Raises: ValueError: If the molecule cannot be created.
molecular_weight
property
Get the molecular weight of the molecule.
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Molecular weight of the molecule in atomic mass units. |
Raises: ValueError: If the molecule cannot be created.
logp
property
Calculate octanol-water partition coefficient.
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
LogP value of the molecule. |
Raises: ValueError: If the molecule cannot be created.
num_rotatable_bonds
property
Get number of rotatable bonds.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of rotatable bonds in the molecule. |
Raises: ValueError: If the molecule cannot be created.
smiles_canonical
property
Get canonical SMILES representation.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
Canonical SMILES string for the molecule. |
Raises: ValueError: If the molecule cannot be created.
from_dict
classmethod
Create datapoint from dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
dict
|
Dictionary containing the datapoint data. |
required |
Returns:
Name | Type | Description |
---|---|---|
MoleculeDatapoint |
MoleculeDatapoint
|
Datapoint object. |
Example
datapoint = MoleculeDatapoint.from_dict({ ... "task_id": "toxicity_prediction", ... "smiles": "CCCO", ... "bool_label": True, ... "numeric_label": 0.8 ... }) print(datapoint)
MoleculeDatapoint(task_id=toxicity_prediction, smiles=CCCO, bool_label=True, numeric_label=0.8)
get_fingerprint
Get the Morgan fingerprint for a molecule.
This method computes the Extended-Connectivity Fingerprint (ECFP) for the molecule using RDKit's Morgan fingerprint generator. Features are cached globally to avoid recomputation across different instances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
force_recompute
|
bool
|
If True, the fingerprint is recomputed even if cached. |
False
|
Returns:
Type | Description |
---|---|
Optional[ndarray]
|
Optional[np.ndarray]: Morgan fingerprint for the molecule (r=2, nbits=2048). The fingerprint is a binary vector representing the molecular structure. Returns None if fingerprint generation fails. |
Raises:
Type | Description |
---|---|
FeaturizationError
|
If computing fingerprint for molecule fails. |
Note
dtype of the fingerprint is np.uint8.
get_features
get_features(
featurizer_name: Optional[str] = "ecfp", force_recompute: bool = False
) -> Optional[np.ndarray]
Get features for a molecule using a featurizer model.
This method computes molecular features using the specified featurizer model. Features are cached globally to avoid recomputation across different instances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
Optional[str]
|
Name of the featurizer model to use. Defaults to "ecfp". If None, returns None. |
'ecfp'
|
force_recompute
|
bool
|
If True, features are recomputed even if cached. |
False
|
Returns:
Type | Description |
---|---|
Optional[ndarray]
|
Optional[np.ndarray]: Features for the molecule. The shape and content depend on the featurizer used. Returns None if feature generation fails or featurizer_name is None. |
Note
dtype of the features is different for different featurizers. For example, ecfp and fcfp dtype is np.uint8, while mordred dtype is np.float64.
Raises:
Type | Description |
---|---|
FeaturizationError
|
If computing features for molecule fails. |
MoleculeDataset
dataclass
Data structure holding information for a dataset of molecules.
This class represents a collection of molecule datapoints, providing methods for dataset manipulation, feature computation, and statistical analysis.
Attributes:
Name | Type | Description |
---|---|---|
task_id |
str
|
String describing the task this dataset is taken from. |
data |
List[MoleculeDatapoint]
|
List of MoleculeDatapoint objects. |
_current_featurizer |
Optional[str]
|
Name of the current featurizer. |
_cache_info |
Dict[str, Any]
|
Information about the feature caching. |
_persistent_cache |
Optional[PersistentFeatureCache]
|
Persistent feature cache. |
Properties
computed_features (Optional[NDArray[np.float32]]): Get the cached features for the dataset. labels (List[bool]): Get the labels for the dataset. smiles (List[str]): Get the SMILES for the dataset. ratio (float): Get the ratio of positive to negative labels.
Methods:
Name | Description |
---|---|
load_from_file |
Load dataset from a file |
get_features |
Get the features for the entire dataset using a featurizer |
get_prototype |
Get the prototype of the dataset |
get_statistics |
Get the statistics of the dataset |
filter |
Filter the dataset |
clear_cache |
Clear the cache |
enable_persistent_cache |
Enable persistent caching for this dataset |
get_persistent_cache_stats |
Get statistics about the persistent cache |
get_cache_info |
Get information about the current cache state |
get_memory_usage |
Get memory usage statistics for the dataset |
optimize_memory |
Optimize memory usage by cleaning up unnecessary data |
validate_dataset_integrity |
Validate the integrity of the dataset |
Examples:
Load a dataset from a file:
dataset = MoleculeDataset.load_from_file("datasets/test/CHEMBL2219358.jsonl.gz") print(dataset)
MoleculeDataset(task_id=CHEMBL2219358, task_size=157)
compute the dataset embedding:
dataset.get_features(featurizer_name="fcfp", n_jobs=1)
compute the prototype:
dataset.get_prototype(featurizer_name="fcfp")
compute the dataset statistics:
dataset.get_statistics()
filter the dataset:
dataset.filter(lambda x: x.bool_label == 1)
enable persistent caching:
dataset.enable_persistent_cache("cache/")
get statistics about the persistent cache:
dataset.get_persistent_cache_stats()
get information about the current cache state:
dataset.get_cache_info()
get memory usage statistics for the dataset:
dataset.get_memory_usage()
optimize memory usage by cleaning up unnecessary data:
dataset.optimize_memory()
validate the integrity of the dataset:
dataset.validate_dataset_integrity()
computed_features
property
Get the cached features for the dataset.
Returns:
Type | Description |
---|---|
Optional[NDArray[float32]]
|
Optional[NDArray[np.float32]]: Cached features for the dataset if available, None otherwise. |
labels
property
Get the boolean labels for all molecules in the dataset.
Returns:
Type | Description |
---|---|
NDArray[int32]
|
Array of boolean labels converted to integers (0/1) |
smiles
property
Get the SMILES strings for all molecules in the dataset.
Returns:
Type | Description |
---|---|
List[str]
|
List of SMILES strings |
positive_ratio
property
Get the ratio of positive to negative examples in the dataset.
Returns:
Name | Type | Description |
---|---|---|
float |
float
|
Ratio of positive to negative examples in the dataset. |
get_computed_features
property
Deprecated: Use 'computed_features' property instead.
get_features
get_features(
featurizer_name: str,
n_jobs: Optional[int] = DEFAULT_N_JOBS,
force_recompute: bool = False,
batch_size: int = DEFAULT_BATCH_SIZE,
) -> np.ndarray
Get the features for the entire dataset using a featurizer.
Efficiently computes features for all molecules in the dataset using the specified featurizer, taking advantage of the featurizer's built-in parallelization capabilities and maintaining a two-level cache strategy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the featurizer to use |
required |
n_jobs
|
Optional[int]
|
Number of parallel jobs. If provided, temporarily overrides the featurizer's own setting |
DEFAULT_N_JOBS
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
batch_size
|
int
|
Batch size for processing, used for memory efficiency when handling large datasets |
DEFAULT_BATCH_SIZE
|
Returns:
Type | Description |
---|---|
ndarray
|
np.ndarray: Features for the entire dataset, shape (n_samples, n_features) |
ndarray
|
if the features are already computed, they are loaded from the cache. |
ndarray
|
if the features are not computed, they are computed and cached. |
Raises:
Type | Description |
---|---|
ValueError
|
If the generated features length doesn't match the dataset length |
TypeError
|
If featurizer_name is not a string |
RuntimeError
|
If featurization fails |
IndexError
|
If dataset is empty |
Notes
- Output dtype is different for each featurizer.
validate_dataset_integrity
Validate the integrity of the dataset.
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if dataset is valid, False otherwise |
Raises:
Type | Description |
---|---|
ValueError
|
If critical integrity issues are found |
get_memory_usage
Get memory usage statistics for the dataset.
Returns:
Type | Description |
---|---|
Dict[str, float]
|
Dictionary with memory usage in MB for different components |
optimize_memory
Optimize memory usage by cleaning up unnecessary data.
Returns:
Type | Description |
---|---|
Dict[str, Any]
|
Dictionary with optimization results |
enable_persistent_cache
Enable persistent caching for this dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
cache_dir
|
Union[str, Path]
|
Directory for storing cached features |
required |
get_features_with_persistent_cache
get_features_with_persistent_cache(
featurizer_name: str,
cache_dir: Optional[Union[str, Path]] = None,
n_jobs: Optional[int] = None,
force_recompute: bool = False,
batch_size: int = 1000,
) -> NDArray[np.float32]
Get dataset features with persistent caching enabled.
This method provides the same functionality as get_features but with persistent disk caching to avoid recomputation across sessions.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the featurizer to use |
required |
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent cache (if None, uses existing cache) |
None
|
n_jobs
|
Optional[int]
|
Number of parallel jobs |
None
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
batch_size
|
int
|
Batch size for processing |
1000
|
Returns:
Type | Description |
---|---|
NDArray[float32]
|
Features for the entire dataset |
get_persistent_cache_stats
Get statistics about the persistent cache.
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
Cache statistics if persistent cache is enabled, None otherwise |
get_prototype
Get the prototype of the dataset.
This method calculates the mean feature vector of positive and negative examples in the dataset using the specified featurizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the featurizer to use. |
required |
Returns:
Type | Description |
---|---|
Tuple[NDArray[float32], NDArray[float32]]
|
Tuple[NDArray[np.float32], NDArray[np.float32]]: Tuple containing: - positive_prototype: Mean feature vector of positive examples - negative_prototype: Mean feature vector of negative examples |
Raises:
Type | Description |
---|---|
ValueError
|
If there are no positive or negative examples in the dataset |
TypeError
|
If featurizer_name is not a string |
RuntimeError
|
If feature computation fails |
Notes
- It assumes there are two positive and two negative examples in the dataset.
- Output dtype is different for each featurizer.
load_from_file
staticmethod
Load dataset from a JSONL.GZ file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Union[str, RichPath]
|
Path to the JSONL.GZ file. |
required |
Returns:
Name | Type | Description |
---|---|---|
MoleculeDataset |
MoleculeDataset
|
Loaded dataset. |
filter
Filter dataset based on a condition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
condition
|
Callable[[MoleculeDatapoint], bool]
|
Function that returns True/False for each datapoint. |
required |
Returns:
Name | Type | Description |
---|---|---|
MoleculeDataset |
MoleculeDataset
|
New dataset containing only the filtered datapoints. |
get_statistics
Get statistics about the dataset.
Returns:
Name | Type | Description |
---|---|---|
DatasetStats |
DatasetStats
|
Dictionary containing: - size: Total number of datapoints - positive_ratio: Ratio of positive to negative examples - avg_molecular_weight: Average molecular weight - avg_atoms: Average number of atoms - avg_bonds: Average number of bonds |
Raises:
Type | Description |
---|---|
ValueError
|
If the dataset is empty |
DataFold
Bases: IntEnum
Enum for data fold types.
This enum represents the different data splits used in machine learning: - TRAIN (0): Training/source tasks - VALIDATION (1): Validation/development tasks - TEST (2): Test/target tasks
By inheriting from IntEnum, each fold type is assigned an integer value which allows for easy indexing and comparison operations.
MoleculeDatasets
Dataset of related tasks, provided as individual files split into meta-train, meta-valid and meta-test sets.
Attributes:
Name | Type | Description |
---|---|---|
_fold_to_data_paths |
Dict[DataFold, List[RichPath]]
|
Dictionary mapping data folds to their respective data paths. |
_num_workers |
Optional[int]
|
Number of workers for data loading. |
_cache_dir |
Optional[Union[str, Path]]
|
Directory for persistent caching. |
_global_cache |
Optional[GlobalMoleculeCache]
|
Global molecule cache. |
_loaded_datasets |
Dict[str, MoleculeDataset]
|
Dictionary mapping dataset names to their respective loaded datasets. |
Properties
get_num_fold_tasks (int): Get the number of tasks in a specific fold. get_task_names (List[str]): Get the list of task names in a specific fold. load_datasets (Dict[str, MoleculeDataset]): Load all datasets from specified folds. compute_all_features_with_deduplication (Dict[str, np.ndarray]): Compute features for all datasets with global SMILES deduplication. get_distance_computation_ready_features (Tuple[List[np.ndarray], List[np.ndarray], List[str], List[str]]): Get features organized for efficient N×M distance matrix computation. get_global_cache_stats (Optional[Dict]): Get statistics about the global cache usage.
Methods:
Name | Description |
---|---|
from_directory |
Create MoleculeDatasets from a directory. |
get_num_fold_tasks |
Get the number of tasks in a specific fold. |
get_task_names |
Get the list of task names in a specific fold. |
load_datasets |
Load all datasets from specified folds. |
compute_all_features_with_deduplication |
Compute features for all datasets with global SMILES deduplication. |
get_distance_computation_ready_features |
Get features organized for efficient N×M distance matrix computation. |
get_global_cache_stats |
Get statistics about the global cache usage. |
Examples:
Create MoleculeDatasets from a directory:
molecule_datasets = MoleculeDatasets.from_directory("datasets/")
Get the number of tasks in the train fold:
molecule_datasets.get_num_fold_tasks(DataFold.TRAIN)
Get the list of task names in the validation fold:
molecule_datasets.get_task_names(DataFold.VALIDATION)
Get the list of task names in the test fold:
molecule_datasets.get_task_names(DataFold.TEST)
__init__
__init__(
train_data_paths: Optional[List[RichPath]] = None,
valid_data_paths: Optional[List[RichPath]] = None,
test_data_paths: Optional[List[RichPath]] = None,
num_workers: Optional[int] = None,
cache_dir: Optional[Union[str, Path]] = None,
) -> None
Initialize MoleculeDatasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_data_paths
|
List[RichPath]
|
List of paths to training data files. |
None
|
valid_data_paths
|
List[RichPath]
|
List of paths to validation data files. |
None
|
test_data_paths
|
List[RichPath]
|
List of paths to test data files. |
None
|
num_workers
|
Optional[int]
|
Number of workers for data loading. |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching. |
None
|
get_num_fold_tasks
Get number of tasks in a specific fold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
fold
|
DataFold
|
The fold to get number of tasks for. |
required |
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of tasks in the fold. |
from_directory
staticmethod
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
**kwargs: Any,
) -> MoleculeDatasets
Create MoleculeDatasets from a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Union[str, RichPath]
|
Directory containing train/valid/test subdirectories. |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
File containing list of tasks to include. Can be either a text file (one task per line) or JSON file with fold-specific task lists. |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching. |
None
|
**kwargs
|
any
|
Additional arguments to pass to MoleculeDatasets constructor. |
{}
|
Returns:
Name | Type | Description |
---|---|---|
MoleculeDatasets |
MoleculeDatasets
|
Created dataset. |
get_task_names
Get list of task names in a specific fold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_fold
|
DataFold
|
The fold to get task names for. |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List[str]: List of task names in the fold. |
load_datasets
Load all datasets from specified folds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folds
|
Optional[List[DataFold]]
|
List of folds to load. If None, loads all folds. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, MoleculeDataset]
|
Dictionary mapping dataset names to loaded datasets |
compute_all_features_with_deduplication
compute_all_features_with_deduplication(
featurizer_name: str,
folds: Optional[List[DataFold]] = None,
batch_size: int = 1000,
n_jobs: int = -1,
force_recompute: bool = False,
) -> Dict[str, np.ndarray]
Compute features for all datasets with global SMILES deduplication.
This method provides significant efficiency gains by: 1. Finding all unique SMILES across all datasets 2. Computing features only once per unique SMILES 3. Distributing computed features back to all datasets 4. Using persistent caching to avoid recomputation
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of featurizer to use |
required |
folds
|
Optional[List[DataFold]]
|
List of folds to process. If None, processes all folds |
None
|
batch_size
|
int
|
Batch size for feature computation |
1000
|
n_jobs
|
int
|
Number of parallel jobs |
-1
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
Returns:
Type | Description |
---|---|
Dict[str, ndarray]
|
Dictionary mapping dataset names to computed features |
get_distance_computation_ready_features
get_distance_computation_ready_features(
featurizer_name: str,
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
) -> Tuple[List[np.ndarray], List[np.ndarray], List[str], List[str]]
Get features organized for efficient N×M distance matrix computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of featurizer to use |
required |
source_fold
|
DataFold
|
Fold to use as source datasets (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target datasets (M) |
None
|
Returns:
Type | Description |
---|---|
List[ndarray]
|
Tuple containing: |
List[ndarray]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[ndarray], List[ndarray], List[str], List[str]]
|
|
Notes
if you dont provide target_folds, it will use the validation and test folds as default target folds.
ProteinMetadataDataset
dataclass
Single protein metadata dataset representing one task.
Attributes:
Name | Type | Description |
---|---|---|
task_id |
str
|
Unique identifier for the task (CHEMBL ID) |
uniprot_id |
str
|
UniProt accession ID for the protein |
sequence |
str
|
Protein amino acid sequence |
features |
Optional[NDArray[float32]]
|
Optional pre-computed protein features |
Properties
get_computed_features (Optional[NDArray[np.float32]]): Get computed protein features.
Methods:
Name | Description |
---|---|
get_features |
Get protein features using the specified featurizer |
get_computed_features
property
Get computed protein features.
get_features
get_features(
featurizer_name: str = "esm3_sm_open_v1", layer: Optional[int] = None
) -> NDArray[np.float32]
Get protein features using the specified featurizer.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of the protein featurizer to use |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
Returns:
Type | Description |
---|---|
NDArray[float32]
|
Computed protein features |
ProteinMetadataDatasets
Collection of protein datasets for different folds (train/validation/test).
Similar to MoleculeDatasets but specifically designed for protein data management, including FASTA file downloading, caching, and feature computation.
Attributes:
Name | Type | Description |
---|---|---|
_fold_to_data_paths |
Dict[DataFold, List[RichPath]]
|
Dictionary mapping data folds to their respective data paths. |
_num_workers |
Optional[int]
|
Number of workers for data loading. |
_cache_dir |
Optional[Union[str, Path]]
|
Directory for persistent caching. |
_global_cache |
Optional[GlobalMoleculeCache]
|
Global molecule cache. |
_loaded_datasets |
Dict[str, ProteinMetadataDataset]
|
Dictionary mapping dataset names to their respective loaded datasets. |
uniprot_mapping_file |
Optional[Union[str, Path]]
|
Path to CHEMBLID -> UNIPROT mapping file. |
Properties
uniprot_mapping (pd.DataFrame): Lazy loaded UniProt mapping dataframe. get_num_fold_tasks (int): Get the number of tasks in a specific fold. get_task_names (List[str]): Get the list of task names in a specific fold. load_datasets (Dict[str, ProteinMetadataDataset]): Load all datasets from specified folds. compute_all_features_with_deduplication (Dict[str, NDArray[np.float32]]): Compute features for all datasets with global SMILES deduplication. get_distance_computation_ready_features (Tuple[List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]]): Get features organized for efficient N×M distance matrix computation. get_global_cache_stats (Optional[Dict[str, Any]]): Get statistics about the global cache usage.
Methods:
Name | Description |
---|---|
from_directory |
Create ProteinMetadataDatasets from a directory. |
get_num_fold_tasks |
Get the number of tasks in a specific fold. |
get_task_names |
Get the list of task names in a specific fold. |
load_datasets |
Load all datasets from specified folds. |
compute_all_features_with_deduplication |
Compute features for all datasets with global SMILES deduplication. |
get_distance_computation_ready_features |
Get features organized for efficient N×M distance matrix computation. |
get_global_cache_stats |
Get statistics about the global cache usage. |
__init__
__init__(
train_data_paths: Optional[List[RichPath]] = None,
valid_data_paths: Optional[List[RichPath]] = None,
test_data_paths: Optional[List[RichPath]] = None,
num_workers: Optional[int] = None,
cache_dir: Optional[Union[str, Path]] = None,
uniprot_mapping_file: Optional[Union[str, Path]] = None,
) -> None
Initialize ProteinMetadataDatasets.
get_uniprot_id_from_chembl
Get UniProt ID from ChEMBL ID using mapping file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chembl_id
|
str
|
ChEMBL task ID |
required |
Returns:
Type | Description |
---|---|
Optional[str]
|
UniProt accession ID if found, None otherwise |
download_fasta_for_task
Download FASTA file for a single task.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
chembl_id
|
str
|
ChEMBL task ID |
required |
output_path
|
Path
|
Path where to save the FASTA file |
required |
Returns:
Type | Description |
---|---|
bool
|
True if successful, False otherwise |
create_fasta_files_from_task_list
staticmethod
create_fasta_files_from_task_list(
task_list_file: Union[str, Path],
output_dir: Union[str, Path],
uniprot_mapping_file: Optional[Union[str, Path]] = None,
) -> ProteinMetadataDatasets
Create FASTA files from a task list and return ProteinMetadataDatasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task_list_file
|
Union[str, Path]
|
Path to JSON file containing fold-specific task lists |
required |
output_dir
|
Union[str, Path]
|
Base directory where to create train/test subdirectories |
required |
uniprot_mapping_file
|
Optional[Union[str, Path]]
|
Path to CHEMBLID -> UNIPROT mapping file |
None
|
Returns:
Type | Description |
---|---|
ProteinMetadataDatasets
|
ProteinMetadataDatasets instance with paths to created FASTA files |
from_directory
staticmethod
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
uniprot_mapping_file: Optional[Union[str, Path]] = None,
**kwargs: Any,
) -> ProteinMetadataDatasets
Create ProteinMetadataDatasets from a directory containing FASTA files.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Union[str, RichPath]
|
Directory containing train/valid/test subdirectories with FASTA files |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
File containing list of tasks to include |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
uniprot_mapping_file
|
Optional[Union[str, Path]]
|
Path to CHEMBLID -> UNIPROT mapping file |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
ProteinMetadataDatasets
|
ProteinMetadataDatasets instance |
get_num_fold_tasks
Get number of tasks in a specific fold.
get_task_names
Get list of task names in a specific fold.
load_datasets
Load all protein datasets from specified folds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
folds
|
Optional[List[DataFold]]
|
List of folds to load. If None, loads all folds. |
None
|
Returns:
Type | Description |
---|---|
Dict[str, ProteinMetadataDataset]
|
Dictionary mapping dataset names to loaded ProteinDataset objects |
compute_all_features_with_deduplication
compute_all_features_with_deduplication(
featurizer_name: str = "esm3_sm_open_v1",
layer: Optional[int] = None,
folds: Optional[List[DataFold]] = None,
batch_size: int = 100,
force_recompute: bool = False,
) -> Dict[str, NDArray[np.float32]]
Compute features for all protein datasets with UniProt ID deduplication.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of protein featurizer to use |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
folds
|
Optional[List[DataFold]]
|
List of folds to process. If None, processes all folds |
None
|
batch_size
|
int
|
Batch size for feature computation |
100
|
force_recompute
|
bool
|
Whether to force recomputation even if cached |
False
|
Returns:
Type | Description |
---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping dataset names to computed features |
get_distance_computation_ready_features
get_distance_computation_ready_features(
featurizer_name: str = "esm3_sm_open_v1",
layer: Optional[int] = None,
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
) -> Tuple[
List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]
Get protein features organized for efficient N×M distance matrix computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
featurizer_name
|
str
|
Name of protein featurizer to use |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
source_fold
|
DataFold
|
Fold to use as source datasets (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target datasets (M) |
None
|
Returns:
Type | Description |
---|---|
List[NDArray[float32]]
|
Tuple containing: |
List[NDArray[float32]]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
|
|
save_features_to_file
save_features_to_file(
output_path: Union[str, Path],
featurizer_name: str = "esm3_sm_open_v1",
layer: Optional[int] = None,
folds: Optional[List[DataFold]] = None,
) -> None
Save computed features to a pickle file for efficient loading.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_path
|
Union[str, Path]
|
Path where to save the features |
required |
featurizer_name
|
str
|
Name of protein featurizer used |
'esm3_sm_open_v1'
|
layer
|
Optional[int]
|
Layer number for ESM models |
None
|
folds
|
Optional[List[DataFold]]
|
List of folds to save. If None, saves all folds |
None
|
load_features_from_file
staticmethod
Load precomputed features from a pickle file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_path
|
Union[str, Path]
|
Path to the saved features file |
required |
Returns:
Type | Description |
---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping dataset names to features |
Tasks
Collection of tasks for molecular property prediction across different folds.
This class manages multiple Task objects and provides unified access to molecular, protein, and metadata features across train/validation/test splits.
__init__
__init__(
train_tasks: Optional[List[Task]] = None,
valid_tasks: Optional[List[Task]] = None,
test_tasks: Optional[List[Task]] = None,
cache_dir: Optional[Union[str, Path]] = None,
) -> None
Initialize Tasks collection.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
train_tasks
|
Optional[List[Task]]
|
List of training tasks |
None
|
valid_tasks
|
Optional[List[Task]]
|
List of validation tasks |
None
|
test_tasks
|
Optional[List[Task]]
|
List of test tasks |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
from_directory
staticmethod
from_directory(
directory: Union[str, RichPath],
task_list_file: Optional[Union[str, RichPath]] = None,
cache_dir: Optional[Union[str, Path]] = None,
load_molecules: bool = True,
load_proteins: bool = True,
load_metadata: bool = True,
metadata_types: Optional[List[str]] = None,
**kwargs: Any,
) -> Tasks
Create Tasks from a directory structure.
Expected directory structure: directory/ ├── train/ │ ├── CHEMBL123.jsonl.gz (molecules) │ ├── CHEMBL123.fasta (proteins) │ ├── CHEMBL123_assay.json (metadata) │ └── ... ├── valid/ └── test/
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Union[str, RichPath]
|
Base directory containing task data |
required |
task_list_file
|
Optional[Union[str, RichPath]]
|
JSON file with fold-specific task lists |
None
|
cache_dir
|
Optional[Union[str, Path]]
|
Directory for persistent caching |
None
|
load_molecules
|
bool
|
Whether to load molecular data |
True
|
load_proteins
|
bool
|
Whether to load protein data |
True
|
load_metadata
|
bool
|
Whether to load metadata |
True
|
metadata_types
|
Optional[List[str]]
|
List of metadata types to load |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
Tasks
|
Tasks instance with loaded data |
get_num_fold_tasks
Get number of tasks in a specific fold.
__getitem__
Get a task by index.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
int: index of the task |
required |
Returns:
Type | Description |
---|---|
List[Task]
|
List[Task]: list of tasks |
Note
index 0: Train Tasks index 1: Validation Tasks index 2: Test Tasks
Raises:
Type | Description |
---|---|
IndexError
|
if index is out of range |
compute_all_task_features
compute_all_task_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
folds: Optional[List[DataFold]] = None,
force_recompute: bool = False,
**kwargs: Any,
) -> Dict[str, NDArray[np.float32]]
Compute combined features for all tasks.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Metadata featurizer configurations |
None
|
combination_method
|
str
|
How to combine features |
'concatenate'
|
folds
|
Optional[List[DataFold]]
|
List of folds to process |
None
|
force_recompute
|
bool
|
Whether to force recomputation |
False
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
Dict[str, NDArray[float32]]
|
Dictionary mapping task names to combined features |
get_distance_computation_ready_features
get_distance_computation_ready_features(
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
source_fold: DataFold = DataFold.TRAIN,
target_folds: Optional[List[DataFold]] = None,
**kwargs: Any,
) -> Tuple[
List[NDArray[np.float32]], List[NDArray[np.float32]], List[str], List[str]
]
Get task features organized for efficient N×M distance matrix computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
molecule_featurizer
|
Optional[str]
|
Molecular featurizer name |
None
|
protein_featurizer
|
Optional[str]
|
Protein featurizer name |
None
|
metadata_configs
|
Optional[Dict[str, Dict[str, Any]]]
|
Metadata featurizer configurations |
None
|
combination_method
|
str
|
How to combine features |
'concatenate'
|
source_fold
|
DataFold
|
Fold to use as source tasks (N) |
TRAIN
|
target_folds
|
Optional[List[DataFold]]
|
Folds to use as target tasks (M) |
None
|
**kwargs
|
Any
|
Additional arguments |
{}
|
Returns:
Type | Description |
---|---|
List[NDArray[float32]]
|
Tuple containing: |
List[NDArray[float32]]
|
|
List[str]
|
|
List[str]
|
|
Tuple[List[NDArray[float32]], List[NDArray[float32]], List[str], List[str]]
|
|
save_task_features_to_file
save_task_features_to_file(
output_path: Union[str, Path],
molecule_featurizer: Optional[str] = None,
protein_featurizer: Optional[str] = None,
metadata_configs: Optional[Dict[str, Dict[str, Any]]] = None,
combination_method: str = "concatenate",
folds: Optional[List[DataFold]] = None,
**kwargs: Any,
) -> None
Save computed task features to a pickle file for efficient loading.
load_task_features_from_file
staticmethod
Load precomputed task features from a pickle file.
TorchMoleculeDataset
Bases: Dataset
Enhanced PyTorch Dataset wrapper for molecular data.
This class wraps a MoleculeDataset to provide PyTorch Dataset functionality while maintaining access to all original MoleculeDataset methods through delegation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
MoleculeDataset object |
required |
transform
|
callable
|
Transform to apply to features |
None
|
target_transform
|
callable
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to load data lazily. Defaults to False. |
False
|
Example
from themap.data import MoleculeDataset from themap.data.torch_dataset import TorchMoleculeDataset
Load molecular dataset
mol_dataset = MoleculeDataset.load_from_file("data.jsonl.gz")
Create PyTorch wrapper
torch_dataset = TorchMoleculeDataset(mol_dataset)
Use as PyTorch Dataset
dataloader = torch.utils.data.DataLoader(torch_dataset, batch_size=32)
Access original methods through delegation
stats = torch_dataset.get_statistics() features = torch_dataset.get_features("ecfp")
dataset
property
Access to the underlying MoleculeDataset.
Returns:
Name | Type | Description |
---|---|---|
MoleculeDataset |
MoleculeDataset
|
The wrapped dataset |
__init__
__init__(
data: MoleculeDataset,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
lazy_loading: bool = False,
) -> None
Initialize TorchMoleculeDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
Input molecular dataset |
required |
transform
|
callable
|
Transform to apply to features |
None
|
target_transform
|
callable
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to load tensors lazily |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If the dataset is empty or features/labels are invalid |
TypeError
|
If data is not a MoleculeDataset instance |
__getitem__
Get a data sample.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
Index of the sample to get |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, Tensor]
|
tuple[torch.Tensor, torch.Tensor]: Tuple of (features, label) |
Raises:
Type | Description |
---|---|
IndexError
|
If index is out of bounds |
RuntimeError
|
If lazy loading fails |
__len__
Get the number of samples in the dataset.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of samples |
__getattr__
Delegate attribute access to underlying MoleculeDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Attribute name |
required |
Returns:
Type | Description |
---|---|
Any
|
The attribute from the underlying dataset |
Raises:
Type | Description |
---|---|
AttributeError
|
If attribute doesn't exist in underlying dataset |
get_smiles
Get SMILES strings for all molecules.
Returns:
Type | Description |
---|---|
list[str]
|
list[str]: List of SMILES strings |
refresh_tensors
Refresh cached tensors from the underlying dataset.
Useful when the underlying dataset has been modified.
create_dataloader
classmethod
create_dataloader(
data: MoleculeDataset,
batch_size: int = 64,
shuffle: bool = True,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
lazy_loading: bool = False,
**kwargs: Any,
) -> torch.utils.data.DataLoader
Create PyTorch DataLoader with enhanced options.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
Input molecular dataset |
required |
batch_size
|
int
|
Batch size |
64
|
shuffle
|
bool
|
Whether to shuffle data |
True
|
transform
|
Optional[Callable]
|
Transform to apply to features |
None
|
target_transform
|
Optional[Callable]
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to use lazy loading |
False
|
**kwargs
|
Any
|
Additional arguments for DataLoader |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DataLoader |
DataLoader
|
PyTorch data loader |
Example
loader = TorchMoleculeDataset.create_dataloader( ... dataset, ... batch_size=32, ... shuffle=True, ... num_workers=4 ... )
TorchProteinMetadataDataset
Bases: Dataset
Enhanced PyTorch Dataset wrapper for protein data.
This class wraps a ProteinMetadataDataset to provide PyTorch Dataset functionality while maintaining access to all original ProteinMetadataDataset methods through delegation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ProteinMetadataDataset
|
ProteinMetadataDataset object |
required |
transform
|
callable
|
Transform to apply to features |
None
|
target_transform
|
callable
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to load data lazily. Defaults to False. |
False
|
sequence_length
|
int
|
Fixed sequence length for padding/truncation |
None
|
Example
from themap.data import ProteinMetadataDataset from themap.data.torch_dataset import TorchProteinMetadataDataset
Create protein dataset
protein_dataset = ProteinMetadataDataset( ... task_id="CHEMBL123", ... uniprot_id="P12345", ... sequence="MKLLVFSLCLLAFSSATAAF" ... )
Create PyTorch wrapper
torch_dataset = TorchProteinMetadataDataset(protein_dataset)
Use as PyTorch Dataset
dataloader = torch.utils.data.DataLoader(torch_dataset, batch_size=1)
dataset
property
Access to the underlying ProteinMetadataDataset.
Returns:
Name | Type | Description |
---|---|---|
ProteinMetadataDataset |
ProteinMetadataDataset
|
The wrapped dataset |
__init__
__init__(
data: ProteinMetadataDataset,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
lazy_loading: bool = False,
sequence_length: Optional[int] = None,
) -> None
Initialize TorchProteinMetadataDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ProteinMetadataDataset
|
Input protein dataset |
required |
transform
|
callable
|
Transform to apply to features |
None
|
target_transform
|
callable
|
Transform to apply to targets |
None
|
lazy_loading
|
bool
|
Whether to load tensors lazily |
False
|
sequence_length
|
int
|
Fixed sequence length for padding/truncation |
None
|
Raises:
Type | Description |
---|---|
TypeError
|
If data is not a ProteinMetadataDataset instance |
ValueError
|
If the dataset is invalid |
__getitem__
Get a protein data sample.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int
|
Index of the sample to get (should be 0 for single protein) |
required |
Returns:
Type | Description |
---|---|
tuple[Tensor, Tensor]
|
tuple[torch.Tensor, torch.Tensor]: Tuple of (features, label) |
Raises:
Type | Description |
---|---|
IndexError
|
If index is out of bounds |
RuntimeError
|
If lazy loading fails |
__len__
Get the number of samples in the dataset.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Always 1 for a single protein |
__getattr__
Delegate attribute access to underlying ProteinMetadataDataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Attribute name |
required |
Returns:
Type | Description |
---|---|
Any
|
The attribute from the underlying dataset |
Raises:
Type | Description |
---|---|
AttributeError
|
If attribute doesn't exist in underlying dataset |
refresh_tensors
Refresh cached tensors from the underlying dataset.
Useful when the underlying dataset has been modified.
create_dataloader
classmethod
create_dataloader(
data: ProteinMetadataDataset,
batch_size: int = 1,
shuffle: bool = False,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
lazy_loading: bool = False,
sequence_length: Optional[int] = None,
**kwargs: Any,
) -> torch.utils.data.DataLoader
Create PyTorch DataLoader for protein data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ProteinMetadataDataset
|
Input protein dataset |
required |
batch_size
|
int
|
Batch size (typically 1 for single proteins) |
1
|
shuffle
|
bool
|
Whether to shuffle data (typically False for single protein) |
False
|
transform
|
Optional[Callable]
|
Transform to apply to features |
None
|
target_transform
|
Optional[Callable]
|
Transform to apply to labels |
None
|
lazy_loading
|
bool
|
Whether to use lazy loading |
False
|
sequence_length
|
Optional[int]
|
Fixed sequence length for padding/truncation |
None
|
**kwargs
|
Any
|
Additional arguments for DataLoader |
{}
|
Returns:
Name | Type | Description |
---|---|---|
DataLoader |
DataLoader
|
PyTorch data loader |
MoleculeDataloader
MoleculeDataloader(
data: MoleculeDataset,
batch_size: int = 64,
shuffle: bool = True,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
) -> torch.utils.data.DataLoader
Load molecular data and create PyTorch dataloader.
This function is kept for backward compatibility. Consider using TorchMoleculeDataset.create_dataloader() for new code.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
MoleculeDataset
|
MoleculeDataset object |
required |
batch_size
|
int
|
batch size |
64
|
shuffle
|
bool
|
whether to shuffle data |
True
|
transform
|
callable
|
transform to apply to data |
None
|
target_transform
|
callable
|
transform to apply to targets |
None
|
Returns:
Name | Type | Description |
---|---|---|
dataset_loader |
DataLoader
|
PyTorch dataloader |
Example
from themap.data.torch_dataset import MoleculeDataloader from themap.data.tasks import Tasks tasks = Tasks.from_directory( directory="datasets/", task_list_file="datasets/sample_tasks_list.json", load_molecules=True, load_proteins=False, load_metadata=False, cache_dir="./cache" ) dataset_loader = MoleculeDataloader(tasks.get_task("TASK_ID").molecule_dataset, batch_size=10, shuffle=True) for batch in dataset_loader: print(batch) break
ProteinDataloader
ProteinDataloader(
data: ProteinMetadataDataset,
batch_size: int = 1,
shuffle: bool = False,
transform: Optional[Callable] = None,
target_transform: Optional[Callable] = None,
sequence_length: Optional[int] = None,
) -> torch.utils.data.DataLoader
Load protein data and create PyTorch dataloader.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
ProteinMetadataDataset
|
ProteinMetadataDataset object |
required |
batch_size
|
int
|
batch size (typically 1 for proteins) |
1
|
shuffle
|
bool
|
whether to shuffle data |
False
|
transform
|
callable
|
transform to apply to data |
None
|
target_transform
|
callable
|
transform to apply to targets |
None
|
sequence_length
|
int
|
Fixed sequence length for padding/truncation |
None
|
Returns:
Name | Type | Description |
---|---|---|
dataset_loader |
DataLoader
|
PyTorch dataloader |
Example
from themap.data.torch_dataset import ProteinDataloader from themap.data.protein_datasets import ProteinMetadataDataset
protein_dataset = ProteinMetadataDataset( ... task_id="CHEMBL123", ... uniprot_id="P12345", ... sequence="MKLLVFSLCLLAFSSATAAF" ... ) dataset_loader = ProteinDataloader(protein_dataset) for batch in dataset_loader: print(batch) break