Distance Computation API
The distance module provides comprehensive functionality for computing distances between molecular datasets, protein datasets, and tasks. This module supports various distance metrics and can handle both single dataset comparisons and batch comparisons across multiple datasets.
Overview
The distance computation system consists of three main classes:
MoleculeDatasetDistance
- Computes distances between molecule datasetsProteinDatasetDistance
- Computes distances between protein datasetsTaskDistance
- Unified interface for computing combined task distances
Core Classes
AbstractTasksDistance
themap.distance.tasks_distance.AbstractTasksDistance
Base class for computing distances between datasets.
This abstract class defines the interface for dataset distance computation. It provides a common structure for both molecule and protein dataset distances.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection for distance computation |
None
|
molecule_method
|
str
|
Distance computation method for molecules (default: "euclidean") |
'euclidean'
|
protein_method
|
str
|
Distance computation method for proteins (default: "euclidean") |
'euclidean'
|
metadata_method
|
str
|
Distance computation method for metadata (default: "euclidean") |
'euclidean'
|
method
|
Optional[str]
|
Global method (for backward compatibility, overrides individual methods if provided) |
None
|
get_distance
Compute the distance between datasets.
Each of the subclasses should implement this method.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
If not implemented by subclass |
get_hopts
Get hyperparameters for distance computation.
Each of the subclasses should implement this method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
'molecule'
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
Dictionary of hyperparameters for the specified data type distance computation method. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
If not implemented by subclass |
get_supported_methods
Get list of supported methods for a specific data type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of supported method names for the data type |
Raises:
Type | Description |
---|---|
NotImplementedError
|
If not implemented by subclass |
MoleculeDatasetDistance
themap.distance.tasks_distance.MoleculeDatasetDistance
Bases: AbstractTasksDistance
Calculate distances between molecule datasets using various methods.
This class implements distance computation between molecule datasets using: - Optimal Transport Dataset Distance (OTDD) - Euclidean distance - Cosine distance
The class supports both single dataset comparisons and batch comparisons across multiple datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection containing molecule datasets for distance computation |
None
|
method
|
Optional[str]
|
Distance computation method ('otdd', 'euclidean', or 'cosine') |
None
|
**kwargs
|
Any
|
Additional arguments passed to the distance computation method |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If the specified method is not supported for molecule datasets |
get_hopts
Get hyperparameters for the distance computation method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
'molecule'
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
Dictionary of hyperparameters specific to the chosen distance method for the data type. |
get_supported_methods
Get list of supported methods for a specific data type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of supported method names for the data type |
otdd_distance
Compute Optimal Transport Dataset Distance between molecule datasets.
This method uses the OTDD implementation to compute distances between molecule datasets, which takes into account both the feature space and label space of the datasets.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing OTDD distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
euclidean_distance
Compute Euclidean distance between molecule datasets.
This method computes the Euclidean distance between the feature vectors of the datasets. For each dataset, it computes the mean feature vector and then calculates the pairwise distances between these mean vectors.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing Euclidean distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
Raises:
Type | Description |
---|---|
DistanceComputationError
|
If feature computation fails |
cosine_distance
Compute cosine distance between molecule datasets.
This method computes the cosine distance between the feature vectors of the datasets.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing cosine distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
get_distance
Compute the distance between molecule datasets using the specified method.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
load_distance
Load pre-computed distances from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the file containing pre-computed distances |
required |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the file doesn't exist |
ValueError
|
If the file format is invalid |
to_pandas
Convert the distance matrix to a pandas DataFrame.
Returns:
Type | Description |
---|---|
DataFrame
|
DataFrame with source task IDs as index and target task IDs as columns, |
DataFrame
|
containing the distance values. |
ProteinDatasetDistance
themap.distance.tasks_distance.ProteinDatasetDistance
Bases: AbstractTasksDistance
Calculate distances between protein datasets using various methods.
This class implements distance computation between protein datasets using: - Euclidean distance - Cosine distance
The class supports both single dataset comparisons and batch comparisons across multiple datasets.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection containing protein datasets for distance computation |
None
|
method
|
Optional[str]
|
Distance computation method ('euclidean' or 'cosine') |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If the specified method is not supported for protein datasets |
get_hopts
Get hyperparameters for the distance computation method.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
'protein'
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
Dictionary of hyperparameters specific to the chosen distance method for the data type. |
get_supported_methods
Get list of supported methods for a specific data type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of supported method names for the data type |
euclidean_distance
Compute Euclidean distance between protein datasets.
This method calculates the pairwise Euclidean distances between protein feature vectors in the datasets.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing Euclidean distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
cosine_distance
Compute cosine distance between protein datasets.
This method calculates the pairwise cosine distances between protein feature vectors in the datasets.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing cosine distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
sequence_identity_distance
Compute sequence identity-based distance between protein datasets.
This method calculates distances based on protein sequence identity.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing sequence identity-based distances between datasets. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
This method is not yet implemented |
get_distance
Compute the distance between protein datasets using the specified method.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
load_distance
Load pre-computed distances from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the file containing pre-computed distances |
required |
Raises:
Type | Description |
---|---|
FileNotFoundError
|
If the file doesn't exist |
ValueError
|
If the file format is invalid |
to_pandas
Convert the distance matrix to a pandas DataFrame.
Returns:
Type | Description |
---|---|
DataFrame
|
DataFrame with source task IDs as index and target task IDs as columns, |
DataFrame
|
containing the distance values. |
TaskDistance
themap.distance.tasks_distance.TaskDistance
Bases: AbstractTasksDistance
Class for computing and managing distances between tasks.
This class handles the computation and storage of distances between tasks, supporting both chemical and protein space distances. It can compute distances directly from Tasks collections or work with pre-computed distance matrices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection for distance computation (optional) |
None
|
method
|
Optional[str]
|
Default distance computation method |
None
|
source_task_ids
|
Optional[List[str]]
|
List of task IDs for source tasks (legacy, optional) |
None
|
target_task_ids
|
Optional[List[str]]
|
List of task IDs for target tasks (legacy, optional) |
None
|
external_chemical_space
|
Optional[ndarray]
|
Pre-computed chemical space distance matrix (optional) |
None
|
external_protein_space
|
Optional[ndarray]
|
Pre-computed protein space distance matrix (optional) |
None
|
shape
property
Get the shape of the distance matrix.
Returns:
Type | Description |
---|---|
Tuple[int, int]
|
Tuple containing (number of source tasks, number of target tasks). |
get_distance
Compute and return the default distance between tasks.
Uses the combined distance if both molecule and protein data are available, otherwise uses molecule distance, then protein distance as fallback.
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target tasks. |
get_hopts
Get hyperparameters for distance computation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
'molecule'
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Any]]
|
Dictionary of hyperparameters for the specified data type distance computation method. |
get_supported_methods
Get list of supported methods for a specific data type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
required |
Returns:
Type | Description |
---|---|
List[str]
|
List of supported method names for the data type |
__repr__
Return a string representation of the TaskDistance instance.
Returns:
Type | Description |
---|---|
str
|
String containing the number of source and target tasks and the mode. |
compute_molecule_distance
compute_molecule_distance(
method: Optional[str] = None, molecule_featurizer: str = "ecfp"
) -> Dict[str, Dict[str, float]]
Compute distances between tasks using molecule data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method
|
Optional[str]
|
Distance computation method ('euclidean', 'cosine', or 'otdd'). If None, uses the molecule_method from initialization. |
None
|
molecule_featurizer
|
str
|
Molecular featurizer to use |
'ecfp'
|
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing molecule-based distances between tasks. |
compute_protein_distance
compute_protein_distance(
method: Optional[str] = None,
protein_featurizer: str = "esm2_t33_650M_UR50D",
) -> Dict[str, Dict[str, float]]
Compute distances between tasks using protein data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method
|
Optional[str]
|
Distance computation method ('euclidean' or 'cosine'). If None, uses the protein_method from initialization. |
None
|
protein_featurizer
|
str
|
Protein featurizer to use |
'esm2_t33_650M_UR50D'
|
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing protein-based distances between tasks. |
compute_combined_distance
compute_combined_distance(
molecule_method: Optional[str] = None,
protein_method: Optional[str] = None,
combination_strategy: str = "average",
molecule_weight: float = 0.5,
protein_weight: float = 0.5,
molecule_featurizer: str = "ecfp",
protein_featurizer: str = "esm2_t33_650M_UR50D",
) -> Dict[str, Dict[str, float]]
Compute combined distances using both molecule and protein data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
molecule_method
|
Optional[str]
|
Method for molecule distance computation |
None
|
protein_method
|
Optional[str]
|
Method for protein distance computation |
None
|
combination_strategy
|
str
|
How to combine distances ('average', 'weighted_average', 'min', 'max') |
'average'
|
molecule_weight
|
float
|
Weight for molecule distances (used with 'weighted_average') |
0.5
|
protein_weight
|
float
|
Weight for protein distances (used with 'weighted_average') |
0.5
|
molecule_featurizer
|
str
|
Molecular featurizer to use |
'ecfp'
|
protein_featurizer
|
str
|
Protein featurizer to use |
'esm2_t33_650M_UR50D'
|
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing combined distances between tasks. |
compute_all_distances
compute_all_distances(
molecule_method: Optional[str] = None,
protein_method: Optional[str] = None,
combination_strategy: str = "average",
molecule_weight: float = 0.5,
protein_weight: float = 0.5,
molecule_featurizer: str = "ecfp",
protein_featurizer: str = "esm2_t33_650M_UR50D",
) -> Dict[str, Dict[str, Dict[str, float]]]
Compute all distance types (molecule, protein, and combined).
Parameters:
Name | Type | Description | Default |
---|---|---|---|
molecule_method
|
Optional[str]
|
Method for molecule distance computation |
None
|
protein_method
|
Optional[str]
|
Method for protein distance computation |
None
|
combination_strategy
|
str
|
How to combine distances |
'average'
|
molecule_weight
|
float
|
Weight for molecule distances |
0.5
|
protein_weight
|
float
|
Weight for protein distances |
0.5
|
molecule_featurizer
|
str
|
Molecular featurizer to use |
'ecfp'
|
protein_featurizer
|
str
|
Protein featurizer to use |
'esm2_t33_650M_UR50D'
|
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, Dict[str, float]]]
|
Dictionary with keys 'molecule', 'protein', 'combined' containing respective distance matrices. |
compute_ext_chem_distance
Compute chemical space distances between tasks using external matrices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method
|
str
|
Distance computation method to use |
required |
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing chemical space distances between tasks. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
If external chemical space is not provided |
compute_ext_prot_distance
Compute protein space distances between tasks using external matrices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
method
|
str
|
Distance computation method to use |
required |
Returns:
Type | Description |
---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing protein space distances between tasks. |
Raises:
Type | Description |
---|---|
NotImplementedError
|
If external protein space is not provided |
load_ext_chem_distance
staticmethod
Load pre-computed chemical space distances from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the file containing pre-computed chemical space distances |
required |
Returns:
Type | Description |
---|---|
TaskDistance
|
TaskDistance instance initialized with the loaded distances. |
Note
The file should contain a dictionary with keys: - 'train_chembl_ids' or 'train_pubchem_ids' or 'source_task_ids' - 'test_chembl_ids' or 'test_pubchem_ids' or 'target_task_ids' - 'distance_matrices'
load_ext_prot_distance
staticmethod
Load pre-computed protein space distances from a file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
str
|
Path to the file containing pre-computed protein space distances |
required |
Returns:
Type | Description |
---|---|
TaskDistance
|
TaskDistance instance initialized with the loaded distances. |
Note
The file should contain a dictionary with keys: - 'train_chembl_ids' or 'train_pubchem_ids' or 'source_task_ids' - 'test_chembl_ids' or 'test_pubchem_ids' or 'target_task_ids' - 'distance_matrices'
get_computed_distance
Get computed distances of the specified type.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
distance_type
|
str
|
Type of distance to return ('molecule', 'protein', 'combined') |
'combined'
|
Returns:
Type | Description |
---|---|
Optional[Dict[str, Dict[str, float]]]
|
Dictionary containing the requested distances, or None if not computed. |
to_pandas
Convert distance matrix to a pandas DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
distance_type
|
str
|
Type of distance to convert ('molecule', 'protein', 'combined', 'external_chemical') |
'combined'
|
Returns:
Type | Description |
---|---|
DataFrame
|
DataFrame with source task IDs as index and target task IDs as columns, |
DataFrame
|
containing the distance values. |
Raises:
Type | Description |
---|---|
ValueError
|
If no distances of the specified type are available |
Utility Functions
Validation Functions
themap.distance.tasks_distance._validate_and_extract_task_id
Safely extract task ID from task name with validation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
task_name
|
str
|
Task name in format 'fold_task_id' |
required |
Returns:
Type | Description |
---|---|
str
|
Extracted task ID |
Raises:
Type | Description |
---|---|
DataValidationError
|
If task name format is invalid |
Exception Classes
DistanceComputationError
themap.distance.tasks_distance.DistanceComputationError
Bases: Exception
Custom exception for distance computation errors.
DataValidationError
themap.distance.tasks_distance.DataValidationError
Bases: Exception
Custom exception for data validation errors.
Constants
Supported Methods
# Available distance methods for molecule datasets
MOLECULE_DISTANCE_METHODS = ["otdd", "euclidean", "cosine"]
# Available distance methods for protein datasets
PROTEIN_DISTANCE_METHODS = ["euclidean", "cosine"]
Usage Examples
Basic Molecule Distance Computation
from themap.data.tasks import Tasks
from themap.distance import MoleculeDatasetDistance
# Load tasks from directory
tasks = Tasks.from_directory(
directory="datasets/",
task_list_file="datasets/sample_tasks_list.json",
load_molecules=True,
load_proteins=False
)
# Compute molecule distances using OTDD
mol_distance = MoleculeDatasetDistance(
tasks=tasks,
molecule_method="otdd"
)
distances = mol_distance.get_distance()
print(distances)
# {'target_task': {'source_task': 0.75, ...}}
Protein Distance Computation
from themap.distance import ProteinDatasetDistance
# Compute protein distances using euclidean method
prot_distance = ProteinDatasetDistance(
tasks=tasks,
protein_method="euclidean"
)
distances = prot_distance.get_distance()
Combined Task Distance
from themap.distance import TaskDistance
# Compute combined distances from multiple modalities
task_distance = TaskDistance(
tasks=tasks,
molecule_method="cosine",
protein_method="euclidean"
)
# Compute all distance types
all_distances = task_distance.compute_all_distances(
combination_strategy="weighted_average",
molecule_weight=0.7,
protein_weight=0.3
)
# Access specific distance types
molecule_distances = all_distances["molecule"]
protein_distances = all_distances["protein"]
combined_distances = all_distances["combined"]
Working with External Distance Matrices
import numpy as np
# Load pre-computed distances
task_distance = TaskDistance.load_ext_chem_distance("path/to/chemical_distances.pkl")
# Or initialize with external matrices
external_chem = np.random.rand(10, 8) # 10 source, 8 target tasks
task_distance = TaskDistance(
tasks=None,
source_task_ids=["task1", "task2", ...],
target_task_ids=["test1", "test2", ...],
external_chemical_space=external_chem
)
# Convert to pandas for analysis
df = task_distance.to_pandas("external_chemical")
Error Handling
from themap.distance import DistanceComputationError, DataValidationError
try:
# This might fail if OTDD dependencies are missing
distances = mol_distance.otdd_distance()
except ImportError as e:
print(f"OTDD not available: {e}")
# Fall back to euclidean distance
distances = mol_distance.euclidean_distance()
except DistanceComputationError as e:
print(f"Distance computation failed: {e}")
except DataValidationError as e:
print(f"Data validation failed: {e}")
Performance Considerations
Memory Usage
- OTDD: Most memory-intensive, especially for large datasets
- Euclidean/Cosine: More memory-efficient, suitable for large-scale computations
- External matrices: Memory usage depends on matrix size
Computational Complexity
- OTDD: O(n²m²) where n,m are dataset sizes
- Euclidean/Cosine: O(nm) for feature extraction + O(kl) for distance matrix where k,l are number of tasks
- Combined distances: Sum of individual method complexities
Optimization Tips
# 1. Use appropriate max_samples for OTDD
hopts = {"maxsamples": 500} # Reduce for faster computation
# 2. Cache features for repeated computations
tasks.save_task_features_to_file("cached_features.pkl")
cached_features = Tasks.load_task_features_from_file("cached_features.pkl")
# 3. Use appropriate distance method based on data size
if num_molecules > 10000:
method = "euclidean" # Faster for large datasets
else:
method = "otdd" # More accurate for smaller datasets
Configuration
Distance Method Configuration
Configuration files for distance methods are stored in themap/models/distance_configures/
:
Custom Configuration
from themap.utils.distance_utils import get_configure
# Get default configuration
config = get_configure("otdd")
# Modify configuration
config["maxsamples"] = 500
config["device"] = "cpu"
# Use in distance computation
mol_distance = MoleculeDatasetDistance(tasks=tasks, molecule_method="otdd")
# Configuration is automatically loaded and can be overridden