Distance Computation API¶
The distance module provides comprehensive functionality for computing distances between molecular datasets, protein datasets, and tasks. This module supports various distance metrics and can handle both single dataset comparisons and batch comparisons across multiple datasets.
Overview¶
The distance computation system consists of three main classes:
MoleculeDatasetDistance- Computes distances between molecule datasetsProteinDatasetDistance- Computes distances between protein datasetsTaskDistance- Unified interface for computing combined task distances
Core Classes¶
AbstractTasksDistance¶
themap.distance.base.AbstractTasksDistance ¶
Base class for computing distances between tasks.
This abstract class defines the interface for task distance computation. It distinguishes between: - Dataset distances: Between sets of molecules (OTDD, set-based Euclidean/Cosine) - Metadata distances: Between single vectors per task (vector-based Euclidean/Cosine)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection for distance computation |
None
|
dataset_method
|
str
|
Distance computation method for datasets (molecules) (default: "euclidean") |
'euclidean'
|
metadata_method
|
str
|
Distance computation method for metadata including protein (default: "euclidean") |
'euclidean'
|
molecule_method
|
Optional[str]
|
Deprecated alias for dataset_method |
None
|
protein_method
|
Optional[str]
|
Deprecated - protein is metadata, use metadata_method |
None
|
method
|
Optional[str]
|
Global method (for backward compatibility, overrides individual methods if provided) |
None
|
get_distance ¶
Compute the distance between datasets.
Each of the subclasses should implement this method.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by subclass |
get_hopts ¶
Get hyperparameters for distance computation.
Each of the subclasses should implement this method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
Type of data ("dataset", "metadata") Legacy: "molecule" (alias for "dataset"), "protein" (alias for "metadata") |
'dataset'
|
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Dictionary containing hyperparameters for the distance computation method |
Optional[Dict[str, Any]]
|
or None if no hyperparameters are needed. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by subclass |
get_supported_methods ¶
Get list of supported methods for a specific data type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
Type of data ("dataset", "metadata") Legacy: "molecule" (alias for "dataset"), "protein" (alias for "metadata") |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of supported method names for the data type |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
If not implemented by subclass |
__call__ ¶
Allow the class to be called as a function.
Each of the subclasses should implement this method.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
The computed distance matrix. |
MoleculeDatasetDistance¶
themap.distance.molecule_distance.MoleculeDatasetDistance ¶
Bases: AbstractTasksDistance
Calculate distances between molecule datasets using various methods.
This class implements distance computation between molecule datasets using: - Optimal Transport Dataset Distance (OTDD) - Euclidean distance - Cosine distance
The class supports both single dataset comparisons and batch comparisons across multiple datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection containing molecule datasets for distance computation |
None
|
method
|
Optional[str]
|
Distance computation method ('otdd', 'euclidean', or 'cosine') |
None
|
**kwargs
|
Any
|
Additional arguments passed to the distance computation method |
{}
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the specified method is not supported for molecule datasets |
get_hopts ¶
Get hyperparameters for the distance computation method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
'molecule'
|
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Dictionary of hyperparameters specific to the chosen distance method for the data type. |
get_supported_methods ¶
Get list of supported methods for a specific data type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of supported method names for the data type |
otdd_distance ¶
Compute Optimal Transport Dataset Distance between molecule datasets.
This method uses the OTDD implementation to compute distances between molecule datasets, which takes into account both the feature space and label space of the datasets.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing OTDD distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
euclidean_distance ¶
Compute Euclidean distance between molecule datasets.
This method computes the dataset-level Euclidean distance by comparing the prototypes of the datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of the molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D") |
'ecfp'
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing Euclidean distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
Raises:
| Type | Description |
|---|---|
DistanceComputationError
|
If feature computation fails |
cosine_distance ¶
Compute cosine distance between molecule datasets.
This method computes the dataset-level cosine distance by comparing the prototypes of the datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of the molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D") |
'ecfp'
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing cosine distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
get_distance ¶
Compute the distance between molecule datasets using the specified method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
featurizer_name
|
str
|
Name of the molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D") |
'ecfp'
|
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
load_distance ¶
Load pre-computed distances from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the file containing pre-computed distances |
required |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file doesn't exist |
ValueError
|
If the file format is invalid |
to_pandas ¶
Convert the distance matrix to a pandas DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with source task IDs as index and target task IDs as columns, |
DataFrame
|
containing the distance values. |
__repr__ ¶
Return a string representation of the MoleculeDatasetDistance instance.
Returns:
| Type | Description |
|---|---|
str
|
String containing the class name and initialization parameters. |
ProteinDatasetDistance¶
themap.distance.protein_distance.ProteinDatasetDistance ¶
Bases: AbstractTasksDistance
Calculate distances between protein datasets using various methods.
This class implements distance computation between protein datasets using: - Euclidean distance - Cosine distance
The class supports both single dataset comparisons and batch comparisons across multiple datasets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tasks
|
Optional[Tasks]
|
Tasks collection containing protein datasets for distance computation |
None
|
method
|
Optional[str]
|
Distance computation method ('euclidean' or 'cosine') |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If the specified method is not supported for protein datasets |
get_hopts ¶
Get hyperparameters for the distance computation method.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
'protein'
|
Returns:
| Type | Description |
|---|---|
Optional[Dict[str, Any]]
|
Dictionary of hyperparameters specific to the chosen distance method for the data type. |
get_supported_methods ¶
Get list of supported methods for a specific data type.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_type
|
str
|
Type of data ("molecule", "protein", "metadata") |
required |
Returns:
| Type | Description |
|---|---|
List[str]
|
List of supported method names for the data type |
euclidean_distance ¶
Compute Euclidean distance between protein datasets.
This method calculates the pairwise Euclidean distances between protein feature vectors in the datasets.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing Euclidean distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
cosine_distance ¶
Compute cosine distance between protein datasets.
This method calculates the pairwise cosine distances between protein feature vectors in the datasets.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing cosine distances between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
sequence_identity_distance ¶
Compute sequence identity-based distance between protein datasets.
This method calculates distances based on protein sequence identity.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing sequence identity-based distances between datasets. |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
This method is not yet implemented |
get_distance ¶
Compute the distance between protein datasets using the specified method.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, float]]
|
Dictionary containing distance matrix between source and target datasets. |
Dict[str, Dict[str, float]]
|
The outer dictionary is keyed by target task IDs, and the inner dictionary |
Dict[str, Dict[str, float]]
|
is keyed by source task IDs with distance values. |
load_distance ¶
Load pre-computed distances from a file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
str
|
Path to the file containing pre-computed distances |
required |
Raises:
| Type | Description |
|---|---|
FileNotFoundError
|
If the file doesn't exist |
ValueError
|
If the file format is invalid |
to_pandas ¶
Convert the distance matrix to a pandas DataFrame.
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with source task IDs as index and target task IDs as columns, |
DataFrame
|
containing the distance values. |
__repr__ ¶
Return a string representation of the ProteinDatasetDistance instance.
Returns:
| Type | Description |
|---|---|
str
|
String containing the class name and initialization parameters. |
TaskDistance¶
Utility Functions¶
Validation Functions¶
themap.distance.base._validate_and_extract_task_id ¶
Safely extract task ID from task name with validation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
task_name
|
str
|
Task name in format 'fold_task_id' |
required |
Returns:
| Type | Description |
|---|---|
str
|
Extracted task ID |
Raises:
| Type | Description |
|---|---|
DataValidationError
|
If task name format is invalid |
Exception Classes¶
DistanceComputationError¶
themap.distance.exceptions.DistanceComputationError ¶
Bases: Exception
Custom exception for distance computation errors.
DataValidationError¶
themap.distance.exceptions.DataValidationError ¶
Bases: Exception
Custom exception for data validation errors.
Constants¶
Supported Methods¶
# Available distance methods for molecule datasets
MOLECULE_DISTANCE_METHODS = ["otdd", "euclidean", "cosine"]
# Available distance methods for protein datasets
PROTEIN_DISTANCE_METHODS = ["euclidean", "cosine"]
Usage Examples¶
Basic Molecule Distance Computation¶
from themap.data.tasks import Tasks
from themap.distance import MoleculeDatasetDistance
# Load tasks from directory
tasks = Tasks.from_directory(
directory="datasets/",
task_list_file="datasets/sample_tasks_list.json",
load_molecules=True,
load_proteins=False
)
# Compute molecule distances using OTDD
mol_distance = MoleculeDatasetDistance(
tasks=tasks,
molecule_method="otdd"
)
distances = mol_distance.get_distance()
print(distances)
# {'target_task': {'source_task': 0.75, ...}}
Protein Distance Computation¶
from themap.distance import ProteinDatasetDistance
# Compute protein distances using euclidean method
prot_distance = ProteinDatasetDistance(
tasks=tasks,
protein_method="euclidean"
)
distances = prot_distance.get_distance()
Combined Task Distance¶
from themap.distance import TaskDistance
# Compute combined distances from multiple modalities
task_distance = TaskDistance(
tasks=tasks,
molecule_method="cosine",
protein_method="euclidean"
)
# Compute all distance types
all_distances = task_distance.compute_all_distances(
combination_strategy="weighted_average",
molecule_weight=0.7,
protein_weight=0.3
)
# Access specific distance types
molecule_distances = all_distances["molecule"]
protein_distances = all_distances["protein"]
combined_distances = all_distances["combined"]
Working with External Distance Matrices¶
import numpy as np
# Load pre-computed distances
task_distance = TaskDistance.load_ext_chem_distance("path/to/chemical_distances.pkl")
# Or initialize with external matrices
external_chem = np.random.rand(10, 8) # 10 source, 8 target tasks
task_distance = TaskDistance(
tasks=None,
source_task_ids=["task1", "task2", ...],
target_task_ids=["test1", "test2", ...],
external_chemical_space=external_chem
)
# Convert to pandas for analysis
df = task_distance.to_pandas("external_chemical")
Error Handling¶
from themap.distance import DistanceComputationError, DataValidationError
try:
# This might fail if OTDD dependencies are missing
distances = mol_distance.otdd_distance()
except ImportError as e:
print(f"OTDD not available: {e}")
# Fall back to euclidean distance
distances = mol_distance.euclidean_distance()
except DistanceComputationError as e:
print(f"Distance computation failed: {e}")
except DataValidationError as e:
print(f"Data validation failed: {e}")
Performance Considerations¶
Memory Usage¶
- OTDD: Most memory-intensive, especially for large datasets
- Euclidean/Cosine: More memory-efficient, suitable for large-scale computations
- External matrices: Memory usage depends on matrix size
Computational Complexity¶
- OTDD: O(n²m²) where n,m are dataset sizes
- Euclidean/Cosine: O(nm) for feature extraction + O(kl) for distance matrix where k,l are number of tasks
- Combined distances: Sum of individual method complexities
Optimization Tips¶
# 1. Use appropriate max_samples for OTDD
hopts = {"maxsamples": 500} # Reduce for faster computation
# 2. Cache features for repeated computations
tasks.save_task_features_to_file("cached_features.pkl")
cached_features = Tasks.load_task_features_from_file("cached_features.pkl")
# 3. Use appropriate distance method based on data size
if num_molecules > 10000:
method = "euclidean" # Faster for large datasets
else:
method = "otdd" # More accurate for smaller datasets
Configuration¶
Distance Method Configuration¶
Configuration files for distance methods are stored in themap/models/distance_configures/:
Custom Configuration¶
from themap.utils.distance_utils import get_configure
# Get default configuration
config = get_configure("otdd")
# Modify configuration
config["maxsamples"] = 500
config["device"] = "cpu"
# Use in distance computation
mol_distance = MoleculeDatasetDistance(tasks=tasks, molecule_method="otdd")
# Configuration is automatically loaded and can be overridden