Skip to content

Distance Computation API

The distance module provides comprehensive functionality for computing distances between molecular datasets, protein datasets, and tasks. This module supports various distance metrics and can handle both single dataset comparisons and batch comparisons across multiple datasets.

Overview

The distance computation system consists of three main classes:

  • MoleculeDatasetDistance - Computes distances between molecule datasets
  • ProteinDatasetDistance - Computes distances between protein datasets
  • TaskDistance - Unified interface for computing combined task distances

Core Classes

AbstractTasksDistance

themap.distance.tasks_distance.AbstractTasksDistance

Base class for computing distances between datasets.

This abstract class defines the interface for dataset distance computation. It provides a common structure for both molecule and protein dataset distances.

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection for distance computation

None
molecule_method str

Distance computation method for molecules (default: "euclidean")

'euclidean'
protein_method str

Distance computation method for proteins (default: "euclidean")

'euclidean'
metadata_method str

Distance computation method for metadata (default: "euclidean")

'euclidean'
method Optional[str]

Global method (for backward compatibility, overrides individual methods if provided)

None

get_num_tasks

get_num_tasks() -> Tuple[int, int]

Get the number of source and target tasks.

get_distance

get_distance() -> Dict[str, Dict[str, float]]

Compute the distance between datasets.

Each of the subclasses should implement this method.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

Raises:

Type Description
NotImplementedError

If not implemented by subclass

get_hopts

get_hopts(data_type: str = 'molecule') -> Optional[Dict[str, Any]]

Get hyperparameters for distance computation.

Each of the subclasses should implement this method.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

'molecule'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary of hyperparameters for the specified data type distance computation method.

Raises:

Type Description
NotImplementedError

If not implemented by subclass

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

Raises:

Type Description
NotImplementedError

If not implemented by subclass

__call__

__call__(*args: Any, **kwds: Any) -> Dict[str, Dict[str, float]]

Allow the class to be called as a function.

Each of the subclasses should implement this method.

Returns:

Type Description
Dict[str, Dict[str, float]]

The computed distance matrix.

MoleculeDatasetDistance

themap.distance.tasks_distance.MoleculeDatasetDistance

Bases: AbstractTasksDistance

Calculate distances between molecule datasets using various methods.

This class implements distance computation between molecule datasets using: - Optimal Transport Dataset Distance (OTDD) - Euclidean distance - Cosine distance

The class supports both single dataset comparisons and batch comparisons across multiple datasets.

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection containing molecule datasets for distance computation

None
method Optional[str]

Distance computation method ('otdd', 'euclidean', or 'cosine')

None
**kwargs Any

Additional arguments passed to the distance computation method

{}

Raises:

Type Description
ValueError

If the specified method is not supported for molecule datasets

get_hopts

get_hopts(data_type: str = 'molecule') -> Optional[Dict[str, Any]]

Get hyperparameters for the distance computation method.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

'molecule'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary of hyperparameters specific to the chosen distance method for the data type.

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

otdd_distance

otdd_distance() -> Dict[str, Dict[str, float]]

Compute Optimal Transport Dataset Distance between molecule datasets.

This method uses the OTDD implementation to compute distances between molecule datasets, which takes into account both the feature space and label space of the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing OTDD distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

euclidean_distance

euclidean_distance() -> Dict[str, Dict[str, float]]

Compute Euclidean distance between molecule datasets.

This method computes the Euclidean distance between the feature vectors of the datasets. For each dataset, it computes the mean feature vector and then calculates the pairwise distances between these mean vectors.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing Euclidean distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

Raises:

Type Description
DistanceComputationError

If feature computation fails

cosine_distance

cosine_distance() -> Dict[str, Dict[str, float]]

Compute cosine distance between molecule datasets.

This method computes the cosine distance between the feature vectors of the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing cosine distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

get_distance

get_distance() -> Dict[str, Dict[str, float]]

Compute the distance between molecule datasets using the specified method.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

load_distance

load_distance(path: str) -> None

Load pre-computed distances from a file.

Parameters:

Name Type Description Default
path str

Path to the file containing pre-computed distances

required

Raises:

Type Description
FileNotFoundError

If the file doesn't exist

ValueError

If the file format is invalid

to_pandas

to_pandas() -> pd.DataFrame

Convert the distance matrix to a pandas DataFrame.

Returns:

Type Description
DataFrame

DataFrame with source task IDs as index and target task IDs as columns,

DataFrame

containing the distance values.

__repr__

__repr__() -> str

Return a string representation of the MoleculeDatasetDistance instance.

Returns:

Type Description
str

String containing the class name and initialization parameters.

ProteinDatasetDistance

themap.distance.tasks_distance.ProteinDatasetDistance

Bases: AbstractTasksDistance

Calculate distances between protein datasets using various methods.

This class implements distance computation between protein datasets using: - Euclidean distance - Cosine distance

The class supports both single dataset comparisons and batch comparisons across multiple datasets.

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection containing protein datasets for distance computation

None
method Optional[str]

Distance computation method ('euclidean' or 'cosine')

None

Raises:

Type Description
ValueError

If the specified method is not supported for protein datasets

get_hopts

get_hopts(data_type: str = 'protein') -> Optional[Dict[str, Any]]

Get hyperparameters for the distance computation method.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

'protein'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary of hyperparameters specific to the chosen distance method for the data type.

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

euclidean_distance

euclidean_distance() -> Dict[str, Dict[str, float]]

Compute Euclidean distance between protein datasets.

This method calculates the pairwise Euclidean distances between protein feature vectors in the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing Euclidean distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

cosine_distance

cosine_distance() -> Dict[str, Dict[str, float]]

Compute cosine distance between protein datasets.

This method calculates the pairwise cosine distances between protein feature vectors in the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing cosine distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

sequence_identity_distance

sequence_identity_distance() -> Dict[str, Dict[str, float]]

Compute sequence identity-based distance between protein datasets.

This method calculates distances based on protein sequence identity.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing sequence identity-based distances between datasets.

Raises:

Type Description
NotImplementedError

This method is not yet implemented

get_distance

get_distance() -> Dict[str, Dict[str, float]]

Compute the distance between protein datasets using the specified method.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

load_distance

load_distance(path: str) -> None

Load pre-computed distances from a file.

Parameters:

Name Type Description Default
path str

Path to the file containing pre-computed distances

required

Raises:

Type Description
FileNotFoundError

If the file doesn't exist

ValueError

If the file format is invalid

to_pandas

to_pandas() -> pd.DataFrame

Convert the distance matrix to a pandas DataFrame.

Returns:

Type Description
DataFrame

DataFrame with source task IDs as index and target task IDs as columns,

DataFrame

containing the distance values.

__repr__

__repr__() -> str

Return a string representation of the ProteinDatasetDistance instance.

Returns:

Type Description
str

String containing the class name and initialization parameters.

TaskDistance

themap.distance.tasks_distance.TaskDistance

Bases: AbstractTasksDistance

Class for computing and managing distances between tasks.

This class handles the computation and storage of distances between tasks, supporting both chemical and protein space distances. It can compute distances directly from Tasks collections or work with pre-computed distance matrices.

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection for distance computation (optional)

None
method Optional[str]

Default distance computation method

None
source_task_ids Optional[List[str]]

List of task IDs for source tasks (legacy, optional)

None
target_task_ids Optional[List[str]]

List of task IDs for target tasks (legacy, optional)

None
external_chemical_space Optional[ndarray]

Pre-computed chemical space distance matrix (optional)

None
external_protein_space Optional[ndarray]

Pre-computed protein space distance matrix (optional)

None

shape property

shape: Tuple[int, int]

Get the shape of the distance matrix.

Returns:

Type Description
Tuple[int, int]

Tuple containing (number of source tasks, number of target tasks).

get_distance

get_distance() -> Dict[str, Dict[str, float]]

Compute and return the default distance between tasks.

Uses the combined distance if both molecule and protein data are available, otherwise uses molecule distance, then protein distance as fallback.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target tasks.

get_hopts

get_hopts(data_type: str = 'molecule') -> Optional[Dict[str, Any]]

Get hyperparameters for distance computation.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

'molecule'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary of hyperparameters for the specified data type distance computation method.

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

__repr__

__repr__() -> str

Return a string representation of the TaskDistance instance.

Returns:

Type Description
str

String containing the number of source and target tasks and the mode.

compute_molecule_distance

compute_molecule_distance(
    method: Optional[str] = None, molecule_featurizer: str = "ecfp"
) -> Dict[str, Dict[str, float]]

Compute distances between tasks using molecule data.

Parameters:

Name Type Description Default
method Optional[str]

Distance computation method ('euclidean', 'cosine', or 'otdd'). If None, uses the molecule_method from initialization.

None
molecule_featurizer str

Molecular featurizer to use

'ecfp'

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing molecule-based distances between tasks.

compute_protein_distance

compute_protein_distance(
    method: Optional[str] = None,
    protein_featurizer: str = "esm2_t33_650M_UR50D",
) -> Dict[str, Dict[str, float]]

Compute distances between tasks using protein data.

Parameters:

Name Type Description Default
method Optional[str]

Distance computation method ('euclidean' or 'cosine'). If None, uses the protein_method from initialization.

None
protein_featurizer str

Protein featurizer to use

'esm2_t33_650M_UR50D'

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing protein-based distances between tasks.

compute_combined_distance

compute_combined_distance(
    molecule_method: Optional[str] = None,
    protein_method: Optional[str] = None,
    combination_strategy: str = "average",
    molecule_weight: float = 0.5,
    protein_weight: float = 0.5,
    molecule_featurizer: str = "ecfp",
    protein_featurizer: str = "esm2_t33_650M_UR50D",
) -> Dict[str, Dict[str, float]]

Compute combined distances using both molecule and protein data.

Parameters:

Name Type Description Default
molecule_method Optional[str]

Method for molecule distance computation

None
protein_method Optional[str]

Method for protein distance computation

None
combination_strategy str

How to combine distances ('average', 'weighted_average', 'min', 'max')

'average'
molecule_weight float

Weight for molecule distances (used with 'weighted_average')

0.5
protein_weight float

Weight for protein distances (used with 'weighted_average')

0.5
molecule_featurizer str

Molecular featurizer to use

'ecfp'
protein_featurizer str

Protein featurizer to use

'esm2_t33_650M_UR50D'

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing combined distances between tasks.

compute_all_distances

compute_all_distances(
    molecule_method: Optional[str] = None,
    protein_method: Optional[str] = None,
    combination_strategy: str = "average",
    molecule_weight: float = 0.5,
    protein_weight: float = 0.5,
    molecule_featurizer: str = "ecfp",
    protein_featurizer: str = "esm2_t33_650M_UR50D",
) -> Dict[str, Dict[str, Dict[str, float]]]

Compute all distance types (molecule, protein, and combined).

Parameters:

Name Type Description Default
molecule_method Optional[str]

Method for molecule distance computation

None
protein_method Optional[str]

Method for protein distance computation

None
combination_strategy str

How to combine distances

'average'
molecule_weight float

Weight for molecule distances

0.5
protein_weight float

Weight for protein distances

0.5
molecule_featurizer str

Molecular featurizer to use

'ecfp'
protein_featurizer str

Protein featurizer to use

'esm2_t33_650M_UR50D'

Returns:

Type Description
Dict[str, Dict[str, Dict[str, float]]]

Dictionary with keys 'molecule', 'protein', 'combined' containing respective distance matrices.

compute_ext_chem_distance

compute_ext_chem_distance(method: str) -> Dict[str, Dict[str, float]]

Compute chemical space distances between tasks using external matrices.

Parameters:

Name Type Description Default
method str

Distance computation method to use

required

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing chemical space distances between tasks.

Raises:

Type Description
NotImplementedError

If external chemical space is not provided

compute_ext_prot_distance

compute_ext_prot_distance(method: str) -> Dict[str, Dict[str, float]]

Compute protein space distances between tasks using external matrices.

Parameters:

Name Type Description Default
method str

Distance computation method to use

required

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing protein space distances between tasks.

Raises:

Type Description
NotImplementedError

If external protein space is not provided

load_ext_chem_distance staticmethod

load_ext_chem_distance(path: str) -> TaskDistance

Load pre-computed chemical space distances from a file.

Parameters:

Name Type Description Default
path str

Path to the file containing pre-computed chemical space distances

required

Returns:

Type Description
TaskDistance

TaskDistance instance initialized with the loaded distances.

Note

The file should contain a dictionary with keys: - 'train_chembl_ids' or 'train_pubchem_ids' or 'source_task_ids' - 'test_chembl_ids' or 'test_pubchem_ids' or 'target_task_ids' - 'distance_matrices'

load_ext_prot_distance staticmethod

load_ext_prot_distance(path: str) -> TaskDistance

Load pre-computed protein space distances from a file.

Parameters:

Name Type Description Default
path str

Path to the file containing pre-computed protein space distances

required

Returns:

Type Description
TaskDistance

TaskDistance instance initialized with the loaded distances.

Note

The file should contain a dictionary with keys: - 'train_chembl_ids' or 'train_pubchem_ids' or 'source_task_ids' - 'test_chembl_ids' or 'test_pubchem_ids' or 'target_task_ids' - 'distance_matrices'

get_computed_distance

get_computed_distance(
    distance_type: str = "combined",
) -> Optional[Dict[str, Dict[str, float]]]

Get computed distances of the specified type.

Parameters:

Name Type Description Default
distance_type str

Type of distance to return ('molecule', 'protein', 'combined')

'combined'

Returns:

Type Description
Optional[Dict[str, Dict[str, float]]]

Dictionary containing the requested distances, or None if not computed.

to_pandas

to_pandas(distance_type: str = 'combined') -> pd.DataFrame

Convert distance matrix to a pandas DataFrame.

Parameters:

Name Type Description Default
distance_type str

Type of distance to convert ('molecule', 'protein', 'combined', 'external_chemical')

'combined'

Returns:

Type Description
DataFrame

DataFrame with source task IDs as index and target task IDs as columns,

DataFrame

containing the distance values.

Raises:

Type Description
ValueError

If no distances of the specified type are available

Utility Functions

Validation Functions

themap.distance.tasks_distance._validate_and_extract_task_id

_validate_and_extract_task_id(task_name: str) -> str

Safely extract task ID from task name with validation.

Parameters:

Name Type Description Default
task_name str

Task name in format 'fold_task_id'

required

Returns:

Type Description
str

Extracted task ID

Raises:

Type Description
DataValidationError

If task name format is invalid

Exception Classes

DistanceComputationError

themap.distance.tasks_distance.DistanceComputationError

Bases: Exception

Custom exception for distance computation errors.

DataValidationError

themap.distance.tasks_distance.DataValidationError

Bases: Exception

Custom exception for data validation errors.

Constants

Supported Methods

# Available distance methods for molecule datasets
MOLECULE_DISTANCE_METHODS = ["otdd", "euclidean", "cosine"]

# Available distance methods for protein datasets
PROTEIN_DISTANCE_METHODS = ["euclidean", "cosine"]

Usage Examples

Basic Molecule Distance Computation

from themap.data.tasks import Tasks
from themap.distance import MoleculeDatasetDistance

# Load tasks from directory
tasks = Tasks.from_directory(
    directory="datasets/",
    task_list_file="datasets/sample_tasks_list.json",
    load_molecules=True,
    load_proteins=False
)

# Compute molecule distances using OTDD
mol_distance = MoleculeDatasetDistance(
    tasks=tasks,
    molecule_method="otdd"
)

distances = mol_distance.get_distance()
print(distances)
# {'target_task': {'source_task': 0.75, ...}}

Protein Distance Computation

from themap.distance import ProteinDatasetDistance

# Compute protein distances using euclidean method
prot_distance = ProteinDatasetDistance(
    tasks=tasks,
    protein_method="euclidean"
)

distances = prot_distance.get_distance()

Combined Task Distance

from themap.distance import TaskDistance

# Compute combined distances from multiple modalities
task_distance = TaskDistance(
    tasks=tasks,
    molecule_method="cosine",
    protein_method="euclidean"
)

# Compute all distance types
all_distances = task_distance.compute_all_distances(
    combination_strategy="weighted_average",
    molecule_weight=0.7,
    protein_weight=0.3
)

# Access specific distance types
molecule_distances = all_distances["molecule"]
protein_distances = all_distances["protein"]
combined_distances = all_distances["combined"]

Working with External Distance Matrices

import numpy as np

# Load pre-computed distances
task_distance = TaskDistance.load_ext_chem_distance("path/to/chemical_distances.pkl")

# Or initialize with external matrices
external_chem = np.random.rand(10, 8)  # 10 source, 8 target tasks
task_distance = TaskDistance(
    tasks=None,
    source_task_ids=["task1", "task2", ...],
    target_task_ids=["test1", "test2", ...],
    external_chemical_space=external_chem
)

# Convert to pandas for analysis
df = task_distance.to_pandas("external_chemical")

Error Handling

from themap.distance import DistanceComputationError, DataValidationError

try:
    # This might fail if OTDD dependencies are missing
    distances = mol_distance.otdd_distance()
except ImportError as e:
    print(f"OTDD not available: {e}")
    # Fall back to euclidean distance
    distances = mol_distance.euclidean_distance()
except DistanceComputationError as e:
    print(f"Distance computation failed: {e}")
except DataValidationError as e:
    print(f"Data validation failed: {e}")

Performance Considerations

Memory Usage

  • OTDD: Most memory-intensive, especially for large datasets
  • Euclidean/Cosine: More memory-efficient, suitable for large-scale computations
  • External matrices: Memory usage depends on matrix size

Computational Complexity

  • OTDD: O(n²m²) where n,m are dataset sizes
  • Euclidean/Cosine: O(nm) for feature extraction + O(kl) for distance matrix where k,l are number of tasks
  • Combined distances: Sum of individual method complexities

Optimization Tips

# 1. Use appropriate max_samples for OTDD
hopts = {"maxsamples": 500}  # Reduce for faster computation

# 2. Cache features for repeated computations
tasks.save_task_features_to_file("cached_features.pkl")
cached_features = Tasks.load_task_features_from_file("cached_features.pkl")

# 3. Use appropriate distance method based on data size
if num_molecules > 10000:
    method = "euclidean"  # Faster for large datasets
else:
    method = "otdd"       # More accurate for smaller datasets

Configuration

Distance Method Configuration

Configuration files for distance methods are stored in themap/models/distance_configures/:

// otdd.json
{
    "method": "otdd",
    "maxsamples": 1000,
    "device": "auto",
    "parallel": true
}

Custom Configuration

from themap.utils.distance_utils import get_configure

# Get default configuration
config = get_configure("otdd")

# Modify configuration
config["maxsamples"] = 500
config["device"] = "cpu"

# Use in distance computation
mol_distance = MoleculeDatasetDistance(tasks=tasks, molecule_method="otdd")
# Configuration is automatically loaded and can be overridden