Skip to content

Distance Computation API

The distance module provides comprehensive functionality for computing distances between molecular datasets, protein datasets, and tasks. This module supports various distance metrics and can handle both single dataset comparisons and batch comparisons across multiple datasets.

Overview

The distance computation system consists of three main classes:

  • MoleculeDatasetDistance - Computes distances between molecule datasets
  • ProteinDatasetDistance - Computes distances between protein datasets
  • TaskDistance - Unified interface for computing combined task distances

Core Classes

AbstractTasksDistance

themap.distance.base.AbstractTasksDistance

Base class for computing distances between tasks.

This abstract class defines the interface for task distance computation. It distinguishes between: - Dataset distances: Between sets of molecules (OTDD, set-based Euclidean/Cosine) - Metadata distances: Between single vectors per task (vector-based Euclidean/Cosine)

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection for distance computation

None
dataset_method str

Distance computation method for datasets (molecules) (default: "euclidean")

'euclidean'
metadata_method str

Distance computation method for metadata including protein (default: "euclidean")

'euclidean'
molecule_method Optional[str]

Deprecated alias for dataset_method

None
protein_method Optional[str]

Deprecated - protein is metadata, use metadata_method

None
method Optional[str]

Global method (for backward compatibility, overrides individual methods if provided)

None

get_num_tasks

get_num_tasks() -> Tuple[int, int]

Get the number of source and target tasks.

get_distance

get_distance() -> Dict[str, Dict[str, float]]

Compute the distance between datasets.

Each of the subclasses should implement this method.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

Raises:

Type Description
NotImplementedError

If not implemented by subclass

get_hopts

get_hopts(data_type: str = 'dataset') -> Optional[Dict[str, Any]]

Get hyperparameters for distance computation.

Each of the subclasses should implement this method.

Parameters:

Name Type Description Default
data_type str

Type of data ("dataset", "metadata") Legacy: "molecule" (alias for "dataset"), "protein" (alias for "metadata")

'dataset'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary containing hyperparameters for the distance computation method

Optional[Dict[str, Any]]

or None if no hyperparameters are needed.

Raises:

Type Description
NotImplementedError

If not implemented by subclass

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("dataset", "metadata") Legacy: "molecule" (alias for "dataset"), "protein" (alias for "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

Raises:

Type Description
NotImplementedError

If not implemented by subclass

__call__

__call__(*args: Any, **kwds: Any) -> Dict[str, Dict[str, float]]

Allow the class to be called as a function.

Each of the subclasses should implement this method.

Returns:

Type Description
Dict[str, Dict[str, float]]

The computed distance matrix.

MoleculeDatasetDistance

themap.distance.molecule_distance.MoleculeDatasetDistance

Bases: AbstractTasksDistance

Calculate distances between molecule datasets using various methods.

This class implements distance computation between molecule datasets using: - Optimal Transport Dataset Distance (OTDD) - Euclidean distance - Cosine distance

The class supports both single dataset comparisons and batch comparisons across multiple datasets.

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection containing molecule datasets for distance computation

None
method Optional[str]

Distance computation method ('otdd', 'euclidean', or 'cosine')

None
**kwargs Any

Additional arguments passed to the distance computation method

{}

Raises:

Type Description
ValueError

If the specified method is not supported for molecule datasets

get_hopts

get_hopts(data_type: str = 'molecule') -> Optional[Dict[str, Any]]

Get hyperparameters for the distance computation method.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

'molecule'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary of hyperparameters specific to the chosen distance method for the data type.

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

otdd_distance

otdd_distance() -> Dict[str, Dict[str, float]]

Compute Optimal Transport Dataset Distance between molecule datasets.

This method uses the OTDD implementation to compute distances between molecule datasets, which takes into account both the feature space and label space of the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing OTDD distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

euclidean_distance

euclidean_distance(
    featurizer_name: str = "ecfp",
) -> Dict[str, Dict[str, float]]

Compute Euclidean distance between molecule datasets.

This method computes the dataset-level Euclidean distance by comparing the prototypes of the datasets.

Parameters:

Name Type Description Default
featurizer_name str

Name of the molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D")

'ecfp'

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing Euclidean distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

Raises:

Type Description
DistanceComputationError

If feature computation fails

cosine_distance

cosine_distance(featurizer_name: str = 'ecfp') -> Dict[str, Dict[str, float]]

Compute cosine distance between molecule datasets.

This method computes the dataset-level cosine distance by comparing the prototypes of the datasets.

Parameters:

Name Type Description Default
featurizer_name str

Name of the molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D")

'ecfp'

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing cosine distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

get_distance

get_distance(featurizer_name: str = 'ecfp') -> Dict[str, Dict[str, float]]

Compute the distance between molecule datasets using the specified method.

Parameters:

Name Type Description Default
featurizer_name str

Name of the molecular featurizer to use (e.g., "ecfp", "maccs", "desc2D")

'ecfp'

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

load_distance

load_distance(path: str) -> None

Load pre-computed distances from a file.

Parameters:

Name Type Description Default
path str

Path to the file containing pre-computed distances

required

Raises:

Type Description
FileNotFoundError

If the file doesn't exist

ValueError

If the file format is invalid

to_pandas

to_pandas() -> pd.DataFrame

Convert the distance matrix to a pandas DataFrame.

Returns:

Type Description
DataFrame

DataFrame with source task IDs as index and target task IDs as columns,

DataFrame

containing the distance values.

__repr__

__repr__() -> str

Return a string representation of the MoleculeDatasetDistance instance.

Returns:

Type Description
str

String containing the class name and initialization parameters.

ProteinDatasetDistance

themap.distance.protein_distance.ProteinDatasetDistance

Bases: AbstractTasksDistance

Calculate distances between protein datasets using various methods.

This class implements distance computation between protein datasets using: - Euclidean distance - Cosine distance

The class supports both single dataset comparisons and batch comparisons across multiple datasets.

Parameters:

Name Type Description Default
tasks Optional[Tasks]

Tasks collection containing protein datasets for distance computation

None
method Optional[str]

Distance computation method ('euclidean' or 'cosine')

None

Raises:

Type Description
ValueError

If the specified method is not supported for protein datasets

get_hopts

get_hopts(data_type: str = 'protein') -> Optional[Dict[str, Any]]

Get hyperparameters for the distance computation method.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

'protein'

Returns:

Type Description
Optional[Dict[str, Any]]

Dictionary of hyperparameters specific to the chosen distance method for the data type.

get_supported_methods

get_supported_methods(data_type: str) -> List[str]

Get list of supported methods for a specific data type.

Parameters:

Name Type Description Default
data_type str

Type of data ("molecule", "protein", "metadata")

required

Returns:

Type Description
List[str]

List of supported method names for the data type

euclidean_distance

euclidean_distance() -> Dict[str, Dict[str, float]]

Compute Euclidean distance between protein datasets.

This method calculates the pairwise Euclidean distances between protein feature vectors in the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing Euclidean distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

cosine_distance

cosine_distance() -> Dict[str, Dict[str, float]]

Compute cosine distance between protein datasets.

This method calculates the pairwise cosine distances between protein feature vectors in the datasets.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing cosine distances between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

sequence_identity_distance

sequence_identity_distance() -> Dict[str, Dict[str, float]]

Compute sequence identity-based distance between protein datasets.

This method calculates distances based on protein sequence identity.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing sequence identity-based distances between datasets.

Raises:

Type Description
NotImplementedError

This method is not yet implemented

get_distance

get_distance() -> Dict[str, Dict[str, float]]

Compute the distance between protein datasets using the specified method.

Returns:

Type Description
Dict[str, Dict[str, float]]

Dictionary containing distance matrix between source and target datasets.

Dict[str, Dict[str, float]]

The outer dictionary is keyed by target task IDs, and the inner dictionary

Dict[str, Dict[str, float]]

is keyed by source task IDs with distance values.

load_distance

load_distance(path: str) -> None

Load pre-computed distances from a file.

Parameters:

Name Type Description Default
path str

Path to the file containing pre-computed distances

required

Raises:

Type Description
FileNotFoundError

If the file doesn't exist

ValueError

If the file format is invalid

to_pandas

to_pandas() -> pd.DataFrame

Convert the distance matrix to a pandas DataFrame.

Returns:

Type Description
DataFrame

DataFrame with source task IDs as index and target task IDs as columns,

DataFrame

containing the distance values.

__repr__

__repr__() -> str

Return a string representation of the ProteinDatasetDistance instance.

Returns:

Type Description
str

String containing the class name and initialization parameters.

TaskDistance

themap.distance.task_distance.TaskDistance module-attribute

TaskDistance = TaskDistanceCalculator

Utility Functions

Validation Functions

themap.distance.base._validate_and_extract_task_id

_validate_and_extract_task_id(task_name: str) -> str

Safely extract task ID from task name with validation.

Parameters:

Name Type Description Default
task_name str

Task name in format 'fold_task_id'

required

Returns:

Type Description
str

Extracted task ID

Raises:

Type Description
DataValidationError

If task name format is invalid

Exception Classes

DistanceComputationError

themap.distance.exceptions.DistanceComputationError

Bases: Exception

Custom exception for distance computation errors.

DataValidationError

themap.distance.exceptions.DataValidationError

Bases: Exception

Custom exception for data validation errors.

Constants

Supported Methods

# Available distance methods for molecule datasets
MOLECULE_DISTANCE_METHODS = ["otdd", "euclidean", "cosine"]

# Available distance methods for protein datasets
PROTEIN_DISTANCE_METHODS = ["euclidean", "cosine"]

Usage Examples

Basic Molecule Distance Computation

from themap.data.tasks import Tasks
from themap.distance import MoleculeDatasetDistance

# Load tasks from directory
tasks = Tasks.from_directory(
    directory="datasets/",
    task_list_file="datasets/sample_tasks_list.json",
    load_molecules=True,
    load_proteins=False
)

# Compute molecule distances using OTDD
mol_distance = MoleculeDatasetDistance(
    tasks=tasks,
    molecule_method="otdd"
)

distances = mol_distance.get_distance()
print(distances)
# {'target_task': {'source_task': 0.75, ...}}

Protein Distance Computation

from themap.distance import ProteinDatasetDistance

# Compute protein distances using euclidean method
prot_distance = ProteinDatasetDistance(
    tasks=tasks,
    protein_method="euclidean"
)

distances = prot_distance.get_distance()

Combined Task Distance

from themap.distance import TaskDistance

# Compute combined distances from multiple modalities
task_distance = TaskDistance(
    tasks=tasks,
    molecule_method="cosine",
    protein_method="euclidean"
)

# Compute all distance types
all_distances = task_distance.compute_all_distances(
    combination_strategy="weighted_average",
    molecule_weight=0.7,
    protein_weight=0.3
)

# Access specific distance types
molecule_distances = all_distances["molecule"]
protein_distances = all_distances["protein"]
combined_distances = all_distances["combined"]

Working with External Distance Matrices

import numpy as np

# Load pre-computed distances
task_distance = TaskDistance.load_ext_chem_distance("path/to/chemical_distances.pkl")

# Or initialize with external matrices
external_chem = np.random.rand(10, 8)  # 10 source, 8 target tasks
task_distance = TaskDistance(
    tasks=None,
    source_task_ids=["task1", "task2", ...],
    target_task_ids=["test1", "test2", ...],
    external_chemical_space=external_chem
)

# Convert to pandas for analysis
df = task_distance.to_pandas("external_chemical")

Error Handling

from themap.distance import DistanceComputationError, DataValidationError

try:
    # This might fail if OTDD dependencies are missing
    distances = mol_distance.otdd_distance()
except ImportError as e:
    print(f"OTDD not available: {e}")
    # Fall back to euclidean distance
    distances = mol_distance.euclidean_distance()
except DistanceComputationError as e:
    print(f"Distance computation failed: {e}")
except DataValidationError as e:
    print(f"Data validation failed: {e}")

Performance Considerations

Memory Usage

  • OTDD: Most memory-intensive, especially for large datasets
  • Euclidean/Cosine: More memory-efficient, suitable for large-scale computations
  • External matrices: Memory usage depends on matrix size

Computational Complexity

  • OTDD: O(n²m²) where n,m are dataset sizes
  • Euclidean/Cosine: O(nm) for feature extraction + O(kl) for distance matrix where k,l are number of tasks
  • Combined distances: Sum of individual method complexities

Optimization Tips

# 1. Use appropriate max_samples for OTDD
hopts = {"maxsamples": 500}  # Reduce for faster computation

# 2. Cache features for repeated computations
tasks.save_task_features_to_file("cached_features.pkl")
cached_features = Tasks.load_task_features_from_file("cached_features.pkl")

# 3. Use appropriate distance method based on data size
if num_molecules > 10000:
    method = "euclidean"  # Faster for large datasets
else:
    method = "otdd"       # More accurate for smaller datasets

Configuration

Distance Method Configuration

Configuration files for distance methods are stored in themap/models/distance_configures/:

// otdd.json
{
    "method": "otdd",
    "maxsamples": 1000,
    "device": "auto",
    "parallel": true
}

Custom Configuration

from themap.utils.distance_utils import get_configure

# Get default configuration
config = get_configure("otdd")

# Modify configuration
config["maxsamples"] = 500
config["device"] = "cpu"

# Use in distance computation
mol_distance = MoleculeDatasetDistance(tasks=tasks, molecule_method="otdd")
# Configuration is automatically loaded and can be overridden