THEMAP: Task Hardness Estimation for Molecular Activity Prediction

THEMAP is a comprehensive Python library designed to aid drug discovery by providing powerful methods for estimating task hardness and computing transferability maps for bioactivity prediction tasks. It enables researchers and chemists to efficiently determine the similarity between molecular datasets and make informed decisions about transfer learning strategies.

:material-rocket-launch-outline: Quick Start

Get up and running with THEMAP in minutes

:octicons-arrow-right-24: Getting started
:material-book-open-variant: Tutorials

Step-by-step guides for common workflows

:octicons-arrow-right-24: View tutorials
:material-code-braces: API Reference

Detailed documentation for all modules

:octicons-arrow-right-24: API docs
:material-script-text-outline: Examples

Ready-to-use code examples and scripts

:octicons-arrow-right-24: Browse examples

Key Features

Molecular datasets: OTDD, Euclidean, and Cosine distances
Protein sequences: ESM2-based embeddings and similarity metrics
Metadata integration: Assay descriptions and experimental conditions
Combined analysis: Multi-modal fusion strategies

🎯 Task Hardness Estimation

Transfer learning guidance: Identify similar tasks for knowledge transfer
Difficulty quantification: Estimate prediction task complexity
Resource optimization: Prioritize computational resources effectively

⚡ Production-Ready Framework

Scalable architecture: Handle large-scale dataset comparisons
Caching system: Efficient feature storage and reuse
Error handling: Robust validation and error recovery
GPU acceleration: CUDA support for intensive computations

🔬 Unified Task System

Integrated data management: Molecules, proteins, and metadata in one framework
Flexible organization: Train/validation/test fold management
Feature extraction: Unified API for multi-modal featurization

Installation

Basic InstallationWith All FeaturesDevelopment Setup

# Clone repository
git clone https://github.com/HFooladi/THEMAP.git
cd THEMAP

# Create conda environment
conda env create -f environment.yml
conda activate themap

# Install THEMAP
pip install --no-deps -e .

# Install with all optional dependencies
pip install -e ".[all]"

# Install development dependencies
pip install -e ".[dev,test]"

# Run tests
python run_tests.py

Quick Examples

Molecular Dataset Distance

Compute distances between molecular datasets to understand chemical space similarity:

from themap.data.tasks import Tasks
from themap.distance import MoleculeDatasetDistance

# Load tasks from directory structure
tasks = Tasks.from_directory(
    directory="datasets/",
    task_list_file="datasets/sample_tasks_list.json",
    load_molecules=True,
    load_proteins=True
)

# Compute molecular distances using OTDD
mol_distance = MoleculeDatasetDistance(
    tasks=tasks,
    molecule_method="otdd"
)

distances = mol_distance.get_distance()
print(distances)
# {'CHEMBL2219358': {'CHEMBL1023359': 7.074}}

Combine molecular, protein, and metadata information for comprehensive task comparison:

from themap.distance import TaskDistance

# Compute combined distances from multiple modalities
task_distance = TaskDistance(
    tasks=tasks,
    molecule_method="cosine",
    protein_method="euclidean"
)

# Get all distance types
all_distances = task_distance.compute_all_distances(
    combination_strategy="weighted_average",
    molecule_weight=0.7,
    protein_weight=0.3
)

print(f"Molecule distances: {len(all_distances['molecule'])} tasks")
print(f"Protein distances: {len(all_distances['protein'])} tasks")
print(f"Combined distances: {len(all_distances['combined'])} tasks")

Protein Similarity Analysis

Analyze protein similarity using advanced sequence embeddings:

from themap.distance import ProteinDatasetDistance

# Compute protein distances using ESM2 embeddings
prot_distance = ProteinDatasetDistance(
    tasks=tasks,
    protein_method="euclidean"
)

distances = prot_distance.get_distance()
# Organized as {target_task: {source_task: distance, ...}, ...}

Use Cases

Drug Discovery Workflows

Target identification: Find similar protein targets for drug repurposing
Chemical space analysis: Understand molecular diversity across datasets
Assay development: Identify related bioactivity assays

Transfer Learning Applications

Source task selection: Choose optimal training data for new targets
Model adaptation: Quantify domain shift between datasets
Performance prediction: Estimate model performance on new tasks

Computational Biology

Protein function prediction: Leverage sequence similarity for annotation
Chemical-protein interaction: Model molecular-target relationships
Multi-omics integration: Combine molecular and protein data

Performance

THEMAP is optimized for both accuracy and computational efficiency:

Method	Speed	Memory	Accuracy	Best For
OTDD	Slower	High	Highest	Small-medium datasets
Euclidean	Fast	Low	Good	Large datasets
Cosine	Fast	Low	Good	High-dimensional features

Scalability Features

Parallel processing: Multi-core distance computations
Memory management: Efficient handling of large datasets
Caching system: Reuse expensive feature computations
Batch processing: Handle thousands of dataset comparisons

Citation

If you use THEMAP in your research, please cite:

@article{fooladi2024quantifying,
  title={Quantifying the hardness of bioactivity prediction tasks for transfer learning},
  author={Fooladi, Hosein and Hirte, Steffen and Kirchmair, Johannes},
  journal={Journal of Chemical Information and Modeling},
  volume={64},
  number={10},
  pages={4031-4046},
  year={2024},
  publisher={ACS Publications}
}

Community and Support

Get Help

Documentation: Comprehensive guides and API reference
GitHub Issues: Bug reports and feature requests
Discussions: Community Q&A and best practices

Contributing

We welcome contributions! See our contribution guidelines for: - Code contributions - Documentation improvements - Bug reports and feature requests - Example workflows and tutorials

License

THEMAP is released under the MIT License. See LICENSE for details.

What's Next?

New to THEMAP?

Start with our comprehensive getting started guide

:octicons-arrow-right-24: Getting started
Want to compute distances?

Learn about all available distance computation methods

:octicons-arrow-right-24: Distance computation
Working with real data?

Understand the unified task system for multi-modal data

:octicons-arrow-right-24: Task system
Need inspiration?

Browse our collection of examples and use cases

:octicons-arrow-right-24: Examples