Performance Optimization¶
This tutorial covers best practices for optimizing THEMAP performance when working with large datasets.
Overview¶
THEMAP offers several strategies to optimize performance:
- Method selection: Choose appropriate distance metrics
- Featurizer selection: Choose fast vs. accurate featurizers
- Parallel processing: Use multiple cores
- Caching: Save features to disk for reuse
Distance Method Performance¶
Speed Comparison¶
| Method | Speed | Memory | Accuracy | Best For |
|---|---|---|---|---|
| Euclidean | Fast | Low | Good | Initial exploration, large datasets |
| Cosine | Fast | Low | Good | High-dimensional features |
| OTDD | Slow | High | Best | Detailed analysis, small datasets |
Choosing the Right Method¶
from themap import quick_distance
# For large datasets or initial exploration
results = quick_distance(
data_dir="datasets",
molecule_method="euclidean", # Fast
)
# For final analysis on small datasets
results = quick_distance(
data_dir="datasets",
molecule_method="otdd", # Most accurate but slow
)
Featurizer Performance¶
Speed Comparison¶
| Featurizer | Speed | Quality |
|---|---|---|
ecfp |
Fast | Good |
maccs |
Fast | Good |
desc2D |
Medium | Good |
desc3D |
Slow | Better |
# Fast featurizer for large datasets
results = quick_distance(
data_dir="datasets",
molecule_featurizer="ecfp", # Fast fingerprints
)
Parallel Processing¶
Use the n_jobs parameter:
from themap import quick_distance
results = quick_distance(
data_dir="datasets",
n_jobs=8, # Use 8 parallel workers
)
Caching Features¶
Save computed features to disk for reuse:
from themap import run_pipeline
# First run: computes and caches features
results = run_pipeline("config.yaml")
# Subsequent runs: loads cached features (faster)
results = run_pipeline("config.yaml")
Memory Management¶
For large datasets:
from themap import quick_distance
# Use euclidean instead of OTDD (much less memory)
results = quick_distance(
data_dir="datasets",
molecule_featurizer="ecfp",
molecule_method="euclidean",
)
Best Practices¶
Quick Optimization Checklist¶
- Start with fast methods: Use euclidean for exploration
- Use fast featurizers: ecfp or maccs for initial runs
- Enable caching: Set
save_features: true - Use parallel processing: Set
n_jobsto available cores - Reserve OTDD for final analysis: Only use on small datasets
Recommended Configuration¶
# config.yaml - optimized for performance
data:
directory: "datasets"
molecule:
enabled: true
featurizer: "ecfp" # Fast fingerprints
method: "euclidean" # Fast distance
output:
directory: "output"
format: "csv"
save_features: true # Cache for reuse
compute:
n_jobs: 8 # Parallel processing
device: "auto" # Use GPU if available
Benchmarking¶
Compare methods on your data:
import time
from themap import quick_distance
methods = ["euclidean", "cosine"]
for method in methods:
start = time.time()
results = quick_distance(
data_dir="datasets",
molecule_method=method,
)
elapsed = time.time() - start
print(f"{method}: {elapsed:.2f}s")
Next Steps¶
- Check out the examples
- Read the API documentation