Examples¶
This section provides practical, runnable examples for common THEMAP use cases.
Quick Start¶
One-liner Distance Computation¶
from themap import quick_distance
results = quick_distance(
data_dir="datasets", # Directory with train/ and test/ folders
output_dir="output", # Where to save results
molecule_featurizer="ecfp", # Fingerprint type
molecule_method="euclidean", # Distance metric
)
# Results saved to output/molecule_distances.csv
Using a Config File¶
Example config.yaml:
data:
directory: "datasets"
molecule:
enabled: true
featurizer: "ecfp"
method: "euclidean"
output:
directory: "output"
format: "csv"
save_features: true
compute:
n_jobs: 8
Full Programmatic Control¶
from themap import Pipeline, PipelineConfig
from themap.config import DataConfig, MoleculeDistanceConfig, OutputConfig
config = PipelineConfig(
data=DataConfig(directory=Path("datasets")),
molecule=MoleculeDistanceConfig(
enabled=True,
featurizer="ecfp",
method="euclidean"
),
output=OutputConfig(
directory=Path("output"),
format="csv"
),
)
pipeline = Pipeline(config)
results = pipeline.run()
Analyzing Results¶
Load and Explore Distance Matrix¶
import pandas as pd
# Load computed distances
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
print(f"Shape: {distances.shape}")
print(f"Sources: {list(distances.index)}")
print(f"Targets: {list(distances.columns)}")
Find Closest Source for Each Target¶
import pandas as pd
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
for target in distances.columns:
closest = distances[target].idxmin()
dist = distances[target].min()
print(f"{target} <- {closest} (distance: {dist:.4f})")
Estimate Task Hardness¶
import pandas as pd
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
# Task hardness = average distance to k-nearest sources
k = 3
for target in distances.columns:
hardness = distances[target].nsmallest(k).mean()
print(f"Hardness for {target}: {hardness:.4f}")
Comparing Distance Methods¶
import time
from themap import quick_distance
methods = ["euclidean", "cosine"]
for method in methods:
start = time.time()
results = quick_distance(
data_dir="datasets",
molecule_method=method,
)
elapsed = time.time() - start
print(f"{method}: {elapsed:.2f}s")
Visualization¶
Distance Matrix Heatmap¶
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Load distances
distances = pd.read_csv("output/molecule_distances.csv", index_col=0)
# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(
distances,
annot=True,
fmt='.3f',
cmap='viridis',
cbar_kws={'label': 'Distance'},
)
plt.title("Dataset Distances")
plt.xlabel("Target Tasks")
plt.ylabel("Source Tasks")
plt.tight_layout()
plt.savefig("distance_heatmap.png", dpi=300)
plt.show()
Data Format¶
Directory Structure¶
datasets/
├── train/ # Source datasets
│ ├── CHEMBL123456.jsonl.gz
│ └── ...
└── test/ # Target datasets
├── CHEMBL111111.jsonl.gz
└── ...
JSONL.GZ File Format¶
Each file contains molecules in JSON lines format:
Command Line Usage¶
# Run quickstart example
python examples/quickstart.py --data datasets --featurizer ecfp --method euclidean
# With config file
python examples/quickstart.py --config config.yaml
Available Options¶
Featurizers¶
| Featurizer | Description |
|---|---|
ecfp |
Extended Connectivity Fingerprints (fast) |
maccs |
MACCS structural keys (fast) |
desc2D |
2D molecular descriptors |
desc3D |
3D molecular descriptors |
Distance Methods¶
| Method | Description |
|---|---|
euclidean |
Fast Euclidean distance |
cosine |
Cosine distance |
otdd |
Optimal Transport Dataset Distance (slow but accurate) |
For more detailed explanations, see our tutorials.