alinemol.splitters
alinemol.splitters
MolecularLogPSplit
Bases: BaseShuffleSplit
Split a molecular dataset by sorting molecules according to their LogP values.
This splitter is designed for chemical domain shift experiments, where you want to evaluate how well models generalize to molecules with different physical properties than those they were trained on. LogP (octanol-water partition coefficient) is a measure of lipophilicity, which affects molecular solubility, permeability, and binding properties.
The splitter works by: 1. Calculating LogP values for all molecules 2. Sorting molecules by their LogP values 3. Splitting the sorted list according to train/test size parameters
When generalize_to_larger=True (default), the training set contains molecules with lower LogP values, and the test set contains those with higher LogP values. This mimics the real-world scenario of testing on molecules with properties outside the training distribution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
generalize_to_larger
|
bool, default=True If True, train set will have smaller LogP values, test set will have larger values. If False, train set will have larger LogP values, test set will have smaller values. |
True
|
|
n_splits
|
int, default=5 Number of re-shuffling & splitting iterations. Note that for this deterministic splitter, all iterations will produce the same split. |
5
|
|
smiles
|
List[str], optional List of SMILES strings if not provided directly as input in split() or _iter_indices(). Useful when the input X to those methods is not a list of SMILES strings but some other feature representation. |
None
|
|
test_size
|
float or int, optional If float, represents the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. |
None
|
|
train_size
|
float or int, optional If float, represents the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size. |
None
|
|
random_state
|
int or RandomState instance, optional Controls the randomness of the training and testing indices produced. Note that this splitter is deterministic, so random_state only affects the implementation of _validate_shuffle_split. |
None
|
Examples:
>>> from alinemol.splitters import MolecularLogPSplit
>>> import numpy as np
>>> # Example with list of SMILES
>>> smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CCN", "CCCCCCC"]
>>> splitter = MolecularLogPSplit(generalize_to_larger=True, test_size=0.4)
>>> for train_idx, test_idx in splitter.split(smiles):
... print(f"Training on: {[smiles[i] for i in train_idx]}")
... print(f"Testing on: {[smiles[i] for i in test_idx]}")
... break # Just show the first split
>>> # Example with separate features and target
>>> X = np.random.randn(5, 10) # Some molecular features
>>> y = np.random.randint(0, 2, 5) # Binary target
>>> splitter = MolecularLogPSplit(smiles=smiles, test_size=0.4)
>>> for train_idx, test_idx in splitter.split(X, y):
... X_train, X_test = X[train_idx], X[test_idx]
... y_train, y_test = y[train_idx], y[test_idx]
... break # Just show the first split
Notes
- LogP values are calculated using the Crippen method implemented in datamol
- This splitter is deterministic - calling split() multiple times will produce the same split regardless of n_splits value
- Useful for testing model extrapolation to molecules with different physical-chemical properties than the training set
StratifiedRandomSplit
Bases: object
Randomly reorder datasets and then split them. make sure that the label distribution among the training, validation and test sets are the same as the original dataset.
The dataset is split with permutation and the splitting is hence stratified random.
train_val_test_split
staticmethod
train_val_test_split(
dataset: LabeledDataset,
frac_train: float = 0.8,
frac_val: float = 0.1,
frac_test: float = 0.1,
random_state: RandomStateType = None,
) -> DatasetSplit
Randomly permute the dataset and then stratified split it into three consecutive chunks for training, validation and test.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LabeledDataset
We assume |
required | |
frac_train
|
float Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training. |
0.8
|
|
frac_val
|
float Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation. |
0.1
|
|
frac_test
|
float Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test. |
0.1
|
|
random_state
|
None, int or array_like, optional Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise. |
None
|
Returns:
Type | Description |
---|---|
DatasetSplit
|
list of length 3
Subsets for training, validation and test, which also have |
k_fold_split
staticmethod
k_fold_split(
dataset: LabeledDataset,
k: int = 5,
random_state: RandomStateType = None,
log: bool = True,
) -> KFoldSplit
Performs stratified k-fold split of the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset
|
LabeledDataset
We assume |
required | |
k
|
int Number of folds. Default is 5. |
5
|
|
random_state
|
None, int or array_like, optional Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). |
None
|
|
log
|
bool Whether to log information about the split. Default is True. |
True
|
Returns:
Type | Description |
---|---|
KFoldSplit
|
list of tuples Each tuple contains (train_set, val_set) where train_set and val_set are Subset objects of the original dataset. |
UMAPSplit
Bases: GroupShuffleSplit
Group-based split that uses the UMAP clustering in the input space for splitting.
From the following papers: 1. "UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines" https://doi.org/10.26434/chemrxiv-2024-f1v2v-v2 2. "On the Best Way to Cluster NCI-60 Molecules" https://doi.org/10.3390/biom13030498
Parameters:
Name | Type | Description | Default |
---|---|---|---|
n_clusters
|
int
|
The number of clusters to use for clustering |
10
|
n_neighbors
|
int
|
The number of neighbors to use for the UMAP algorithm |
100
|
min_dist
|
float
|
The minimum distance between points in the UMAP embedding |
0.1
|
n_components
|
int
|
The number of components to use for the PCA algorithm |
2
|
umap_metric
|
Union[str, Callable]
|
The metric to use for the UMAP algorithm |
'jaccard'
|
linkage
|
str
|
The linkage to use for the AgglomerativeClustering algorithm |
'ward'
|
n_splits
|
int
|
The number of splits to use for the split |
5
|
test_size
|
Optional[Union[float, int]]
|
The size of the test set |
None
|
train_size
|
Optional[Union[float, int]]
|
The size of the train set |
None
|
random_state
|
Optional[Union[int, RandomState]]
|
The random state to use for the split |
None
|
Example
from alinemol.splitters import UMAPSplit splitter = UMAPSplit(n_clusters=2, linkage="ward", n_neighbors=3, min_dist=0.1, n_components=2, n_splits=5) smiles = ["c1ccccc1", "CCC", "CCCC(CCC)C(=O)O", "NC1CCCCC1N","COc1cc(CNC(=O)CCCCC=CC(C)C)ccc1O", "Cc1cc(Br)c(O)c2ncccc12", "OCC(O)c1oc(O)c(O)c1O"] for train_idx, test_idx in splitter.split(smiles): print(train_idx) print(test_idx)
LoSplit
__init__
__init__(
threshold: float = 0.4,
min_cluster_size: int = 5,
max_clusters: int = 50,
std_threshold: float = 0.6,
)
A splitter that prepares data for training ML models for Lead Optimization or to guide molecular generative models. These models must be sensitive to minor modifications of molecules, and this splitter constructs a test that allows the evaluation of a model's ability to distinguish those modifications.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
threshold
|
float
|
ECFP4 1024-bit Tanimoto similarity threshold. Molecules more similar than this threshold are considered too similar and can be grouped together in one cluster. |
0.4
|
min_cluster_size
|
int
|
the minimum number of molecules per cluster. |
5
|
max_clusters
|
int
|
the maximum number of selected clusters. The remaining molecules go to the training set. This can be useful for limiting your test set to get more molecules in the train set. |
50
|
std_threshold
|
float
|
the lower bound of the acceptable standard deviation for a cluster's values. It should be greater than the measurement noise. For ChEMBL-like data set it to 0.60 for logKi and 0.70 for logIC50. Set it lower if you have a high-quality dataset. |
0.6
|
For more information, see a tutorial in the docs and Steshin 2023, Lo-Hi: Practical ML Drug Discovery Benchmark.
split
split(
smiles: List[str], values: List[float], n_jobs: int = -1, verbose: int = 1
) -> Tuple[List[int], List[List[int]]]
Split the dataset into test clusters and train.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
List[str]
|
list of SMILES strings representing molecules |
required |
values
|
List[float]
|
list of their continuous activity values |
required |
n_jobs
|
int
|
number of parallel jobs to run, -1 means use all processors |
-1
|
verbose
|
int
|
set to 0 to turn off progressbar |
1
|
Returns:
Name | Type | Description |
---|---|---|
train_idx |
List[int]
|
list of indices for training set |
clusters_idx |
List[List[int]]
|
list of lists containing indices for each cluster |
HiSplit
Bases: BaseShuffleSplit
__init__
__init__(
similarity_threshold: float = 0.4,
train_min_frac: float = 0.7,
test_min_frac: float = 0.15,
coarsening_threshold: Optional[float] = None,
verbose: bool = True,
max_mip_gap: float = 0.1,
)
A splitter that creates train/test splits with no molecules in the test set having ECFP4 Tanimoto similarity greater than similarity_threshold to molecules in the train set.
This splitter is designed for evaluating model generalization to structurally dissimilar molecules. It uses a min vertex k-cut algorithm to optimally partition molecules while respecting similarity constraints.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
similarity_threshold
|
float
|
ECFP4 Tanimoto threshold. Molecules in the test set won't have a similarity greater than this threshold to those in the train set. |
0.4
|
train_min_frac
|
float
|
Minimum fraction for the train set, e.g., 0.7 of the entire dataset. |
0.7
|
test_min_frac
|
float
|
Minimum fraction for the test set, e.g., 0.1 of the entire dataset. It's possible that the k-cut might not be feasible without discarding some molecules, so ensure that the sum of train_min_frac and test_min_frac is less than 1.0. |
0.15
|
coarsening_threshold
|
Optional[float]
|
Molecules with a similarity greater than the coarsening_threshold will be clustered together. It speeds up execution, but makes the solution less optimal. None -- Disables clustering (default value). 1.0 -- Won't do anything 0.90 -- will cluster molecules with similarity > 0.90 together |
None
|
verbose
|
bool
|
If set to False, suppresses status messages. |
True
|
max_mip_gap
|
float
|
Determines when to halt optimization based on proximity to the optimal solution. For example, setting it to 0.5 yields a faster but less optimal solution, while 0.01 aims for a more optimal solution, potentially at the cost of more computation time. |
0.1
|
split
Split the dataset into train and test sets such that no molecule in the test has ECFP4 Tanimoto similarity to the train > similarity_threshold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
List[str]
|
List of SMILES strings representing molecules |
required |
Returns:
Type | Description |
---|---|
Iterator[Tuple[ndarray, ndarray]]
|
Tuple containing: - List[int]: Indices of training molecules - List[int]: Indices of test molecules |
Example
from alinemol.splitters.lohi import HiSplit splitter = HiSplit() for train_indices, test_indices in splitter.split(smiles): print(train_indices) print(test_indices)
k_fold_split
k_fold_split(
smiles: List[str], k: int = 3, fold_min_frac: Optional[float] = None
) -> List[List[int]]
Split the dataset into k folds such that no molecule in any fold has an ECFP4 Tanimoto similarity greater than similarity_threshold when compared to molecules in another fold.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
smiles
|
List[str]
|
List of SMILES strings representing molecules |
required |
k
|
int
|
Number of folds |
3
|
fold_min_frac
|
Optional[float]
|
Minimum fraction of a fold (e.g., 0.2 of the entire dataset). If not specified (None), it defaults to 0.9 / k. |
None
|
Returns:
Type | Description |
---|---|
List[List[int]]
|
List[List[int]]: List of lists, where each list contains the indices of molecules in that fold |
get_umap_clusters
get_umap_clusters(
X: Union[ndarray, List[ndarray]],
n_clusters: int = 10,
n_neighbors: int = 100,
min_dist: float = 0.1,
n_components: int = 2,
umap_metric: str = "euclidean",
linkage: str = "ward",
random_state: Optional[Union[int, RandomState]] = None,
n_jobs: int = -1,
return_embedding: bool = False,
**kwargs
) -> Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]
Cluster a list of SMILES strings using the umap clustering algorithm.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
X
|
Union[ndarray, List[ndarray]]
|
The input data (N * D) |
required |
n_clusters
|
int
|
The number of clusters to use for clustering |
10
|
n_neighbors
|
int
|
The number of neighbors to use for the UMAP algorithm |
100
|
min_dist
|
float
|
The minimum distance between points in the UMAP embedding |
0.1
|
n_components
|
int
|
The number of components to use for the PCA algorithm |
2
|
umap_metric
|
str
|
The metric to use for the UMAP algorithm |
'euclidean'
|
linkage
|
str
|
The linkage to use for the AgglomerativeClustering algorithm |
'ward'
|
random_state
|
Optional[Union[int, RandomState]]
|
The random state to use for the PCA algorithm and the Empirical Kernel Map |
None
|
n_jobs
|
int
|
The number of jobs to use for the UMAP algorithm |
-1
|
return_embedding
|
bool
|
Whether to return the UMAP embedding |
False
|
Returns:
Type | Description |
---|---|
Union[ndarray, Tuple[ndarray, ndarray]]
|
Array of cluster labels corresponding to each SMILES string in the input list. If return_embedding is True, returns a tuple of the cluster labels and the UMAP embedding. |
Example
from alinemol.splitters import get_umap_clusters X = np.random.rand(100, 128) clusters_indices, embedding = get_umap_clusters(X, n_clusters=10, n_jobs=1, return_embedding=True) print(clusters_indices)