Skip to content

alinemol.splitters

alinemol.splitters

MolecularLogPSplit

Bases: BaseShuffleSplit

Split a molecular dataset by sorting molecules according to their LogP values.

This splitter is designed for chemical domain shift experiments, where you want to evaluate how well models generalize to molecules with different physical properties than those they were trained on. LogP (octanol-water partition coefficient) is a measure of lipophilicity, which affects molecular solubility, permeability, and binding properties.

The splitter works by: 1. Calculating LogP values for all molecules 2. Sorting molecules by their LogP values 3. Splitting the sorted list according to train/test size parameters

When generalize_to_larger=True (default), the training set contains molecules with lower LogP values, and the test set contains those with higher LogP values. This mimics the real-world scenario of testing on molecules with properties outside the training distribution.

Parameters:

Name Type Description Default
generalize_to_larger

bool, default=True If True, train set will have smaller LogP values, test set will have larger values. If False, train set will have larger LogP values, test set will have smaller values.

True
n_splits

int, default=5 Number of re-shuffling & splitting iterations. Note that for this deterministic splitter, all iterations will produce the same split.

5
smiles

List[str], optional List of SMILES strings if not provided directly as input in split() or _iter_indices(). Useful when the input X to those methods is not a list of SMILES strings but some other feature representation.

None
test_size

float or int, optional If float, represents the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size.

None
train_size

float or int, optional If float, represents the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.

None
random_state

int or RandomState instance, optional Controls the randomness of the training and testing indices produced. Note that this splitter is deterministic, so random_state only affects the implementation of _validate_shuffle_split.

None

Examples:

>>> from alinemol.splitters import MolecularLogPSplit
>>> import numpy as np
>>> # Example with list of SMILES
>>> smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CCN", "CCCCCCC"]
>>> splitter = MolecularLogPSplit(generalize_to_larger=True, test_size=0.4)
>>> for train_idx, test_idx in splitter.split(smiles):
...     print(f"Training on: {[smiles[i] for i in train_idx]}")
...     print(f"Testing on: {[smiles[i] for i in test_idx]}")
...     break  # Just show the first split
>>> # Example with separate features and target
>>> X = np.random.randn(5, 10)  # Some molecular features
>>> y = np.random.randint(0, 2, 5)  # Binary target
>>> splitter = MolecularLogPSplit(smiles=smiles, test_size=0.4)
>>> for train_idx, test_idx in splitter.split(X, y):
...     X_train, X_test = X[train_idx], X[test_idx]
...     y_train, y_test = y[train_idx], y[test_idx]
...     break  # Just show the first split
Notes
  • LogP values are calculated using the Crippen method implemented in datamol
  • This splitter is deterministic - calling split() multiple times will produce the same split regardless of n_splits value
  • Useful for testing model extrapolation to molecules with different physical-chemical properties than the training set

StratifiedRandomSplit

Bases: object

Randomly reorder datasets and then split them. make sure that the label distribution among the training, validation and test sets are the same as the original dataset.

The dataset is split with permutation and the splitting is hence stratified random.

train_val_test_split staticmethod
train_val_test_split(
    dataset: LabeledDataset,
    frac_train: float = 0.8,
    frac_val: float = 0.1,
    frac_test: float = 0.1,
    random_state: RandomStateType = None,
) -> DatasetSplit

Randomly permute the dataset and then stratified split it into three consecutive chunks for training, validation and test.

Parameters:

Name Type Description Default
dataset

LabeledDataset We assume len(dataset) gives the size for the dataset and dataset[i] gives the ith datapoint.

required
frac_train

float Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.

0.8
frac_val

float Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.

0.1
frac_test

float Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.

0.1
random_state

None, int or array_like, optional Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.

None

Returns:

Type Description
DatasetSplit

list of length 3 Subsets for training, validation and test, which also have len(dataset) and dataset[i] behaviors.

k_fold_split staticmethod
k_fold_split(
    dataset: LabeledDataset,
    k: int = 5,
    random_state: RandomStateType = None,
    log: bool = True,
) -> KFoldSplit

Performs stratified k-fold split of the dataset.

Parameters:

Name Type Description Default
dataset

LabeledDataset We assume len(dataset) gives the size for the dataset and dataset[i] gives the ith datapoint. The dataset should have a 'labels' attribute.

required
k

int Number of folds. Default is 5.

5
random_state

None, int or array_like, optional Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default).

None
log

bool Whether to log information about the split. Default is True.

True

Returns:

Type Description
KFoldSplit

list of tuples Each tuple contains (train_set, val_set) where train_set and val_set are Subset objects of the original dataset.

UMAPSplit

Bases: GroupShuffleSplit

Group-based split that uses the UMAP clustering in the input space for splitting.

From the following papers: 1. "UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines" https://doi.org/10.26434/chemrxiv-2024-f1v2v-v2 2. "On the Best Way to Cluster NCI-60 Molecules" https://doi.org/10.3390/biom13030498

Parameters:

Name Type Description Default
n_clusters int

The number of clusters to use for clustering

10
n_neighbors int

The number of neighbors to use for the UMAP algorithm

100
min_dist float

The minimum distance between points in the UMAP embedding

0.1
n_components int

The number of components to use for the PCA algorithm

2
umap_metric Union[str, Callable]

The metric to use for the UMAP algorithm

'jaccard'
linkage str

The linkage to use for the AgglomerativeClustering algorithm

'ward'
n_splits int

The number of splits to use for the split

5
test_size Optional[Union[float, int]]

The size of the test set

None
train_size Optional[Union[float, int]]

The size of the train set

None
random_state Optional[Union[int, RandomState]]

The random state to use for the split

None
Example

from alinemol.splitters import UMAPSplit splitter = UMAPSplit(n_clusters=2, linkage="ward", n_neighbors=3, min_dist=0.1, n_components=2, n_splits=5) smiles = ["c1ccccc1", "CCC", "CCCC(CCC)C(=O)O", "NC1CCCCC1N","COc1cc(CNC(=O)CCCCC=CC(C)C)ccc1O", "Cc1cc(Br)c(O)c2ncccc12", "OCC(O)c1oc(O)c(O)c1O"] for train_idx, test_idx in splitter.split(smiles): print(train_idx) print(test_idx)

LoSplit

__init__
__init__(
    threshold: float = 0.4,
    min_cluster_size: int = 5,
    max_clusters: int = 50,
    std_threshold: float = 0.6,
)

A splitter that prepares data for training ML models for Lead Optimization or to guide molecular generative models. These models must be sensitive to minor modifications of molecules, and this splitter constructs a test that allows the evaluation of a model's ability to distinguish those modifications.

Parameters:

Name Type Description Default
threshold float

ECFP4 1024-bit Tanimoto similarity threshold. Molecules more similar than this threshold are considered too similar and can be grouped together in one cluster.

0.4
min_cluster_size int

the minimum number of molecules per cluster.

5
max_clusters int

the maximum number of selected clusters. The remaining molecules go to the training set. This can be useful for limiting your test set to get more molecules in the train set.

50
std_threshold float

the lower bound of the acceptable standard deviation for a cluster's values. It should be greater than the measurement noise. For ChEMBL-like data set it to 0.60 for logKi and 0.70 for logIC50. Set it lower if you have a high-quality dataset.

0.6

For more information, see a tutorial in the docs and Steshin 2023, Lo-Hi: Practical ML Drug Discovery Benchmark.

split
split(
    smiles: List[str], values: List[float], n_jobs: int = -1, verbose: int = 1
) -> Tuple[List[int], List[List[int]]]

Split the dataset into test clusters and train.

Parameters:

Name Type Description Default
smiles List[str]

list of SMILES strings representing molecules

required
values List[float]

list of their continuous activity values

required
n_jobs int

number of parallel jobs to run, -1 means use all processors

-1
verbose int

set to 0 to turn off progressbar

1

Returns:

Name Type Description
train_idx List[int]

list of indices for training set

clusters_idx List[List[int]]

list of lists containing indices for each cluster

HiSplit

Bases: BaseShuffleSplit

__init__
__init__(
    similarity_threshold: float = 0.4,
    train_min_frac: float = 0.7,
    test_min_frac: float = 0.15,
    coarsening_threshold: Optional[float] = None,
    verbose: bool = True,
    max_mip_gap: float = 0.1,
)

A splitter that creates train/test splits with no molecules in the test set having ECFP4 Tanimoto similarity greater than similarity_threshold to molecules in the train set.

This splitter is designed for evaluating model generalization to structurally dissimilar molecules. It uses a min vertex k-cut algorithm to optimally partition molecules while respecting similarity constraints.

Parameters:

Name Type Description Default
similarity_threshold float

ECFP4 Tanimoto threshold. Molecules in the test set won't have a similarity greater than this threshold to those in the train set.

0.4
train_min_frac float

Minimum fraction for the train set, e.g., 0.7 of the entire dataset.

0.7
test_min_frac float

Minimum fraction for the test set, e.g., 0.1 of the entire dataset. It's possible that the k-cut might not be feasible without discarding some molecules, so ensure that the sum of train_min_frac and test_min_frac is less than 1.0.

0.15
coarsening_threshold Optional[float]

Molecules with a similarity greater than the coarsening_threshold will be clustered together. It speeds up execution, but makes the solution less optimal. None -- Disables clustering (default value). 1.0 -- Won't do anything 0.90 -- will cluster molecules with similarity > 0.90 together

None
verbose bool

If set to False, suppresses status messages.

True
max_mip_gap float

Determines when to halt optimization based on proximity to the optimal solution. For example, setting it to 0.5 yields a faster but less optimal solution, while 0.01 aims for a more optimal solution, potentially at the cost of more computation time.

0.1
split
split(smiles: List[str]) -> Iterator[Tuple[np.ndarray, np.ndarray]]

Split the dataset into train and test sets such that no molecule in the test has ECFP4 Tanimoto similarity to the train > similarity_threshold.

Parameters:

Name Type Description Default
smiles List[str]

List of SMILES strings representing molecules

required

Returns:

Type Description
Iterator[Tuple[ndarray, ndarray]]

Tuple containing: - List[int]: Indices of training molecules - List[int]: Indices of test molecules

Example

from alinemol.splitters.lohi import HiSplit splitter = HiSplit() for train_indices, test_indices in splitter.split(smiles): print(train_indices) print(test_indices)

k_fold_split
k_fold_split(
    smiles: List[str], k: int = 3, fold_min_frac: Optional[float] = None
) -> List[List[int]]

Split the dataset into k folds such that no molecule in any fold has an ECFP4 Tanimoto similarity greater than similarity_threshold when compared to molecules in another fold.

Parameters:

Name Type Description Default
smiles List[str]

List of SMILES strings representing molecules

required
k int

Number of folds

3
fold_min_frac Optional[float]

Minimum fraction of a fold (e.g., 0.2 of the entire dataset). If not specified (None), it defaults to 0.9 / k.

None

Returns:

Type Description
List[List[int]]

List[List[int]]: List of lists, where each list contains the indices of molecules in that fold

get_umap_clusters

get_umap_clusters(
    X: Union[ndarray, List[ndarray]],
    n_clusters: int = 10,
    n_neighbors: int = 100,
    min_dist: float = 0.1,
    n_components: int = 2,
    umap_metric: str = "euclidean",
    linkage: str = "ward",
    random_state: Optional[Union[int, RandomState]] = None,
    n_jobs: int = -1,
    return_embedding: bool = False,
    **kwargs
) -> Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

Cluster a list of SMILES strings using the umap clustering algorithm.

Parameters:

Name Type Description Default
X Union[ndarray, List[ndarray]]

The input data (N * D)

required
n_clusters int

The number of clusters to use for clustering

10
n_neighbors int

The number of neighbors to use for the UMAP algorithm

100
min_dist float

The minimum distance between points in the UMAP embedding

0.1
n_components int

The number of components to use for the PCA algorithm

2
umap_metric str

The metric to use for the UMAP algorithm

'euclidean'
linkage str

The linkage to use for the AgglomerativeClustering algorithm

'ward'
random_state Optional[Union[int, RandomState]]

The random state to use for the PCA algorithm and the Empirical Kernel Map

None
n_jobs int

The number of jobs to use for the UMAP algorithm

-1
return_embedding bool

Whether to return the UMAP embedding

False

Returns:

Type Description
Union[ndarray, Tuple[ndarray, ndarray]]

Array of cluster labels corresponding to each SMILES string in the input list. If return_embedding is True, returns a tuple of the cluster labels and the UMAP embedding.

Example

from alinemol.splitters import get_umap_clusters X = np.random.rand(100, 128) clusters_indices, embedding = get_umap_clusters(X, n_clusters=10, n_jobs=1, return_embedding=True) print(clusters_indices)