`alinemol.splitters`

alinemol.splitters

MolecularLogPSplit

Bases: BaseShuffleSplit

Split a molecular dataset by sorting molecules according to their LogP values.

This splitter is designed for chemical domain shift experiments, where you want to evaluate how well models generalize to molecules with different physical properties than those they were trained on. LogP (octanol-water partition coefficient) is a measure of lipophilicity, which affects molecular solubility, permeability, and binding properties.

The splitter works by: 1. Calculating LogP values for all molecules 2. Sorting molecules by their LogP values 3. Splitting the sorted list according to train/test size parameters

When generalize_to_larger=True (default), the training set contains molecules with lower LogP values, and the test set contains those with higher LogP values. This mimics the real-world scenario of testing on molecules with properties outside the training distribution.

Parameters:

Name	Description	Default
`generalize_to_larger`	bool, default=True If True, train set will have smaller LogP values, test set will have larger values. If False, train set will have larger LogP values, test set will have smaller values.	`True`
`n_splits`	int, default=5 Number of re-shuffling & splitting iterations. Note that for this deterministic splitter, all iterations will produce the same split.	`5`
`smiles`	List[str], optional List of SMILES strings if not provided directly as input in split() or _iter_indices(). Useful when the input X to those methods is not a list of SMILES strings but some other feature representation.	`None`
`test_size`	float or int, optional If float, represents the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size.	`None`
`train_size`	float or int, optional If float, represents the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.	`None`
`random_state`	int or RandomState instance, optional Controls the randomness of the training and testing indices produced. Note that this splitter is deterministic, so random_state only affects the implementation of _validate_shuffle_split.	`None`

Examples:

>>> from alinemol.splitters import MolecularLogPSplit
>>> import numpy as np
>>> # Example with list of SMILES
>>> smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CCN", "CCCCCCC"]
>>> splitter = MolecularLogPSplit(generalize_to_larger=True, test_size=0.4)
>>> for train_idx, test_idx in splitter.split(smiles):
...     print(f"Training on: {[smiles[i] for i in train_idx]}")
...     print(f"Testing on: {[smiles[i] for i in test_idx]}")
...     break  # Just show the first split

>>> # Example with separate features and target
>>> X = np.random.randn(5, 10)  # Some molecular features
>>> y = np.random.randint(0, 2, 5)  # Binary target
>>> splitter = MolecularLogPSplit(smiles=smiles, test_size=0.4)
>>> for train_idx, test_idx in splitter.split(X, y):
...     X_train, X_test = X[train_idx], X[test_idx]
...     y_train, y_test = y[train_idx], y[test_idx]
...     break  # Just show the first split

Notes

LogP values are calculated using the Crippen method implemented in datamol
This splitter is deterministic - calling split() multiple times will produce the same split regardless of n_splits value
Useful for testing model extrapolation to molecules with different physical-chemical properties than the training set

StratifiedRandomSplit

Bases: object

Randomly reorder datasets and then split them. make sure that the label distribution among the training, validation and test sets are the same as the original dataset.

The dataset is split with permutation and the splitting is hence stratified random.

train_val_test_split `staticmethod`

train_val_test_split(
    dataset: LabeledDataset,
    frac_train: float = 0.8,
    frac_val: float = 0.1,
    frac_test: float = 0.1,
    random_state: RandomStateType = None,
) -> DatasetSplit

Randomly permute the dataset and then stratified split it into three consecutive chunks for training, validation and test.

Parameters:

Name	Description	Default
`dataset`	LabeledDataset We assume `len(dataset)` gives the size for the dataset and `dataset[i]` gives the ith datapoint.	required
`frac_train`	float Fraction of data to use for training. By default, we set this to be 0.8, i.e. 80% of the dataset is used for training.	`0.8`
`frac_val`	float Fraction of data to use for validation. By default, we set this to be 0.1, i.e. 10% of the dataset is used for validation.	`0.1`
`frac_test`	float Fraction of data to use for test. By default, we set this to be 0.1, i.e. 10% of the dataset is used for test.	`0.1`
`random_state`	None, int or array_like, optional Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default). If seed is None, then RandomState will try to read data from /dev/urandom (or the Windows analogue) if available or seed from the clock otherwise.	`None`

Returns:

Type	Description
`DatasetSplit`	list of length 3 Subsets for training, validation and test, which also have `len(dataset)` and `dataset[i]` behaviors.

k_fold_split `staticmethod`

k_fold_split(
    dataset: LabeledDataset,
    k: int = 5,
    random_state: RandomStateType = None,
    log: bool = True,
) -> KFoldSplit

Performs stratified k-fold split of the dataset.

Parameters:

Name	Description	Default
`dataset`	LabeledDataset We assume `len(dataset)` gives the size for the dataset and `dataset[i]` gives the ith datapoint. The dataset should have a 'labels' attribute.	required
`k`	int Number of folds. Default is 5.	`5`
`random_state`	None, int or array_like, optional Random seed used to initialize the pseudo-random number generator. Can be any integer between 0 and 2**32 - 1 inclusive, an array (or other sequence) of such integers, or None (the default).	`None`
`log`	bool Whether to log information about the split. Default is True.	`True`

Returns:

Type	Description
`KFoldSplit`	list of tuples Each tuple contains (train_set, val_set) where train_set and val_set are Subset objects of the original dataset.

UMAPSplit

Bases: GroupShuffleSplit

Group-based split that uses the UMAP clustering in the input space for splitting.

From the following papers: 1. "UMAP-based clustering split for rigorous evaluation of AI models for virtual screening on cancer cell lines" https://doi.org/10.26434/chemrxiv-2024-f1v2v-v2 2. "On the Best Way to Cluster NCI-60 Molecules" https://doi.org/10.3390/biom13030498

Parameters:

Name	Type	Description	Default
`n_clusters`	`int`	The number of clusters to use for clustering	`10`
`n_neighbors`	`int`	The number of neighbors to use for the UMAP algorithm	`100`
`min_dist`	`float`	The minimum distance between points in the UMAP embedding	`0.1`
`n_components`	`int`	The number of components to use for the PCA algorithm	`2`
`umap_metric`	`Union[str, Callable]`	The metric to use for the UMAP algorithm	`'jaccard'`
`linkage`	`str`	The linkage to use for the AgglomerativeClustering algorithm	`'ward'`
`n_splits`	`int`	The number of splits to use for the split	`5`
`test_size`	`Optional[Union[float, int]]`	The size of the test set	`None`
`train_size`	`Optional[Union[float, int]]`	The size of the train set	`None`
`random_state`	`Optional[Union[int, RandomState]]`	The random state to use for the split	`None`

Example

from alinemol.splitters import UMAPSplit splitter = UMAPSplit(n_clusters=2, linkage="ward", n_neighbors=3, min_dist=0.1, n_components=2, n_splits=5) smiles = ["c1ccccc1", "CCC", "CCCC(CCC)C(=O)O", "NC1CCCCC1N","COc1cc(CNC(=O)CCCCC=CC(C)C)ccc1O", "Cc1cc(Br)c(O)c2ncccc12", "OCC(O)c1oc(O)c(O)c1O"] for train_idx, test_idx in splitter.split(smiles): print(train_idx) print(test_idx)

LoSplit

init

__init__(
    threshold: float = 0.4,
    min_cluster_size: int = 5,
    max_clusters: int = 50,
    std_threshold: float = 0.6,
)

A splitter that prepares data for training ML models for Lead Optimization or to guide molecular generative models. These models must be sensitive to minor modifications of molecules, and this splitter constructs a test that allows the evaluation of a model's ability to distinguish those modifications.

Parameters:

Name	Type	Description	Default
`threshold`	`float`	ECFP4 1024-bit Tanimoto similarity threshold. Molecules more similar than this threshold are considered too similar and can be grouped together in one cluster.	`0.4`
`min_cluster_size`	`int`	the minimum number of molecules per cluster.	`5`
`max_clusters`	`int`	the maximum number of selected clusters. The remaining molecules go to the training set. This can be useful for limiting your test set to get more molecules in the train set.	`50`
`std_threshold`	`float`	the lower bound of the acceptable standard deviation for a cluster's values. It should be greater than the measurement noise. For ChEMBL-like data set it to 0.60 for logKi and 0.70 for logIC50. Set it lower if you have a high-quality dataset.	`0.6`

For more information, see a tutorial in the docs and Steshin 2023, Lo-Hi: Practical ML Drug Discovery Benchmark.

split

split(
    smiles: List[str], values: List[float], n_jobs: int = -1, verbose: int = 1
) -> Tuple[List[int], List[List[int]]]

Split the dataset into test clusters and train.

Parameters:

Name	Type	Description	Default
`smiles`	`List[str]`	list of SMILES strings representing molecules	required
`values`	`List[float]`	list of their continuous activity values	required
`n_jobs`	`int`	number of parallel jobs to run, -1 means use all processors	`-1`
`verbose`	`int`	set to 0 to turn off progressbar	`1`

Returns:

Name	Type	Description
`train_idx`	`List[int]`	list of indices for training set
`clusters_idx`	`List[List[int]]`	list of lists containing indices for each cluster

HiSplit

Bases: BaseShuffleSplit

init

__init__(
    similarity_threshold: float = 0.4,
    train_min_frac: float = 0.7,
    test_min_frac: float = 0.15,
    coarsening_threshold: Optional[float] = None,
    verbose: bool = True,
    max_mip_gap: float = 0.1,
)

A splitter that creates train/test splits with no molecules in the test set having ECFP4 Tanimoto similarity greater than similarity_threshold to molecules in the train set.

This splitter is designed for evaluating model generalization to structurally dissimilar molecules. It uses a min vertex k-cut algorithm to optimally partition molecules while respecting similarity constraints.

Parameters:

Name	Type	Description	Default
`similarity_threshold`	`float`	ECFP4 Tanimoto threshold. Molecules in the test set won't have a similarity greater than this threshold to those in the train set.	`0.4`
`train_min_frac`	`float`	Minimum fraction for the train set, e.g., 0.7 of the entire dataset.	`0.7`
`test_min_frac`	`float`	Minimum fraction for the test set, e.g., 0.1 of the entire dataset. It's possible that the k-cut might not be feasible without discarding some molecules, so ensure that the sum of train_min_frac and test_min_frac is less than 1.0.	`0.15`
`coarsening_threshold`	`Optional[float]`	Molecules with a similarity greater than the coarsening_threshold will be clustered together. It speeds up execution, but makes the solution less optimal. None -- Disables clustering (default value). 1.0 -- Won't do anything 0.90 -- will cluster molecules with similarity > 0.90 together	`None`
`verbose`	`bool`	If set to False, suppresses status messages.	`True`
`max_mip_gap`	`float`	Determines when to halt optimization based on proximity to the optimal solution. For example, setting it to 0.5 yields a faster but less optimal solution, while 0.01 aims for a more optimal solution, potentially at the cost of more computation time.	`0.1`

split

split(smiles: List[str]) -> Iterator[Tuple[np.ndarray, np.ndarray]]

Split the dataset into train and test sets such that no molecule in the test has ECFP4 Tanimoto similarity to the train > similarity_threshold.

Parameters:

Name	Type	Description	Default
`smiles`	`List[str]`	List of SMILES strings representing molecules	required

Returns:

Type	Description
`Iterator[Tuple[ndarray, ndarray]]`	Tuple containing: - List[int]: Indices of training molecules - List[int]: Indices of test molecules

Example

from alinemol.splitters.lohi import HiSplit splitter = HiSplit() for train_indices, test_indices in splitter.split(smiles): print(train_indices) print(test_indices)

k_fold_split

k_fold_split(
    smiles: List[str], k: int = 3, fold_min_frac: Optional[float] = None
) -> List[List[int]]

Split the dataset into k folds such that no molecule in any fold has an ECFP4 Tanimoto similarity greater than similarity_threshold when compared to molecules in another fold.

Parameters:

Name	Type	Description	Default
`smiles`	`List[str]`	List of SMILES strings representing molecules	required
`k`	`int`	Number of folds	`3`
`fold_min_frac`	`Optional[float]`	Minimum fraction of a fold (e.g., 0.2 of the entire dataset). If not specified (None), it defaults to 0.9 / k.	`None`

Returns:

Type	Description
`List[List[int]]`	List[List[int]]: List of lists, where each list contains the indices of molecules in that fold

get_umap_clusters

get_umap_clusters(
    X: Union[ndarray, List[ndarray]],
    n_clusters: int = 10,
    n_neighbors: int = 100,
    min_dist: float = 0.1,
    n_components: int = 2,
    umap_metric: str = "euclidean",
    linkage: str = "ward",
    random_state: Optional[Union[int, RandomState]] = None,
    n_jobs: int = -1,
    return_embedding: bool = False,
    **kwargs
) -> Union[np.ndarray, Tuple[np.ndarray, np.ndarray]]

Cluster a list of SMILES strings using the umap clustering algorithm.

Parameters:

Name	Type	Description	Default
`X`	`Union[ndarray, List[ndarray]]`	The input data (N * D)	required
`n_clusters`	`int`	The number of clusters to use for clustering	`10`
`n_neighbors`	`int`	The number of neighbors to use for the UMAP algorithm	`100`
`min_dist`	`float`	The minimum distance between points in the UMAP embedding	`0.1`
`n_components`	`int`	The number of components to use for the PCA algorithm	`2`
`umap_metric`	`str`	The metric to use for the UMAP algorithm	`'euclidean'`
`linkage`	`str`	The linkage to use for the AgglomerativeClustering algorithm	`'ward'`
`random_state`	`Optional[Union[int, RandomState]]`	The random state to use for the PCA algorithm and the Empirical Kernel Map	`None`
`n_jobs`	`int`	The number of jobs to use for the UMAP algorithm	`-1`
`return_embedding`	`bool`	Whether to return the UMAP embedding	`False`

Returns:

Type	Description
`Union[ndarray, Tuple[ndarray, ndarray]]`	Array of cluster labels corresponding to each SMILES string in the input list. If return_embedding is True, returns a tuple of the cluster labels and the UMAP embedding.

Example

from alinemol.splitters import get_umap_clusters X = np.random.rand(100, 128) clusters_indices, embedding = get_umap_clusters(X, n_clusters=10, n_jobs=1, return_embedding=True) print(clusters_indices)

alinemol.splitters