Skip to content

alinemol.preprocessing

alinemol.preprocessing

standardize_smiles

standardize_smiles(
    x: DataFrame, taut_canonicalization: bool = True
) -> pd.DataFrame

Standardization of a SMILES string.

Uses the Standardizer to perform sequence of cleaning operations on a SMILES string.

Parameters:

Name Type Description Default
x DataFrame

pd.DataFrame with smiles column

required
taut_canonicalization bool

whether or not to use tautomer canonicalization

True

Returns:

Type Description
DataFrame

pd.DataFrame with 'canonical_smiles', 'molecular_weight', and 'num_atoms' additional columns

drop_duplicates

drop_duplicates(x: DataFrame) -> pd.DataFrame

Remove conflicting duplicates from a DataFrame.

This function processes the DataFrame to
  • Drop rows where the 'canonical_smiles' values are the same but the labels differ (conflicting rows).
  • Retain only one row for each set of identical 'canonical_smiles' values with the same label.

Parameters:

Name Type Description Default
x DataFrame

The input DataFrame containing a 'canonical_smiles' column.

required

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with conflicting duplicates removed, ensuring unique rows.

standardization_pipeline

standardization_pipeline(
    x: DataFrame, taut_canonicalization: bool = True
) -> pd.DataFrame

Standardization pipeline for a DataFrame.

This function performs the following operations on the input DataFrame
  • Standardize the 'smiles' column using the standardize_smiles function.
  • Drop conflicting duplicates using the drop_duplicates function.
  • Return a DataFrame with 'smiles' and 'label' columns.

Parameters:

Name Type Description Default
x DataFrame

The input DataFrame containing a 'smiles' column.

required
taut_canonicalization bool

Whether or not to use tautomer canonicalization.

True

Returns:

Type Description
DataFrame

pd.DataFrame: A DataFrame with standardized 'canonical_smiles' values and conflicting duplicates removed.

Note

The input DataFrame must contain a 'smiles' and label column. Output DataFrame will contain 'smiles' and label columns.