alinemol.preprocessing
alinemol.preprocessing
standardize_smiles
Standardization of a SMILES string.
Uses the Standardizer
to perform sequence of cleaning operations on a SMILES string.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
DataFrame
|
pd.DataFrame with |
required |
taut_canonicalization
|
bool
|
whether or not to use tautomer canonicalization |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame with 'canonical_smiles', 'molecular_weight', and 'num_atoms' additional columns |
drop_duplicates
Remove conflicting duplicates from a DataFrame.
This function processes the DataFrame to
- Drop rows where the 'canonical_smiles' values are the same but the labels differ (conflicting rows).
- Retain only one row for each set of identical 'canonical_smiles' values with the same label.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
DataFrame
|
The input DataFrame containing a 'canonical_smiles' column. |
required |
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with conflicting duplicates removed, ensuring unique rows. |
standardization_pipeline
Standardization pipeline for a DataFrame.
This function performs the following operations on the input DataFrame
- Standardize the 'smiles' column using the
standardize_smiles
function. - Drop conflicting duplicates using the
drop_duplicates
function. - Return a DataFrame with 'smiles' and 'label' columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
x
|
DataFrame
|
The input DataFrame containing a 'smiles' column. |
required |
taut_canonicalization
|
bool
|
Whether or not to use tautomer canonicalization. |
True
|
Returns:
Type | Description |
---|---|
DataFrame
|
pd.DataFrame: A DataFrame with standardized 'canonical_smiles' values and conflicting duplicates removed. |
Note
The input DataFrame must contain a 'smiles' and label
column.
Output DataFrame will contain 'smiles' and label
columns.