`themap.data`

themap.data

MoleculeDatapoint `dataclass`

Data structure holding information for a single molecule and associated features.

This class represents a single molecule datapoint with its associated features and labels. It provides methods to compute molecular fingerprints and features, and includes various molecular properties as properties.

Attributes:

Name	Type	Description
`task_id`	`str`	String describing the task this datapoint is taken from.
`smiles`	`str`	SMILES string describing the molecule this datapoint corresponds to.
`bool_label`	`bool`	bool classification label, usually derived from the numeric label using a threshold.
`numeric_label`	`Optional[float]`	numerical label (e.g., activity), usually measured in the lab
`_rdkit_mol`	`Optional[Mol]`	cached RDKit molecule object

Properties

number_of_atoms (int): Number of heavy atoms in the molecule number_of_bonds (int): Number of bonds in the molecule molecular_weight (float): Molecular weight in atomic mass units logp (float): Octanol-water partition coefficient (LogP) num_rotatable_bonds (int): Number of rotatable bonds in the molecule smiles_canonical (str): Canonical SMILES representation rdkit_mol (Chem.Mol): RDKit molecule object (lazy loaded)

Methods:

Name	Description
`from_dict`	Create datapoint from dictionary (class method)
`to_dict`	Convert datapoint to dictionary
`get_fingerprint`	Computes and returns the Morgan fingerprint for the molecule
`get_features`	Computes and returns molecular features using specified featurizer

Notes

By design, if the SMILES is invalid and can not be parsed with RDKit, it will result in a InvalidSMILESError. So make sure to validate and sanitize your SMILES strings before creating a MoleculeDatapoint.

Example

Create a molecule datapoint

datapoint = MoleculeDatapoint( ... task_id="toxicity_prediction", ... smiles="CCCO", # propanol ... bool_label=True, ... numeric_label=0.8 ... )

Access molecular properties

print(f"Number of heavy atoms: {datapoint.number_of_atoms}")

Number of heavy atoms: 4

print(f"Molecular weight: {datapoint.molecular_weight:.2f}")

Molecular weight: 60.06

print(f"LogP: {datapoint.logp:.2f}")

LogP: 0.39

print(f"Number of rotatable bonds: {datapoint.num_rotatable_bonds}")

Number of rotatable bonds: 1

print(f"SMILES canonical: {datapoint.smiles_canonical}")

SMILES canonical: CCCO

Get molecular features

fingerprint = datapoint.get_fingerprint() print(f"Fingerprint shape: {fingerprint.shape if fingerprint is not None else None}")

Fingerprint shape: (2048,)

features = datapoint.get_features(featurizer_name="ecfp") print(f"Features shape: {features.shape if features is not None else None}")

Features shape: (2048,)

rdkit_mol `property`

rdkit_mol: Optional[Mol]

Get the RDKit molecule object.

This property lazily initializes the RDKit molecule if it hasn't been created yet. The molecule is cached to avoid recreating it multiple times.

Returns:

Type	Description
`Optional[Mol]`	Optional[Chem.Mol]: RDKit molecule object. Returns None if molecule creation fails.

number_of_atoms `property`

number_of_atoms: int

Get the number of heavy atoms in the molecule.

Returns:

Name	Type	Description
`int`	`int`	Number of heavy atoms in the molecule.