Skip to content

API Reference

Module: rustmolbpe

atomwise_tokenize

def atomwise_tokenize(smiles: str) -> List[str]

Tokenize a SMILES string into atom-level tokens.

Arguments:

  • smiles (str): SMILES string to tokenize

Returns:

  • List[str]: List of atom-level tokens

Examples:

>>> rustmolbpe.atomwise_tokenize("CCO")
['C', 'C', 'O']

>>> rustmolbpe.atomwise_tokenize("c1ccccc1")
['c', '1', 'c', 'c', 'c', 'c', 'c', '1']

>>> rustmolbpe.atomwise_tokenize("[C@@H](O)C")
['[C@@H]', '(', 'O', ')', 'C']

>>> rustmolbpe.atomwise_tokenize("CBr")
['C', 'Br']

Class: SmilesTokenizer

BPE tokenizer for molecular SMILES strings.

Constructor

def __init__(self) -> None

Create a new tokenizer with special tokens initialized.

Special tokens are always at fixed IDs:

Token ID String
PAD 0 <pad>
UNK 1 <unk>
BOS 2 <bos>
EOS 3 <eos>

Training Methods

train_from_iterator

def train_from_iterator(
    self,
    iterator: Iterator[str],
    vocab_size: int,
    buffer_size: int = 8192,
    min_frequency: int = 2
) -> None

Train the tokenizer from a SMILES iterator.

Arguments:

  • iterator (Iterator[str]): Iterator yielding SMILES strings
  • vocab_size (int): Target vocabulary size (including special tokens and base atoms)
  • buffer_size (int, optional): Number of SMILES to buffer for parallel processing. Default: 8192
  • min_frequency (int, optional): Minimum frequency for a pair to be merged. Default: 2

Example:

def smiles_generator(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

tokenizer = rustmolbpe.SmilesTokenizer()
tokenizer.train_from_iterator(
    smiles_generator("molecules.smi"),
    vocab_size=8000
)

Vocabulary I/O

load_vocabulary

def load_vocabulary(self, path: str) -> None

Load vocabulary from a SMILESPE-format file.

Arguments:

  • path (str): Path to vocabulary file

Raises:

  • IOError: If file cannot be read

save_vocabulary

def save_vocabulary(self, path: str) -> None

Save vocabulary to a SMILESPE-format file.

Arguments:

  • path (str): Path to save vocabulary file

Raises:

  • IOError: If file cannot be written

Encoding Methods

encode

def encode(self, smiles: str, add_special_tokens: bool = False) -> List[int]

Encode a SMILES string to token IDs.

Arguments:

  • smiles (str): SMILES string to encode
  • add_special_tokens (bool, optional): If True, add BOS at start and EOS at end. Default: False

Returns:

  • List[int]: List of token IDs

Example:

ids = tokenizer.encode("CCO")  # [42]
ids = tokenizer.encode("CCO", add_special_tokens=True)  # [2, 42, 3]

batch_encode

def batch_encode(
    self,
    smiles_list: List[str],
    add_special_tokens: bool = False
) -> List[List[int]]

Encode multiple SMILES strings in parallel.

Arguments:

  • smiles_list (List[str]): List of SMILES strings
  • add_special_tokens (bool, optional): If True, add BOS/EOS tokens. Default: False

Returns:

  • List[List[int]]: List of token ID lists

Decoding Methods

decode

def decode(self, ids: List[int]) -> str

Decode token IDs back to a SMILES string.

Arguments:

  • ids (List[int]): List of token IDs

Returns:

  • str: Decoded SMILES string

Raises:

  • ValueError: If an ID is not in vocabulary

batch_decode

def batch_decode(self, ids_list: List[List[int]]) -> List[str]

Decode multiple token sequences in parallel.

Arguments:

  • ids_list (List[List[int]]): List of token ID lists

Returns:

  • List[str]: List of decoded SMILES strings

Padding Methods

pad

def pad(
    self,
    sequences: List[List[int]],
    max_length: Optional[int] = None,
    padding: str = "right",
    truncation: bool = False,
    return_attention_mask: bool = True
) -> Dict[str, List[List[int]]]

Pad sequences to equal length.

Arguments:

  • sequences (List[List[int]]): List of token ID sequences
  • max_length (int, optional): Target length. If None, uses longest sequence length
  • padding (str, optional): Padding side, either "right" or "left". Default: "right"
  • truncation (bool, optional): If True, truncate sequences longer than max_length. Default: False
  • return_attention_mask (bool, optional): If True, include attention_mask in result. Default: True

Returns:

  • Dict[str, List[List[int]]]: Dictionary with keys:
    • "input_ids": Padded token ID sequences
    • "attention_mask": Attention masks (if requested)

encode_batch_padded

def encode_batch_padded(
    self,
    smiles_list: List[str],
    max_length: Optional[int] = None,
    padding: str = "right",
    truncation: bool = False,
    add_special_tokens: bool = False,
    return_attention_mask: bool = True
) -> Dict[str, List[List[int]]]

Encode multiple SMILES and pad to equal length.

Convenience method combining batch_encode and pad.

Arguments:

  • smiles_list (List[str]): List of SMILES strings
  • max_length (int, optional): Target length. If None, uses longest sequence length
  • padding (str, optional): Padding side, either "right" or "left". Default: "right"
  • truncation (bool, optional): If True, truncate sequences longer than max_length. Default: False
  • add_special_tokens (bool, optional): If True, add BOS/EOS tokens. Default: False
  • return_attention_mask (bool, optional): If True, include attention_mask in result. Default: True

Returns:

  • Dict[str, List[List[int]]]: Dictionary with "input_ids" and optionally "attention_mask"

Example:

result = tokenizer.encode_batch_padded(
    ["CCO", "c1ccccc1"],
    max_length=10,
    add_special_tokens=True
)
print(result["input_ids"])       # [[2, 42, 3, 0, 0, ...], [2, 15, 3, 0, 0, ...]]
print(result["attention_mask"])  # [[1, 1, 1, 0, 0, ...], [1, 1, 1, 0, 0, ...]]

Vocabulary Access

get_vocabulary

def get_vocabulary(self) -> List[Tuple[str, int]]

Get vocabulary as (token, id) pairs.

Returns:

  • List[Tuple[str, int]]: List of (token_string, token_id) tuples

id_to_token

def id_to_token(self, id: int) -> str

Convert token ID to token string.

Arguments:

  • id (int): Token ID

Returns:

  • str: Token string

Raises:

  • ValueError: If ID is not in vocabulary

token_to_id

def token_to_id(self, token: str) -> int

Convert token string to token ID.

Arguments:

  • token (str): Token string

Returns:

  • int: Token ID

Raises:

  • ValueError: If token is not in vocabulary

Properties

Property Type Description
vocab_size int Total vocabulary size (special + base atoms + merges)
base_vocab_size int Number of base atom tokens
num_merges int Number of learned merge operations
pad_token_id int PAD token ID (always 0)
unk_token_id int UNK token ID (always 1)
bos_token_id int BOS token ID (always 2)
eos_token_id int EOS token ID (always 3)
pad_token str PAD token string (<pad>)
unk_token str UNK token string (<unk>)
bos_token str BOS token string (<bos>)
eos_token str EOS token string (<eos>)

State Inspection

is_trained

def is_trained(self) -> bool

Check if the tokenizer has been trained or has a vocabulary loaded.

Returns:

  • bool: True if the tokenizer has merge rules, False otherwise

Example:

tokenizer = rustmolbpe.SmilesTokenizer()
print(tokenizer.is_trained())  # False

tokenizer.train_from_iterator(iter(["CCO", "CCC"]), vocab_size=50)
print(tokenizer.is_trained())  # True

get_merges

def get_merges(self) -> List[Tuple[str, str, str]]

Get the learned merge rules as tuples.

Returns:

  • List[Tuple[str, str, str]]: List of (left_token, right_token, merged_token) tuples, ordered by merge priority

Example:

tokenizer = rustmolbpe.SmilesTokenizer()
tokenizer.load_vocabulary("data/chembl36_vocab.txt")

merges = tokenizer.get_merges()
print(merges[:3])  # First 3 merge rules
# [('c', 'c', 'cc'), ('C', 'C', 'CC'), ('c', '1', 'c1')]

Serialization

Pickle Support

SmilesTokenizer supports Python's pickle protocol for serialization.

import pickle

tokenizer = rustmolbpe.SmilesTokenizer()
tokenizer.load_vocabulary("data/chembl36_vocab.txt")

# Save to bytes
data = pickle.dumps(tokenizer)

# Restore from bytes
restored = pickle.loads(data)

# Verify
assert tokenizer.encode("CCO") == restored.encode("CCO")

Multiprocessing Example:

from multiprocessing import Pool
import rustmolbpe

def encode_smiles(smiles):
    # tokenizer is pickled and sent to worker processes
    return tokenizer.encode(smiles)

tokenizer = rustmolbpe.SmilesTokenizer()
tokenizer.load_vocabulary("data/chembl36_vocab.txt")

smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
with Pool(4) as pool:
    results = pool.map(encode_smiles, smiles_list)