rustmolbpe¶

A high-performance BPE (Byte Pair Encoding) tokenizer for molecular SMILES written in Rust with Python bindings.

Current version: 0.2.0

Features¶

SMILES-aware tokenization: Correctly handles multi-character atoms (Br, Cl), bracket atoms ([C@@H], [N+]), ring closures, and stereochemistry
Fast training: Parallel processing with Rayon for efficient training on large molecular datasets
Streaming support: Train on datasets of any size with configurable buffer sizes
Special tokens: Built-in PAD, UNK, BOS, EOS tokens for sequence modeling
Batch padding: Ready for transformer models with attention masks
SMILESPE compatibility: Load and save vocabularies in SMILESPE format
Pickle support: Full serialization support for multiprocessing workflows
Type hints: PEP 561 compliant with py.typed marker

Performance¶

rustmolbpe is significantly faster than the original Python SMILESPE implementation:

Operation	Speedup
Encoding	25-35x faster
Training	16-18x faster

Throughput¶

Batch encoding: ~200,000-280,000 SMILES/second
Training: 2.8M molecules in ~100 seconds

Installation¶

From PyPI¶

pip install rustmolbpe

From source¶

# Install maturin
pip install maturin

# Clone and build
git clone https://github.com/HFooladi/rustmolbpe.git
cd rustmolbpe
maturin develop --release

Quick Example¶

import rustmolbpe

# Create tokenizer and load vocabulary
tokenizer = rustmolbpe.SmilesTokenizer()
tokenizer.load_vocabulary("data/chembl36_vocab.txt")

# Encode SMILES
ids = tokenizer.encode("CC(=O)Nc1ccc(O)cc1")  # paracetamol
print(ids)  # [2864, 1077]

# Decode back
smiles = tokenizer.decode(ids)
print(smiles)  # CC(=O)Nc1ccc(O)cc1

# Batch processing with padding (for ML)
result = tokenizer.encode_batch_padded(
    ["CCO", "c1ccccc1", "CC(=O)O"],
    add_special_tokens=True,
    return_attention_mask=True
)
print(result["input_ids"])
print(result["attention_mask"])

Pre-trained Vocabularies¶

Pre-trained vocabularies are included:

data/chembl36_vocab.txt - Trained on ChEMBL 36 (2.8M drug-like molecules)
data/pubchem_10M_vocab.txt - Trained on PubChem (10M diverse molecules)

License¶

MIT License - see LICENSE