rustmolbpe¶
A high-performance BPE (Byte Pair Encoding) tokenizer for molecular SMILES written in Rust with Python bindings.
Current version: 0.2.0
Features¶
- SMILES-aware tokenization: Correctly handles multi-character atoms (Br, Cl), bracket atoms ([C@@H], [N+]), ring closures, and stereochemistry
- Fast training: Parallel processing with Rayon for efficient training on large molecular datasets
- Streaming support: Train on datasets of any size with configurable buffer sizes
- Special tokens: Built-in PAD, UNK, BOS, EOS tokens for sequence modeling
- Batch padding: Ready for transformer models with attention masks
- SMILESPE compatibility: Load and save vocabularies in SMILESPE format
- Pickle support: Full serialization support for multiprocessing workflows
- Type hints: PEP 561 compliant with
py.typedmarker
Performance¶
rustmolbpe is significantly faster than the original Python SMILESPE implementation:
| Operation | Speedup |
|---|---|
| Encoding | 25-35x faster |
| Training | 16-18x faster |
Throughput¶
- Batch encoding: ~200,000-280,000 SMILES/second
- Training: 2.8M molecules in ~100 seconds
Installation¶
From PyPI¶
From source¶
# Install maturin
pip install maturin
# Clone and build
git clone https://github.com/HFooladi/rustmolbpe.git
cd rustmolbpe
maturin develop --release
Quick Example¶
import rustmolbpe
# Create tokenizer and load vocabulary
tokenizer = rustmolbpe.SmilesTokenizer()
tokenizer.load_vocabulary("data/chembl36_vocab.txt")
# Encode SMILES
ids = tokenizer.encode("CC(=O)Nc1ccc(O)cc1") # paracetamol
print(ids) # [2864, 1077]
# Decode back
smiles = tokenizer.decode(ids)
print(smiles) # CC(=O)Nc1ccc(O)cc1
# Batch processing with padding (for ML)
result = tokenizer.encode_batch_padded(
["CCO", "c1ccccc1", "CC(=O)O"],
add_special_tokens=True,
return_attention_mask=True
)
print(result["input_ids"])
print(result["attention_mask"])
Pre-trained Vocabularies¶
Pre-trained vocabularies are included:
data/chembl36_vocab.txt- Trained on ChEMBL 36 (2.8M drug-like molecules)data/pubchem_10M_vocab.txt- Trained on PubChem (10M diverse molecules)
License¶
MIT License - see LICENSE