QuantumChem-200K: A Large-Scale Open Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking
We introduce QuantumChem-200K, a large-scale dataset of over 200,000 organic molecules annotated with seven quantum-chemical properties, including two-photon absorption (TPA) cross sections, TPA spectral ranges, singlet–triplet intersystem crossing (ISC) energies, toxicity and synthetic accessibility scores, solubility, and boiling point. These values are computed using a hybrid workflow that integrates density function theory (DFT), semi-empirical excited-state methods, atomistic quantum solvers, and neural-network predictors. Data is available at: https://huggingface.co/datasets/YinqiZeng704/200k_monomer_properties, and paper is avalibale at: https://arxiv.org/abs/2511.21747
This repository contains scripts and notebooks for fine-tuning and evaluating large language models for SMILES-to-monomer property prediction. Given a molecular SMILES string, the model predicts relevant photochemical, physical, and synthetic properties such as:
- sigma at 780 nm
- maximum sigma
- ISC energy
- toxicity score
- synthetic accessibility score
- boiling point
- solubility
The main workflow includes:
- Fine-tuning a Qwen2.5-32B model with LoRA using Unsloth.
- Running batch inference on SMILES strings.
- Comparing model predictions against ground-truth property tables using weighted MAE, RMSE, Pearson r.
- Running Claude Haiku, Deepseek, Gemma, Llama, Phi, Mistral, GPT 5.2, and graph learning model baselines for comparison, using wMAE, precision, recall, Pareto precision, and hypervolume regret.
.
├── fine_tuning.py # LoRA fine-tuning script using Unsloth + Qwen2.5-32B
├── infer.ipynb # Batch inference notebook using a fine-tuned LoRA adapter
├── claude_infer.py # Claude baseline inference script
├── wmae_eval_sqrt.ipynb # Weighted MAE evaluation notebook with sqrt rebalancing
├── requirements.txt # Python dependencies
├── 100testbank.csv # Input SMILES file for batch inference
├── 3000testbank.csv # Ground-truth test set for evaluation
├── 100prediction.csv # Prediction CSV for evaluation
└── outputs/ # Training checkpoints and model outputs
Requirements: Python 3.10 or newer is recommended. A CUDA-capable NVIDIA GPU is strongly recommended because the training and inference scripts load Qwen2.5-32B in 4-bit mode and move tensors to CUDA.
cd /path/to/your/repo
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activatepip install --upgrade pip
pip install -r requirements.txtA minimal requirements.txt may include:
torch
transformers
datasets
trl
peft
accelerate
bitsandbytes
unsloth
pandas
numpy
matplotlib
ipython
jupyter
anthropic
For GPU-accelerated PyTorch, install the CUDA build that matches your system from the official PyTorch installation page:
https://pytorch.org/get-started/locally/
Before running inference or uploading models, set your API tokens as environment variables.
export HF_TOKEN="your_huggingface_token"
export ANTHROPIC_API_KEY="your_anthropic_api_key"Do not hard-code tokens in notebooks or scripts before pushing to GitHub.
In infer.ipynb, replace any hard-coded Hugging Face token with:
import os
HF_TOKEN = os.getenv("HF_TOKEN")In claude_infer.py, replace:
api_key=os.getenv("your api key")with:
api_key=os.getenv("ANTHROPIC_API_KEY")The main training script is:
python fine_tuning.pyThis script fine-tunes:
Base model: unsloth/Qwen2.5-32B
Dataset: YinqiZeng704/200k_monomer_properties
Method: LoRA fine-tuning
Quantization: 4-bit
Output directory: outputs/
The training data are formatted using an Alpaca-style prompt:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Based on the given SMILES string, predict the monomer's relevant properties.
### Input:
<SMILES>
### Response:
<property prediction>
The current script resumes training from a hard-coded checkpoint path:
trainer.train(resume_from_checkpoint="/mnt/shared/gpfs/home/renjie2/fine_tune/forward/outputs/checkpoint-3000")For a fresh training run, change this to:
trainer.train()Or replace the checkpoint path with your own local checkpoint path.
The fine-tuning script produces:
outputs/ # model checkpoints
training_log.csv # exported training log
The script can also push the trained adapter and tokenizer to Hugging Face:
model.push_to_hub("YinqiZeng704/200k_model", token="your token")
tokenizer.push_to_hub("YinqiZeng704/200k_model", token="your token")Before public release, replace the token with HF_TOKEN from the environment.
Open the inference notebook:
jupyter notebook infer.ipynbor:
jupyter lab infer.ipynbThe notebook loads:
Base model: unsloth/Qwen2.5-32B
LoRA adapter: YinqiZeng704/200k_model_v2
It runs a single test prediction and then performs batch inference on:
3000bestbank.csv
The CSV should contain a column named:
SMILES
Example:
SMILES
CC12NC1C1C(C#N)C21
C=C(C)OC
NC(=O)C1=CCCCC1The notebook writes batch outputs to:
3000outputs.txt
E.g., the Claude baseline script is:
python claude_infer.pyThis script performs:
- A single test prediction for one SMILES string.
- Batch inference over the first 3000 SMILES strings in
3000.csv. - Output saving to
3000outputs_claude_haiku.txt.
Expected input:
3000testbank.csv
Required column:
SMILES
Output file:
3000outputs_claude_haiku.txt
The script currently uses:
model = claude-haiku-4.5
temperature = 0.2
max_tokens = 256
Open the evaluation notebook:
jupyter notebook wmae_eval_sqrt.ipynbor:
jupyter lab wmae_eval_sqrt.ipynbThe notebook compares a ground-truth CSV and a prediction CSV:
truth_path = "100testbank.csv"
pred_path = "100prediction.csv"It excludes:
exclude = {"wavelength_range"}The notebook computes weighted mean absolute error using square-root rebalancing:
Where:
Kis the number of evaluated properties.Mis the number of evaluated molecules.n_iis the number of valid samples for propertyi.r_iis the numeric range used for property normalization.y_iis the ground-truth value.ŷ_iis the predicted value.
Evaluation outputs:
wmae_details_sqrt.csv
wmae_contribution.png
wmae_contribution.pdf
python fine_tuning.pyjupyter notebook infer.ipynbpython claude_infer.pyjupyter notebook wmae_eval_sqrt.ipynbModel checkpoints, training outputs, prediction files, and logs can become large. These files should generally not be committed to GitHub.
Recommended .gitignore:
.venv/
__pycache__/
.ipynb_checkpoints/
outputs/
runs/
checkpoints/
*.pt
*.pth
*.bin
*.safetensors
training_log.csv
3000outputs.txt
3000outputs_claude_haiku.txt
wmae_details_sqrt.csv
wmae_contribution.png
wmae_contribution.pdf
.env
Before pushing this repository to GitHub:
- Remove all hard-coded API tokens.
- Remove private Hugging Face tokens from notebooks.
- Replace local absolute paths with relative paths.
- Do not commit model checkpoints unless intentionally releasing them.
- Use environment variables for all private credentials.
SMILES: C=C(C)OC
The monomer compound has sigma of ... GM at 780 nm, maximum sigma of ... GM, ISC of ... eV, toxicity score of ..., SA score of ..., boiling point of ... °C, logP of ..., aromaticity of ..., solubility of ... ug/mol, molecular weight of ... g/mol.
Using QuantumChem-200K, we fine-tuned the open-source Qwen-2.5-32B LLM to create a chemistry AI assistant capable of forward polymer property prediction from SMILES. It demonstrates that domain-specific fine-tuning significantly improves prediction accuracy over baselines such as GPT-4o, Llama-3.1-70B, and the base Qwen2.5- 32B model. The evaluation metric used is the wMAE:
If you use this repository in academic work, please cite the associated project or manuscript.
@misc{quantumchem-zeng2026,
title = {QuantumChem-200K: A Large-Scale Organic Molecular Dataset for Quantum-Chemistry Property Screening and Language Model Benchmarking},
author = {Yinqi Zeng, Shangding Gu, Jiancong Xiao, Ruizhong Qiu, Yuanchen Bei, Hanghang Tong, Renjie Li},
year = {2026},
note = {under review},
url = {https://arxiv.org/abs/2511.21747}
}Data: GNU General Public License v3, code: MIT.