Skip to content

【Hackathon 10th Spring No.6】基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现 — RFC设计文档#1256

Open
cloudforge1 wants to merge 4 commits intoPaddlePaddle:masterfrom
cloudforge1:task/006-rfc-crystalllm
Open

【Hackathon 10th Spring No.6】基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现 — RFC设计文档#1256
cloudforge1 wants to merge 4 commits intoPaddlePaddle:masterfrom
cloudforge1:task/006-rfc-crystalllm

Conversation

@cloudforge1
Copy link
Copy Markdown
Contributor

【Hackathon 10th Spring No.6】基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现

一、概述

本 RFC 为飞桨黑客松第十期 No.6 任务的设计文档,基于 PaddleMaterials (ppmat) 框架实现 CrystalLLM 模型复现。

CrystalLLM 是一种基于 GPT-2 架构的自回归语言模型,将晶体结构表示为 CIF (Crystallographic Information File) 文本序列,通过自回归生成实现新晶体结构的从头生成。论文发表于 Nature Communications 2024 (DOI: 10.1038/s41467-024-54639-7)。

二、任务要求

字段 内容
任务类型 晶体结构生成 (Crystal Structure Generation)
模型名称 CrystalLLM
难度 ⭐⭐⭐
技术标签 GPT-2, CIF Tokenization, Crystal Structure Generation, Autoregressive LM

详见:PaddlePaddle/PaddleScience#1202

三、设计文档

详见 rfcs/Science/hackathon10th_6_crystalllm.md,包含以下内容:

  1. 概述 — 任务背景、模型原理、ppmat 适配策略
  2. 功能目标 — 从头生成+条件生成两种模式
  3. 意义 — 填补 ppmat 在晶体生成领域的空白
  4. 设计思路 — 目录结构、数据管线、模型架构、训练/评估流程
  5. 详细设计 — 完整代码骨架 (GPT, CIFTokenDataset, build_cif_text, YAML 配置)
  6. 测试和验收 — 验收标准、精度对齐方案
  7. 可行性分析 — 风险评估和工期规划
  8. 排期规划 — 三阶段实施计划

四、关键设计决策

  • 遵循 ppmat 三层模式: _forward()forward()predict(),返回 (loss_dict, pred_dict)
  • 新增 build_cif_text 工厂函数: CIF 文本与图结构本质不同,需独立数据管线
  • BCS 自动下载: 预训练权重和分词器通过百度 BCS 存储分发
  • Task Category: 新增 crystal_structure_generation/ 目录(与现有 crystal_structure_prediction/ 平行)

五、参考资料

基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现设计文档
@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 23, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请检查PR提交格式和内容是否完备,具体请参考示例模版
Your PR has been submitted. Thanks for your contribution!
Please check its format and content. For this, you can refer to Template and Demo.

- Replace wrong GPT-2 defaults (12/12/768, ~124M) with actual CrystalLLM configs
- Small: n_layer=8, n_head=8, n_embd=512, block_size=1024, dropout=0.1 (~33M)
- Large: n_layer=16, n_head=16, n_embd=1024, block_size=2048, dropout=0.1 (~250M)
- Add large model YAML config alongside small
- Update training hyperparams to match upstream (lr=1e-3, batch_size=32/16)
- Remove pretrained weight conversion references (train from scratch)
- Fix section numbering in impact analysis
cloudforge1 added a commit to cloudforge1/PaddleMaterials that referenced this pull request Mar 23, 2026
Reproduce CrystalLLM (Nature Communications 2024) in PaddleMaterials/ppmat.

New files:
- ppmat/models/crystalllm/: GPT model, CIF tokenizer, space groups
- ppmat/datasets/cif_token_dataset.py: memory-mapped CIF token dataset
- ppmat/metrics/crystal_metrics.py: validity, bond-length, space-group metrics
- structure_generation/configs/crystalllm/: 8 configs (4 datasets x 2 sizes)
- structure_generation/convert_weights.py: PyTorch->Paddle weight converter
- test/test_crystalllm_forward.py: 7-test validation suite (all passing)

Architecture: nanoGPT-based causal LM for CIF text generation.
- Small: 8L/8H/512D/1024ctx (~33M params)
- Large: 16L/16H/1024D/2048ctx (~250M params)
- Weight tying via matmul(x, wte.weight^T) (Paddle-idiomatic)
- Vocabulary: 371 tokens (89 atoms + 10 digits + 31 keywords + 13 symbols + 227 space groups + 1 UNK)

RFC: PaddlePaddle/community#1256
Copy link
Copy Markdown

@leeleolay leeleolay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议使用套件已有的trainer进行,工厂函数可放在ppmat/dataset 这已经目录下,数据集文件可调用工厂函数

…tom trainer

Changes per reviewer feedback:
- Remove ppmat/trainer/crystalllm_trainer.py from design (use BaseTrainer directly)
- Rewrite Section 4.4 to clarify no custom trainer file needed
- Update YAML configs to ppmat __class_name__/__init_params__ convention
- Update dataset registration to follow eval(cls_name) pattern
- Fix Section 7 impact numbering after removing trainer item
@cloudforge1
Copy link
Copy Markdown
Contributor Author

@leeleolay 感谢审阅,已根据反馈修改并推送:

  1. Trainer:已移除自定义 crystalllm_trainer.py,直接使用 BaseTrainer。GPT 模型的 forward() 已返回 ppmat 标准 (loss_dict, pred_dict) 格式,BaseTrainer 可直接驱动训练循环
  2. 工厂函数:保留在 ppmat/datasets/crystalllm_dataset.py,同时将 CIFTokenDataset 注册到 __init__.py__all__ 中,遵循 __class_name__/__init_params__ 配置模式
  3. YAML 配置:已改为套件统一的 Model.__class_name__: GPT + Dataset.__class_name__: CIFTokenDataset 格式

请再看下是否还有需要调整的地方。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants