【Hackathon 10th Spring No.6】基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现 — RFC设计文档#1256
Open
cloudforge1 wants to merge 4 commits intoPaddlePaddle:masterfrom
Open
【Hackathon 10th Spring No.6】基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现 — RFC设计文档#1256cloudforge1 wants to merge 4 commits intoPaddlePaddle:masterfrom
cloudforge1 wants to merge 4 commits intoPaddlePaddle:masterfrom
Conversation
基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现设计文档
- Replace wrong GPT-2 defaults (12/12/768, ~124M) with actual CrystalLLM configs - Small: n_layer=8, n_head=8, n_embd=512, block_size=1024, dropout=0.1 (~33M) - Large: n_layer=16, n_head=16, n_embd=1024, block_size=2048, dropout=0.1 (~250M) - Add large model YAML config alongside small - Update training hyperparams to match upstream (lr=1e-3, batch_size=32/16) - Remove pretrained weight conversion references (train from scratch) - Fix section numbering in impact analysis
cloudforge1
added a commit
to cloudforge1/PaddleMaterials
that referenced
this pull request
Mar 23, 2026
Reproduce CrystalLLM (Nature Communications 2024) in PaddleMaterials/ppmat. New files: - ppmat/models/crystalllm/: GPT model, CIF tokenizer, space groups - ppmat/datasets/cif_token_dataset.py: memory-mapped CIF token dataset - ppmat/metrics/crystal_metrics.py: validity, bond-length, space-group metrics - structure_generation/configs/crystalllm/: 8 configs (4 datasets x 2 sizes) - structure_generation/convert_weights.py: PyTorch->Paddle weight converter - test/test_crystalllm_forward.py: 7-test validation suite (all passing) Architecture: nanoGPT-based causal LM for CIF text generation. - Small: 8L/8H/512D/1024ctx (~33M params) - Large: 16L/16H/1024D/2048ctx (~250M params) - Weight tying via matmul(x, wte.weight^T) (Paddle-idiomatic) - Vocabulary: 371 tokens (89 atoms + 10 digits + 31 keywords + 13 symbols + 227 space groups + 1 UNK) RFC: PaddlePaddle/community#1256
5 tasks
leeleolay
reviewed
Mar 30, 2026
leeleolay
left a comment
There was a problem hiding this comment.
建议使用套件已有的trainer进行,工厂函数可放在ppmat/dataset 这已经目录下,数据集文件可调用工厂函数
…tom trainer Changes per reviewer feedback: - Remove ppmat/trainer/crystalllm_trainer.py from design (use BaseTrainer directly) - Rewrite Section 4.4 to clarify no custom trainer file needed - Update YAML configs to ppmat __class_name__/__init_params__ convention - Update dataset registration to follow eval(cls_name) pattern - Fix Section 7 impact numbering after removing trainer item
Contributor
Author
|
@leeleolay 感谢审阅,已根据反馈修改并推送:
请再看下是否还有需要调整的地方。 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
【Hackathon 10th Spring No.6】基于PaddleMaterials实现CrystalLLM晶体结构生成模型复现
一、概述
本 RFC 为飞桨黑客松第十期 No.6 任务的设计文档,基于 PaddleMaterials (ppmat) 框架实现 CrystalLLM 模型复现。
CrystalLLM 是一种基于 GPT-2 架构的自回归语言模型,将晶体结构表示为 CIF (Crystallographic Information File) 文本序列,通过自回归生成实现新晶体结构的从头生成。论文发表于 Nature Communications 2024 (DOI: 10.1038/s41467-024-54639-7)。
二、任务要求
详见:PaddlePaddle/PaddleScience#1202
三、设计文档
详见
rfcs/Science/hackathon10th_6_crystalllm.md,包含以下内容:GPT,CIFTokenDataset,build_cif_text, YAML 配置)四、关键设计决策
_forward()→forward()→predict(),返回(loss_dict, pred_dict)build_cif_text工厂函数: CIF 文本与图结构本质不同,需独立数据管线crystal_structure_generation/目录(与现有crystal_structure_prediction/平行)五、参考资料