本仓库为论文 “Simple Named Entity Recognition (NER) System with RoBERTa for Ancient Chinese” 的代码实现(EvaCun2025)。
The code implementation for the article "Simple Named Entity Recognition (NER) System with RoBERTa for Ancient Chinese".
论文链接:https://aclanthology.org/2025.alp-1.27.pdf
SimpleNER 面向古代汉语文本(如《史记》《二十四史》《中医药典籍》)的命名实体识别任务,采用 GujiRoBERTa_jian_fan + LSTM + CRF 架构,并在训练策略上采用“前期冻结预训练参数 + 后期全局微调”以缓解小样本过拟合问题。代码与实验结果用于复现论文中的 EvaHan/EvaCun2025 评测。
SimpleNER
│
│ README.md
│ requirements.txt
│
├─data
│
├─figure
│ Poster.png
│
├─model
│ README.md
│
├─notebook
│ data.ipynb
│ EvaNer.ipynb
│ EvaNer_crf.ipynb
│ EvaNer_crf_attention.ipynb
│ EvaNer_crf_lstm.ipynb
│ predicted.ipynb
│
└─src
EvaNer.py
git clone https://github.com/Blue-radish/SimpleNER.git
cd SimpleNER
pip install -r requirements.txt建议使用 Python 3.10 与 GPU 环境(CUDA)以加速训练。
在 notebook/data.ipynb 中运行数据抽取与格式转换单元,生成训练/验证/测试所需的数据。
- train.ipynb (multi-modal, can handle multiple images)
- train_all.ipynb (multi-modal, if there are multiple images, only one will be used)
- train_text.ipynb (uni-modal, using only text information)
如果在研究中使用本代码或数据,请引用我们的论文:
@inproceedings{zhang-etal-2025-simple,
title = "Simple Named Entity Recognition ({NER}) System with {R}o{BERT}a for {A}ncient {C}hinese",
author = "Zhang, Yunmeng and
Liu, Meiling and
Tang, Hanqi and
Lu, Shige and
Xue, Lang",
editor = "Anderson, Adam and
Gordin, Shai and
Li, Bin and
Liu, Yudong and
Passarotti, Marco C. and
Sprugnoli, Rachele",
booktitle = "Proceedings of the Second Workshop on Ancient Language Processing",
month = may,
year = "2025",
address = "The Albuquerque Convention Center, Laguna",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.alp-1.27/",
doi = "10.18653/v1/2025.alp-1.27",
pages = "206--212",
ISBN = "979-8-89176-235-0",
abstract = "Named Entity Recognition (NER) is a fun-damental task in Natural Language Process-ing (NLP), particularly in the analysis of Chi-nese historical texts. In this work, we pro-pose an innovative NER model based on Gu-jiRoBERTa, incorporating Conditional Ran-dom Fields (CRF) and Long Short Term Mem-ory Network(LSTM) to enhance sequence la-beling performance. Our model is evaluated on three datasets from the EvaHan2025 competi-tion, demonstrating superior performance over the baseline model, SikuRoBERTa-BiLSTM-CRF. The proposed approach effectively cap-tures contextual dependencies and improves entity boundary recognition. Experimental re-sults show that our method achieves consistent improvements across almost all evaluation met-rics, highlighting its robustness and effective-ness in handling ancient Chinese texts."
}
