GuwenBERT
is a RoBERTa
model trained on Classical Chinese text.
In natural language processing, pre-trained language models have become a very important basic technology. At present, there are a large number of modern Chinese BERT models available for download on the Internet, but the language model of Classical Chinese is lacking. In order to promote the research of Classical Chinese and natural language processing, we released the Classical Chinese pre-trained language model called GuwenBERT
.
For common tasks in Classical Chinese: sentence segmentation, punctuation, and named entity recognition, the sequence labeling model is usually used. This type of model relies on pre-trained word vectors or BERT, so a good language model can greatly improve performance. In the NER task, our BERT is increased by 6.3% than the most popular Chinese RoBERTa. It performs as same F1 score as Chinese RoBERTa in only 300 steps, which is especially suitable for small datasets with insufficient annotation corpus. Our model can also reduce processes such as data cleaning, data enhancement, introducing a dictionary. In the evaluation, we only used a BERT+CRF model to reach second place.
-
GuwenBERT
is based on the corpusDaizhige Classical Chinese Documents
, which contains 15,694 Classical Chinese books with 1.70 characters. All traditional characters are converted into simplified characters. -
GuwenBERT
's vocabulary is based on Classical Chinese corpus taking high-frequency characters. The vocab size is 23,292. -
Based on Continue Training,
GuwenBERT
combines modern Chinese RoBERTa weights and a large amount of Classical Chinese corpus to transfer some of the language features of modern Chinese to Classical Chinese to improve performance.
Click the picture or here to jump. The model may need to be loaded for the first time, please wait a minute.
2020/10/31 CCL2020 Conference Sharing: Classical Chinese Language Model Based on Continued Training slides
2020/10/25 Our model has been uploaded toHuggingface Transformers, checkInstructions
2020/9/29 Our model won the 2020 "Gulian Cup" Ancient Book Literature Named Entity Recognition Evaluation Contest Second Prize
The following models can be easily used by Huggingface Transformers.
ethanyt/guwenbert-base
:12-layer, 768-hidden, 12-headsethanyt/guwenbert-large
:24-layer, 1024-hidden, 16-heads
Code:
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("ethanyt/guwenbert-base")
model = AutoModel.from_pretrained("ethanyt/guwenbert-base")
Note: Since this work uses Chinese corpus, RoBERTa's original Tokenizer is based on the BPE algorithm and is not friendly to Chinese, so the BERT tokenizer is used here. This configuration has been written into config.json
, so using AutoTokenizer
directly will automatically load BertTokenizer
, and ʻAutoModelwill automatically load
RobertaModel`.
The model we provide is the PyTorch version. If you need the tensorflow version, please use the conversion script provided by Transformers to perform the conversion.
Download directly from the official website of huggingface:
https://huggingface.co/ethanyt/guwenbert-base
https://huggingface.co/ethanyt/guwenbert-large
Drag to the bottom and click "List all files in model" → download each file in the pop-up.
If users in mainland China cannot directly download the model of the huggingface hub, they can use the following mirror:
Model | Size | Baidu Pan |
---|---|---|
guwenbert-base | 235.2M | Link Password: 4jng |
guwenbert-large | 738.1M | Link Password: m5sz |
Second place in the competition. Detailed test results:
NE Type | Precision | Recall | F1 |
---|---|---|---|
Book Name | 77.50 | 73.73 | 75.57 |
Other Name | 85.85 | 89.32 | 87.55 |
Micro Avg. | 83.88 | 85.39 | 84.63 |
If you have any questions, you can leave a message directly in the Issue area, or contact me directly. In the future, some common problems will be summarized here.
- The initial learning rate is a critical parameter and needs to be adjusted according to the sub-task.
- For models that require CRF, please increase the learning rate of the CRF layer, generally more than 100 times that of RoBERTa.
Note: This section describes the pre-training process, do not refer to the configuration in this section for fine-tuning
The models are initialized with hfl/chinese-roberta-wwm-ext
and then pre-trained with a 2-step strategy.
In the first step, the model learns MLM with only word embeddings updated during training, until convergence. In the second step, all parameters are updated during training.
The detailed hyper-parameter are as follows:
Name | Value |
---|---|
Batch size | 2,048 |
Seq Length | 512 |
Optimizer | Adam |
Learning Rate | 2e-4(base), 1e-4 (large) |
Adam-eps | 1e-6 |
Weight Decay | 0.01 |
Warmup | 5K steps, linear decay of learning rate after. |
If the content in this article is helpful to your research, welcome to refer to this work in your paper. Since our paper has not been published yet, it can be used as a footnote temporarily.
\footnote{GuwenBERT \url{https://github.com/ethan-yt/guwenbert}}.
The experimental results presented in the report only show the performance under a specific data set and hyperparameter combination, and cannot represent the essence of each model. Experimental results may be changed due to random number seeds and computing equipment. **The content in this project is for reference only for technical research, not as any conclusive basis. Users can use the model arbitrarily within the scope of the license, but we are not responsible for direct or indirect losses caused by the use of the content of the project. **
This work is based on 中文BERT-wwm to continue training.