本项目旨在使用原生PyTorch统一实现法律判决预测LJP(legal judgment prediction)任务的当前各重要模型,包括对多种语言下多种公开数据集的预处理、多种子任务下的实现。
直接通过命令行即可调用torch_ljp/main.py文件,传入参数并得到对应的结果,需要预先在torch_ljp文件夹下创建config.py文件(由于真实文件的内容对用户来说无意义,因此没有上传,但是我上传了一个fakeconfig.py文件,把里面需要填的参数填上就行)。
具体的使用命令可参考example.txt。
op_examples文件夹是输出示例,见example.txt中介绍的对应的命令行。
模型的预测指标及其计算方式详见metrics文件夹中的介绍。
我所使用的系统环境中的重要版本见enviroment_v.txt所示。
以下分别介绍本项目中已经可实现分析和处理的数据,模型,及二者相对应的任务中,我跑出来的实验结果和原论文或其他引用论文中跑出来的结果的对比(有海量没整好的内容,等我慢慢补吧): (如果您希望我添加什么数据或模型,可以直接给我提issue!)
中文:
- CAIL(又名CAIL2018数据集)(来源:CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction,下载地址:https://cail.oss-cn-qingdao.aliyuncs.com/CAIL2018_ALL_DATA.zip)(在CAIL2018比赛中,原始任务是:以事实文本作为输入,以分类任务的范式,预测罪名(accusation)、法条(law)、刑期(imprisonment,单位为月,如被判为无期徒刑则是-1、死刑是-2)
- CAIL2021(来源:Equality before the law: Legal judgment consistency analysis for fairness,改自CAIL数据集。包含在FairLex中)
- LJP-E(还没有完全公开,我发邮件问过一作,他说会全部公开的。来源:Legal Judgment Prediction via Event Extraction with Constraints)
- attribute_charge(来源:Few-Shot Charge Prediction with Discriminative Legal Attributes)
- LEVEN(来源:LEVEN: A Large-Scale Chinese Legal Event Detection Dataset,下载地址:https://cloud.tsinghua.edu.cn/d/6e911ff1286d47db8016/)
英文:
- LJP-MSJudge(来源:Legal Judgment Prediction with Multi-Stage Case Representation Learning in the Real Court Setting)
英文(美国):
- ILLDM(作者在论文里说要公开的,但是GitHub项目里还没有放出来。来源:Interpretable Low-Resource Legal Decision Making)
英文(欧洲):
- ECHR(来源:Neural Legal Judgment Prediction in English,下载地址:https://archive.org/download/ECHR-ACL2019/ECHR_Dataset.zip。包含在LexGLUE中)
- ECtHR(来源:Paragraph-level Rationale Extraction through Regularization: A case study on European Court of Human Rights Cases,下载地址:ecthr_cases · Datasets at Hugging Face。使用时同时需引用Neural Legal Judgment Prediction in English。包含在FairLex、LexGLUE中)
英文(印度):
- ILDC(来源:ILDC for CJPE: Indian Legal Documents Corpus for Court Judgment Prediction and Explanation)
- ILSI(来源:LeSICiN: A Heterogeneous Graph-Based Approach for Automatic Legal Statute Identification from Indian Legal Documents,下载地址:Dataset and additional files/softwares required for the paper "LeSICiN: A Heterogeneous Graph-based Approach for Automatic Legal Statute Identification from Indian Legal Documents" | Zenodo(除best_model.pt和ils2v.bin外都是数据相关的文件)
法语(比利时):
- BSARD(来源:A Statutory Article Retrieval Dataset in French,下载地址:https://raw.githubusercontent.com/maastrichtlawtech/bsard/master/data/bsard_v1.zip)
多语言:
- Swiss-Judgment-Predict dataset(瑞士,德语、法语、意大利语,来源:Swiss-Judgment-Prediction: A Multilingual Legal Judgment Prediction Benchmark,下载地址1 SwissJudgmentPrediction | Zenodo,下载地址2 swiss_judgment_prediction · Datasets at Hugging Face。包含在FairLex中)
- TFIDF+SVM(又名LibSVM):定类数据,多分类单标签范式。(TFIDF来自Term-weighting approaches in automatic text retrieval,SVM来自Least Squares Support Vector Machine Classifiers。CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction使用的baseline。代码参考:CAIL2018/baseline at master · thunlp/CAIL2018)
- fastText(来源:Bag of Tricks for Efficient Text Classification。CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction使用的baseline。代码参考:fastText/python at main · facebookresearch/fastText)
- TextCNN(又名CNN)(来源:Convolutional neural networks for sentence classification,CAIL2018: A Large-Scale Legal Dataset for Judgment Prediction、LADAN使用的baseline)
- LSTM(来源:Long short-term memory)
- GRU(来源:Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation)
- RCNN(来源:Recurrent Convolutional Neural Networks for Text Classification)
- HAN(又名HARNN)(来源:Hierarchical Attention Networks for Document Classification,LADAN使用的baseline)
- DPCNN(来源:Deep Pyramid Convolutional Neural Networks for Text Categorization)
- 随机森林
- MLAC(又名FLA)(来源:Learning to Predict Charges for Criminal Cases with Legal Basis,LADAN、LeSICiN、EPM使用的baseline)
- DAPM(来源:Modeling Dynamic Pairwise Attention for Crime Classification over Legal Articles,LeSICiN使用的baseline)
- TOPJUDGE(来源:Legal Judgment Prediction via Topological Learning,LANDAN、EPM使用的baseline。代码参考:thunlp/TopJudge)
- Few-Shot(来源:Few-Shot Charge Prediction with Discriminative Legal Attributes,LADAN使用的baseline)
- HMN(来源:Hierarchical Matching Network for Crime Classification,LeSICiN使用的baseline)
- MPBFN(又名MPBFN-WCA)(来源:Legal Judgment Prediction via Multi-Perspective Bi-Feedback Network,LADAN、EPM使用的baseline)
- HBERT(来源:Neural Legal Judgment Prediction in English,LeSICiN使用的baseline)
- HLegalBERT(将HBERT中的BERT换成LegalBERT,LeSICiN使用的baseline)
- LegalAtt(来源:Charge Prediction with Legal Attention)
- HLCP(来源:Legal Cause Prediction with Inner Descriptions and Outer Hierarchies)
- LADAN(来源:Distinguish Confusing Law Articles for Legal Judgment Prediction,LeSICiN、EPM使用的baseline。代码参考:prometheusXN/LADAN: The source code of article "Distinguish Confusing Law Articles for Legal Judgment Prediction", ACL 2020)
- MSJudge(来源:Legal Judgment Prediction with Multi-Stage Case Representation Learning in the Real Court Setting)
- R-former(来源:Legal Judgment Prediction via Relational Learning)
- NeurJudge(来源:NeurJudge: A Circumstance-aware Neural Framework for Legal Judgment Prediction,EPM使用的baseline)
- LawReasoning(来源:Judgment Prediction via Injecting Legal Knowledge into Neural Networks,论文里给的官方GitHub项目leileigan/LawReasoning只放了个README文件所以根本没用)
- MLMN(来源:Learning Fine-grained Fact-Article Correspondence in Legal Cases)
- MFMI(来源:Few-Shot Charge Prediction with Multi-grained Features and Mutual Information)
- Dependency-LJP(来源:Dependency Learning for Legal Judgment Prediction with a Unified Text-to-Text Transformer)
- LDAIM(来源:Label Definitions Augmented Interaction Model for Legal Charge Prediction)
- LamBERTa(来源:Unsupervised law article mining based on deep pre-trained language representation models with application to the Italian civil code)
- CCJudge(来源:Legal Judgment Prediction with Multiple Perspectives on Civil Cases)
- LeSICiN(来源:LeSICiN: A Heterogeneous Graph-Based Approach for Automatic Legal Statute Identification from Indian Legal Documents)
- ILLDM(只能用在特殊数据里,但是GitHub项目里还没有放出所用的数据。来源:Interpretable Low-Resource Legal Decision Making)
- EPM(官方代码还没有完全公开,我发邮件问了一作他说他以后要全部公开的,所以我想等他们全部公开了再写。来源:Legal Judgment Prediction via Event Extraction with Constraints)
- FLSA(来源:A few-shot transfer learning approach using text-label embedding with legal attributes for law article prediction)
- PRRP(来源:Interpretable prison term prediction with reinforce learning and attention)
- DCSCP(来源:Charge prediction modeling with interpretation enhancement driven by double-layer criminal system)
- CEEN(来源:Improving legal judgment prediction through reinforced criminal element extraction)
- Bert(来源:BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding)
- RoBerta(来源:Roberta: A robustly optimized bert pretraining approach)
- DistillBert(来源:DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter)
- XLNet(来源:XLNet: Generalized Autoregressive Pretraining for Language Understanding)
- NEZHA(来源:NEZHA: Neural Contextualized Representation for Chinese Language Understanding)
- Longformer(来源:Longformer: The Long-Document Transformer)
- LegalBert(来源:LEGAL-BERT: The Muppets straight out of Law School)
- Lawformer(来源:Lawformer: A Pre-trained Language Model for Chinese Legal Long Documents)
- 线性回归
- DEAL(来源:Inductive Link Prediction for Nodes Having Only Attribute Information,LeSICiN使用的baseline)
使用CAIL2018原始任务范式。
训练集是first_stage/train.json,测试集是 first_stage/test.json + restData/rest_data.json(文中说,这个配置是删除多被告情况,仅保留单一被告的案例;删除了出现频数低于30的罪名和法条;删除了不与特定罪名相关的102个法条(没看懂这句话是啥意思))。用THULAC分词,Adam优化器,学习率为0.001,dropout rate是0.5,batch size是128
baseline: ①TFIDF+SVM(SVM是线性核,特征维度是5000,用skip-gram训练200维词向量) ②TextCNN(输入限长4096,filter widths是(2, 3, 4, 5),filter size是64) ③FastText
指标:accuracy, macro-precision, macro-recall
见reappear_files文件夹
实验配置见example.txt中的命令行。
其他注意事项:
- torch_ljp/dataset_utils/other_data文件夹内放的是一些比较小,而且不太好解释怎么制作的文件,所以直接跟着GitHub项目一起上传了。
- cn_criminal_law.txt:2021版中华人民共和国刑法。复制自中华人民共和国刑法(2022年最新版) - 中国刑事辩护网中下载的Word文件,并删除了其中语涉“中国刑事辩护网提供……”的字样。