This annotated corpora has been developed under the Bhashini project funded by Ministry of Electronics and Information Technology (MeitY), Government of India. We thank MeitY for funding this work.
This dataset is licensed under Creative Commons Attribution 4.0 (CC-BY-4.0) license. The details of the dataset are given below. This dataset was developed by three partnering institutes, IIIT Hyderabad, CDAC Noida and IIIT Bhubaneshwar.
Language | Train | Test | Dev |
---|---|---|---|
Hindi | 11076 | 1389 | 1389 |
Urdu | 8720 | 1096 | 1094 |
Odia | 12109 | 1519 | 1517 |
Telugu | 2993 | 384 | 384 |
To use this dataset, cite the paper as
@misc{bahad2024finetuning,
title={Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages},
author={Sankalp Bahad and Pruthwik Mishra and Karunesh Arora and Rakesh Chandra Balabantaray and Dipti Misra Sharma and Parameswari Krishnamurthy},
year={2024},
eprint={2405.04829},
archivePrefix={arXiv},
primaryClass={cs.CL}
}