Skip to content

This repository contains annotated corpora developed under the Bhashini project for 4 Indian Languages for Named Entity Recognition Task, and the code for inference of the models fine-tuned on XLM-Roberta architecture.

Notifications You must be signed in to change notification settings

SankalpBahad/IL-NER

Repository files navigation

IL-NER

This annotated corpora has been developed under the Bhashini project funded by Ministry of Electronics and Information Technology (MeitY), Government of India. We thank MeitY for funding this work.

This dataset is licensed under Creative Commons Attribution 4.0 (CC-BY-4.0) license. The details of the dataset are given below. This dataset was developed by three partnering institutes, IIIT Hyderabad, CDAC Noida and IIIT Bhubaneshwar.

Language Train Test Dev
Hindi 11076 1389 1389
Urdu 8720 1096 1094
Odia 12109 1519 1517
Telugu 2993 384 384

To use this dataset, cite the paper as

  @misc{bahad2024finetuning,
        title={Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages}, 
        author={Sankalp Bahad and Pruthwik Mishra and Karunesh Arora and Rakesh Chandra Balabantaray and Dipti Misra Sharma and Parameswari Krishnamurthy},
        year={2024},
        eprint={2405.04829},
        archivePrefix={arXiv},
        primaryClass={cs.CL}
  }

About

This repository contains annotated corpora developed under the Bhashini project for 4 Indian Languages for Named Entity Recognition Task, and the code for inference of the models fine-tuned on XLM-Roberta architecture.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages