annotations_creators | language_creators | language | license | multilinguality | pretty_name | size_categories | source_datasets | task_categories | task_ids | |||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
Language Identification dataset |
|
|
|
|
- Table of Contents
- Dataset Description
- Dataset Structure
- Dataset Creation
- Considerations for Using the Data
- Additional Information
- Homepage:
- Repository:
- Paper:
- Leaderboard:
- Point of Contact:
The Language Identification dataset is a collection of 90k samples consisting of text passages and corresponding language label. This dataset was created by collecting data from 3 sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT.
The dataset can be used to train a model for language identification, which is a multi-class text classification task. The model papluca/xlm-roberta-base-language-detection, which is a fine-tuned version of xlm-roberta-base, was trained on this dataset and currently achieves 99.6% accuracy on the test set.
The Language Identification dataset contains text in 20 languages, which are:
arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)
For each instance, there is a string for the text and a string for the label (the language tag). Here is an example:
{'labels': 'fr', 'text': 'Conforme à la description, produit pratique.'}
- labels: a string indicating the language label.
- text: a string consisting of one or more sentences in one of the 20 languages listed above.
The Language Identification dataset has 3 splits: train, valid, and test. The train set contains 70k samples, while the validation and test sets 10k each. All splits are perfectly balanced: the train set contains 3500 samples per language, while the validation and test sets 500.
This dataset was built during The Hugging Face Course Community Event, which took place in November 2021, with the goal of collecting a dataset with enough samples for each language to train a robust language detection model.
The Language Identification dataset was created by collecting data from 3 sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT.
The dataset does not contain any personal information about the authors or the crowdworkers.
This dataset was developed as a benchmark for evaluating (balanced) multi-class text classification models.
The possible biases correspond to those of the 3 datasets on which this dataset is based.
[More Information Needed]
[More Information Needed]
[More Information Needed]
Thanks to @LucaPapariello for adding this dataset.