Dataset Card for Language Identification dataset

annotations_creators

language_creators

language

license

multilinguality

pretty_name

size_categories

source_datasets

task_categories

task_ids

ar

bg

de

el

en

es

fr

hi

it

ja

nl

pl

pt

ru

sw

th

tr

ur

vi

zh

multilingual

Language Identification dataset

unknown

extended|amazon_reviews_multi

extended|xnli

extended|stsb_multi_mt

text-classification

multi-class-classification

Dataset Card for Language Identification dataset

Dataset Description

Homepage:
Repository:
Paper:
Leaderboard:
Point of Contact:

Dataset Summary

The Language Identification dataset is a collection of 90k samples consisting of text passages and corresponding language label. This dataset was created by collecting data from 3 sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT.

Supported Tasks and Leaderboards

The dataset can be used to train a model for language identification, which is a multi-class text classification task. The model papluca/xlm-roberta-base-language-detection, which is a fine-tuned version of xlm-roberta-base, was trained on this dataset and currently achieves 99.6% accuracy on the test set.

Languages

The Language Identification dataset contains text in 20 languages, which are:

arabic (ar), bulgarian (bg), german (de), modern greek (el), english (en), spanish (es), french (fr), hindi (hi), italian (it), japanese (ja), dutch (nl), polish (pl), portuguese (pt), russian (ru), swahili (sw), thai (th), turkish (tr), urdu (ur), vietnamese (vi), and chinese (zh)

Dataset Structure

Data Instances

For each instance, there is a string for the text and a string for the label (the language tag). Here is an example:

{'labels': 'fr', 'text': 'Conforme à la description, produit pratique.'}

Data Fields

labels: a string indicating the language label.
text: a string consisting of one or more sentences in one of the 20 languages listed above.

Data Splits

The Language Identification dataset has 3 splits: train, valid, and test. The train set contains 70k samples, while the validation and test sets 10k each. All splits are perfectly balanced: the train set contains 3500 samples per language, while the validation and test sets 500.

Dataset Creation

Curation Rationale

This dataset was built during The Hugging Face Course Community Event, which took place in November 2021, with the goal of collecting a dataset with enough samples for each language to train a robust language detection model.

Source Data

The Language Identification dataset was created by collecting data from 3 sources: Multilingual Amazon Reviews Corpus, XNLI, and STSb Multi MT.

Personal and Sensitive Information

The dataset does not contain any personal information about the authors or the crowdworkers.

Considerations for Using the Data

Social Impact of Dataset

This dataset was developed as a benchmark for evaluating (balanced) multi-class text classification models.

Discussion of Biases

The possible biases correspond to those of the 3 datasets on which this dataset is based.

Additional Information

Dataset Curators

[More Information Needed]

Licensing Information

[More Information Needed]

Citation Information

[More Information Needed]

Contributions

Thanks to @LucaPapariello for adding this dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
Language-identification-LLM.ipynb		Language-identification-LLM.ipynb
README.md		README.md
test.csv		test.csv
train.csv		train.csv
valid.csv		valid.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Card for Language Identification dataset

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

About

Releases

Packages

Languages

tungtt30/Language-Identification

Folders and files

Latest commit

History

Repository files navigation

Dataset Card for Language Identification dataset

Table of Contents

Dataset Description

Dataset Summary

Supported Tasks and Leaderboards

Languages

Dataset Structure

Data Instances

Data Fields

Data Splits

Dataset Creation

Curation Rationale

Source Data

Personal and Sensitive Information

Considerations for Using the Data

Social Impact of Dataset

Discussion of Biases

Additional Information

Dataset Curators

Licensing Information

Citation Information

Contributions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages