GitHub - spraakbanken/multigec-2025: MultiGEC-2025 shared task website, results and scripts.

Quick links: dataset overview | using MultiGEC | citing

MultiGEC is a dataset for Multilingual Grammatical Error Correction in 12 European languages (Czech, English, Estonian, German, Greek, Icelandic, Italian, Latvian, Russian, Slovene, Swedish and Ukrainian) compiled by the CompSLA working group and over 20 external data providers in the context of MultiGEC-2025, the first text-level GEC shared task.

Overview

The MultiGEC dataset is divided into 17 subcorpora covering different languages, domains and correction styles, summarized below. More detailed information about each subcorpus is available with the data as machine-readable metadata, whose format is described here. See also the full dataset statistics.

language code	subcorpus	learners	# essays (train)	# essays (dev)	# essays (test)	# essays (total)	hypothesis sets	minimal	fluency	peculiarities
cs	NatWebInf	L1 (web)	3620	1291	1256	6167	2	✓
cs	Romani	L1 (Romani children)	3247	179	173	3599	2	✓
cs	SecLearn	L2	2057	173	177	2407	2	✓
cs	NatForm	L1 (students)	227	88	76	391	2	✓
en	Write & Improve	L2	4040	506	504	5050	1	✓		separate download
et	EIC	L2	206	26	26	258	3	✓	✓
et	EKIL2	L2	1202	150	151	1503	2		✓
de	Merlin	L2	827	103	103	1033	1	✓		pre-tokenized
el	GLCII	L2	1031	129	129	1289	1	✓
is	IceEC	L1 (mixed)	140	18	18	176	1		✓	pre-tokenized
is	IceL2EC	L2	155	19	19	193	1		✓	pre-tokenized; includes text fragments
it	Merlin	L2	651	81	81	813	1	✓
lv	LaVA	L2	813	101	101	1015	1	✓
ru	RULEC-GEC	mixed (L2 + heritage)	2539	1969	1535	6043	3	✓	✓	pre-tokenized; includes text fragments; separate download
sl	Solar-Eval	L1 (students)	10	50	49	109	1	✓
sv	SweLL_gold	L2	402	50	50	502	1	✓
uk	UA-GEC	mixed (crowdsourced)	1706	87	79	1872	4	✓	✓

Usage

The MultiGEC dataset is subject to the terms of use listed here. To get the data, go to the download page. A collection of scripts to work with MultiGEC data is avialable through GitHub.

Citing MultiGEC

Information on how to cite the dataset is available here. See also the list of MultiGEC-related publications.

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
_includes		_includes
_sass/minima		_sass/minima
ack		ack
plots		plots
results		results
scripts		scripts
README.md		README.md
_config.yaml		_config.yaml
apple-touch-icon.png		apple-touch-icon.png
contributors.md		contributors.md
data_format.md		data_format.md
favicon-96x96.png		favicon-96x96.png
favicon.ico		favicon.ico
favicon.svg		favicon.svg
gec_overview.md		gec_overview.md
gec_overview.pdf		gec_overview.pdf
multigec-2025-horizontal.png		multigec-2025-horizontal.png
multigec-2025.md		multigec-2025.md
multigec-2025.png		multigec-2025.png
multigec.png		multigec.png
publications.md		publications.md
shared_task.md		shared_task.md
site.webmanifest		site.webmanifest
stats.md		stats.md
terms_of_use.md		terms_of_use.md
web-app-manifest-192x192.png		web-app-manifest-192x192.png
web-app-manifest-512x512.png		web-app-manifest-512x512.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Quick links: dataset overview | using MultiGEC | citing

Overview

Usage

Citing MultiGEC

About

Releases

Packages

Contributors 6

Languages

spraakbanken/multigec-2025

Folders and files

Latest commit

History

Repository files navigation

Quick links: dataset overview | using MultiGEC | citing

Overview

Usage

Citing MultiGEC

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 6

Languages

Packages