Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GNOME Catalan #15

Open
jorgtied opened this issue Jul 18, 2024 · 0 comments
Open

GNOME Catalan #15

jorgtied opened this issue Jul 18, 2024 · 0 comments

Comments

@jorgtied
Copy link
Member

I think there might be something wrong with one file from the GNOME corpus. These links are from "legacy" OPUS, but I think the problem might be the same obtaining the file with a more current method. The file is the Catalan (ca) monolingual plain text file from the GNOME corpus: https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.txt.gz According to the stats on the website, these are the expected stats for the file: language files tokens sentences ca 2,071 6.4M 0.9M However, the downloaded file "ca.txt.gz" has much fewer tokens and sentences: zcat GNOME_v1_mono_ca.txt.gz | wc 1422 13808 87751 In contrast, the corresponding ca.tok.gz is a much larger file which actually has the expected number of lines. zcat GNOME_v1_mono_ca.tok.gz | wc 668727 6386997 33416861 ( from https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.tok.gz ) Could you check whether the ca.txt.gz is wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant