You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think there might be something wrong with one file from the GNOME corpus. These links are from "legacy" OPUS, but I think the problem might be the same obtaining the file with a more current method. The file is the Catalan (ca) monolingual plain text file from the GNOME corpus: https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.txt.gz According to the stats on the website, these are the expected stats for the file: language files tokens sentences ca 2,071 6.4M 0.9M However, the downloaded file "ca.txt.gz" has much fewer tokens and sentences: zcat GNOME_v1_mono_ca.txt.gz | wc 1422 13808 87751 In contrast, the corresponding ca.tok.gz is a much larger file which actually has the expected number of lines. zcat GNOME_v1_mono_ca.tok.gz | wc 668727 6386997 33416861 ( from https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.tok.gz ) Could you check whether the ca.txt.gz is wrong?
The text was updated successfully, but these errors were encountered:
I think there might be something wrong with one file from the GNOME corpus. These links are from "legacy" OPUS, but I think the problem might be the same obtaining the file with a more current method. The file is the Catalan (ca) monolingual plain text file from the GNOME corpus: https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.txt.gz According to the stats on the website, these are the expected stats for the file: language files tokens sentences ca 2,071 6.4M 0.9M However, the downloaded file "ca.txt.gz" has much fewer tokens and sentences: zcat GNOME_v1_mono_ca.txt.gz | wc 1422 13808 87751 In contrast, the corresponding ca.tok.gz is a much larger file which actually has the expected number of lines. zcat GNOME_v1_mono_ca.tok.gz | wc 668727 6386997 33416861 ( from https://opus.nlpl.eu/legacy/download.php?f=GNOME/v1/mono/ca.tok.gz ) Could you check whether the ca.txt.gz is wrong?
The text was updated successfully, but these errors were encountered: