Generate Dataset for OpenSubtitles 2018 #9

hermitdave · 2018-05-17T20:21:07Z

No description provided.

hugolpz · 2018-12-28T16:12:58Z

See http://opus.nlpl.eu/OpenSubtitles2018.php 😄
To have it under hands ! 😃

Method:

I found out Subtlex-pl (2014) discussion interesting. I cross checked below with Lison & Tiedemann (2016), they barely talk about duplicata. So maybe Lison & Tiedemann did the anti-duplicatia work --very likely given the extensive processing on the data--.

Subtlex-pl (2014) : article, non-free. The section on data clean up is interesting. Some part are at reach. Other not. 3 methods are mentioned :

Remove non target-language (~5% of files): Check if top 30 words of one file match 10% occurence of all files
Remove duplicate or variations of files (~80% of files).
Remove non-words, proper names: Check if words are also in publicly available spellchecker.

Corpus compilation, cleaning, and processing
We processed about 105,000 documents containing film and
television subtitles flagged as Polish by the contributors of
http://opensubtitles.org. All subtitle-specific text formatting
was removed before further processing.
(1) To detect documents containing large portions of text in
languages other than Polish, we first calculated preliminary
word frequencies on the basis of all documents and then
removed from the corpus all files in which the 30 most
frequent types did not cover at least 10 % of a total count of
tokens in the file. Using this method, 5,365 files were removed
from the corpus.
(2) Because many documents are available in multiple versions,
it was necessary to remove duplicates from the corpus.
To do so, we first performed a topic analysis using Latent
Dirichlet Allocation (Blei, Ng, & Jordan, 2003), assigning
each file to one of 600 clusters. If any pair of files within a
cluster had an overlap of at least 10 % unique word-trigrams,
the file with the highest number of hapax legomena (words
occurring only once) was removed from the corpus, since
more words occurring once would indicate more misspellings.
After removing duplicates, 27,767 documents remained,
containing about 146 million tokens (individual strings, including
punctuation marks, numbers, etc.).
(3) From these, 101 million tokens (449,300 types) were accepted
as correctly spelled Polish words by the Aspell spell-checker
(http://aspell.net/; Polish dictionary available at ftp://ftp.gnu.org/
gnu/aspell/dict/pl/) and consisted only of legal Polish,
alphabetical characters. All words were converted to
lowercase before spell-checking. Because Aspell rejects
proper names spelled with lowercase, this number does not
include proper names.

P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .

Furthermore, the administrators of OpenSubtitles have introduced over the last years various mechanisms to sanitise their database and remove duplicate, spurious or misclassified subtitles

Also related to : #2

hermitdave · 2019-02-12T17:36:18Z

@hugolpz apologies for the delays.. i downloaded the tar.gz files on another machine and never got around to running it.
Based on early user feedback, I implemented checks to ensure duplicates were reduced. The subtitle folders often contain multiple subtitles. I have taken to only picking up a single file if multiple files are found. Programmatically identifying language is harder.

I could do basic checks like this language should only have latin / cryllic character set.

hermitdave · 2019-02-13T10:07:41Z

I am going to use this project to do language detection
https://github.com/TechnikEmpire/language-detection

Its a .NET port of
https://code.google.com/archive/p/language-detection/

hugolpz · 2019-02-13T13:14:29Z

Hello Dave, cool to see you back,
Thanks for considering my input.

I could do basic checks like this language should only have latin / cyrillic character set.
I think you "latin/cyrillic" would mean russian and similar, as subtitles often contains latin words for all languages, at least for "ok" and other basic ones.

As for :

I am going to use this project to do language detection

Java (original) : 99% over precision for 53 languages
- .NET
- Python : supports 55 languages out of the box.

Priority to get job done. Then, use more popular languages (python, js, java) so community can jump in if want.

Or others :

Node franc : packaged with support for 82, 188, or 402 languages

hermitdave · 2019-02-13T13:20:04Z

Thanks @hugolpz I have reworked the code and added language lookup using .NET for now.
Its slowly chugging along generating the files as we speak. I am downloading the dataset again - they changed from tar.gz to zip so wanted to ensure i had the latest set..

I will start uploading soon

hugolpz · 2019-02-13T13:21:35Z

Ahahahahahahahahaha. I didn't expected that fast 👍

I tried to create my own list and I bumped into lot of pollution. I share with you my findings: lot of English names and basic English in French subtitles.

Noise: Lot of characters names, a bunch of basic English words. I review and cleaned up the 6000 to 8000th list: out of 2000 items, I had to edit 206 modifications (10%) and make 82 deletions (4%) (diff, open view-source:https://lingualibre.fr/index.php?title=List%3AFra%2Fsubtlex-for-user-Penegal-06001-to-08000&type=revision&diff=83866&oldid=83864 and search for "diff-deletedline" and "diff-addedline").

On my side, I'am building a list of "words to preprocess before calculating stats". Some are to delete (Individual names). Some are to edit, to move all to lowercase or to capitalized. This kind of thing?

hermitdave · 2019-02-13T13:30:49Z

@hugolpz i am at home (case of mild flu) which means I can do other things.

That's pretty neat 👍 I force lower casing to ensure small reduction in noise

hermitdave · 2019-02-13T13:33:50Z

Maybe I should create another output with the words / characters that were filtered out ?

hugolpz · 2019-02-13T18:25:45Z

Yes, forcing lower case is smart.
For names I had to inject back the Capitalization manually. But I'am not sure it really worth it has my end goal is to record words : my recording can be all lower case as well.

hermitdave self-assigned this May 17, 2018

hermitdave mentioned this issue Feb 13, 2019

2018 frequency word list #13

Merged

hermitdave closed this as completed Feb 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate Dataset for OpenSubtitles 2018 #9

Generate Dataset for OpenSubtitles 2018 #9

hermitdave commented May 17, 2018

hugolpz commented Dec 28, 2018 •

edited

Loading

hermitdave commented Feb 12, 2019

hermitdave commented Feb 13, 2019

hugolpz commented Feb 13, 2019 •

edited

Loading

hermitdave commented Feb 13, 2019

hugolpz commented Feb 13, 2019 •

edited

Loading

hermitdave commented Feb 13, 2019

hermitdave commented Feb 13, 2019

hugolpz commented Feb 13, 2019

Generate Dataset for OpenSubtitles 2018 #9

Generate Dataset for OpenSubtitles 2018 #9

Comments

hermitdave commented May 17, 2018

hugolpz commented Dec 28, 2018 • edited Loading

Method:

hermitdave commented Feb 12, 2019

hermitdave commented Feb 13, 2019

hugolpz commented Feb 13, 2019 • edited Loading

hermitdave commented Feb 13, 2019

hugolpz commented Feb 13, 2019 • edited Loading

hermitdave commented Feb 13, 2019

hermitdave commented Feb 13, 2019

hugolpz commented Feb 13, 2019

hugolpz commented Dec 28, 2018 •

edited

Loading

hugolpz commented Feb 13, 2019 •

edited

Loading

hugolpz commented Feb 13, 2019 •

edited

Loading