Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate Dataset for OpenSubtitles 2018 #9

Closed
hermitdave opened this issue May 17, 2018 · 9 comments
Closed

Generate Dataset for OpenSubtitles 2018 #9

hermitdave opened this issue May 17, 2018 · 9 comments
Assignees

Comments

@hermitdave
Copy link
Owner

No description provided.

@hermitdave hermitdave self-assigned this May 17, 2018
@hugolpz
Copy link
Contributor

hugolpz commented Dec 28, 2018

See http://opus.nlpl.eu/OpenSubtitles2018.php 😄
To have it under hands ! 😃

Method:

I found out Subtlex-pl (2014) discussion interesting. I cross checked below with Lison & Tiedemann (2016), they barely talk about duplicata. So maybe Lison & Tiedemann did the anti-duplicatia work --very likely given the extensive processing on the data--.

Subtlex-pl (2014) : article, non-free. The section on data clean up is interesting. Some part are at reach. Other not. 3 methods are mentioned :

  1. Remove non target-language (~5% of files): Check if top 30 words of one file match 10% occurence of all files
  2. Remove duplicate or variations of files (~80% of files).
  3. Remove non-words, proper names: Check if words are also in publicly available spellchecker.

Corpus compilation, cleaning, and processing
We processed about 105,000 documents containing film and
television subtitles flagged as Polish by the contributors of
http://opensubtitles.org. All subtitle-specific text formatting
was removed before further processing.
(1) To detect documents containing large portions of text in
languages other than Polish, we first calculated preliminary
word frequencies on the basis of all documents and then
removed from the corpus all files in which the 30 most
frequent types did not cover at least 10 % of a total count of
tokens in the file. Using this method, 5,365 files were removed
from the corpus.
(2) Because many documents are available in multiple versions,
it was necessary to remove duplicates from the corpus.
To do so, we first performed a topic analysis using Latent
Dirichlet Allocation (Blei, Ng, & Jordan, 2003), assigning
each file to one of 600 clusters. If any pair of files within a
cluster had an overlap of at least 10 % unique word-trigrams,
the file with the highest number of hapax legomena (words
occurring only once) was removed from the corpus, since
more words occurring once would indicate more misspellings.
After removing duplicates, 27,767 documents remained,
containing about 146 million tokens (individual strings, including
punctuation marks, numbers, etc.).
(3) From these, 101 million tokens (449,300 types) were accepted
as correctly spelled Polish words by the Aspell spell-checker
(http://aspell.net/; Polish dictionary available at ftp://ftp.gnu.org/
gnu/aspell/dict/pl/) and consisted only of legal Polish,
alphabetical characters. All words were converted to
lowercase before spell-checking. Because Aspell rejects
proper names spelled with lowercase, this number does not
include proper names.

P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .

Furthermore, the administrators of OpenSubtitles have introduced over the last years various mechanisms to sanitise their database and remove duplicate, spurious or misclassified subtitles

Also related to : #2

@hermitdave
Copy link
Owner Author

@hugolpz apologies for the delays.. i downloaded the tar.gz files on another machine and never got around to running it.
Based on early user feedback, I implemented checks to ensure duplicates were reduced. The subtitle folders often contain multiple subtitles. I have taken to only picking up a single file if multiple files are found. Programmatically identifying language is harder.

I could do basic checks like this language should only have latin / cryllic character set.

@hermitdave
Copy link
Owner Author

I am going to use this project to do language detection
https://github.com/TechnikEmpire/language-detection

Its a .NET port of
https://code.google.com/archive/p/language-detection/

@hugolpz
Copy link
Contributor

hugolpz commented Feb 13, 2019

Hello Dave, cool to see you back,
Thanks for considering my input.

I could do basic checks like this language should only have latin / cyrillic character set.
I think you "latin/cyrillic" would mean russian and similar, as subtitles often contains latin words for all languages, at least for "ok" and other basic ones.

As for :

I am going to use this project to do language detection

  • Java (original) : 99% over precision for 53 languages
    • .NET
    • Python : supports 55 languages out of the box.

Priority to get job done. Then, use more popular languages (python, js, java) so community can jump in if want.

Or others :

  • Node franc : packaged with support for 82, 188, or 402 languages

@hermitdave
Copy link
Owner Author

Thanks @hugolpz I have reworked the code and added language lookup using .NET for now.
Its slowly chugging along generating the files as we speak. I am downloading the dataset again - they changed from tar.gz to zip so wanted to ensure i had the latest set..

I will start uploading soon

@hugolpz
Copy link
Contributor

hugolpz commented Feb 13, 2019

Ahahahahahahahahaha. I didn't expected that fast 👍

I tried to create my own list and I bumped into lot of pollution. I share with you my findings: lot of English names and basic English in French subtitles.

Noise: Lot of characters names, a bunch of basic English words. I review and cleaned up the 6000 to 8000th list: out of 2000 items, I had to edit 206 modifications (10%) and make 82 deletions (4%) (diff, open view-source:https://lingualibre.fr/index.php?title=List%3AFra%2Fsubtlex-for-user-Penegal-06001-to-08000&type=revision&diff=83866&oldid=83864 and search for "diff-deletedline" and "diff-addedline").

On my side, I'am building a list of "words to preprocess before calculating stats". Some are to delete (Individual names). Some are to edit, to move all to lowercase or to capitalized. This kind of thing?

@hermitdave
Copy link
Owner Author

@hugolpz i am at home (case of mild flu) which means I can do other things.

That's pretty neat 👍 I force lower casing to ensure small reduction in noise

@hermitdave
Copy link
Owner Author

Maybe I should create another output with the words / characters that were filtered out ?

@hugolpz
Copy link
Contributor

hugolpz commented Feb 13, 2019

Yes, forcing lower case is smart.
For names I had to inject back the Capitalization manually. But I'am not sure it really worth it has my end goal is to record words : my recording can be all lower case as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants