-
Notifications
You must be signed in to change notification settings - Fork 555
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate Dataset for OpenSubtitles 2018 #9
Comments
See http://opus.nlpl.eu/OpenSubtitles2018.php 😄 Method:I found out Subtlex-pl (2014) discussion interesting. I cross checked below with Lison & Tiedemann (2016), they barely talk about duplicata. So maybe Lison & Tiedemann did the anti-duplicatia work --very likely given the extensive processing on the data--. Subtlex-pl (2014) : article, non-free. The section on data clean up is interesting. Some part are at reach. Other not. 3 methods are mentioned :
P. Lison and J. Tiedemann (2016), "OpenSubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles", In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016) http://stp.lingfil.uu.se/~joerg/paper/opensubs2016.pdf .
Also related to : #2 |
@hugolpz apologies for the delays.. i downloaded the tar.gz files on another machine and never got around to running it. I could do basic checks like this language should only have latin / cryllic character set. |
I am going to use this project to do language detection Its a .NET port of |
Hello Dave, cool to see you back,
As for :
Priority to get job done. Then, use more popular languages (python, js, java) so community can jump in if want. Or others :
|
Thanks @hugolpz I have reworked the code and added language lookup using .NET for now. I will start uploading soon |
Ahahahahahahahahaha. I didn't expected that fast 👍 I tried to create my own list and I bumped into lot of pollution. I share with you my findings: lot of English names and basic English in French subtitles.
On my side, I'am building a list of "words to preprocess before calculating stats". Some are to delete (Individual names). Some are to edit, to move all to lowercase or to capitalized. This kind of thing? |
@hugolpz i am at home (case of mild flu) which means I can do other things. That's pretty neat 👍 I force lower casing to ensure small reduction in noise |
Maybe I should create another output with the words / characters that were filtered out ? |
Yes, forcing lower case is smart. |
No description provided.
The text was updated successfully, but these errors were encountered: