-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automatic language detection in Corpus with language listed in output #983
Comments
@wvdvegte, orange3-text guesses language on a Corpus basis (not on a document basis). So it doesn't produce any additional column to the corpus; it just sets common language to the corpus. Guessing mechanism sets the initial language value in the language dropdown in the Corpus widget. Corpus's language is used to initiate language dropdowns in different widgets automatically. All methods (which support language setting) in the add-on can use only one language per corpus, which is why language is set on a corpus basis. What you suggested is to have a method to guess language on a document basis. If we implement that, I would suggest implementing it in a separate widget. |
@PrimozGodec, thanks for the explanation. Indeed it would make more sense to do this in a separate widget. I think it would be useful for data-cleaning: sometimes a corpus gets "polluted" with documents in a different language and it would be nice if it would be possible to exclude these. |
I have programmed a Python Script that add column with language-code.
|
Is your feature request related to a problem? Please describe.
I have a data file with abstracts from PhD theses in a column "ABSTRACT". Most abstracts are in English, but some are in Italian, and I want to be able to filter by language (e.g. using Select Rows). I get the impression that the latest versions of the Text add-on (>= 1.13) support automatic language detection, and in #916 it is suggested that this has now been incorporated in Corpus. However, from what I see, the language still has to be selected by the user, and there is no column in the output that lists the language.
Describe the solution you'd like
In the Language drop-down menu, add the option "guess" to the list of languages, and add a column "Language" to the output that lists the language of each row.
Describe alternatives you've considered
As I've learned that appearance of the word "il" (="the") distinguishes Italian text from English text, I inserted a Feature Constructor with the following assignment
Language := 'IT' if 'il' in ABSTRACT.lower().split() or ABSTRACT.lower().startswith('il') else 'EN'
but it works for this specific case only and it's rather awkward.The text was updated successfully, but these errors were encountered: