Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic language detection in Corpus with language listed in output #983

Open
wvdvegte opened this issue Jun 1, 2023 · 3 comments
Open
Labels
feast This may require a few weeks of work help wanted

Comments

@wvdvegte
Copy link

wvdvegte commented Jun 1, 2023

Is your feature request related to a problem? Please describe.
I have a data file with abstracts from PhD theses in a column "ABSTRACT". Most abstracts are in English, but some are in Italian, and I want to be able to filter by language (e.g. using Select Rows). I get the impression that the latest versions of the Text add-on (>= 1.13) support automatic language detection, and in #916 it is suggested that this has now been incorporated in Corpus. However, from what I see, the language still has to be selected by the user, and there is no column in the output that lists the language.

Describe the solution you'd like
In the Language drop-down menu, add the option "guess" to the list of languages, and add a column "Language" to the output that lists the language of each row.

Describe alternatives you've considered
As I've learned that appearance of the word "il" (="the") distinguishes Italian text from English text, I inserted a Feature Constructor with the following assignment Language := 'IT' if 'il' in ABSTRACT.lower().split() or ABSTRACT.lower().startswith('il') else 'EN' but it works for this specific case only and it's rather awkward.

@PrimozGodec
Copy link
Collaborator

@wvdvegte, orange3-text guesses language on a Corpus basis (not on a document basis). So it doesn't produce any additional column to the corpus; it just sets common language to the corpus. Guessing mechanism sets the initial language value in the language dropdown in the Corpus widget.

Corpus's language is used to initiate language dropdowns in different widgets automatically. All methods (which support language setting) in the add-on can use only one language per corpus, which is why language is set on a corpus basis.

What you suggested is to have a method to guess language on a document basis. If we implement that, I would suggest implementing it in a separate widget.

@wvdvegte
Copy link
Author

wvdvegte commented Jun 8, 2023

@PrimozGodec, thanks for the explanation. Indeed it would make more sense to do this in a separate widget. I think it would be useful for data-cleaning: sometimes a corpus gets "polluted" with documents in a different language and it would be nice if it would be possible to exclude these.

@PrimozGodec PrimozGodec added help wanted feast This may require a few weeks of work labels Jul 31, 2023
@gmolledaj
Copy link

gmolledaj commented Jan 19, 2025

I have programmed a Python Script that add column with language-code.
The explains are in Esperanto language, I'm sorry but English is discriminatory and has made me spend more than 10,000 euros for each son.

from langdetect import detect
from orangecontrib.text.corpus import Corpus
from Orange.data import StringVariable, Domain
import numpy as np

# Eniga variablo (in_data de tipo Corpus)
# Kreu kopion de in_data por labori
out_data = in_data.copy()

def detect_language(text):
    try:
        return detect(text)
    except Exception:
        return 'nekonaton'  # Se eraro, redonu 'nekonaton'

tekstaj_trajtoj = out_data.text_features

if tekstaj_trajtoj:
    # Uzu la unuan tekstan funkcion trovitan
    teksta_trajto = tekstaj_trajtoj[0]
    teksto_kolona_nomo = teksta_trajto.name
    # Akiru la tekstojn de la detektita kolumno
    tekstoj = out_data.get_column(teksta_trajto)

    # Detektu la lingvon de ĉiu komento
    lingvo_kodoj = [detect_language(comment) for comment in tekstoj]

    # Kreu novan StringVariable por la kolumno
    lingvo_var = StringVariable("lingvo_kodo")

    # Kreu novan domajnon kun la nova kolumno aldonita al la metas
    new_domain = Domain(
        out_data.domain.attributes,
        out_data.domain.class_vars,
        out_data.domain.metas + (lingvo_var,),
    )

    # Vastigu la celmatricon por inkluzivi la novajn datumojn
    new_metas = out_data.metas
    lingvo_kodoj_tabelo = [[code] for code in lingvo_kodoj]
    # Konverti al 2D tabelo
    new_metas = np.hstack((new_metas, lingvo_kodoj_tabelo))

    # Kreu la novan Corpus kun la ĝisdatigita domajno
    out_data = Corpus.from_numpy(
        new_domain,
        out_data.X,
        out_data.Y,
        new_metas,
        out_data.W,
        text_features=out_data.text_features,
    )

    #print("Kolumno 'lingvo_kodo' aldonita al out_data")

else:
    print("Ne estas komentoj en la Korpuso")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feast This may require a few weeks of work help wanted
Projects
None yet
Development

No branches or pull requests

3 participants