Automatic language detection in Corpus with language listed in output #983

wvdvegte · 2023-06-01T14:45:44Z

Is your feature request related to a problem? Please describe.
I have a data file with abstracts from PhD theses in a column "ABSTRACT". Most abstracts are in English, but some are in Italian, and I want to be able to filter by language (e.g. using Select Rows). I get the impression that the latest versions of the Text add-on (>= 1.13) support automatic language detection, and in #916 it is suggested that this has now been incorporated in Corpus. However, from what I see, the language still has to be selected by the user, and there is no column in the output that lists the language.

Describe the solution you'd like
In the Language drop-down menu, add the option "guess" to the list of languages, and add a column "Language" to the output that lists the language of each row.

Describe alternatives you've considered
As I've learned that appearance of the word "il" (="the") distinguishes Italian text from English text, I inserted a Feature Constructor with the following assignment Language := 'IT' if 'il' in ABSTRACT.lower().split() or ABSTRACT.lower().startswith('il') else 'EN' but it works for this specific case only and it's rather awkward.

The text was updated successfully, but these errors were encountered:

PrimozGodec · 2023-06-08T09:28:29Z

@wvdvegte, orange3-text guesses language on a Corpus basis (not on a document basis). So it doesn't produce any additional column to the corpus; it just sets common language to the corpus. Guessing mechanism sets the initial language value in the language dropdown in the Corpus widget.

Corpus's language is used to initiate language dropdowns in different widgets automatically. All methods (which support language setting) in the add-on can use only one language per corpus, which is why language is set on a corpus basis.

What you suggested is to have a method to guess language on a document basis. If we implement that, I would suggest implementing it in a separate widget.

wvdvegte · 2023-06-08T11:39:48Z

@PrimozGodec, thanks for the explanation. Indeed it would make more sense to do this in a separate widget. I think it would be useful for data-cleaning: sometimes a corpus gets "polluted" with documents in a different language and it would be nice if it would be possible to exclude these.

gmolledaj · 2025-01-19T11:41:01Z

I have programmed a Python Script that add column with language-code.
The explains are in Esperanto language, I'm sorry but English is discriminatory and has made me spend more than 10,000 euros for each son.

from langdetect import detect
from orangecontrib.text.corpus import Corpus
from Orange.data import StringVariable, Domain
import numpy as np

# Eniga variablo (in_data de tipo Corpus)
# Kreu kopion de in_data por labori
out_data = in_data.copy()

def detect_language(text):
    try:
        return detect(text)
    except Exception:
        return 'nekonaton'  # Se eraro, redonu 'nekonaton'

tekstaj_trajtoj = out_data.text_features

if tekstaj_trajtoj:
    # Uzu la unuan tekstan funkcion trovitan
    teksta_trajto = tekstaj_trajtoj[0]
    teksto_kolona_nomo = teksta_trajto.name
    # Akiru la tekstojn de la detektita kolumno
    tekstoj = out_data.get_column(teksta_trajto)

    # Detektu la lingvon de ĉiu komento
    lingvo_kodoj = [detect_language(comment) for comment in tekstoj]

    # Kreu novan StringVariable por la kolumno
    lingvo_var = StringVariable("lingvo_kodo")

    # Kreu novan domajnon kun la nova kolumno aldonita al la metas
    new_domain = Domain(
        out_data.domain.attributes,
        out_data.domain.class_vars,
        out_data.domain.metas + (lingvo_var,),
    )

    # Vastigu la celmatricon por inkluzivi la novajn datumojn
    new_metas = out_data.metas
    lingvo_kodoj_tabelo = [[code] for code in lingvo_kodoj]
    # Konverti al 2D tabelo
    new_metas = np.hstack((new_metas, lingvo_kodoj_tabelo))

    # Kreu la novan Corpus kun la ĝisdatigita domajno
    out_data = Corpus.from_numpy(
        new_domain,
        out_data.X,
        out_data.Y,
        new_metas,
        out_data.W,
        text_features=out_data.text_features,
    )

    #print("Kolumno 'lingvo_kodo' aldonita al out_data")

else:
    print("Ne estas komentoj en la Korpuso")

PrimozGodec added help wanted feast This may require a few weeks of work labels Jul 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic language detection in Corpus with language listed in output #983

Automatic language detection in Corpus with language listed in output #983

wvdvegte commented Jun 1, 2023

PrimozGodec commented Jun 8, 2023

wvdvegte commented Jun 8, 2023

gmolledaj commented Jan 19, 2025 •

edited

Loading

Automatic language detection in Corpus with language listed in output #983

Automatic language detection in Corpus with language listed in output #983

Comments

wvdvegte commented Jun 1, 2023

PrimozGodec commented Jun 8, 2023

wvdvegte commented Jun 8, 2023

gmolledaj commented Jan 19, 2025 • edited Loading

gmolledaj commented Jan 19, 2025 •

edited

Loading