Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fulltext and suggestion handling of multiple languages #734

Open
kelson42 opened this issue Nov 2, 2022 · 4 comments
Open

Fulltext and suggestion handling of multiple languages #734

kelson42 opened this issue Nov 2, 2022 · 4 comments
Assignees
Milestone

Comments

@kelson42
Copy link
Contributor

kelson42 commented Nov 2, 2022

This ticket is a follow-up of kiwix/libkiwix#785

Current libzim search features are not working fine with contents in different languages, whereas they are in the same ZIM or not. The search can basically apply only one language strategy in a search (so only one stemmer, only one stopword list).

As a consequence, the multizim search/suggestion feature is limited to one language which is annoying under certain circonstances.

We had a first short list of approaches to go forward on this at kiwix/libkiwix#785 (comment).

@kelson42
Copy link
Contributor Author

kelson42 commented Nov 3, 2022

Add a language tag to each document. Then the database are considered as multilanguage by definition. [New indexation strategy]

@mgautierfr I’m quite interested by this proposal because it might as well solve the problem of a ZIM file with articles in different languages. Would you be able please to elaborate how this could work? On bith indexation and search?

@mgautierfr
Copy link
Collaborator

For now, the language is a property of the whole database.
We could instead add a property to each article telling the language of the article. For most database, it would simply move the fra information from the database to all articles. For multilanguage zim files (not really handle for now), we should have a way to tell libzim what is the language of each article (probably by extended IndexData api).

At searching I see different strategies:

  • We assuming the user search in only one language (french here). We parse/stem the query using this language and we search for article with the <query> AND lang=fra. So it returns only french articles and we don't search for article in other language.
  • We want to search in different languages. We parse/stem the query several time (one per language) and we search with the query (<parsed_query_fra> AND lang=fra) OR (<parsed_query_eng> AND lang=eng) OR (<parsed_query_esp> AND lang=esp). The list of languages can come from the user (select box and correct api) or from the languages we have in the (multi)database(s).

@kelson42 kelson42 modified the milestones: 8.2.0, 8.3.0 Apr 6, 2023
@kelson42 kelson42 modified the milestones: 9.0.0, 9.1.0 Sep 26, 2023
@kelson42
Copy link
Contributor Author

kelson42 commented Oct 29, 2023

@mgautierfr What you propose seems appropriate to me. But:

  • Would that also allow to fix the multizim and multilanguage searches?
  • Could we achieve that way to have an « auto search », where the search executes automatically on all available (from zim content perspective) languages?

@mgautierfr
Copy link
Collaborator

Would that also allow to fix the multizim and multilanguage searches?

Yes and no.
Multilang (multi zim or not) need that we change how we create the fulltext db, so yes.
However, multilang and multizim with current xapian format would not work. Because we can't have (or I don't know how) a criteria db_level_property_lang=fra

Could we achieve that way to have an « auto search », where the search executes automatically on all available (from zim content perspective) languages?

Yes, we can store at db level a list of all article's languages and so know in which languages build the query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants