Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should merge_results take the union or the intersection of the resulting documents? #1

Open
Dim131 opened this issue Apr 7, 2019 · 1 comment
Labels
question Further information is requested

Comments

@Dim131
Copy link
Collaborator

Dim131 commented Apr 7, 2019

In NgramPostingLists we are taking the union of the results for the different ngrams. This means that if the query is changed from "atlantis" to "an atlantis" the number of documents returned is much larger.

Do we want this?

@Dim131 Dim131 added the question Further information is requested label Apr 7, 2019
@DevanKuleindiren
Copy link
Owner

Good point. I think we should still take the union, because if we take the intersection then we might start dropping documents which are really great matches, but don't contain every single n-gram from the query.

For example, suppose you have:
query = "what kind of company is google"
doc1 = "google is a technology company"

Whilst doc1 is a great match, it would be dropped for not matching every single n-gram from the query.

I realize that taking the union means that we will end up getting more results for longer queries, but I suppose it's probably better not to drop potentially very relevant results. I guess, taking the intersection might give better precision, but taking the union will almost certainly give better recall. In the precision/recall trade-off, I suppose we probably want recall more (since our answer-extraction component will take care of finding the best answer amongst the top results, hopefully).

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants