Should merge_results take the union or the intersection of the resulting documents? #1

Dim131 · 2019-04-07T23:01:06Z

In NgramPostingLists we are taking the union of the results for the different ngrams. This means that if the query is changed from "atlantis" to "an atlantis" the number of documents returned is much larger.

Do we want this?

DevanKuleindiren · 2019-04-08T23:39:13Z

Good point. I think we should still take the union, because if we take the intersection then we might start dropping documents which are really great matches, but don't contain every single n-gram from the query.

For example, suppose you have:
query = "what kind of company is google"
doc1 = "google is a technology company"

Whilst doc1 is a great match, it would be dropped for not matching every single n-gram from the query.

I realize that taking the union means that we will end up getting more results for longer queries, but I suppose it's probably better not to drop potentially very relevant results. I guess, taking the intersection might give better precision, but taking the union will almost certainly give better recall. In the precision/recall trade-off, I suppose we probably want recall more (since our answer-extraction component will take care of finding the best answer amongst the top results, hopefully).

What do you think?

Dim131 added the question Further information is requested label Apr 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should merge_results take the union or the intersection of the resulting documents? #1

Should merge_results take the union or the intersection of the resulting documents? #1

Dim131 commented Apr 7, 2019

DevanKuleindiren commented Apr 8, 2019

Should merge_results take the union or the intersection of the resulting documents? #1

Should merge_results take the union or the intersection of the resulting documents? #1

Comments

Dim131 commented Apr 7, 2019

DevanKuleindiren commented Apr 8, 2019