Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Searchable.search doesn't return keyword's position information #224

Open
lmtoo opened this issue May 21, 2020 · 5 comments
Open

Searchable.search doesn't return keyword's position information #224

lmtoo opened this issue May 21, 2020 · 5 comments

Comments

@lmtoo
Copy link

lmtoo commented May 21, 2020

Searchable.search dosn't return keyword's position information, like pageNumber or text position

@lmtoo lmtoo changed the title Searchable.search dosn't return keyword's position information Searchable.search doesn't return keyword's position information May 21, 2020
@paulcwarren
Copy link
Owner

Hmmm, yeah interesting issue. The search integration aims to return entities that have associated content that match the search terms.

That said, I can see how this would be useful. Perhaps the searchContent endpoint should return a resultset that links to the entity and also supplies additional information about the match including pageNumber, text position, relevancy and so on.

@paulcwarren
Copy link
Owner

@lmtoo you are using elasticsearch, correct?

@lmtoo
Copy link
Author

lmtoo commented Aug 27, 2020

hi, @paulcwarren I remove spring-content's elasticsearch module and implement the similar feature.

  1. use a TextExtractor to extract document's words

  2. index document's words by elasticsearch

  3. use spring-batch to do this job

TextExtractor like this :

`interface TextExtractor {

fun consumes(): String

fun extract(resource: Resource): List<String>

}`

extract method will return page's words , each element in this list as a page's words

each page's words map to a DocumentPage instance ,which have contentId 、 pageNumber and pageContent

@paulcwarren
Copy link
Owner

I see. So you have a custom solution for the page numbers part of it then. That makes sense because, to the best of my knowledge, neither elasticsearch or solr can provide page number information. The closest feature they offer is term vectors (for position) and highlighting for marked up abstracts. Even then I don't think solrj (the client API we use) supports term vectors. Plus I have little to no experience about how accurate the position information is that you get back from extracted text then applied to the original document content.

That said, I am definitely happy to extend spring content fulltext modules to support both term vectors and highlighting and then we can see if there is a customization for supporting page numbers but I can't think how to do that cleanly atm. Whilst there I will have a go at tackling your previous issue #223 too.

@paulcwarren
Copy link
Owner

So, here is where we are at with this one. Spring Content Solr, Elasticsearrch and REST all now support custom search types allowing you to define your own result type to be returned from a searchContent query. This support custom attributes and highlighting.

I would like to understand you solution more though to see how we progress from here.

If I understand your solution it sounds like you have one DocumentPage for each page of a document. The page's content is associated with that DocumentPage instance. Unclear to me if you still use searchContent to search that content, or not. Or if you do some other search against the word index directly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants