-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407
Comments
You likely have duplicate nodes in your index? The nodes are deduplicated after retrieval |
You can confirm this by using a lower level api
|
@logan-markewich Thanks a lot! I used the codes and yes there are duplicated nodes. But the node IDs are all different. I used the codes below to create and save the index
And the output is:
I think these two numbers are expected to be the same? Is there something wrong for the index? Thanks! |
It's deduplicating based on the text. As it is today, this is expected. It seems like the markdown reader is parsing your data into duplicate chunks Have you looked at the nodes it's creating? Tbh i wouldn't combine these two, just use one or the other. It's a bit unintuitve but the markdown reader is already splitting |
You can also deduplicsmate your nodes before indexing too |
@logan-markewich I found that I made a mistake before. Each time I created and saved the index, the database is cumulative. So the markdown reader did not parse my data into duplicate chunks, but I run several times of the scripts to create and save the index, which resulted in cumulative database and thus duplicate chunks in the index loaded later. After I deleted the database and run only once to create and save the index and then load it later to retrieve, the no. of nodes retrieved is the same as the top_k. But I have one doubt here. In the index which had duplicate chunks (i.e., the case when the |
Bug Description
Hi, I am using the Qdrant to store a document as a vector store, and then use VectorIndexRetriever to do similarity top k search for an input query. I set the retriever as:
But I only got 4 retrieved nodes:
If I set a smaller top k then the number of nodes retrieved would be smaller but they never became the same.
Version
llama_index.core.version='0.12.8'
Steps to Reproduce
Run the codes below with the path pointing to the attached folder (in zip) and the error will be reproduced.
doc_vector.zip
Output:
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: