[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407

Howie-Arup · 2025-01-02T02:45:44Z

Bug Description

Hi, I am using the Qdrant to store a document as a vector store, and then use VectorIndexRetriever to do similarity top k search for an input query. I set the retriever as:

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

But I only got 4 retrieved nodes:

retrieved_nodes = retriever.retrieve("what is this guideline about")
print(f"\nNumber of nodes retrieved: {len(retrieved_nodes)}")
Output: Number of nodes retrieved: 4

If I set a smaller top k then the number of nodes retrieved would be smaller but they never became the same.

Version

llama_index.core.version='0.12.8'

Steps to Reproduce

Run the codes below with the path pointing to the attached folder (in zip) and the error will be reproduced.
doc_vector.zip

from llama_index.core import Settings
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.indices.vector_store.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from qdrant_client import QdrantClient

Settings.context_window = 120000

azure_llm = AzureOpenAI(
    model="gpt-4-1106-preview",
    deployment_name="gpt-4-1106-preview",
    api_key=AZURE_API_KEY,
    azure_endpoint=AZURE_API_ENDPOINT,
    api_version=AZURE_API_VERSION,
    temperature=0,
)

azure_embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="text-embedding-ada-002",
    api_key=EMBEDDING_AZURE_API_KEY,
    azure_endpoint=EMBEDDING_AZURE_API_ENDPOINT,
    api_version=EMBEDDING_AZURE_API_VERSION,
)

Settings.llm = azure_llm
Settings.embed_model = azure_embed_model

client = QdrantClient(path=...)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="test"
)

storage_context = StorageContext.from_defaults(
    persist_dir=...,
    vector_store=vector_store
)
index = load_index_from_storage(storage_context)

print(f"Total nodes in index: {len(index._vector_store.__doc__)}")
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

query_engine = RetrieverQueryEngine.from_args(
                                index.as_retriever(sub_retrievers=[retriever]), 
                                llm=azure_llm,
                            )

retrieved_nodes = retriever.retrieve("what is this guideline about")
print(f"\nNumber of nodes retrieved: {len(retrieved_nodes)}")
    
for i, node in enumerate(retrieved_nodes):
    print(f"Node {i+1}:")
    print(f"Text: {node.node.get_text()[:100]}...")  
    print(f"Score: {node.score}")
    print("----------------------------------------------------------")

Output:

Number of nodes retrieved: 4
Node 1:
Text: 11.1 GENERAL

This chapter provides guidelines on the design of manholes....
Score: 0.7686689584315392
----------------------------------------------------------
Node 2:
Text: 1.1 SCOPE

This Manual offers guidance on the planning, design, operation and maintenance of stormwa...
Score: 0.7667132384163163
----------------------------------------------------------
Node 3:
Text: 14.2.6 Harbourfront Enhancement

For new polder and floodwater pumping facilities to be provided on ...
Score: 0.765393196106186
----------------------------------------------------------
Node 4:
Text: 16.1 INTRODUCTION
Page: 105...
Score: 0.7629867536021274
----------------------------------------------------------

Relevant Logs/Tracbacks

No response

The text was updated successfully, but these errors were encountered:

logan-markewich · 2025-01-02T02:46:42Z

You likely have duplicate nodes in your index? The nodes are deduplicated after retrieval

logan-markewich · 2025-01-02T02:49:42Z

You can confirm this by using a lower level api

from llama_index.core.vector_stores.types import VectorStoreQuery

result = vector_store.query(VectorStoreQuery(
  query_str="hello world",
  query_embedding=embed_model.get_query_embedding("hello world"),
  similarity_top_k=10 
))

# i think it's result.nodes? result.ids works too
print(len(result.nodes))

Howie-Arup · 2025-01-02T03:15:28Z

You can confirm this by using a lower level api

from llama_index.core.vector_stores.types import VectorStoreQuery

result = vector_store.query(VectorStoreQuery(
  query_str="hello world",
  query_embedding=embed_model.get_query_embedding("hello world"),
  similarity_top_k=10 
)

# i think it's result.nodes? result.ids works too
print(len(result.nodes))

@logan-markewich Thanks a lot! I used the codes and yes there are duplicated nodes. But the node IDs are all different. I used the codes below to create and save the index

client = QdrantClient(path=db_path)
vector_store = QdrantVectorStore(
        client=client,
        collection_name=collection_name
    )
storage_context = StorageContext.from_defaults(
        vector_store=vector_store
    )

reader = MarkdownReader()
documents = reader.load_data(r'...\Manual.txt')
node_parser = MarkdownNodeParser()
nodes = node_parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, show_progress=True)

print(f"Number of nodes before indexing: {len(nodes)}")
print(f"Number of nodes in vector store: {len(vector_store.get_nodes())}")

index.storage_context.persist(persist_dir=...)

And the output is:

Number of nodes before indexing: 420
Number of nodess in vector store: 4478

I think these two numbers are expected to be the same? Is there something wrong for the index? Thanks!

logan-markewich · 2025-01-02T05:21:27Z

It's deduplicating based on the text. As it is today, this is expected.

It seems like the markdown reader is parsing your data into duplicate chunks

Have you looked at the nodes it's creating? Tbh i wouldn't combine these two, just use one or the other. It's a bit unintuitve but the markdown reader is already splitting

logan-markewich · 2025-01-02T05:23:08Z

You can also deduplicsmate your nodes before indexing too

Howie-Arup · 2025-01-02T06:05:47Z

@logan-markewich I found that I made a mistake before. Each time I created and saved the index, the database is cumulative. So the markdown reader did not parse my data into duplicate chunks, but I run several times of the scripts to create and save the index, which resulted in cumulative database and thus duplicate chunks in the index loaded later. After I deleted the database and run only once to create and save the index and then load it later to retrieve, the no. of nodes retrieved is the same as the top_k.

But I have one doubt here. In the index which had duplicate chunks (i.e., the case when the no. of retrieved nodes would be different from the top_k set in the retriever), how does the retriever removes duplicates in the retrieved top_k results? For example, in the initial top-10 nodes below, there are five duplicates with different node IDs. Does it compare the texts? I didn't find how it do the de-duplication in the VectorIndexRetriever. Sorry if I missed anything

Howie-Arup added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407

[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407

Howie-Arup commented Jan 2, 2025

logan-markewich commented Jan 2, 2025

logan-markewich commented Jan 2, 2025 •

edited

Loading

Howie-Arup commented Jan 2, 2025 •

edited

Loading

logan-markewich commented Jan 2, 2025

logan-markewich commented Jan 2, 2025

Howie-Arup commented Jan 2, 2025 •

edited

Loading

[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407

[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407

Comments

Howie-Arup commented Jan 2, 2025

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

logan-markewich commented Jan 2, 2025

logan-markewich commented Jan 2, 2025 • edited Loading

Howie-Arup commented Jan 2, 2025 • edited Loading

logan-markewich commented Jan 2, 2025

logan-markewich commented Jan 2, 2025

Howie-Arup commented Jan 2, 2025 • edited Loading

logan-markewich commented Jan 2, 2025 •

edited

Loading

Howie-Arup commented Jan 2, 2025 •

edited

Loading

Howie-Arup commented Jan 2, 2025 •

edited

Loading