Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Number of retrieved nodes not equal to the simiarity_top_k set in VectorIndexRetriever #17407

Open
Howie-Arup opened this issue Jan 2, 2025 · 6 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@Howie-Arup
Copy link

Bug Description

Hi, I am using the Qdrant to store a document as a vector store, and then use VectorIndexRetriever to do similarity top k search for an input query. I set the retriever as:

retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

But I only got 4 retrieved nodes:

retrieved_nodes = retriever.retrieve("what is this guideline about")
print(f"\nNumber of nodes retrieved: {len(retrieved_nodes)}")
Output: Number of nodes retrieved: 4

If I set a smaller top k then the number of nodes retrieved would be smaller but they never became the same.

Version

llama_index.core.version='0.12.8'

Steps to Reproduce

Run the codes below with the path pointing to the attached folder (in zip) and the error will be reproduced.
doc_vector.zip

from llama_index.core import Settings
from llama_index.core import StorageContext, load_index_from_storage
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core.indices.vector_store.retrievers import VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from qdrant_client import QdrantClient

Settings.context_window = 120000

azure_llm = AzureOpenAI(
    model="gpt-4-1106-preview",
    deployment_name="gpt-4-1106-preview",
    api_key=AZURE_API_KEY,
    azure_endpoint=AZURE_API_ENDPOINT,
    api_version=AZURE_API_VERSION,
    temperature=0,
)

azure_embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="text-embedding-ada-002",
    api_key=EMBEDDING_AZURE_API_KEY,
    azure_endpoint=EMBEDDING_AZURE_API_ENDPOINT,
    api_version=EMBEDDING_AZURE_API_VERSION,
)

Settings.llm = azure_llm
Settings.embed_model = azure_embed_model

client = QdrantClient(path=...)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="test"
)

storage_context = StorageContext.from_defaults(
    persist_dir=...,
    vector_store=vector_store
)
index = load_index_from_storage(storage_context)

print(f"Total nodes in index: {len(index._vector_store.__doc__)}")
retriever = VectorIndexRetriever(
    index=index,
    similarity_top_k=10,
)

query_engine = RetrieverQueryEngine.from_args(
                                index.as_retriever(sub_retrievers=[retriever]), 
                                llm=azure_llm,
                            )

retrieved_nodes = retriever.retrieve("what is this guideline about")
print(f"\nNumber of nodes retrieved: {len(retrieved_nodes)}")
    
for i, node in enumerate(retrieved_nodes):
    print(f"Node {i+1}:")
    print(f"Text: {node.node.get_text()[:100]}...")  
    print(f"Score: {node.score}")
    print("----------------------------------------------------------")

Output:

Number of nodes retrieved: 4
Node 1:
Text: 11.1 GENERAL

This chapter provides guidelines on the design of manholes....
Score: 0.7686689584315392
----------------------------------------------------------
Node 2:
Text: 1.1 SCOPE

This Manual offers guidance on the planning, design, operation and maintenance of stormwa...
Score: 0.7667132384163163
----------------------------------------------------------
Node 3:
Text: 14.2.6 Harbourfront Enhancement

For new polder and floodwater pumping facilities to be provided on ...
Score: 0.765393196106186
----------------------------------------------------------
Node 4:
Text: 16.1 INTRODUCTION
Page: 105...
Score: 0.7629867536021274
----------------------------------------------------------

Relevant Logs/Tracbacks

No response

@Howie-Arup Howie-Arup added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jan 2, 2025
@logan-markewich
Copy link
Collaborator

You likely have duplicate nodes in your index? The nodes are deduplicated after retrieval

@logan-markewich
Copy link
Collaborator

logan-markewich commented Jan 2, 2025

You can confirm this by using a lower level api

from llama_index.core.vector_stores.types import VectorStoreQuery

result = vector_store.query(VectorStoreQuery(
  query_str="hello world",
  query_embedding=embed_model.get_query_embedding("hello world"),
  similarity_top_k=10 
))

# i think it's result.nodes? result.ids works too
print(len(result.nodes))

@Howie-Arup
Copy link
Author

Howie-Arup commented Jan 2, 2025

You can confirm this by using a lower level api

from llama_index.core.vector_stores.types import VectorStoreQuery

result = vector_store.query(VectorStoreQuery(
  query_str="hello world",
  query_embedding=embed_model.get_query_embedding("hello world"),
  similarity_top_k=10 
)

# i think it's result.nodes? result.ids works too
print(len(result.nodes))

@logan-markewich Thanks a lot! I used the codes and yes there are duplicated nodes. But the node IDs are all different. I used the codes below to create and save the index

client = QdrantClient(path=db_path)
vector_store = QdrantVectorStore(
        client=client,
        collection_name=collection_name
    )
storage_context = StorageContext.from_defaults(
        vector_store=vector_store
    )

reader = MarkdownReader()
documents = reader.load_data(r'...\Manual.txt')
node_parser = MarkdownNodeParser()
nodes = node_parser.get_nodes_from_documents(documents)

index = VectorStoreIndex(nodes=nodes, storage_context=storage_context, show_progress=True)

print(f"Number of nodes before indexing: {len(nodes)}")
print(f"Number of nodes in vector store: {len(vector_store.get_nodes())}")

index.storage_context.persist(persist_dir=...)

And the output is:

Number of nodes before indexing: 420
Number of nodess in vector store: 4478

I think these two numbers are expected to be the same? Is there something wrong for the index? Thanks!

@logan-markewich
Copy link
Collaborator

It's deduplicating based on the text. As it is today, this is expected.

It seems like the markdown reader is parsing your data into duplicate chunks

Have you looked at the nodes it's creating? Tbh i wouldn't combine these two, just use one or the other. It's a bit unintuitve but the markdown reader is already splitting

@logan-markewich
Copy link
Collaborator

You can also deduplicsmate your nodes before indexing too

@Howie-Arup
Copy link
Author

Howie-Arup commented Jan 2, 2025

@logan-markewich I found that I made a mistake before. Each time I created and saved the index, the database is cumulative. So the markdown reader did not parse my data into duplicate chunks, but I run several times of the scripts to create and save the index, which resulted in cumulative database and thus duplicate chunks in the index loaded later. After I deleted the database and run only once to create and save the index and then load it later to retrieve, the no. of nodes retrieved is the same as the top_k.

But I have one doubt here. In the index which had duplicate chunks (i.e., the case when the no. of retrieved nodes would be different from the top_k set in the retriever), how does the retriever removes duplicates in the retrieved top_k results? For example, in the initial top-10 nodes below, there are five duplicates with different node IDs. Does it compare the texts? I didn't find how it do the de-duplication in the VectorIndexRetriever. Sorry if I missed anything

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants