[Bug]: Duplicated citation nodes in CitationQueryEngine when chunk_size is less than the content length #17439

minmie · 2025-01-06T10:03:40Z

Bug Description

As picture ,when chunk_size is less than the content length, the sub node are duplicated.

TextNode.model_validate dosen't create a new instance but return node.node itself.

Version

0.12.5

Steps to Reproduce

from llama_index.core.query_engine import CitationQueryEngine
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.openai.utils import ALL_AVAILABLE_MODELS, CHAT_MODELS
from llama_index.llms.openai import OpenAI
from llama_index.core import Settings
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

MODEL='ep-20241223171230-8tv46'
Settings.context_window = 4096
ALL_AVAILABLE_MODELS[MODEL] = 4000
CHAT_MODELS[MODEL] = 4000
Settings.llm = OpenAI(temperature=0.1,
                        model=MODEL,
                      api_base='https://ark.cn-beijing.volces.com/api/v3',
                      api_key='xxxx',
                      max_tokens=512
                      )


Settings.embed_model = HuggingFaceEmbedding(
    model_name="/home/chenjq/models/m3e-base"
)



text = """
Introduction#
What is context augmentation?#
LLMs offer a natural language interface between humans and data. LLMs come pre-trained on huge amounts of publicly available data, but they are not trained on your data. Your data may be private or specific to the problem you're trying to solve. It's behind APIs, in SQL databases, or trapped in PDFs and slide decks.

Context augmentation makes your data available to the LLM to solve the problem at hand. LlamaIndex provides the tools to build any of context-augmentation use case, from prototype to production. Our tools allow you to ingest, parse, index and process your data and quickly implement complex query workflows combining data access with LLM prompting.

The most popular example of context-augmentation is Retrieval-Augmented Generation or RAG, which combines context with LLMs at inference time.

What are agents?#
Agents are LLM-powered knowledge assistants that use tools to perform tasks like research, data extraction, and more. Agents range from simple question-answering to being able to sense, decide and take actions in order to complete tasks.

LlamaIndex provides a framework for building agents including the ability to use RAG pipelines as one of many tools to complete a task.

What are workflows?#
Workflows are multi-step processes that combine one or more agents, data connectors, and other tools to complete a task. They are event-driven software that allows you to combine RAG data sources and multiple agents to create a complex application that can perform a wide variety of tasks with reflection, error-correction, and other hallmarks of advanced LLM applications. You can then deploy these agentic workflows as production microservices.

LlamaIndex is the framework for Context-Augmented LLM Applications#
LlamaIndex imposes no restriction on how you use LLMs. You can use LLMs as auto-complete, chatbots, agents, and more. It just makes using them easier. We provide tools like:

Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL, and (much) more.
Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.
Engines provide natural language access to your data. For example:
Query engines are powerful interfaces for question-answering (e.g. a RAG flow).
Chat engines are conversational interfaces for multi-message, "back and forth" interactions with your data.
Agents are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.
Observability/Evaluation integrations that enable you to rigorously experiment, evaluate, and monitor your app in a virtuous cycle.
Workflows allow you to combine all of the above into an event-driven system far more flexible than other, graph-based approaches.
Use cases#
Some popular use cases for LlamaIndex and context augmentation in general include:

Question-Answering (Retrieval-Augmented Generation aka RAG)
Chatbots
Document Understanding and Data Extraction
Autonomous Agents that can perform research and take actions
Multi-modal applications that combine text, images, and other data types
Fine-tuning models on data to improve performance
Check out our use cases documentation for more examples and links to tutorials.

👨‍👩‍👧‍👦 Who is LlamaIndex for?#
LlamaIndex provides tools for beginners, advanced users, and everyone in between.

Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.

For more complex applications, our lower-level APIs allow advanced users to customize and extend any module -- data connectors, indices, retrievers, query engines, and reranking modules -- to fit their needs.

Getting Started#
LlamaIndex is available in Python (these docs) and Typescript. If you're not sure where to start, we recommend reading how to read these docs which will point you to the right place based on your experience level.
"""
text_list = [text]
documents = [Document(text=t) for t in text_list]
node_parser = SentenceSplitter(chunk_size=500, chunk_overlap=50)
nodes = node_parser.get_nodes_from_documents(
    documents, show_progress=False
)
# build index
index = VectorStoreIndex(nodes)

query_engine = CitationQueryEngine.from_args(
    index,
    similarity_top_k=3,
    citation_chunk_size=200,  # small than retrieved nodes!
)

response = query_engine.query("What are agents?")


print(response)

Relevant Logs/Tracbacks

No response

dosubot · 2025-01-06T10:04:44Z

To resolve the issue of duplicated citation nodes in the CitationQueryEngine when the chunk_size is less than the content length in version 0.12.5, you should ensure that the text_splitter is correctly configured to handle the splitting of text into chunks without duplication. The text_splitter is responsible for dividing the content into chunks based on the specified citation_chunk_size and citation_chunk_overlap.

Here's a relevant part of the code that deals with text splitting:

self.text_splitter = text_splitter or SentenceSplitter(
    chunk_size=citation_chunk_size, chunk_overlap=citation_chunk_overlap
)

The SentenceSplitter or any TextSplitter used should be configured to handle overlaps correctly to avoid duplication. If the default SentenceSplitter is causing duplication, you might need to implement a custom TextSplitter that better handles your specific content structure and avoids creating overlapping chunks that lead to duplicated nodes.

If the issue persists, consider reviewing the logic in the _create_citation_nodes method to ensure that it correctly processes the split text chunks without introducing duplicates:

def _create_citation_nodes(self, nodes: List[NodeWithScore]) -> List[NodeWithScore]:
    new_nodes: List[NodeWithScore] = []
    for node in nodes:
        text_chunks = self.text_splitter.split_text(
            node.node.get_content(metadata_mode=self._metadata_mode)
        )

        for text_chunk in text_chunks:
            text = f"Source {len(new_nodes) + 1}:\n{text_chunk}\n"

            new_node = NodeWithScore(
                node=TextNode.model_validate(node.node), score=node.score
            )
            new_node.node.set_content(text)
            new_nodes.append(new_node)
    return new_nodes

Ensure that the split_text method in your TextSplitter implementation is correctly handling the overlap and chunking logic to prevent duplication. Adjust the citation_chunk_size and citation_chunk_overlap parameters as needed to fit your content's structure ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

minmie added bug Something isn't working triage Issue needs to be triaged/prioritized labels Jan 6, 2025

minmie linked a pull request Jan 6, 2025 that will close this issue

bugfix: Duplicated citation nodes #17440

Open

18 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Duplicated citation nodes in CitationQueryEngine when chunk_size is less than the content length #17439

[Bug]: Duplicated citation nodes in CitationQueryEngine when chunk_size is less than the content length #17439

minmie commented Jan 6, 2025 •

edited

Loading

dosubot bot commented Jan 6, 2025

[Bug]: Duplicated citation nodes in CitationQueryEngine when chunk_size is less than the content length #17439

[Bug]: Duplicated citation nodes in CitationQueryEngine when chunk_size is less than the content length #17439

Comments

minmie commented Jan 6, 2025 • edited Loading

Bug Description

Version

Steps to Reproduce

Relevant Logs/Tracbacks

dosubot bot commented Jan 6, 2025

minmie commented Jan 6, 2025 •

edited

Loading