You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As picture ,when chunk_size is less than the content length, the sub node are duplicated.
TextNode.model_validate dosen't create a new instance but return node.node itself.
Version
0.12.5
Steps to Reproduce
fromllama_index.core.query_engineimportCitationQueryEnginefromllama_index.embeddings.huggingfaceimportHuggingFaceEmbeddingfromllama_index.llms.openai.utilsimportALL_AVAILABLE_MODELS, CHAT_MODELSfromllama_index.llms.openaiimportOpenAIfromllama_index.coreimportSettingsfromllama_index.coreimportDocument, VectorStoreIndexfromllama_index.core.node_parserimportSentenceSplitterMODEL='ep-20241223171230-8tv46'Settings.context_window=4096ALL_AVAILABLE_MODELS[MODEL] =4000CHAT_MODELS[MODEL] =4000Settings.llm=OpenAI(temperature=0.1,
model=MODEL,
api_base='https://ark.cn-beijing.volces.com/api/v3',
api_key='xxxx',
max_tokens=512
)
Settings.embed_model=HuggingFaceEmbedding(
model_name="/home/chenjq/models/m3e-base"
)
text="""Introduction#What is context augmentation?#LLMs offer a natural language interface between humans and data. LLMs come pre-trained on huge amounts of publicly available data, but they are not trained on your data. Your data may be private or specific to the problem you're trying to solve. It's behind APIs, in SQL databases, or trapped in PDFs and slide decks.Context augmentation makes your data available to the LLM to solve the problem at hand. LlamaIndex provides the tools to build any of context-augmentation use case, from prototype to production. Our tools allow you to ingest, parse, index and process your data and quickly implement complex query workflows combining data access with LLM prompting.The most popular example of context-augmentation is Retrieval-Augmented Generation or RAG, which combines context with LLMs at inference time.What are agents?#Agents are LLM-powered knowledge assistants that use tools to perform tasks like research, data extraction, and more. Agents range from simple question-answering to being able to sense, decide and take actions in order to complete tasks.LlamaIndex provides a framework for building agents including the ability to use RAG pipelines as one of many tools to complete a task.What are workflows?#Workflows are multi-step processes that combine one or more agents, data connectors, and other tools to complete a task. They are event-driven software that allows you to combine RAG data sources and multiple agents to create a complex application that can perform a wide variety of tasks with reflection, error-correction, and other hallmarks of advanced LLM applications. You can then deploy these agentic workflows as production microservices.LlamaIndex is the framework for Context-Augmented LLM Applications#LlamaIndex imposes no restriction on how you use LLMs. You can use LLMs as auto-complete, chatbots, agents, and more. It just makes using them easier. We provide tools like:Data connectors ingest your existing data from their native source and format. These could be APIs, PDFs, SQL, and (much) more.Data indexes structure your data in intermediate representations that are easy and performant for LLMs to consume.Engines provide natural language access to your data. For example:Query engines are powerful interfaces for question-answering (e.g. a RAG flow).Chat engines are conversational interfaces for multi-message, "back and forth" interactions with your data.Agents are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.Observability/Evaluation integrations that enable you to rigorously experiment, evaluate, and monitor your app in a virtuous cycle.Workflows allow you to combine all of the above into an event-driven system far more flexible than other, graph-based approaches.Use cases#Some popular use cases for LlamaIndex and context augmentation in general include:Question-Answering (Retrieval-Augmented Generation aka RAG)ChatbotsDocument Understanding and Data ExtractionAutonomous Agents that can perform research and take actionsMulti-modal applications that combine text, images, and other data typesFine-tuning models on data to improve performanceCheck out our use cases documentation for more examples and links to tutorials.👨👩👧👦 Who is LlamaIndex for?#LlamaIndex provides tools for beginners, advanced users, and everyone in between.Our high-level API allows beginner users to use LlamaIndex to ingest and query their data in 5 lines of code.For more complex applications, our lower-level APIs allow advanced users to customize and extend any module -- data connectors, indices, retrievers, query engines, and reranking modules -- to fit their needs.Getting Started#LlamaIndex is available in Python (these docs) and Typescript. If you're not sure where to start, we recommend reading how to read these docs which will point you to the right place based on your experience level."""text_list= [text]
documents= [Document(text=t) fortintext_list]
node_parser=SentenceSplitter(chunk_size=500, chunk_overlap=50)
nodes=node_parser.get_nodes_from_documents(
documents, show_progress=False
)
# build indexindex=VectorStoreIndex(nodes)
query_engine=CitationQueryEngine.from_args(
index,
similarity_top_k=3,
citation_chunk_size=200, # small than retrieved nodes!
)
response=query_engine.query("What are agents?")
print(response)
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered:
To resolve the issue of duplicated citation nodes in the CitationQueryEngine when the chunk_size is less than the content length in version 0.12.5, you should ensure that the text_splitter is correctly configured to handle the splitting of text into chunks without duplication. The text_splitter is responsible for dividing the content into chunks based on the specified citation_chunk_size and citation_chunk_overlap.
Here's a relevant part of the code that deals with text splitting:
The SentenceSplitter or any TextSplitter used should be configured to handle overlaps correctly to avoid duplication. If the default SentenceSplitter is causing duplication, you might need to implement a custom TextSplitter that better handles your specific content structure and avoids creating overlapping chunks that lead to duplicated nodes.
If the issue persists, consider reviewing the logic in the _create_citation_nodes method to ensure that it correctly processes the split text chunks without introducing duplicates:
Ensure that the split_text method in your TextSplitter implementation is correctly handling the overlap and chunking logic to prevent duplication. Adjust the citation_chunk_size and citation_chunk_overlap parameters as needed to fit your content's structure [1].
Bug Description
As picture ,when chunk_size is less than the content length, the sub node are duplicated.
TextNode.model_validate dosen't create a new instance but return node.node itself.
Version
0.12.5
Steps to Reproduce
Relevant Logs/Tracbacks
No response
The text was updated successfully, but these errors were encountered: