Skip to content

Commit

Permalink
Docs changes (#1462)
Browse files Browse the repository at this point in the history
* up

* up

* up

* up
  • Loading branch information
shreyaspimpalgaonkar authored Oct 23, 2024
1 parent f6e2789 commit 0c8f327
Show file tree
Hide file tree
Showing 4 changed files with 181 additions and 21 deletions.
51 changes: 42 additions & 9 deletions docs/cookbooks/advanced-graphrag.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -135,23 +135,36 @@ Prompt tuning produces:

Prompt tuning allow for us to generate communities that better reflect the natural organization of the domain knowledge while maintaining more precise technical and thematic boundaries between related concepts.

## Contextual Embeddings
## Contextual Chunk Enrichment

Contextual embeddings are a technique that allows us to capture the semantic meaning of the entities and relationships in the knowledge graph. This is done by using a combination of the entity's textual description and its contextual embeddings.
Contextual chunk enrichment is a technique that allows us to capture the semantic meaning of the entities and relationships in the knowledge graph. This is done by using a combination of the entity's textual description and its contextual embeddings. This enrichment process enhances the quality and depth of information in your knowledge graph by:

1. Analyzing the surrounding context of each entity mention
2. Incorporating semantic information from related passages
3. Preserving important contextual nuances that might be lost in simple entity extraction

You can learn more about contextual chunk enrichment [here](/cookbooks/contextual-enrichment).


### Entity Deduplication

Note that the entities and triples are created at the document level. This means that if you have multiple documents with the same entity, the entity will be duplicated for each document.
When creating a knowledge graph across multiple documents, entities are initially created at the document level. This means that the same real-world entity (e.g., "Albert Einstein" or "CRISPR") might appear multiple times if it's mentioned in different documents. This duplication can lead to:

To deduplicate the entities, you can run the `deduplicate-entities` endpoint. This endpoint will merge duplicate entities and delete the duplicate entities.
- Redundant information in your knowledge graph
- Fragmented relationships across duplicate entities
- Increased storage and processing overhead
- Potentially inconsistent entity descriptions

The `deduplicate-entities` endpoint addresses these issues by:
1. Identifying similar entities using name (exact match, other strategies coming soon)
2. Merging their properties and relationships
3. Maintaining the most comprehensive description
4. Removing the duplicate entries

<Tabs>
<Tab title="CLI">
```bash
r2r deduplicate-entities --collection-id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09
r2r deduplicate-entities --collection-id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09 --run

# Example Response
[{'message': 'Deduplication task queued successfully.', 'task_id': 'd9dae1bb-5862-4a16-abaf-5297024df390'}]
Expand All @@ -163,15 +176,35 @@ r2r deduplicate-entities --collection-id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09
from r2r import R2RClient

client = R2RClient("http://localhost:7272")
client.deduplicate_entities(collection_id="122fdf6a-e116-546b-a8f6-e4cb2e2c0a09")
client.deduplicate_entities(
collection_id="122fdf6a-e116-546b-a8f6-e4cb2e2c0a09",
run_type="run"
)

# Example Response
[{'message': 'Deduplication task queued successfully.', 'task_id': 'd9dae1bb-5862-4a16-abaf-5297024df390'}]
```
</Tab>
</Tabs>

You can check the status of the deduplication task using the hatchet dashboard on http://localhost:7274. And once that is complete, check the endpoints at a entity_level = collection to see the deduplicated entities and triples.
#### Monitoring Deduplication

You can monitor the deduplication process in two ways:

1. **Hatchet Dashboard**: Access the dashboard at http://localhost:7274 to view:
- Task status and progress
- Any errors or warnings
- Completion time estimates

2. **API Endpoints**: Once deduplication is complete, verify the results using these endpoints with `entity_level = collection`:
- [Entities API](http://localhost:7272/v2/entities?collection_id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09&entity_level=collection)
- [Triples API](http://localhost:7272/v2/triples?collection_id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09&entity_level=collection)

#### Best Practices

When using entity deduplication:

- Entities: [Entities](http://localhost:7272/v2/entities?collection_id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09&entity_level=collection)
- Triples: [Triples](http://localhost:7272/v2/triples?collection_id=122fdf6a-e116-546b-a8f6-e4cb2e2c0a09&entity_level=collection)
- Run deduplication after initial graph creation but before any enrichment steps
- Monitor the number of entities before and after to ensure expected reduction
- Review a sample of merged entities to verify accuracy
- For large collections, expect the process to take longer and plan accordingly
148 changes: 137 additions & 11 deletions docs/cookbooks/contextual-enrichment.mdx
Original file line number Diff line number Diff line change
@@ -1,25 +1,151 @@
# Contextual Chunk Enrichment
---
title: 'Contextual Chunk Enrichment'
description: 'Enhance your RAG system chunks with rich contextual information'
icon: 'puzzle-piece'
---

[In Progress]
# Understanding Chunk Enrichment in RAG Systems

When documents are ingested, they are broken into chunks. Chunks are a great way to store bits of information for vector search, but they may not always contain enough information for a downstream task like question answering.
In modern Retrieval-Augmented Generation (RAG) systems, documents are systematically broken down into smaller, manageable pieces called chunks. While chunking is essential for efficient vector search operations, these individual chunks sometimes lack the broader context needed for comprehensive question answering or analysis tasks.

Let's take an example of the lyft-2021 pdf here: https://github.com/SciPhi-AI/R2R/blob/main/py/core/examples/data/lyft_2021.pdf.
## The Challenge of Context Loss

One of the chunks we get is:
Let's examine a real-world example using Lyft's 2021 annual report (Form 10-K) from their [public filing](https://github.com/SciPhi-AI/R2R/blob/main/py/core/examples/data/lyft_2021.pdf).

```
During ingestion, this 200+ page document is broken into 1,223 distinct chunks. Consider this isolated chunk:

```plaintext
storing unrented and returned vehicles. These impacts to the demand for and operations of the different rental programs have and may continue to adversely affect our business, financial condition and results of operation.
```

This chunk does not contain specific information of Lyft's business, financial condition and results of operation.
Reading this chunk in isolation raises several questions:
- What specific impacts are being discussed?
- Which rental programs are affected?
- What's the broader context of these business challenges?

This is where contextual enrichment becomes invaluable.

## Introducing Contextual Enrichment

Contextual enrichment is an advanced technique that enhances chunks with relevant information from surrounding or semantically related content. Think of it as giving each chunk its own "memory" of related information.

### Enabling Enrichment

To activate this feature, configure your `r2r.toml` file with the following settings:

```toml
[ingestion.chunk_enrichment_settings]
enable_chunk_enrichment = true # disabled by default
strategies = ["semantic", "neighborhood"]
forward_chunks = 3 # Look ahead 3 chunks
backward_chunks = 3 # Look behind 3 chunks
semantic_neighbors = 10 # Find 10 semantically similar chunks
semantic_similarity_threshold = 0.7 # Minimum similarity score
generation_config = { model = "openai/gpt-4o-mini" }
```

## Enrichment Strategies Explained

R2R implements two sophisticated strategies for chunk enrichment:

### 1. Neighborhood Strategy
This approach looks at the document's natural flow by examining chunks that come before and after the target chunk:
- **Forward Looking**: Captures upcoming context (configurable, default: 3 chunks)
- **Backward Looking**: Incorporates previous context (configurable, default: 3 chunks)
- **Use Case**: Particularly effective for narrative documents where context flows linearly

### 2. Semantic Strategy
This method uses advanced embedding similarity to find related content throughout the document:
- **Vector Similarity**: Identifies chunks with similar meaning regardless of location
- **Configurable Neighbors**: Customizable number of similar chunks to consider
- **Similarity Threshold**: Set minimum similarity scores to ensure relevance
- **Use Case**: Excellent for documents with themes repeated across different sections

## The Enrichment Process

When enriching chunks, R2R uses a carefully crafted prompt to guide the LLM:

```plaintext
## Task:
Enrich and refine the given chunk of text using information from the provided context chunks. The goal is to make the chunk more precise and self-contained.
## Context Chunks:
{context_chunks}
## Chunk to Enrich:
{chunk}
## Instructions:
1. Rewrite the chunk in third person.
2. Replace all common nouns with appropriate proper nouns.
3. Use information from the context chunks to enhance clarity.
4. Ensure the enriched chunk remains independent and self-contained.
5. Maintain original scope without bleeding information.
6. Focus on precision and informativeness.
7. Preserve original meaning while improving clarity.
8. Output only the enriched chunk.
## Enriched Chunk:
```

## Implementation and Results

To process your documents with enrichment:

To address this, we can use contextual enrichment to add more information to the chunk.
```bash
r2r ingest-files --file_paths path/to/lyft_2021.pdf
```

After enrichment, we get the following chunk:
### Viewing Enriched Results

Access your enriched chunks through the API:
```
http://localhost:7272/v2/document_chunks/{document_id}
```
The impacts of the COVID-19 pandemic on the demand for and operations of the various vehicle rental programs, including Lyft Rentals and the Express Drive program, have resulted in challenges regarding the storage of unrented and returned vehicles. These adverse conditions are anticipated to continue affecting Lyft’s overall business performance, financial condition, and operational results.

Let's compare the before and after of our example chunk:

**Before Enrichment:**
```plaintext
storing unrented and returned vehicles. These impacts to the demand for and operations of the different rental programs have and may continue to adversely affect our business, financial condition and results of operation.
```

You can enable contextual enrichment by setting the `enrich_chunks` flag to `true` when creating a collection in r2r.toml.
**After Enrichment:**
```plaintext
The impacts of the COVID-19 pandemic on the demand for and operations of the various vehicle rental programs, including Lyft Rentals and the Express Drive program, have resulted in challenges regarding the storage of unrented and returned vehicles. These adverse conditions are anticipated to continue affecting Lyft's overall business performance, financial condition, and operational results.
```

Notice how the enriched version:
- Specifies the cause (COVID-19 pandemic)
- Names specific programs (Lyft Rentals, Express Drive)
- Provides clearer context about the business impact
- Maintains professional, third-person tone

## Metadata and Storage

The system maintains both enriched and original versions:

```json
{
"results": [
{
"text": "enriched_version",
"metadata": {
"original_text": "original_version",
"chunk_enrichment_status": "success",
// ... additional metadata ...
}
}
]
}
```

This dual storage ensures transparency and allows for version comparison when needed.

## Best Practices

1. **Tune Your Parameters**: Adjust `forward_chunks`, `backward_chunks`, and `semantic_neighbors` based on your document structure
2. **Monitor Enrichment Quality**: Regularly review enriched chunks to ensure they maintain accuracy
3. **Consider Document Type**: Different documents may benefit from different enrichment strategies
4. **Balance Context Size**: More context isn't always better - find the sweet spot for your use case
1 change: 1 addition & 0 deletions docs/mint.json
Original file line number Diff line number Diff line change
Expand Up @@ -410,6 +410,7 @@
"pages": [
"cookbooks/walkthrough",
"cookbooks/ingestion",
"cookbooks/contextual-enrichment",
"cookbooks/hybrid-search",
"cookbooks/advanced-rag",
"cookbooks/graphrag",
Expand Down
2 changes: 1 addition & 1 deletion py/core/providers/kg/postgres.py
Original file line number Diff line number Diff line change
Expand Up @@ -915,7 +915,7 @@ async def delete_node_via_document_id(
if count == 0:
# If it's the last document, delete collection-related data
collection_queries = [
f"DELETE FROM {self._get_table_name('community')} WHERE collection_id = $1",
f"DELETE FROM {self._get_table_name('community_info')} WHERE collection_id = $1",
f"DELETE FROM {self._get_table_name('community_report')} WHERE collection_id = $1",
]
for query in collection_queries:
Expand Down

0 comments on commit 0c8f327

Please sign in to comment.