Dev minor (#1126)

* Feature/improve r2r telemetry (#1122) * improve telemetry * finish telemetry tweaks * Feature/improve cli infra (#1123) * improve telemetry * finish telemetry tweaks * up * Feature/add serve fallback to main (#1125) * improve telemetry * finish telemetry tweaks * up * fallback to main * Merge fragments (#1127) * troubleshooting docs (#1128) * troubleshooting docs (#1129) * add system diagram (#1130) * add system diagram * rm multi * fix overview * cleanup and fix * fix syntax * change to fast strategy by default (#1133) * Update parameter passing in js sdk (#1132) * Docs changes + add entity and relationship types (#1134) * up * up * up * up * reduce verbosity * Feature/dev minor cleanups (#1135) * cleanups * bump pkg --------- Co-authored-by: Shreyas Pimpalgaonkar <shreyas.gp.7@gmail.com> Co-authored-by: Nolan Tremelling <34580718+NolanTrem@users.noreply.github.com>
SciPhi-AI · Sep 12, 2024 · dd6ef23 · dd6ef23
1 parent 74a5e51
commit dd6ef23
Show file tree

Hide file tree

Showing 74 changed files with 10,199 additions and 1,939 deletions.
diff --git a/.github/workflows/build-main-old.yml b/.github/workflows/build-main-old.yml
diff --git a/.github/workflows/build-main.yml b/.github/workflows/build-main.yml
@@ -18,13 +18,25 @@ jobs:
       release_version: ${{ steps.version.outputs.RELEASE_VERSION }}
       matrix: ${{ steps.set-matrix.outputs.matrix }}
     steps:
+      - name: Checkout Repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python
+        uses: actions/setup-python@v4
+        with:
+          python-version: '3.10'
+
+      - name: Install toml package
+        run: pip install toml
+
       - name: Determine version to use
         id: version
         run: |
           if [ -n "${{ github.event.inputs.version }}" ]; then
             echo "RELEASE_VERSION=${{ github.event.inputs.version }}" >> $GITHUB_OUTPUT
           else
-            echo "RELEASE_VERSION=main" >> $GITHUB_OUTPUT
+            VERSION=$(python -c "import toml; print(toml.load('py/pyproject.toml')['tool']['poetry']['version'])")
+            echo "RELEASE_VERSION=$VERSION" >> $GITHUB_OUTPUT
           fi
 
       - name: Set matrix

diff --git a/.gitignore b/.gitignore
@@ -4,6 +4,7 @@
 *.gguf
 logs/
 workspace/
+py/workspace/
 uploads/
 env/
 **/__pycache__
@@ -19,6 +20,7 @@ coverage.xml
 
 node_modules/
 dist/
+**/.data/*
 
 *.exe
 *.exe~

diff --git a/docs/cookbooks/graphrag.mdx b/docs/cookbooks/graphrag.mdx
@@ -24,27 +24,21 @@ Note that graph construction may take long for local LLMs, we recommend using cl
 <Tabs>
 <Tab title="Cloud LLMs">
 ```bash
-r2r serve --config-name=neo4j_kg
+r2r serve
 ```
 
-<Accordion icon="gear" title="Configuration: neo4j_kg">
+<Accordion icon="gear" title="Configuration: r2r.toml">
 ``` toml
-[chunking]
-provider = "unstructured_local"
-strategy = "auto"
-chunking_strategy = "basic"
-new_after_n_chars = 2_048
-max_characters = 4_096 # use larger max_characters for KG construction
-combine_under_n_chars = 512
-overlap = 20
-
 [kg]
 provider = "neo4j"
 batch_size = 256
 kg_extraction_prompt = "graphrag_triplet_extraction_zero_shot"
 
   [kg.kg_creation_settings]
+    entity_types = [] # if empty, all entities are extracted
+    relation_types = [] # if empty, all relations are extracted
     max_knowledge_triples = 100
+    fragment_merge_count = 4 # number of fragments to merge into a single extraction
     generation_config = { model = "gpt-4o-mini" } # and other params, model used for triplet extraction
 
   [kg.kg_enrichment_settings]
@@ -104,7 +98,10 @@ provider = "neo4j"
 kg_extraction_prompt = "graphrag_triplet_extraction_zero_shot"
 
   [kg.kg_creation_settings]
+    entity_types = [] # if empty, all entities are extracted
+    relation_types = [] # if empty, all relations are extracted
     max_knowledge_triples = 100
+    fragment_merge_count = 4 # number of fragments to merge into a single extraction
     generation_config = { model = "ollama/llama3.1" } # and other params, model used for triplet extraction
 
   [kg.kg_enrichment_settings]

diff --git a/docs/documentation/configuration/knowledge-graph/enrichment.mdx b/docs/documentation/configuration/knowledge-graph/enrichment.mdx
@@ -5,7 +5,7 @@ description: 'Configuration for Restructuring data after ingestion using Knowled
 
 It is often effective to restructure data after ingestion to improve retrieval performance and accuracy. R2R supports knowledge graphs for data restructuring. You can find out more about creating knowledge graphs in the [Knowledge Graphs Guide](/cookbooks/graphrag).
 
-You can configure knowledge graph enrichment in the R2R configuration file. To do this, just set the `kg.kg_enrichment_settings` section in the configuration file. Following is the sample format from the example configuration file `neo4j_kg.toml`.
+You can configure knowledge graph enrichment in the R2R configuration file. To do this, just set the `kg.kg_enrichment_settings` section in the configuration file. Following is the sample format from the example configuration file `r2r.toml`.
 
 ```toml
 [kg]
@@ -14,6 +14,9 @@ batch_size = 256
 kg_extraction_prompt = "graphrag_triplet_extraction_zero_shot"
 
   [kg.kg_creation_settings]
+    entity_types = [] # if empty, all entities are extracted
+    relation_types = [] # if empty, all relations are extracted
+    fragment_merge_count = 4 # number of fragments to merge into a single extraction
     max_knowledge_triples = 100 # max number of triples to extract for each document chunk
     generation_config = { model = "gpt-4o-mini" } # and other generation params
 

diff --git a/docs/documentation/configuration/knowledge-graph/overview.mdx b/docs/documentation/configuration/knowledge-graph/overview.mdx
@@ -4,10 +4,10 @@ description: 'Configure your R2R knowledge graph provider.'
 ---
 ## Knowledge Graph Provider
 
-R2R supports knowledge graph functionality to enhance document understanding and retrieval. By default, R2R uses [Neo4j](https://neo4j.com/) as the knowledge graph provider. We are actively working to integrate with [Memgraph](https://memgraph.com/docs). You can find out more about creating knowledge graphs in the [Knowledge Graphs Guide](/cookbooks/graphrag).
+R2R supports knowledge graph functionality to enhance document understanding and retrieval. By default, R2R uses [Neo4j](https://neo4j.com/) as the knowledge graph provider. We are actively working to integrate with [Memgraph](https://memgraph.com/docs). You can find out more about creating knowledge graphs in the [GraphRAG Cookbook](/cookbooks/graphrag).
 
 
-To configure the knowledge graph settings:
+To configure the knowledge graph settings for your project:
 
 1. Edit the `kg` section in your `r2r.toml` file:
 
@@ -18,8 +18,11 @@ batch_size = 256
 kg_extraction_prompt = "graphrag_triplet_extraction_zero_shot"
 
   [kg.kg_creation_settings]
+    entity_types = [] # if empty, all entities are extracted
+    relation_types = [] # if empty, all relations are extracted
     generation_config = { model = "gpt-4o-mini" }
     max_knowledge_triples = 100 # max number of triples to extract for each document chunk
+    fragment_merge_count = 4 # number of fragments to merge into a single extraction
 
   [kg.kg_enrichment_settings]
     max_summary_input_length = 65536
@@ -38,6 +41,7 @@ Let's break down the knowledge graph configuration options:
 - `kg_extraction_prompt`: Specifies the prompt template to use for extracting knowledge graph information from text.
 - `kg_creation_settings`: Configuration for the model used in knowledge graph creation.
   - `max_knowledge_triples`: The maximum number of knowledge triples to extract for each document chunk.
+  - `fragment_merge_count`: The number of fragments to merge into a single extraction.
   - `generation_config`: Configuration for the model used in knowledge graph creation.
 - `kg_enrichment_settings`: Similar configuration for the model used in knowledge graph enrichment.
   - `generation_config`: Configuration for the model used in knowledge graph enrichment.
@@ -46,7 +50,7 @@ Let's break down the knowledge graph configuration options:
 
 ### Neo4j Configuration
 
-When using Neo4j as the knowledge graph provider, you need to set up the following environment variables or provide them in the `r2r.toml` file:
+When using Neo4j as the knowledge graph provider, you need to set up the following environment variables or provide them in the `r2r.toml` file. To set them as environment variables:
 
 ```bash
 export NEO4J_USER=your_neo4j_username
@@ -55,6 +59,8 @@ export NEO4J_URL=bolt://your_neo4j_host:7687
 export NEO4J_DATABASE=neo4j
 ```
 
+And to set them directly in your config:
+
 ```toml r2r.toml
 [kg]
 provider = "neo4j"
@@ -64,6 +70,10 @@ url = "bolt://your_neo4j_host:7687"
 database = "neo4j"
 ```
 
+<Note>
+Setting configuration values in the `r2r.toml` will override environment variables by default.
+</Note>
+
 
 ### Knowledge Graph Operations
 

diff --git a/docs/documentation/deep-dive/main/config.mdx b/docs/documentation/deep-dive/main/config.mdx
@@ -34,7 +34,7 @@ config = R2RConfig.from_toml("path/to/your/r2r.toml")
 r2r = R2RBuilder(config).build()
 
 # Or use a preset configuration
-r2r = R2RBuilder(config_name="neo4j_kg").build()
+r2r = R2RBuilder(config_name="default").build()
 ```
 
 ## Configuration Sections