Merge pull request #3140 from ClickHouse/measuring_search

New search
ClickHouse · Jan 27, 2025 · 0bfc9ee · 0bfc9ee
2 parents 023c8de + d9196a5
commit 0bfc9ee
Show file tree

Hide file tree

Showing 20 changed files with 1,625 additions and 69 deletions.
diff --git a/.github/workflows/build-search.yml b/.github/workflows/build-search.yml
@@ -0,0 +1,44 @@
+name: Update Algolia Search
+
+on:
+  pull_request:
+    types:
+      - closed
+
+  workflow_dispatch:
+
+  schedule:
+    - cron: '0 4 * * *'
+
+env:
+  PYTHONUNBUFFERED: 1 # Force the stdout and stderr streams to be unbuffered
+
+jobs:
+  update-search:
+    if: github.event.pull_request.merged == true && contains(github.event.pull_request.labels.*.name, 'update search') && github.event.pull_request.base.ref == 'main'
+    #if: contains(github.event.pull_request.labels.*.name, 'update search') # Updated to trigger directly on PRs with the label
+    runs-on: ubuntu-latest
+
+    steps:
+      - name: Checkout Repository
+        uses: actions/checkout@v3
+
+      - name: Set up Node.js
+        uses: actions/setup-node@v3
+        with:
+          node-version: '20' 
+
+      - name: Run Prep from Master
+        run: yarn copy-clickhouse-repo-docs
+
+      - name: Run Auto Generate Settings
+        run: yarn auto-generate-settings
+
+      - name: Run Indexer
+        run: yarn run-indexer
+        env:
+          ALGOLIA_API_KEY: ${{ secrets.ALGOLIA_API_KEY }}
+          ALGOLIA_APP_ID: 5H9UG7CX5W
+
+      - name: Verify Completion
+        run: echo "All steps completed successfully!"
diff --git a/docs/en/chdb/getting-started.md b/docs/en/chdb/getting-started.md
@@ -49,7 +49,7 @@ pip install pandas pyarrow
 ## Querying a JSON file in S3
 
 Let's now have a look at how to query a JSON file that's stored in an S3 bucket. 
-The [YouTube dislikes dataset](https://clickhouse.com/docs/en/getting-started/example-datasets/youtube-dislikes) contains more than 4 billion rows of dislikes on YouTube videos up to 2021.
+The [YouTube dislikes dataset](/docs/en/getting-started/example-datasets/youtube-dislikes) contains more than 4 billion rows of dislikes on YouTube videos up to 2021.
 We're going to work with one of the JSON files from that dataset.
 
 Import chdb:

diff --git a/docs/en/integrations/data-ingestion/kafka/kafka-clickhouse-connect-sink.md b/docs/en/integrations/data-ingestion/kafka/kafka-clickhouse-connect-sink.md
@@ -5,7 +5,7 @@ slug: /en/integrations/kafka/clickhouse-kafka-connect-sink
 description: The official Kafka connector from ClickHouse.
 ---
 
-import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
+import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
 
 # ClickHouse Kafka Connect Sink
 

diff --git a/docs/en/integrations/data-visualization/mitzu-and-clickhouse.md b/docs/en/integrations/data-visualization/mitzu-and-clickhouse.md
@@ -5,7 +5,7 @@ keywords: [clickhouse, Mitzu, connect, integrate, ui]
 description: Mitzu is a no-code warehouse-native product analytics application.
 ---
 
-import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
+import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
 
 # Connecting Mitzu to ClickHouse
 

diff --git a/docs/en/integrations/data-visualization/omni-and-clickhouse.md b/docs/en/integrations/data-visualization/omni-and-clickhouse.md
@@ -5,7 +5,7 @@ keywords: [clickhouse, Omni, connect, integrate, ui]
 description: Omni is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.
 ---
 
-import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
+import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
 
 # Omni
 

diff --git a/docs/en/managing-data/core-concepts/partitions.md b/docs/en/managing-data/core-concepts/partitions.md
@@ -2,7 +2,7 @@
 slug: /en/partitions
 title: Table partitions
 description: What are table partitions in ClickHouse
-keywords: [partitions]
+keywords: [partitions, partition by]
 ---
 
 ## What are table partitions in ClickHouse?
@@ -12,6 +12,7 @@ keywords: [partitions]
 
 Partitions group the [data parts](/docs/en/parts) of a table in the [MergeTree engine family](/docs/en/engines/table-engines/mergetree-family) into organized, logical units, which is a way of organizing data that is conceptually meaningful and aligned with specific criteria, such as time ranges, categories, or other key attributes. These logical units make data easier to manage, query, and optimize.
 
+### Partition By
 
 Partitioning can be enabled when a table is initially defined via the [PARTITION BY clause](/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key). This clause can contain a SQL expression on any columns, the results of which will define which partition a row belongs to.
 
@@ -33,6 +34,8 @@ PARTITION BY toStartOfMonth(date);
 
 You can [query this table](https://sql.clickhouse.com/?query=U0VMRUNUICogRlJPTSB1ay51a19wcmljZV9wYWlkX3NpbXBsZV9wYXJ0aXRpb25lZA&run_query=true&tab=results) in our ClickHouse SQL Playground.
 
+### Structure on disk
+
 Whenever a set of rows is inserted into the table, instead of creating (at [least](/docs/en/operations/settings/settings#max_insert_block_size)) one single data part containing all the inserted rows (as described [here](/docs/en/parts)), ClickHouse creates one new data part for each unique partition key value among the inserted rows:
 
 <img src={require('./images/partitions.png').default} alt='INSERT PROCESSING' class='image' style={{width: '100%'}} />

diff --git a/docs/en/managing-data/deleting-data/overview.md b/docs/en/managing-data/deleting-data/overview.md
@@ -1,6 +1,6 @@
 ---
 slug: /en/deletes/overview
-title: Overview
+title: Delete Overview
 description: How to delete data in ClickHouse
 keywords: [delete, truncate, drop, lightweight delete]
 ---

diff --git a/docusaurus.config.js b/docusaurus.config.js
@@ -174,8 +174,8 @@ const config = {
 		/** @type {import('@docusaurus/preset-classic').ThemeConfig} */
 		({
 			algolia: {
-				appId: '62VCH2MD74',
-				apiKey: '2363bec2ff1cf20b0fcac675040107c3',
+				appId: '5H9UG7CX5W',
+				apiKey: '4a7bf25cf3edbef29d78d5e1eecfdca5',
 				indexName: 'clickhouse',
 				contextualSearch: false,
 				searchPagePath: 'search',

diff --git a/package.json b/package.json
@@ -13,19 +13,23 @@
     "docusaurus": "docusaurus",
     "prep-from-local": "bash -c 'array_root=($npm_package_config_prep_array_root);array_en=($npm_package_config_prep_array_en);for folder in ${array_en[@]}; do cp -r $0/$folder docs/en;echo \"Copied $folder from [$0]\";done;for folder in ${array_root[@]}; do cp -r $0/$folder docs/;echo \"Copied $folder from [$0]\";done;echo \"Prep completed\";'",
     "prep-from-master": "bash -c 'array_root=($npm_package_config_prep_array_root);array_en=($npm_package_config_prep_array_en);ch_temp=/tmp/ch_temp_$RANDOM && mkdir -p $ch_temp && git clone --depth 1 --branch master https://github.com/ClickHouse/ClickHouse $ch_temp; for folder in ${array_en[@]}; do cp -r $ch_temp/$folder docs/en;echo \"Copied $folder from ClickHouse master branch\";done;for folder in ${array_root[@]}; do cp -r $ch_temp/$folder docs/;echo \"Copied $folder from ClickHouse master branch\";done;rm -rf $ch_temp && echo \"Prep completed\";'",
+    "copy-clickhouse-repo-docs": "bash ./copyClickhouseRepoDocs.sh",
     "serve": "docusaurus serve",
     "build-api-doc": "node clickhouseapi.js",
     "build-swagger": "npx @redocly/cli build-docs  https://api.clickhouse.cloud/v1 --output build/en/cloud/manage/api/swagger.html",
-    "new-build": "bash ./copyClickhouseRepoDocs.sh && bash ./scripts/settings/autogenerate-settings.sh && yarn build-api-doc && yarn build && yarn build-swagger",
+    "auto-generate-settings": "bash ./scripts/settings/autogenerate-settings.sh",
+    "new-build": "yarn copy-clickhouse-repo-docs && yarn auto-generate-settings && yarn build-api-doc && yarn build && yarn build-swagger",
     "start": "docusaurus start",
     "swizzle": "docusaurus swizzle",
-    "write-heading-ids": "docusaurus write-heading-ids"
+    "write-heading-ids": "docusaurus write-heading-ids",
+    "run-indexer": "bash ./scripts/search/run_indexer.sh"
   },
   "dependencies": {
     "@docusaurus/core": "3.7.0",
     "@docusaurus/plugin-client-redirects": "3.7.0",
     "@docusaurus/preset-classic": "3.7.0",
     "@docusaurus/theme-mermaid": "3.7.0",
+    "@docusaurus/theme-search-algolia": "^3.7.0",
     "@mdx-js/react": "^3.1.0",
     "@radix-ui/react-navigation-menu": "^1.2.3",
     "axios": "^1.7.9",

diff --git a/scripts/search/README.md b/scripts/search/README.md
@@ -31,9 +31,33 @@ options:
 
 ## Results
 
-
-| Date       | Average nDCG | Results                                                                                        |
-|------------|--------------|------------------------------------------------------------------------------------------------|
-| 20/01/2024 | 0.5010       | [here](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) |
-|            |              |                                                                                                |
-
+| **Date**   | **Average nDCG** | **Results**                                                                                            | **Changes**                                      |
+|------------|------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------|
+| 20/01/2024 | 0.4700           | [View Results](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) | Baseline                                         |
+| 21/01/2024 | 0.5021           | [View Results](https://pastila.nl/?00bb2c2f/936a9a3af62a9bdda186af5f37f55782#m7Hg0i9F1YCesMW6ot25yA==) | Index `_` character and move language to English |
+| 24/01/2024 | 0.7072           | [View Results](https://pastila.nl/?065e3e67/e4ad889d0c166226118e6160b4ee53ff#x1NPd2R7hU90CZvvrE4nhg==) | Process markdown, and tune settings.             |
+| 24/01/2024 | 0.7412           | [View Results](https://pastila.nl/?0020013d/e69b33aaae82e49bc71c5ee2cea9ad46#pqq3VtRd4eP4JM5/izcBcA==) | Include manual promotions for ambigious terms.   |
+
+Note: exact scores may vary due to constant content changes.
+
+## Issues
+
+1. Some pages are not optimized for retrieval e.g. 
+   a. https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-if will never return for `countIf`, `sumif`, `multiif`
+1. Some pages are hidden e.g. https://clickhouse.com/docs/en/install#from-docker-image - this needs to be separate page.
+1. Some pages e.g. https://clickhouse.com/docs/en/sql-reference/statements/alter need headings e.g. `Alter table`
+1. https://clickhouse.com/docs/en/optimize/sparse-primary-indexes needs to be optimized for primary key
+1. case `when` - https://clickhouse.com/docs/en/sql-reference/functions/conditional-functions needs to be improved. Maybe keywords or a header
+1. `has` - https://clickhouse.com/docs/en/sql-reference/functions/array-functions#hasarr-elem tricky
+1. `codec` - we need better content
+1. `shard` - need a better page
+1. `populate` - we need to have a subheading on the mv page
+1. `contains` - https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions needs words
+1. `replica` - need more terms on https://clickhouse.com/docs/en/architecture/horizontal-scaling but we need a better page
+
+
+Algolia configs to try:
+
+- minProximity - 1
+- minWordSizefor2Typos - 7
+- minWordSizefor1Typo- 3
diff --git a/scripts/search/compute_ndcg.py b/scripts/search/compute_ndcg.py
@@ -3,13 +3,22 @@
 import argparse
 from algoliasearch.search.client import SearchClientSync
 
-# Initialize Algolia client
-ALGOLIA_APP_ID = "62VCH2MD74"
-ALGOLIA_API_KEY = "b78244d947484fe3ece7bc5472e9f2af"
 ALGOLIA_INDEX_NAME = "clickhouse"
 
-client = SearchClientSync(ALGOLIA_APP_ID, ALGOLIA_API_KEY)
+# dev details
+ALGOLIA_APP_ID = "7AL1W7YVZK"
+ALGOLIA_API_KEY = "43bd50d4617a97c9b60042a2e8a348f9"
+
+# Prod details
+# ALGOLIA_APP_ID = "5H9UG7CX5W"
+# ALGOLIA_API_KEY = "4a7bf25cf3edbef29d78d5e1eecfdca5"
 
+# old search engine using crawler
+# ALGOLIA_APP_ID = "62VCH2MD74"
+# ALGOLIA_API_KEY = "b78244d947484fe3ece7bc5472e9f2af"
+
+
+client = SearchClientSync(ALGOLIA_APP_ID, ALGOLIA_API_KEY)
 
 def compute_dcg(relevance_scores):
     """Compute Discounted Cumulative Gain (DCG)."""
@@ -32,12 +41,13 @@ def main(input_csv, detailed, k=3):
     with open(input_csv, mode='r', newline='', encoding='utf-8') as file:
         reader = csv.reader(file)
         rows = list(reader)
-
     results = []
     total_ndcg = 0
+
     for row in rows:
         term = row[0]
-        expected_links = [link for link in row[1:4] if link]  # Skip empty cells
+        # Remove duplicates in expected links - can happen as some docs return same url
+        expected_links = list(dict.fromkeys([link for link in row[1:4] if link]))  # Ensure uniqueness
 
         # Query Algolia
         response = client.search(
@@ -58,17 +68,20 @@ def main(input_csv, detailed, k=3):
         total_ndcg += ndcg
         results.append({"term": term, "nDCG": ndcg})
 
-    # Calculate Mean nDCG
-    mean_ndcg = total_ndcg / len(rows) if rows else 0
+    # Sort results by descending nDCG
+    results.sort(key=lambda x: x['nDCG'], reverse=True)
 
     # Display results
-    print(f"Mean nDCG: {mean_ndcg:.4f}")
     if detailed:
         print("\nSearch Term\t\tnDCG")
         print("=" * 30)
         for result in results:
             print(f"{result['term']}\t\t{result['nDCG']:.4f}")
 
+    # Calculate Mean nDCG
+    mean_ndcg = total_ndcg / len(rows) if rows else 0
+    print(f"Mean nDCG: {mean_ndcg:.4f}")
+
 
 if __name__ == "__main__":
     parser = argparse.ArgumentParser(description="Compute nDCG for Algolia search results.")
-Original file line number
+Diff line change
@@ Expand Up / @@ -5,7 +5,7 @@ keywords: [clickhouse, Omni, connect, integrate, ui] @@
     description: Omni is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.
     ---
-    import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
+    import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';
     # Omni
@@ Expand Down @@