Skip to content

Commit

Permalink
Merge pull request #3140 from ClickHouse/measuring_search
Browse files Browse the repository at this point in the history
New search
  • Loading branch information
gingerwizard authored Jan 27, 2025
2 parents 023c8de + d9196a5 commit 0bfc9ee
Show file tree
Hide file tree
Showing 20 changed files with 1,625 additions and 69 deletions.
44 changes: 44 additions & 0 deletions .github/workflows/build-search.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Update Algolia Search

on:
pull_request:
types:
- closed

workflow_dispatch:

schedule:
- cron: '0 4 * * *'

env:
PYTHONUNBUFFERED: 1 # Force the stdout and stderr streams to be unbuffered

jobs:
update-search:
if: github.event.pull_request.merged == true && contains(github.event.pull_request.labels.*.name, 'update search') && github.event.pull_request.base.ref == 'main'
#if: contains(github.event.pull_request.labels.*.name, 'update search') # Updated to trigger directly on PRs with the label
runs-on: ubuntu-latest

steps:
- name: Checkout Repository
uses: actions/checkout@v3

- name: Set up Node.js
uses: actions/setup-node@v3
with:
node-version: '20'

- name: Run Prep from Master
run: yarn copy-clickhouse-repo-docs

- name: Run Auto Generate Settings
run: yarn auto-generate-settings

- name: Run Indexer
run: yarn run-indexer
env:
ALGOLIA_API_KEY: ${{ secrets.ALGOLIA_API_KEY }}
ALGOLIA_APP_ID: 5H9UG7CX5W

- name: Verify Completion
run: echo "All steps completed successfully!"
2 changes: 1 addition & 1 deletion docs/en/chdb/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ pip install pandas pyarrow
## Querying a JSON file in S3

Let's now have a look at how to query a JSON file that's stored in an S3 bucket.
The [YouTube dislikes dataset](https://clickhouse.com/docs/en/getting-started/example-datasets/youtube-dislikes) contains more than 4 billion rows of dislikes on YouTube videos up to 2021.
The [YouTube dislikes dataset](/docs/en/getting-started/example-datasets/youtube-dislikes) contains more than 4 billion rows of dislikes on YouTube videos up to 2021.
We're going to work with one of the JSON files from that dataset.

Import chdb:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ slug: /en/integrations/kafka/clickhouse-kafka-connect-sink
description: The official Kafka connector from ClickHouse.
---

import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';

# ClickHouse Kafka Connect Sink

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ keywords: [clickhouse, Mitzu, connect, integrate, ui]
description: Mitzu is a no-code warehouse-native product analytics application.
---

import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';

# Connecting Mitzu to ClickHouse

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ keywords: [clickhouse, Omni, connect, integrate, ui]
description: Omni is an enterprise platform for BI, data applications, and embedded analytics that helps you explore and share insights in real time.
---

import ConnectionDetails from '@site/docs/en/\_snippets/\_gather_your_details_http.mdx';
import ConnectionDetails from '@site/docs/en/_snippets/_gather_your_details_http.mdx';

# Omni

Expand Down
5 changes: 4 additions & 1 deletion docs/en/managing-data/core-concepts/partitions.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
slug: /en/partitions
title: Table partitions
description: What are table partitions in ClickHouse
keywords: [partitions]
keywords: [partitions, partition by]
---

## What are table partitions in ClickHouse?
Expand All @@ -12,6 +12,7 @@ keywords: [partitions]

Partitions group the [data parts](/docs/en/parts) of a table in the [MergeTree engine family](/docs/en/engines/table-engines/mergetree-family) into organized, logical units, which is a way of organizing data that is conceptually meaningful and aligned with specific criteria, such as time ranges, categories, or other key attributes. These logical units make data easier to manage, query, and optimize.

### Partition By

Partitioning can be enabled when a table is initially defined via the [PARTITION BY clause](/docs/en/engines/table-engines/mergetree-family/custom-partitioning-key). This clause can contain a SQL expression on any columns, the results of which will define which partition a row belongs to.

Expand All @@ -33,6 +34,8 @@ PARTITION BY toStartOfMonth(date);

You can [query this table](https://sql.clickhouse.com/?query=U0VMRUNUICogRlJPTSB1ay51a19wcmljZV9wYWlkX3NpbXBsZV9wYXJ0aXRpb25lZA&run_query=true&tab=results) in our ClickHouse SQL Playground.

### Structure on disk

Whenever a set of rows is inserted into the table, instead of creating (at [least](/docs/en/operations/settings/settings#max_insert_block_size)) one single data part containing all the inserted rows (as described [here](/docs/en/parts)), ClickHouse creates one new data part for each unique partition key value among the inserted rows:

<img src={require('./images/partitions.png').default} alt='INSERT PROCESSING' class='image' style={{width: '100%'}} />
Expand Down
2 changes: 1 addition & 1 deletion docs/en/managing-data/deleting-data/overview.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
slug: /en/deletes/overview
title: Overview
title: Delete Overview
description: How to delete data in ClickHouse
keywords: [delete, truncate, drop, lightweight delete]
---
Expand Down
4 changes: 2 additions & 2 deletions docusaurus.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -174,8 +174,8 @@ const config = {
/** @type {import('@docusaurus/preset-classic').ThemeConfig} */
({
algolia: {
appId: '62VCH2MD74',
apiKey: '2363bec2ff1cf20b0fcac675040107c3',
appId: '5H9UG7CX5W',
apiKey: '4a7bf25cf3edbef29d78d5e1eecfdca5',
indexName: 'clickhouse',
contextualSearch: false,
searchPagePath: 'search',
Expand Down
8 changes: 6 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,23 @@
"docusaurus": "docusaurus",
"prep-from-local": "bash -c 'array_root=($npm_package_config_prep_array_root);array_en=($npm_package_config_prep_array_en);for folder in ${array_en[@]}; do cp -r $0/$folder docs/en;echo \"Copied $folder from [$0]\";done;for folder in ${array_root[@]}; do cp -r $0/$folder docs/;echo \"Copied $folder from [$0]\";done;echo \"Prep completed\";'",
"prep-from-master": "bash -c 'array_root=($npm_package_config_prep_array_root);array_en=($npm_package_config_prep_array_en);ch_temp=/tmp/ch_temp_$RANDOM && mkdir -p $ch_temp && git clone --depth 1 --branch master https://github.com/ClickHouse/ClickHouse $ch_temp; for folder in ${array_en[@]}; do cp -r $ch_temp/$folder docs/en;echo \"Copied $folder from ClickHouse master branch\";done;for folder in ${array_root[@]}; do cp -r $ch_temp/$folder docs/;echo \"Copied $folder from ClickHouse master branch\";done;rm -rf $ch_temp && echo \"Prep completed\";'",
"copy-clickhouse-repo-docs": "bash ./copyClickhouseRepoDocs.sh",
"serve": "docusaurus serve",
"build-api-doc": "node clickhouseapi.js",
"build-swagger": "npx @redocly/cli build-docs https://api.clickhouse.cloud/v1 --output build/en/cloud/manage/api/swagger.html",
"new-build": "bash ./copyClickhouseRepoDocs.sh && bash ./scripts/settings/autogenerate-settings.sh && yarn build-api-doc && yarn build && yarn build-swagger",
"auto-generate-settings": "bash ./scripts/settings/autogenerate-settings.sh",
"new-build": "yarn copy-clickhouse-repo-docs && yarn auto-generate-settings && yarn build-api-doc && yarn build && yarn build-swagger",
"start": "docusaurus start",
"swizzle": "docusaurus swizzle",
"write-heading-ids": "docusaurus write-heading-ids"
"write-heading-ids": "docusaurus write-heading-ids",
"run-indexer": "bash ./scripts/search/run_indexer.sh"
},
"dependencies": {
"@docusaurus/core": "3.7.0",
"@docusaurus/plugin-client-redirects": "3.7.0",
"@docusaurus/preset-classic": "3.7.0",
"@docusaurus/theme-mermaid": "3.7.0",
"@docusaurus/theme-search-algolia": "^3.7.0",
"@mdx-js/react": "^3.1.0",
"@radix-ui/react-navigation-menu": "^1.2.3",
"axios": "^1.7.9",
Expand Down
36 changes: 30 additions & 6 deletions scripts/search/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,9 +31,33 @@ options:

## Results


| Date | Average nDCG | Results |
|------------|--------------|------------------------------------------------------------------------------------------------|
| 20/01/2024 | 0.5010 | [here](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) |
| | | |

| **Date** | **Average nDCG** | **Results** | **Changes** |
|------------|------------------|--------------------------------------------------------------------------------------------------------|--------------------------------------------------|
| 20/01/2024 | 0.4700 | [View Results](https://pastila.nl/?008231f5/bc107912f8a5074d70201e27b1a66c6c#cB/yJOsZPOWi9h8xAkuTUQ==) | Baseline |
| 21/01/2024 | 0.5021 | [View Results](https://pastila.nl/?00bb2c2f/936a9a3af62a9bdda186af5f37f55782#m7Hg0i9F1YCesMW6ot25yA==) | Index `_` character and move language to English |
| 24/01/2024 | 0.7072 | [View Results](https://pastila.nl/?065e3e67/e4ad889d0c166226118e6160b4ee53ff#x1NPd2R7hU90CZvvrE4nhg==) | Process markdown, and tune settings. |
| 24/01/2024 | 0.7412 | [View Results](https://pastila.nl/?0020013d/e69b33aaae82e49bc71c5ee2cea9ad46#pqq3VtRd4eP4JM5/izcBcA==) | Include manual promotions for ambigious terms. |

Note: exact scores may vary due to constant content changes.

## Issues

1. Some pages are not optimized for retrieval e.g.
a. https://clickhouse.com/docs/en/sql-reference/aggregate-functions/combinators#-if will never return for `countIf`, `sumif`, `multiif`
1. Some pages are hidden e.g. https://clickhouse.com/docs/en/install#from-docker-image - this needs to be separate page.
1. Some pages e.g. https://clickhouse.com/docs/en/sql-reference/statements/alter need headings e.g. `Alter table`
1. https://clickhouse.com/docs/en/optimize/sparse-primary-indexes needs to be optimized for primary key
1. case `when` - https://clickhouse.com/docs/en/sql-reference/functions/conditional-functions needs to be improved. Maybe keywords or a header
1. `has` - https://clickhouse.com/docs/en/sql-reference/functions/array-functions#hasarr-elem tricky
1. `codec` - we need better content
1. `shard` - need a better page
1. `populate` - we need to have a subheading on the mv page
1. `contains` - https://clickhouse.com/docs/en/sql-reference/functions/string-search-functions needs words
1. `replica` - need more terms on https://clickhouse.com/docs/en/architecture/horizontal-scaling but we need a better page


Algolia configs to try:

- minProximity - 1
- minWordSizefor2Typos - 7
- minWordSizefor1Typo- 3
31 changes: 22 additions & 9 deletions scripts/search/compute_ndcg.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,22 @@
import argparse
from algoliasearch.search.client import SearchClientSync

# Initialize Algolia client
ALGOLIA_APP_ID = "62VCH2MD74"
ALGOLIA_API_KEY = "b78244d947484fe3ece7bc5472e9f2af"
ALGOLIA_INDEX_NAME = "clickhouse"

client = SearchClientSync(ALGOLIA_APP_ID, ALGOLIA_API_KEY)
# dev details
ALGOLIA_APP_ID = "7AL1W7YVZK"
ALGOLIA_API_KEY = "43bd50d4617a97c9b60042a2e8a348f9"

# Prod details
# ALGOLIA_APP_ID = "5H9UG7CX5W"
# ALGOLIA_API_KEY = "4a7bf25cf3edbef29d78d5e1eecfdca5"

# old search engine using crawler
# ALGOLIA_APP_ID = "62VCH2MD74"
# ALGOLIA_API_KEY = "b78244d947484fe3ece7bc5472e9f2af"


client = SearchClientSync(ALGOLIA_APP_ID, ALGOLIA_API_KEY)

def compute_dcg(relevance_scores):
"""Compute Discounted Cumulative Gain (DCG)."""
Expand All @@ -32,12 +41,13 @@ def main(input_csv, detailed, k=3):
with open(input_csv, mode='r', newline='', encoding='utf-8') as file:
reader = csv.reader(file)
rows = list(reader)

results = []
total_ndcg = 0

for row in rows:
term = row[0]
expected_links = [link for link in row[1:4] if link] # Skip empty cells
# Remove duplicates in expected links - can happen as some docs return same url
expected_links = list(dict.fromkeys([link for link in row[1:4] if link])) # Ensure uniqueness

# Query Algolia
response = client.search(
Expand All @@ -58,17 +68,20 @@ def main(input_csv, detailed, k=3):
total_ndcg += ndcg
results.append({"term": term, "nDCG": ndcg})

# Calculate Mean nDCG
mean_ndcg = total_ndcg / len(rows) if rows else 0
# Sort results by descending nDCG
results.sort(key=lambda x: x['nDCG'], reverse=True)

# Display results
print(f"Mean nDCG: {mean_ndcg:.4f}")
if detailed:
print("\nSearch Term\t\tnDCG")
print("=" * 30)
for result in results:
print(f"{result['term']}\t\t{result['nDCG']:.4f}")

# Calculate Mean nDCG
mean_ndcg = total_ndcg / len(rows) if rows else 0
print(f"Mean nDCG: {mean_ndcg:.4f}")


if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Compute nDCG for Algolia search results.")
Expand Down
Loading

0 comments on commit 0bfc9ee

Please sign in to comment.