Short Circuit And Filter Operator During Index Evaluation #14700

ankitsultana · 2024-12-22T21:19:21Z

Returns early from the AndFilterOperator if any of the non-lazy predicate has already evaluated to an Empty Doc Id Set.

Additionally, we also return early from CombinedFilterOperator if the main filter operator evaluated to an empty Doc Id Set.

Caveats

Note that this doesn't work for scan based predicates, since those are lazily evaluated in AndDocIdSet#iterator() for obvious reasons.

Alternatives Considered

An alternative approach to this PR could be to restructure our predicate evaluation for ANDs altogether, and instead run the AND Bitmap on a running basis (i.e. currentBitmap.and(filterOperators.get(i).getTrues()). That way, if the AND of the prefix of the predicates evaluates to an Empty Doc Id Set (!currentBitmap.cardinalityExceeds(0)), we could return early.

We could then also use the idea shared by @richardstartin in #14694 to push down the currently evaluated bitmap to the index, to potentially leverage more optimizations.

I think the first step could be to add a Running AND for the prefix of predicates in the AndFilterOperator and bail early when possible, and the next step could be to push down a partial predicate to certain indexes.

cc: @Jackie-Jiang

Test Plan

Unit tests have decent coverage. I am also planning to run some benchmarks and will add their results soon.

Perf Tests

Created a microbenchmark and that runs this filter (obfuscated) on one of our prod segments and the filter evaluates 2-3x faster with the patch. This is a cherry-picked example, but for any use-case that has a high ratio of segmentsProcessed and segmentsMatched, this optimization should be helpful. (note that Error is quite high.. but results are good enough for comparison)

Without Optimization

Benchmark                                           Mode  Cnt  Score   Error  Units
BenchmarkFilterOperator.testFilterDocIdSetOperator  avgt    5  0.208 ± 0.051  ms/op

With Optimization

Benchmark                                           Mode  Cnt  Score   Error  Units
BenchmarkFilterOperator.testFilterDocIdSetOperator  avgt    5  0.070 ± 0.018  ms/op

Example Filter (Indicative because Obfuscated)

  (
    source = 'some_source'
    AND is_deleted = false
    AND catalog_uuid = 'xyz'
    AND storefront_uuid = 'abc'
    AND NOT (external_ids is NULL)
    AND (
      (
        TEXT_MATCH(analyzed_product_name, '/tostitosx.*/')
      )
    )
    AND (
      (
        TEXT_MATCH(analyzed_product_name, '/chips.*/')
      )
    )
  )

codecov-commenter · 2024-12-22T22:02:35Z

Codecov Report

Attention: Patch coverage is 64.51613% with 11 lines in your changes missing coverage. Please review.

Project coverage is 63.83%. Comparing base (59551e4) to head (d7b5abd).
Report is 1528 commits behind head on master.

Files with missing lines	Patch %	Lines
...va/org/apache/pinot/core/common/BlockDocIdSet.java	25.00%	2 Missing and 1 partial ⚠️
.../pinot/core/operator/docidsets/BitmapDocIdSet.java	40.00%	2 Missing and 1 partial ⚠️
...re/operator/docidsets/RangelessBitmapDocIdSet.java	33.33%	1 Missing and 1 partial ⚠️
...t/core/operator/filter/CombinedFilterOperator.java	0.00%	1 Missing and 1 partial ⚠️
...inot/core/operator/docidsets/MatchAllDocIdSet.java	0.00%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14700      +/-   ##
============================================
+ Coverage     61.75%   63.83%   +2.08%     
- Complexity      207     1607    +1400     
============================================
  Files          2436     2703     +267     
  Lines        133233   150759   +17526     
  Branches      20636    23296    +2660     
============================================
+ Hits          82274    96237   +13963     
- Misses        44911    47323    +2412     
- Partials       6048     7199    +1151

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.81% <64.51%> (+2.10%)`	⬆️
java-21	`63.70% <64.51%> (+2.07%)`	⬆️
skip-bytebuffers-false	`63.82% <64.51%> (+2.07%)`	⬆️
skip-bytebuffers-true	`63.68% <64.51%> (+35.95%)`	⬆️
temurin	`63.83% <64.51%> (+2.08%)`	⬆️
unittests	`63.83% <64.51%> (+2.08%)`	⬆️
unittests1	`56.26% <64.51%> (+9.37%)`	⬆️
unittests2	`34.16% <0.00%> (+6.43%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ankitsultana · 2024-12-23T18:47:54Z

pinot-core/src/test/java/org/apache/pinot/queries/H3IndexQueriesTest.java

@@ -401,7 +401,7 @@ public void queryStContainsWithMultipleFilters()

    AggregationOperator aggregationOperator = getOperator(query);
    AggregationResultsBlock resultsBlock = aggregationOperator.nextBlock();
-    QueriesTestUtils.testInnerSegmentExecutionStatistics(aggregationOperator.getExecutionStatistics(), 0, 2, 0, 1);
+    QueriesTestUtils.testInnerSegmentExecutionStatistics(aggregationOperator.getExecutionStatistics(), 0, 1, 0, 1);


This is expected. Previously we would end up evaluating both the predicates, but now we only evaluate the first one, and since that doesn't match any docs, we skip the second predicate

Jackie-Jiang

Another optimization we can consider is to stop using index and use scan based after the cardinality is lower than a threshold. Scanning a small amount of records can be faster than reading an index

pinot-core/src/main/java/org/apache/pinot/core/common/BlockDocIdSet.java

pinot-core/src/main/java/org/apache/pinot/core/operator/docidsets/BitmapDocIdSet.java

Jackie-Jiang

Now I remember why we didn't have isAlwaysFalse() and isAlwaysTrue() in BlockDocIdSet. We should directly create EmptyDocIdSet and MatchAllDocIdSet when we know the result is always false/true. Initially we followed this contract, but through times newer contribution didn't follow the contract.

Can you try if you can fix the violations of this contract? We should also document this contract in the interface

Jackie-Jiang · 2025-01-07T06:14:39Z

pinot-core/src/main/java/org/apache/pinot/core/common/BlockDocIdSet.java

  /**
   * For scan-based FilterBlockDocIdSet, pre-scans the documents and returns a non-scan-based FilterBlockDocIdSet.
   */
  default BlockDocIdSet toNonScanDocIdSet() {
    BlockDocIdIterator docIdIterator = iterator();
-
+    if (docIdIterator instanceof EmptyDocIdIterator) {


Do we need this change? This is prioritizing the always empty filter case, which should be extremely rare

Jackie-Jiang · 2025-01-07T06:31:28Z

pinot-core/src/main/java/org/apache/pinot/core/operator/filter/AndFilterOperator.java

+      blockDocIdSets.add(blockDocIdSet);
+      if (blockDocIdSet.isAlwaysFalse()) {
+        // Return AndDocIdSet to ensure that getNumEntriesScannedInFilter is correctly reported.
+        return new AndDocIdSet(blockDocIdSets, _queryOptions, true);


Do we need to create AndDocIdSet here? The scan shouldn't happen yet. The overhead of using AndDocIdSet is quite high

[WIP] [PoC] Bail Early in And Filter Operator When Possible

669e95c

fix bugs and tests

ff3bd17

ankitsultana changed the title ~~[WIP] [PoC] Bail Early in And Filter Operator When Possible~~ [WIP] Bail Early in And Filter Operator When Possible Dec 23, 2024

ankitsultana commented Dec 23, 2024

View reviewed changes

skip remaining scan based operators when possible

a64894d

ankitsultana requested a review from Jackie-Jiang December 23, 2024 20:08

ankitsultana changed the title ~~[WIP] Bail Early in And Filter Operator When Possible~~ Bail Early in And Filter Operator When Possible Dec 23, 2024

ankitsultana marked this pull request as ready for review December 23, 2024 20:14

ankitsultana added 4 commits December 23, 2024 22:15

minor refactor

897410b

bug fix: return empty block doc id set in toNonScan..

8b04c2a

skip in filtered aggregates too

e468d6e

last set of improvements

abf4a1d

ankitsultana added the performance label Dec 25, 2024

Jackie-Jiang reviewed Dec 27, 2024

View reviewed changes

pinot-core/src/main/java/org/apache/pinot/core/common/BlockDocIdSet.java Outdated Show resolved Hide resolved

pinot-core/src/main/java/org/apache/pinot/core/operator/docidsets/BitmapDocIdSet.java Outdated Show resolved Hide resolved

address feedback

11f1636

ankitsultana changed the title ~~Bail Early in And Filter Operator When Possible~~ Short Circuit And Filter Operator for Index Reads Jan 2, 2025

ankitsultana changed the title ~~Short Circuit And Filter Operator for Index Reads~~ Short Circuit And Filter Operator During Index Evaluation Jan 2, 2025

ankitsultana requested a review from Jackie-Jiang January 2, 2025 20:06

Update AndFilterOperator.java

d7b5abd

ankitsultana mentioned this pull request Jan 3, 2025

Inverted Index Filters Run After All Other Index Filters #14744

Closed

Jackie-Jiang reviewed Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Short Circuit And Filter Operator During Index Evaluation #14700

Short Circuit And Filter Operator During Index Evaluation #14700

ankitsultana commented Dec 22, 2024 •

edited

Loading

codecov-commenter commented Dec 22, 2024 •

edited

Loading

ankitsultana Dec 23, 2024

Jackie-Jiang left a comment

Jackie-Jiang left a comment

Jackie-Jiang Jan 7, 2025

Jackie-Jiang Jan 7, 2025

Short Circuit And Filter Operator During Index Evaluation #14700

Are you sure you want to change the base?

Short Circuit And Filter Operator During Index Evaluation #14700

Conversation

ankitsultana commented Dec 22, 2024 • edited Loading

Caveats

Alternatives Considered

Test Plan

Perf Tests

codecov-commenter commented Dec 22, 2024 • edited Loading

Codecov Report

ankitsultana Dec 23, 2024

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang left a comment

Choose a reason for hiding this comment

Jackie-Jiang Jan 7, 2025

Choose a reason for hiding this comment

Jackie-Jiang Jan 7, 2025

Choose a reason for hiding this comment

ankitsultana commented Dec 22, 2024 •

edited

Loading

codecov-commenter commented Dec 22, 2024 •

edited

Loading