Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Short Circuit And Filter Operator During Index Evaluation #14700

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

ankitsultana
Copy link
Contributor

@ankitsultana ankitsultana commented Dec 22, 2024

Returns early from the AndFilterOperator if any of the non-lazy predicate has already evaluated to an Empty Doc Id Set.

Additionally, we also return early from CombinedFilterOperator if the main filter operator evaluated to an empty Doc Id Set.

Caveats

Note that this doesn't work for scan based predicates, since those are lazily evaluated in AndDocIdSet#iterator() for obvious reasons.

Alternatives Considered

An alternative approach to this PR could be to restructure our predicate evaluation for ANDs altogether, and instead run the AND Bitmap on a running basis (i.e. currentBitmap.and(filterOperators.get(i).getTrues()). That way, if the AND of the prefix of the predicates evaluates to an Empty Doc Id Set (!currentBitmap.cardinalityExceeds(0)), we could return early.

We could then also use the idea shared by @richardstartin in #14694 to push down the currently evaluated bitmap to the index, to potentially leverage more optimizations.

I think the first step could be to add a Running AND for the prefix of predicates in the AndFilterOperator and bail early when possible, and the next step could be to push down a partial predicate to certain indexes.

cc: @Jackie-Jiang

Test Plan

Unit tests have decent coverage. I am also planning to run some benchmarks and will add their results soon.

Perf Tests

Created a microbenchmark and that runs this filter (obfuscated) on one of our prod segments and the filter evaluates 2-3x faster with the patch. This is a cherry-picked example, but for any use-case that has a high ratio of segmentsProcessed and segmentsMatched, this optimization should be helpful. (note that Error is quite high.. but results are good enough for comparison)

Without Optimization

Benchmark                                           Mode  Cnt  Score   Error  Units
BenchmarkFilterOperator.testFilterDocIdSetOperator  avgt    5  0.208 ± 0.051  ms/op

With Optimization

Benchmark                                           Mode  Cnt  Score   Error  Units
BenchmarkFilterOperator.testFilterDocIdSetOperator  avgt    5  0.070 ± 0.018  ms/op

Example Filter (Indicative because Obfuscated)

  (
    source = 'some_source'
    AND is_deleted = false
    AND catalog_uuid = 'xyz'
    AND storefront_uuid = 'abc'
    AND NOT (external_ids is NULL)
    AND (
      (
        TEXT_MATCH(analyzed_product_name, '/tostitosx.*/')
      )
    )
    AND (
      (
        TEXT_MATCH(analyzed_product_name, '/chips.*/')
      )
    )
  )

@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2024

Codecov Report

Attention: Patch coverage is 64.51613% with 11 lines in your changes missing coverage. Please review.

Project coverage is 63.83%. Comparing base (59551e4) to head (d7b5abd).
Report is 1528 commits behind head on master.

Files with missing lines Patch % Lines
...va/org/apache/pinot/core/common/BlockDocIdSet.java 25.00% 2 Missing and 1 partial ⚠️
.../pinot/core/operator/docidsets/BitmapDocIdSet.java 40.00% 2 Missing and 1 partial ⚠️
...re/operator/docidsets/RangelessBitmapDocIdSet.java 33.33% 1 Missing and 1 partial ⚠️
...t/core/operator/filter/CombinedFilterOperator.java 0.00% 1 Missing and 1 partial ⚠️
...inot/core/operator/docidsets/MatchAllDocIdSet.java 0.00% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14700      +/-   ##
============================================
+ Coverage     61.75%   63.83%   +2.08%     
- Complexity      207     1607    +1400     
============================================
  Files          2436     2703     +267     
  Lines        133233   150759   +17526     
  Branches      20636    23296    +2660     
============================================
+ Hits          82274    96237   +13963     
- Misses        44911    47323    +2412     
- Partials       6048     7199    +1151     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.81% <64.51%> (+2.10%) ⬆️
java-21 63.70% <64.51%> (+2.07%) ⬆️
skip-bytebuffers-false 63.82% <64.51%> (+2.07%) ⬆️
skip-bytebuffers-true 63.68% <64.51%> (+35.95%) ⬆️
temurin 63.83% <64.51%> (+2.08%) ⬆️
unittests 63.83% <64.51%> (+2.08%) ⬆️
unittests1 56.26% <64.51%> (+9.37%) ⬆️
unittests2 34.16% <0.00%> (+6.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ankitsultana ankitsultana changed the title [WIP] [PoC] Bail Early in And Filter Operator When Possible [WIP] Bail Early in And Filter Operator When Possible Dec 23, 2024
@@ -401,7 +401,7 @@ public void queryStContainsWithMultipleFilters()

AggregationOperator aggregationOperator = getOperator(query);
AggregationResultsBlock resultsBlock = aggregationOperator.nextBlock();
QueriesTestUtils.testInnerSegmentExecutionStatistics(aggregationOperator.getExecutionStatistics(), 0, 2, 0, 1);
QueriesTestUtils.testInnerSegmentExecutionStatistics(aggregationOperator.getExecutionStatistics(), 0, 1, 0, 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected. Previously we would end up evaluating both the predicates, but now we only evaluate the first one, and since that doesn't match any docs, we skip the second predicate

@ankitsultana ankitsultana changed the title [WIP] Bail Early in And Filter Operator When Possible Bail Early in And Filter Operator When Possible Dec 23, 2024
@ankitsultana ankitsultana marked this pull request as ready for review December 23, 2024 20:14
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another optimization we can consider is to stop using index and use scan based after the cardinality is lower than a threshold. Scanning a small amount of records can be faster than reading an index

@ankitsultana ankitsultana changed the title Bail Early in And Filter Operator When Possible Short Circuit And Filter Operator for Index Reads Jan 2, 2025
@ankitsultana ankitsultana changed the title Short Circuit And Filter Operator for Index Reads Short Circuit And Filter Operator During Index Evaluation Jan 2, 2025
Copy link
Contributor

@Jackie-Jiang Jackie-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I remember why we didn't have isAlwaysFalse() and isAlwaysTrue() in BlockDocIdSet. We should directly create EmptyDocIdSet and MatchAllDocIdSet when we know the result is always false/true. Initially we followed this contract, but through times newer contribution didn't follow the contract.

Can you try if you can fix the violations of this contract? We should also document this contract in the interface

/**
* For scan-based FilterBlockDocIdSet, pre-scans the documents and returns a non-scan-based FilterBlockDocIdSet.
*/
default BlockDocIdSet toNonScanDocIdSet() {
BlockDocIdIterator docIdIterator = iterator();

if (docIdIterator instanceof EmptyDocIdIterator) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this change? This is prioritizing the always empty filter case, which should be extremely rare

blockDocIdSets.add(blockDocIdSet);
if (blockDocIdSet.isAlwaysFalse()) {
// Return AndDocIdSet to ensure that getNumEntriesScannedInFilter is correctly reported.
return new AndDocIdSet(blockDocIdSets, _queryOptions, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to create AndDocIdSet here? The scan shouldn't happen yet. The overhead of using AndDocIdSet is quite high

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants