-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snap 2358 Sorted Column Batches on partitioning keys #1054
Open
vibhaska
wants to merge
392
commits into
master
Choose a base branch
from
SNAP-2358
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,755
−96
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Includes the changes for the two issues and a bunch of other fixes found in testing. - Implementation of StoreCallbacks.columnTableScan that translates Filters to Expressions and generates the code to apply the same locally to ColumnBatchIterator stats rows - Changed smart connector iterator to use the new COLUMN_TABLE_SCAN procedure instead of multiple queries. - Added passing of Filters to the plans and recreation of those in getPartitions if the parameter values have changed for ParamLiterals - Fixed RowFormatScanRDD to regenerate filter clause in getPartitions if the parameter values have changed for ParamLiterals (not seen earlier because index columns were incorrect which has been fixed in store) - Perf fix to ColumnFormatIterator: keep track of updated delta stats separately with forced faultin like the full stats in DiskMultiColumnBatch so that entire batch does not need to be read if filter can skip using stats - Perf fix to RemoteEntriesIterator: - fetch both full stats and delta stats rows when fetching keys first time - sort and ensure both rows of a batch are together when fetching other columns - Updated ParamLiteral serialization to replace its value with the updated one in LiteralValue since parameter may have changed (but the base Literal.value is a val and cannot be changed) - Corrected RDD to be cleared in CachedDataFrame to use either the cachedRDD or the last used one for execution. - Added handling of the new DECOMPRESS_IF_IN_MEMORY fetch type to return self (or null) if decompression cannot replace the underlying in-memory value - Updated ColumnFormatValue to store disk RegionEntry instead of diskId since latter can change. - Improved performance of Snappy stats iterator to avoid lookup of deleted bitmask column for every stats column, rather iterate both of them and add negative size for deletes if any. - Added more transient expected exception types to SnappyTestRunner. - Updated store link.
also refactored PooledKryoSerializer to add generic serialize/deserialize methods that accept closures
- primary reason being that StringStartsWith Filter requires a string as pattern and cannot hold "Any" so the hack of stuffing in a ParamLiteral inside Filter does not work; now using Expression which are translated to Filter just before use if required - pushdown of filters from smart connector to server still uses Filter after conversion from Expression and when the ParamLiterals have been substituted with current values - removed awkward handling of ParamLiterals inside Filters as a result of above changes - fixed the StartsWith stats filter to use a ParamLiteral and generated code for the comparison against stats row bounds
…edQueryRoutingSingleNodeSuite was failing with NPE exception since DefaultSource did not have relevant properties.
…d insert. Need to implement again and also address cases of group by. Now this project is not dependent on AQP changes.
sumwale
force-pushed
the
master
branch
5 times, most recently
from
October 1, 2021 09:23
8b43301
to
2b254d9
Compare
sumwale
force-pushed
the
master
branch
5 times, most recently
from
October 18, 2021 17:01
2c254f0
to
0f2888f
Compare
sumwale
force-pushed
the
master
branch
2 times, most recently
from
April 12, 2022 10:05
a466d26
to
ea127bd
Compare
sumwale
force-pushed
the
master
branch
2 times, most recently
from
June 12, 2022 04:19
99ec79c
to
c7b84fa
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes proposed in this pull request
Now user can create sorted Column Batches on partitioning keys using DDL mentioned below. This will keep a column batch in sorted manner that can be leveraged for better performance of point queries, range queries and colocated join queries on partitioning columns. For more details please refer https://jira.snappydata.io/browse/SNAP-2358
TODO:
Patch testing
Unit test
Precheckin
ReleaseNotes.txt changes
A sample DDL to create table with sorted partitioning columns is,
session.sql(s"create table $colTableName (id int, addr string, status boolean) " +
s"using column options(buckets '$numBuckets', partition_by 'id SORTING ASC' " + s")")
If no sorting is required, above DDL would be,
session.sql(s"create table $colTableName (id int, addr string, status boolean) " +
s"using column options(buckets '$numBuckets', partition_by 'id' " + s")")
Valid sorting identifiers are,
SORTING ASC
SORTING DESC
SORTING Ascending
SORTING Descending
Other PRs
TIBCOSoftware/snappy-store#395