-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support GCS #73
Support GCS #73
Conversation
2af4b2b
to
6a655ea
Compare
03ea259
to
6892f98
Compare
@zhousun please review when you get a chance. thanks! |
@raghunandanbhat thanks for the contribute. For delta-rs, I think it is possible to inline the secret as json instead of providing a path: https://docs.rs/object_store/latest/src/object_store/gcp/builder.rs.html#365 |
delta-rs now uses the service account json instead of the file path. I believe it is better to provide the path when creating/inserting a secret into |
If I understand correct, providing a path like this requires the files to be in server side. For example if you connect to a remote postgres, you cannot use a secret file in local path. That's why I believe having a way to direct type/copy the secret is important. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general lgtm, will approve once you resolve the other comment (service account json file won't work with server-client setup).
thx!
README.md
Outdated
#### GCS Buckets | ||
- Add your GCS credentials | ||
```sql | ||
SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"PATH":"path/to/service_account.json"}'); | ||
``` | ||
- Set default bucket after adding GCS credentials. | ||
```sql | ||
SET mooncake.default_bucket = 'gs://<gcs_bucket>'; | ||
``` | ||
>**Note**: delta-rs seems to accept only Service Account JSON credentials when creating delta tables in GCS. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: To keep readme more condense, we will move the bucket instruction to https://pgmooncake.com/docs/cloud-storage. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
please make sure to update the sql query.
SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"PATH":"path/to/service_account.json"}');
to
SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"GCS_SECRET":"Service Account JSON"}');
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @paurora17
what if we keep both options?
|
Yep that would work |
db60953
to
b4b9472
Compare
rust_extensions/delta/src/lib.rs
Outdated
@@ -48,8 +55,10 @@ pub fn DeltaCreateTable( | |||
runtime.block_on(async { | |||
let mut storage_options: HashMap<String, String> = | |||
serde_json::from_str(options.to_str()?).expect("invalid options"); | |||
// Write directly to S3 without locking is safe since Mooncake is the only writer | |||
storage_options.insert("AWS_S3_ALLOW_UNSAFE_RENAME".to_string(), "true".to_string()); | |||
if get_storage_type(path) == StorageType::S3 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this again, I think it might be better to simply return the type from ColumnstoreMetadata::SecretsSearchDeltaOptions()
and pass it down here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it is much simpler. we can return tuple<string, string>
from ColumnstoreMetadata::SecretsSearchDeltaOptions()
and pass them separately to DeltaCreateTable()
b4b9472
to
62a1f99
Compare
README.md
Outdated
|
||
> **Note**: On Neon, only cloud storage is supported. Neon users must bring their own S3 or R2 buckets or get a free S3 bucket by signing up at [s3.pgmooncake.com](https://s3.pgmooncake.com/). For cloud storage configuration instructions, see [Cloud Storage](https://pgmooncake.com/docs/cloud-storage). We are working to improve this experience. | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rm extra newline
rust_extensions/delta/src/lib.rs
Outdated
@@ -48,8 +52,10 @@ pub fn DeltaCreateTable( | |||
runtime.block_on(async { | |||
let mut storage_options: HashMap<String, String> = | |||
serde_json::from_str(options.to_str()?).expect("invalid options"); | |||
// Write directly to S3 without locking is safe since Mooncake is the only writer | |||
storage_options.insert("AWS_S3_ALLOW_UNSAFE_RENAME".to_string(), "true".to_string()); | |||
if storage_type.to_str()?.eq("S3") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose you can just storage_type.eq("S3")
sql/pg_mooncake--0.1.0.sql
Outdated
@@ -275,14 +275,17 @@ RETURNS VOID | |||
LANGUAGE plpgsql | |||
AS $create_secret$ | |||
DECLARE | |||
allowed_keys TEXT[] := ARRAY['ENDPOINT', 'REGION', 'SCOPE', 'USE_SSL']; | |||
s3_allowed_keys TEXT[] := ARRAY['ENDPOINT', 'REGION', 'SCOPE', 'USE_SSL']; | |||
gcs_allowed_keys TEXT[] := ARRAY['SCOPE', 'GCS_SECRET', 'PATH']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: sort alphabetically
sql/pg_mooncake--0.1.0.sql
Outdated
keys TEXT[]; | ||
invalid_keys TEXT[]; | ||
delta_endpoint TEXT; | ||
gcs_service_account_json JSONB; | ||
gcs_required_keys TEXT[] := ARRAY['type', 'project_id', 'private_key_id', 'private_key', 'client_email', 'client_id']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sort alphabetically
Should these also be uppercase?
We need to also add doc for these - I think your original doc doesn't include them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gcs_required_keys
lists the keys in lowercase and in the same order as they appear in secrets JSON file. This makes it easy to spot which key is missing when error message is shown.
cc: @paurora17
For docs, you can use the below-
Google Cloud Storage
Add your GCS credentials by either providing the path to the service account JSON file:
SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"PATH":"path/to/service_account.json"}');
Or by supplying the Service Account JSON string directly:
SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"GCS_SECRET":"Service Account JSON string"}');
Note: When supplying secrets using a JSON string, ensure it includes the following keys, properly escaped:
'type', 'project_id', 'private_key_id', 'private_key', 'client_email', 'client_id'
.
Set the default bucket after adding GCS credentials:
SET mooncake.default_bucket = 'gs://<gcs_bucket>';
Note: delta-rs seems to accept only Service Account JSON credentials when creating delta tables in GCS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much for the background! I'll revert the order of keys to your original order
@paurora17
When adding the docs, please clearly state that '{"PATH":"path/to/service_account.json"}'
only works if service_account.json
is located on the server side
By default, this won't work in Docker containers unless the file is mounted, and it definitely won't work on Neon
Should we consider omitting this option from the doc altogether?
longest_match = scope.length(); | ||
} | ||
const string storage_type = TextDatumGetCString(values[1]); | ||
if (!IsMatchingStorageType(storage_type, path)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should just compute the type based on path
outside of the loop
@@ -57,6 +57,20 @@ Oid Secrets() { | |||
return get_relname_relid("secrets", Mooncake()); | |||
} | |||
|
|||
static const std::vector<string> s3_prefixes = {"s3://", "s3a://", "s3n://"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
- no need for static. you are already in anonymous namespace
- std::vector -> vector
static const std::vector<string> gcs_prefixes = {"gcs://", "gs://"}; | ||
|
||
// Checks if a path matches the URL scheme for the given storage type (S3 or GCS) | ||
bool IsMatchingStorageType(const string &storage_type, const string &path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your variable names are not conforming with our code convention
bool IsMatchingStorageType(const string &storage_type, const string &path) { | ||
auto has_matching_prefix = [&path](const std::vector<string> &prefixes) { | ||
return std::any_of(prefixes.begin(), prefixes.end(), | ||
[&path](string prefix) { return StringUtil::StartsWith(path, prefix); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use reference string &prefix
@@ -29,7 +29,7 @@ class ColumnstoreMetadata { | |||
const ColumnList *columns = nullptr); | |||
|
|||
vector<string> SecretsGetDuckdbQueries(); | |||
string SecretsSearchDeltaOptions(const string &path); | |||
std::tuple<string, string> SecretsSearchDeltaStorageConfig(const string &path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: std::tuple<string /*type*/, string /*options*/>
src/lake/lake.cpp
Outdated
} | ||
|
||
void ChangeFile(Oid oid, string file_name, int64_t file_size, bool is_add_file) { | ||
if (cached_table_infos.count(oid) == 0) { | ||
ColumnstoreMetadata metadata(NULL /*snapshot*/); | ||
auto [path, timeline_id] = metadata.TablesSearch(oid); | ||
cached_table_infos[oid] = {path, std::move(timeline_id), metadata.SecretsSearchDeltaOptions(path)}; | ||
auto [storage_type, options] = metadata.SecretsSearchDeltaStorageConfig(path); | ||
cached_table_infos[oid] = {path, std::move(timeline_id), storage_type, options}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use std::move()
on storage_type
and options
Sorry, but my force-push to gcs-support branch somehow triggered Github to close this PR accidentally.. We've cleaned up your implementation at 42392b7 Also, I believe DuckDB's delta_scan() won't work with GCS right now due to duckdb/duckdb-delta#97 |
looks great! |
Merged into the main branch. Thank you for your work! 🥮 |
fixes #38