Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GCS #73

Closed
Closed

Conversation

raghunandanbhat
Copy link
Contributor

fixes #38

@dpxcc dpxcc requested a review from zhousun January 7, 2025 19:09
@dpxcc dpxcc force-pushed the main branch 2 times, most recently from 2af4b2b to 6a655ea Compare January 10, 2025 04:05
@raghunandanbhat raghunandanbhat force-pushed the gcs-support branch 3 times, most recently from 03ea259 to 6892f98 Compare January 13, 2025 16:33
@raghunandanbhat raghunandanbhat marked this pull request as ready for review January 13, 2025 16:33
@raghunandanbhat
Copy link
Contributor Author

@zhousun please review when you get a chance. thanks!

@zhousun
Copy link
Contributor

zhousun commented Jan 13, 2025

@raghunandanbhat thanks for the contribute. For delta-rs, I think it is possible to inline the secret as json instead of providing a path: https://docs.rs/object_store/latest/src/object_store/gcp/builder.rs.html#365

@raghunandanbhat
Copy link
Contributor Author

delta-rs now uses the service account json instead of the file path.

I believe it is better to provide the path when creating/inserting a secret into mooncake.secrets, as it is easier than typing the entire JSON.

@zhousun
Copy link
Contributor

zhousun commented Jan 15, 2025

If I understand correct, providing a path like this requires the files to be in server side. For example if you connect to a remote postgres, you cannot use a secret file in local path. That's why I believe having a way to direct type/copy the secret is important.

Copy link
Contributor

@zhousun zhousun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general lgtm, will approve once you resolve the other comment (service account json file won't work with server-client setup).
thx!

README.md Outdated
Comment on lines 86 to 96
#### GCS Buckets
- Add your GCS credentials
```sql
SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"PATH":"path/to/service_account.json"}');
```
- Set default bucket after adding GCS credentials.
```sql
SET mooncake.default_bucket = 'gs://<gcs_bucket>';
```
>**Note**: delta-rs seems to accept only Service Account JSON credentials when creating delta tables in GCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: To keep readme more condense, we will move the bucket instruction to https://pgmooncake.com/docs/cloud-storage. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.
please make sure to update the sql query.

SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"PATH":"path/to/service_account.json"}');

to

SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"GCS_SECRET":"Service Account JSON"}');

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raghunandanbhat
Copy link
Contributor Author

If I understand correct, providing a path like this requires the files to be in server side. For example if you connect to a remote postgres, you cannot use a secret file in local path. That's why I believe having a way to direct type/copy the secret is important.

what if we keep both options?

  • type/copy secrets for server-client setup
  • path based secrets for local setup

@zhousun
Copy link
Contributor

zhousun commented Jan 15, 2025

If I understand correct, providing a path like this requires the files to be in server side. For example if you connect to a remote postgres, you cannot use a secret file in local path. That's why I believe having a way to direct type/copy the secret is important.

what if we keep both options?

  • type/copy secrets for server-client setup
  • path based secrets for local setup

Yep that would work

rust_extensions/delta/Cargo.toml Outdated Show resolved Hide resolved
rust_extensions/delta/src/lib.rs Outdated Show resolved Hide resolved
rust_extensions/delta/src/lib.rs Outdated Show resolved Hide resolved
sql/pg_mooncake--0.1.0.sql Outdated Show resolved Hide resolved
src/columnstore/columnstore_metadata.cpp Outdated Show resolved Hide resolved
src/columnstore/columnstore_metadata.cpp Outdated Show resolved Hide resolved
rust_extensions/delta/src/lib.rs Outdated Show resolved Hide resolved
@@ -48,8 +55,10 @@ pub fn DeltaCreateTable(
runtime.block_on(async {
let mut storage_options: HashMap<String, String> =
serde_json::from_str(options.to_str()?).expect("invalid options");
// Write directly to S3 without locking is safe since Mooncake is the only writer
storage_options.insert("AWS_S3_ALLOW_UNSAFE_RENAME".to_string(), "true".to_string());
if get_storage_type(path) == StorageType::S3 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this again, I think it might be better to simply return the type from ColumnstoreMetadata::SecretsSearchDeltaOptions() and pass it down here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it is much simpler. we can return tuple<string, string> from ColumnstoreMetadata::SecretsSearchDeltaOptions() and pass them separately to DeltaCreateTable()

src/columnstore/columnstore_metadata.cpp Outdated Show resolved Hide resolved
README.md Outdated

> **Note**: On Neon, only cloud storage is supported. Neon users must bring their own S3 or R2 buckets or get a free S3 bucket by signing up at [s3.pgmooncake.com](https://s3.pgmooncake.com/). For cloud storage configuration instructions, see [Cloud Storage](https://pgmooncake.com/docs/cloud-storage). We are working to improve this experience.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rm extra newline

@@ -48,8 +52,10 @@ pub fn DeltaCreateTable(
runtime.block_on(async {
let mut storage_options: HashMap<String, String> =
serde_json::from_str(options.to_str()?).expect("invalid options");
// Write directly to S3 without locking is safe since Mooncake is the only writer
storage_options.insert("AWS_S3_ALLOW_UNSAFE_RENAME".to_string(), "true".to_string());
if storage_type.to_str()?.eq("S3") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose you can just storage_type.eq("S3")

@@ -275,14 +275,17 @@ RETURNS VOID
LANGUAGE plpgsql
AS $create_secret$
DECLARE
allowed_keys TEXT[] := ARRAY['ENDPOINT', 'REGION', 'SCOPE', 'USE_SSL'];
s3_allowed_keys TEXT[] := ARRAY['ENDPOINT', 'REGION', 'SCOPE', 'USE_SSL'];
gcs_allowed_keys TEXT[] := ARRAY['SCOPE', 'GCS_SECRET', 'PATH'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: sort alphabetically

keys TEXT[];
invalid_keys TEXT[];
delta_endpoint TEXT;
gcs_service_account_json JSONB;
gcs_required_keys TEXT[] := ARRAY['type', 'project_id', 'private_key_id', 'private_key', 'client_email', 'client_id'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sort alphabetically
Should these also be uppercase?
We need to also add doc for these - I think your original doc doesn't include them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcs_required_keys lists the keys in lowercase and in the same order as they appear in secrets JSON file. This makes it easy to spot which key is missing when error message is shown.

cc: @paurora17
For docs, you can use the below-

Google Cloud Storage

Add your GCS credentials by either providing the path to the service account JSON file:

SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"PATH":"path/to/service_account.json"}');

Or by supplying the Service Account JSON string directly:

SELECT mooncake.create_secret('<name>', 'GCS', '<HMAC_key_id>', '<HMAC_secret>', '{"GCS_SECRET":"Service Account JSON string"}');

Note: When supplying secrets using a JSON string, ensure it includes the following keys, properly escaped: 'type', 'project_id', 'private_key_id', 'private_key', 'client_email', 'client_id'.

Set the default bucket after adding GCS credentials:

SET mooncake.default_bucket = 'gs://<gcs_bucket>';

Note: delta-rs seems to accept only Service Account JSON credentials when creating delta tables in GCS.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for the background! I'll revert the order of keys to your original order

@paurora17
When adding the docs, please clearly state that '{"PATH":"path/to/service_account.json"}' only works if service_account.json is located on the server side
By default, this won't work in Docker containers unless the file is mounted, and it definitely won't work on Neon
Should we consider omitting this option from the doc altogether?

longest_match = scope.length();
}
const string storage_type = TextDatumGetCString(values[1]);
if (!IsMatchingStorageType(storage_type, path)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should just compute the type based on path outside of the loop

@@ -57,6 +57,20 @@ Oid Secrets() {
return get_relname_relid("secrets", Mooncake());
}

static const std::vector<string> s3_prefixes = {"s3://", "s3a://", "s3n://"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

  1. no need for static. you are already in anonymous namespace
  2. std::vector -> vector

static const std::vector<string> gcs_prefixes = {"gcs://", "gs://"};

// Checks if a path matches the URL scheme for the given storage type (S3 or GCS)
bool IsMatchingStorageType(const string &storage_type, const string &path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your variable names are not conforming with our code convention

bool IsMatchingStorageType(const string &storage_type, const string &path) {
auto has_matching_prefix = [&path](const std::vector<string> &prefixes) {
return std::any_of(prefixes.begin(), prefixes.end(),
[&path](string prefix) { return StringUtil::StartsWith(path, prefix); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use reference string &prefix

@@ -29,7 +29,7 @@ class ColumnstoreMetadata {
const ColumnList *columns = nullptr);

vector<string> SecretsGetDuckdbQueries();
string SecretsSearchDeltaOptions(const string &path);
std::tuple<string, string> SecretsSearchDeltaStorageConfig(const string &path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: std::tuple<string /*type*/, string /*options*/>

}

void ChangeFile(Oid oid, string file_name, int64_t file_size, bool is_add_file) {
if (cached_table_infos.count(oid) == 0) {
ColumnstoreMetadata metadata(NULL /*snapshot*/);
auto [path, timeline_id] = metadata.TablesSearch(oid);
cached_table_infos[oid] = {path, std::move(timeline_id), metadata.SecretsSearchDeltaOptions(path)};
auto [storage_type, options] = metadata.SecretsSearchDeltaStorageConfig(path);
cached_table_infos[oid] = {path, std::move(timeline_id), storage_type, options};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use std::move() on storage_type and options

@dpxcc
Copy link
Contributor

dpxcc commented Jan 21, 2025

Sorry, but my force-push to gcs-support branch somehow triggered Github to close this PR accidentally..

We've cleaned up your implementation at 42392b7
(The cleanup addressed the comments above)
Let us know if you are okay with that

Also, I believe DuckDB's delta_scan() won't work with GCS right now due to duckdb/duckdb-delta#97

@raghunandanbhat
Copy link
Contributor Author

We've cleaned up your implementation at 42392b7
(The cleanup addressed the comments above)
Let us know if you are okay with that

looks great!

dpxcc pushed a commit that referenced this pull request Jan 21, 2025
dpxcc pushed a commit that referenced this pull request Jan 21, 2025
@dpxcc
Copy link
Contributor

dpxcc commented Jan 21, 2025

Merged into the main branch. Thank you for your work! 🥮

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

support for gcs
3 participants