Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement initial version of read cache at filesystem layer #79

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Jan 12, 2025

This PR implements the initial version for read cache mentioned in the RFC here.

Key included items:

  • A basic read cache filesystem, which download the whole file content and dump to local filesystem as cache
  • Wrap filesystem instance in ClientContext with read cache fs wrapper

There're a few TODOs unimplemented, to control the scope and size for the current PR, the most important ones:

  • Design the state machine for cache file using unlogged table, so we could delete unused cache files to control the overall disk space consumption, I will put out a design for state machine transfer and implement in the next PR
    • This is not a regression compared with now existing write-cache solution, which doesn't implement disk reclamation as well
  • A few read and write optimizations mentioned in the thread, and along with the code which we could followup

How I tested:

@dentiny dentiny force-pushed the hjiang/read-cache-filesystem-initial branch 3 times, most recently from 221f7fd to b0f66d2 Compare January 12, 2025 10:44
@dentiny dentiny force-pushed the hjiang/read-cache-filesystem-initial branch from b0f66d2 to 0b823e5 Compare January 12, 2025 10:49
src/columnstore/columnstore_read_cache_filesystem.cpp Outdated Show resolved Hide resolved
@@ -135,6 +138,10 @@ unique_ptr<GlobalTableFunctionState> ColumnstoreScanInitGlobal(ClientContext &co
}

TableFunction ColumnstoreTable::GetScanFunction(ClientContext &context, unique_ptr<FunctionData> &bind_data) {
// Use read-optimized filesystem for scan operations.
context.client_data->client_file_system =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not the right place to hook into filesystem
context is shared globally, so you are essentially replacing global filesystem, so it should be hooked at database level, such as within DuckDBManager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! In the latest commit, I wrap the global filesystem for once at duckdb manager initialization;
Also left a TODO item for adding an extension for registration, let me know if it's ok, thanks!

// Get local cache filename for the given [remote_file].
string GetLocalCacheFile(const string &remote_file) {
const string fname = StringUtil::GetFileName(remote_file);
return x_mooncake_local_cache + fname;
Copy link
Contributor

@dpxcc dpxcc Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because you are replacing global filesystem, you will get conflicts when

SELECT * FROM mooncake.read_parquet('s3://bucket1/file.parquet');
SELECT * FROM mooncake.read_parquet('s3://bucket2/file.parquet');

This is not a problem before for data files because data filenames are uuid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I made it so because we suffix with UUID here:

file_name = UUID::ToString(UUID::GenerateRandomUUID()) + ".parquet";

I think your comment is valid, I do see we have plan to import external table: #61

It's a good point, I added hash value in the formatted cache filename to make sure unique-less.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don’t need #61 to reproduce the issue
mooncake.read_parquet() already supports reading arbitrary Parquet files
As shown in my example above, both s3://bucket1/file.parquet and s3://bucket2/file.parquet will be cached to the same location, file.parquet, causing a conflict

Adding a hash value to the cache filename doesn’t resolve this conflict either

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, you are adding Hash(file_path) instead of Hash(filename)
However, the issue remains because Hash() uses MurmurHash64(), which is not collision-resistant
It’s relatively easy to create collisions, and since the cache directory is shared across sessions, this opens up the possibility of injecting a cache file into a different session, posing a security risk

To address this issue, we can:

  1. Use a cryptographically strong hash function like SHA-2
  2. Use a regular PostgreSQL table to track the mapping between file_path and cached filename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's good point! I use SHA-256 in this PR.

Alternative considered:

  • SHA-1 is easier to suffer hash collision; I don't find SHA-2 util inside of duckdb, so use SHA-256 here;
  • I also considered sanitize remote filename (i.e. replace / with -), but I'm afraid the file path could be pretty long.

Use a regular PostgreSQL table to track the mapping between file_path and cached filename

I plan to handle it in the next PR;
for example, if the same file has already been cached locally, we simply add a reference count instead of download the whole remote object.

Copy link
Contributor

@dpxcc dpxcc Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe SHA-2 is just a family of hash functions that include SHA-256
Yes, I think it's better to use PG table to track the mapping, but we can do it in a separate PR

for example, if the same file has already been cached locally, we simply add a reference count instead of download the whole remote object

I'm not sure if we can use it for ref counting, when PG process crashes, it won't clear its ref count

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can use it for ref counting, when PG process crashes, it won't clear its ref count

Yes, I'm thinking to record the ref count along with its timestamp as well, so we could track broken ref count via reasonable postgres txn timeout.

@@ -0,0 +1,149 @@
// Columnstore read cache filesystem implements read cache to optimize query performance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to create a CachedHttpfsExtension similar to pg_duckdb
However, instead of copying the entire HttpfsExtension and making changes to it, I believe we can just copy HttpfsExtension's LoadInternal() function, and replacing

fs.RegisterSubSystem(make_uniq<HTTPFileSystem>());
fs.RegisterSubSystem(make_uniq<HuggingFaceFileSystem>());
fs.RegisterSubSystem(make_uniq<S3FileSystem>(BufferManager::GetBufferManager(instance)));

with their cached versions to build a very light-weight extension

Copy link
Contributor Author

@dentiny dentiny Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that makes sense, I left a TODO for that;
when checking the extension related code, one thing caught my eye, LoadExtension is disabled by us.

Disabled by pg_mooncake:

// LoadExtensions(context);

Enabled for pg_duckdb:
https://github.com/duckdb/pg_duckdb/blob/cc4a2d6356b02ec773545247a0edcc091c1ac615/src/pgduckdb_duckdb.cpp#L204

Could you please tell me more context?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LoadExtensions() is used by the install_extension() UDF to allow end-users to install custom DuckDB extensions: duckdb/pg_duckdb#62
While this is a nice-to-have feature, we didn’t enable it since we don’t have a strong need for it

Your case is different since you want to install some pre-defined extensions. You can simply run DuckDBQueryOrThrow(context, "LOAD <extension>") in DuckDBManager::Initialize()
Even better, you don’t need to package an extension at all - just invoke the RegisterSubSystem() calls directly on the database's filesystem, similar to LoadInternal()

Copy link
Contributor Author

@dentiny dentiny Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even better, you don’t need to package an extension at all - just invoke the RegisterSubSystem() calls directly on the database's filesystem, similar to LoadInternal()

That's something I've considered but I do have one concern:
Current implementation for virtual filesystem is, for all the registered sub-filesystem, the first-fit will be returned instead of best-fit; which means filesystem used depends on the order they're registered.
https://github.com/duckdb/duckdb/blob/adc6f607a71b87da2d0a7550e90db623e9bea637/src/common/virtual_file_system.cpp#L182-L196

That's also why they introduce "mutual set" concept to provide the minimum priority concept: duckdb/duckdb#10698

That's why in the latest commit, I directly wrap the filesystem instance with our read-cache wrapper, which I think easy to implement and understand, and initialized only once.

Let me know if you think it ok :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you need to remove httpfs from DuckDB: https://github.com/Mooncake-Labs/pg_mooncake/blob/main/Makefile#L109
Now when you register your cached versions of those sub-filesystems, there won't be any conflicts
Note that pg_duckdb also disallows users from loading httpfs extension, which will conflict with their own cached_httpfs extension: https://github.com/Mooncake-Labs/pg_mooncake/blob/main/src/pgduckdb/pgduckdb_duckdb.cpp#L330

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to apologize

Here is the approach I proposed earlier

  1. Create cached versions for HTTPFileSystem, HuggingFaceFileSystem, S3FileSystem, instead of VirtualFileSystem. The advantage is that you just need to override Read(), which is way less changes. And you can prob use template to do it generically
  2. In DuckDBManager::Initialize(), call something similar to HttpfsExtension::Load(), but instead of fs.RegisterSubSystem(make_uniq<HTTPFileSystem>());, register the cached version you created earlier

Copy link
Contributor

@dpxcc dpxcc Jan 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just noticed that DuckDB's httpfs has a flag

config.AddExtensionOption("force_download", "Forces upfront download of file", LogicalType::BOOLEAN, Value(false));

It enables caching of remote files in memory for the current process
I think this is good enough for us for now

  1. Non-serverless deployment like Docker is already fine with write cache
  2. For serverless deployment like Neon, I doubt you can reuse cache for different connections anyway. Also, they are building S3 cache themselves in Q2

In principal, I'm more inclined to move the cache responsibility to infra, not us. Because it's challenging to build a good read cache with Postgres' process-model, and it seems to be reinventing the wheel: nginx-s3-gateway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the approach I proposed earlier

I considered that, there're two concerns:

  • If we have more filesystem instances in the future (likely via other extensions), we need to register more cached fs instances into vfs;
  • These filesystem instances exist in duckdb's third-party or extension folder, which makes it harder to integrate into mooncake's dependency via makefile.
    • I made it work, which requires extra build targets like
      thrift/transport/TBufferTransports.cpp.o: ../../third_party/duckdb/third_party/thrift/thrift/transport/TBufferTransports.cpp
      @mkdir -p $(dir $@)
      $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@
      , and additional flags to suppress compilation warnings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I just noticed that DuckDB's httpfs has a flag

Thank you! Good find! I completely agree it's better if caching could be delegated to filesystem instance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Both of your concerns are very valid, so I agree that my proposal isn't better

@dentiny dentiny requested a review from dpxcc January 16, 2025 12:50
@dentiny dentiny marked this pull request as draft January 27, 2025 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants