Implement initial version of read cache at filesystem layer #79

dentiny · 2025-01-12T08:29:46Z

This PR implements the initial version for read cache mentioned in the RFC here.

Key included items:

A basic read cache filesystem, which download the whole file content and dump to local filesystem as cache
Wrap filesystem instance in ClientContext with read cache fs wrapper

There're a few TODOs unimplemented, to control the scope and size for the current PR, the most important ones:

Design the state machine for cache file using unlogged table, so we could delete unused cache files to control the overall disk space consumption, I will put out a design for state machine transfer and implement in the next PR
- This is not a regression compared with now existing write-cache solution, which doesn't implement disk reclamation as well
A few read and write optimizations mentioned in the thread, and along with the code which we could followup

How I tested:

I manually force the local file access to use read-cache as well, and make sure all functionalities correct
- Including SQL statement execution correct, local cache file correct, etc
- Reference PR: Read cache filesystem for local manual testing dentiny/pg_mooncake#2, where I documented my testing process in more details in the description

src/columnstore/columnstore_read_cache_filesystem.cpp

dpxcc · 2025-01-15T05:22:30Z

src/columnstore/execution/columnstore_scan.cpp

@@ -135,6 +138,10 @@ unique_ptr<GlobalTableFunctionState> ColumnstoreScanInitGlobal(ClientContext &co
 }

 TableFunction ColumnstoreTable::GetScanFunction(ClientContext &context, unique_ptr<FunctionData> &bind_data) {
+    // Use read-optimized filesystem for scan operations.
+    context.client_data->client_file_system =


This is not the right place to hook into filesystem
context is shared globally, so you are essentially replacing global filesystem, so it should be hooked at database level, such as within DuckDBManager

That makes sense! In the latest commit, I wrap the global filesystem for once at duckdb manager initialization;
Also left a TODO item for adding an extension for registration, let me know if it's ok, thanks!

dpxcc · 2025-01-15T05:49:12Z

src/columnstore/columnstore_read_cache_filesystem.cpp

+// Get local cache filename for the given [remote_file].
+string GetLocalCacheFile(const string &remote_file) {
+    const string fname = StringUtil::GetFileName(remote_file);
+    return x_mooncake_local_cache + fname;


Because you are replacing global filesystem, you will get conflicts when

SELECT * FROM mooncake.read_parquet('s3://bucket1/file.parquet'); SELECT * FROM mooncake.read_parquet('s3://bucket2/file.parquet');

This is not a problem before for data files because data filenames are uuid

Yeah I made it so because we suffix with UUID here:

pg_mooncake/src/columnstore/columnstore_table.cpp

Line 105 in 12b39ea

file_name = UUID::ToString(UUID::GenerateRandomUUID()) + ".parquet";

I think your comment is valid, I do see we have plan to import external table: #61

It's a good point, I added hash value in the formatted cache filename to make sure unique-less.

We don’t need #61 to reproduce the issue
mooncake.read_parquet() already supports reading arbitrary Parquet files
As shown in my example above, both s3://bucket1/file.parquet and s3://bucket2/file.parquet will be cached to the same location, file.parquet, causing a conflict

Adding a hash value to the cache filename doesn’t resolve this conflict either

OK, you are adding Hash(file_path) instead of Hash(filename)
However, the issue remains because Hash() uses MurmurHash64(), which is not collision-resistant
It’s relatively easy to create collisions, and since the cache directory is shared across sessions, this opens up the possibility of injecting a cache file into a different session, posing a security risk

To address this issue, we can:

Use a cryptographically strong hash function like SHA-2

Use a regular PostgreSQL table to track the mapping between file_path and cached filename

That's good point! I use SHA-256 in this PR.

Alternative considered:

SHA-1 is easier to suffer hash collision; I don't find SHA-2 util inside of duckdb, so use SHA-256 here;

I also considered sanitize remote filename (i.e. replace / with -), but I'm afraid the file path could be pretty long.

Use a regular PostgreSQL table to track the mapping between file_path and cached filename

I plan to handle it in the next PR;
for example, if the same file has already been cached locally, we simply add a reference count instead of download the whole remote object.

I believe SHA-2 is just a family of hash functions that include SHA-256
Yes, I think it's better to use PG table to track the mapping, but we can do it in a separate PR

for example, if the same file has already been cached locally, we simply add a reference count instead of download the whole remote object

I'm not sure if we can use it for ref counting, when PG process crashes, it won't clear its ref count

I'm not sure if we can use it for ref counting, when PG process crashes, it won't clear its ref count

Yes, I'm thinking to record the ref count along with its timestamp as well, so we could track broken ref count via reasonable postgres txn timeout.

dpxcc · 2025-01-15T05:53:35Z

src/columnstore/columnstore_read_cache_filesystem.hpp

@@ -0,0 +1,149 @@
+// Columnstore read cache filesystem implements read cache to optimize query performance.


I think it's better to create a CachedHttpfsExtension similar to pg_duckdb
However, instead of copying the entire HttpfsExtension and making changes to it, I believe we can just copy HttpfsExtension's LoadInternal() function, and replacing

fs.RegisterSubSystem(make_uniq<HTTPFileSystem>()); fs.RegisterSubSystem(make_uniq<HuggingFaceFileSystem>()); fs.RegisterSubSystem(make_uniq<S3FileSystem>(BufferManager::GetBufferManager(instance)));

with their cached versions to build a very light-weight extension

I think that makes sense, I left a TODO for that;
when checking the extension related code, one thing caught my eye, LoadExtension is disabled by us.

Disabled by pg_mooncake:

pg_mooncake/src/pgduckdb/pgduckdb_duckdb.cpp

Line 239 in edf716e

// LoadExtensions(context);

Enabled for pg_duckdb:
https://github.com/duckdb/pg_duckdb/blob/cc4a2d6356b02ec773545247a0edcc091c1ac615/src/pgduckdb_duckdb.cpp#L204

Could you please tell me more context?

LoadExtensions() is used by the install_extension() UDF to allow end-users to install custom DuckDB extensions: duckdb/pg_duckdb#62
While this is a nice-to-have feature, we didn’t enable it since we don’t have a strong need for it

Your case is different since you want to install some pre-defined extensions. You can simply run DuckDBQueryOrThrow(context, "LOAD <extension>") in DuckDBManager::Initialize()
Even better, you don’t need to package an extension at all - just invoke the RegisterSubSystem() calls directly on the database's filesystem, similar to LoadInternal()

Even better, you don’t need to package an extension at all - just invoke the RegisterSubSystem() calls directly on the database's filesystem, similar to LoadInternal()

That's something I've considered but I do have one concern:
Current implementation for virtual filesystem is, for all the registered sub-filesystem, the first-fit will be returned instead of best-fit; which means filesystem used depends on the order they're registered.
https://github.com/duckdb/duckdb/blob/adc6f607a71b87da2d0a7550e90db623e9bea637/src/common/virtual_file_system.cpp#L182-L196

That's also why they introduce "mutual set" concept to provide the minimum priority concept: duckdb/duckdb#10698

That's why in the latest commit, I directly wrap the filesystem instance with our read-cache wrapper, which I think easy to implement and understand, and initialized only once.

Let me know if you think it ok :)

I believe you need to remove httpfs from DuckDB: https://github.com/Mooncake-Labs/pg_mooncake/blob/main/Makefile#L109
Now when you register your cached versions of those sub-filesystems, there won't be any conflicts
Note that pg_duckdb also disallows users from loading httpfs extension, which will conflict with their own cached_httpfs extension: https://github.com/Mooncake-Labs/pg_mooncake/blob/main/src/pgduckdb/pgduckdb_duckdb.cpp#L330

No need to apologize

Here is the approach I proposed earlier

Create cached versions for HTTPFileSystem, HuggingFaceFileSystem, S3FileSystem, instead of VirtualFileSystem. The advantage is that you just need to override Read(), which is way less changes. And you can prob use template to do it generically

In DuckDBManager::Initialize(), call something similar to HttpfsExtension::Load(), but instead of fs.RegisterSubSystem(make_uniq<HTTPFileSystem>());, register the cached version you created earlier

Also, I just noticed that DuckDB's httpfs has a flag

config.AddExtensionOption("force_download", "Forces upfront download of file", LogicalType::BOOLEAN, Value(false));

It enables caching of remote files in memory for the current process
I think this is good enough for us for now

Non-serverless deployment like Docker is already fine with write cache

For serverless deployment like Neon, I doubt you can reuse cache for different connections anyway. Also, they are building S3 cache themselves in Q2

In principal, I'm more inclined to move the cache responsibility to infra, not us. Because it's challenging to build a good read cache with Postgres' process-model, and it seems to be reinventing the wheel: nginx-s3-gateway

Here is the approach I proposed earlier

I considered that, there're two concerns:

If we have more filesystem instances in the future (likely via other extensions), we need to register more cached fs instances into vfs;

These filesystem instances exist in duckdb's third-party or extension folder, which makes it harder to integrate into mooncake's dependency via makefile.

I made it work, which requires extra build targets like

pg_mooncake/Makefile.build

Lines 73 to 75 in ea4aced

thrift/transport/TBufferTransports.cpp.o: ../../third_party/duckdb/third_party/thrift/thrift/transport/TBufferTransports.cpp

@mkdir -p $(dir $@)

$(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@

, and additional flags to suppress compilation warnings

Also, I just noticed that DuckDB's httpfs has a flag

Thank you! Good find! I completely agree it's better if caching could be delegated to filesystem instance.

Good point! Both of your concerns are very valid, so I agree that my proposal isn't better

dentiny force-pushed the hjiang/read-cache-filesystem-initial branch 3 times, most recently from 221f7fd to b0f66d2 Compare January 12, 2025 10:44

read cache filesystem

0b823e5

dentiny force-pushed the hjiang/read-cache-filesystem-initial branch from b0f66d2 to 0b823e5 Compare January 12, 2025 10:49

dpxcc reviewed Jan 15, 2025

View reviewed changes

dpxcc force-pushed the main branch from 64a3e22 to 6a655ea Compare January 15, 2025 22:11

dentiny added 2 commits January 16, 2025 06:50

uniqueness for cache file and no need for scoped exit

8a98568

wrap read cache filesystem at initialization and update hash prefix

03d0c8a

dentiny requested a review from dpxcc January 16, 2025 12:50

dentiny added 2 commits January 17, 2025 11:01

Prefix sha256 for local cache file

70d7179

remove httpfs from build

ee8b54e

dentiny mentioned this pull request Jan 22, 2025

Enable httpfs read cache by default #101

Open

dentiny marked this pull request as draft January 27, 2025 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement initial version of read cache at filesystem layer #79

Implement initial version of read cache at filesystem layer #79

dentiny commented Jan 12, 2025 •

edited

Loading

dpxcc Jan 15, 2025

dentiny Jan 16, 2025

dpxcc Jan 15, 2025 •

edited

Loading

dentiny Jan 16, 2025

dpxcc Jan 16, 2025

dpxcc Jan 16, 2025

dentiny Jan 17, 2025

dpxcc Jan 18, 2025 •

edited

Loading

dentiny Jan 18, 2025

dpxcc Jan 15, 2025

dentiny Jan 16, 2025 •

edited

Loading

dpxcc Jan 16, 2025

dentiny Jan 17, 2025 •

edited

Loading

dpxcc Jan 18, 2025

dpxcc Jan 21, 2025

dpxcc Jan 21, 2025 •

edited

Loading

dentiny Jan 22, 2025

dentiny Jan 22, 2025

dpxcc Jan 22, 2025

		@@ -0,0 +1,149 @@
		// Columnstore read cache filesystem implements read cache to optimize query performance.

	thrift/transport/TBufferTransports.cpp.o: ../../third_party/duckdb/third_party/thrift/thrift/transport/TBufferTransports.cpp
	@mkdir -p $(dir $@)
	$(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@

Implement initial version of read cache at filesystem layer #79

Are you sure you want to change the base?

Implement initial version of read cache at filesystem layer #79

Conversation

dentiny commented Jan 12, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpxcc Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpxcc Jan 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dentiny Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dentiny Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpxcc Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dentiny commented Jan 12, 2025 •

edited

Loading

dpxcc Jan 15, 2025 •

edited

Loading

dpxcc Jan 18, 2025 •

edited

Loading

dentiny Jan 16, 2025 •

edited

Loading

dentiny Jan 17, 2025 •

edited

Loading

dpxcc Jan 21, 2025 •

edited

Loading