-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement initial version of read cache at filesystem layer #79
base: main
Are you sure you want to change the base?
Implement initial version of read cache at filesystem layer #79
Conversation
221f7fd
to
b0f66d2
Compare
b0f66d2
to
0b823e5
Compare
@@ -135,6 +138,10 @@ unique_ptr<GlobalTableFunctionState> ColumnstoreScanInitGlobal(ClientContext &co | |||
} | |||
|
|||
TableFunction ColumnstoreTable::GetScanFunction(ClientContext &context, unique_ptr<FunctionData> &bind_data) { | |||
// Use read-optimized filesystem for scan operations. | |||
context.client_data->client_file_system = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not the right place to hook into filesystem
context
is shared globally, so you are essentially replacing global filesystem, so it should be hooked at database level, such as within DuckDBManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense! In the latest commit, I wrap the global filesystem for once at duckdb manager initialization;
Also left a TODO item for adding an extension for registration, let me know if it's ok, thanks!
// Get local cache filename for the given [remote_file]. | ||
string GetLocalCacheFile(const string &remote_file) { | ||
const string fname = StringUtil::GetFileName(remote_file); | ||
return x_mooncake_local_cache + fname; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because you are replacing global filesystem, you will get conflicts when
SELECT * FROM mooncake.read_parquet('s3://bucket1/file.parquet');
SELECT * FROM mooncake.read_parquet('s3://bucket2/file.parquet');
This is not a problem before for data files because data filenames are uuid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I made it so because we suffix with UUID here:
file_name = UUID::ToString(UUID::GenerateRandomUUID()) + ".parquet"; |
I think your comment is valid, I do see we have plan to import external table: #61
It's a good point, I added hash value in the formatted cache filename to make sure unique-less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don’t need #61 to reproduce the issue
mooncake.read_parquet()
already supports reading arbitrary Parquet files
As shown in my example above, both s3://bucket1/file.parquet
and s3://bucket2/file.parquet
will be cached to the same location, file.parquet
, causing a conflict
Adding a hash value to the cache filename doesn’t resolve this conflict either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, you are adding Hash(file_path)
instead of Hash(filename)
However, the issue remains because Hash()
uses MurmurHash64()
, which is not collision-resistant
It’s relatively easy to create collisions, and since the cache directory is shared across sessions, this opens up the possibility of injecting a cache file into a different session, posing a security risk
To address this issue, we can:
- Use a cryptographically strong hash function like SHA-2
- Use a regular PostgreSQL table to track the mapping between
file_path
and cachedfilename
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's good point! I use SHA-256 in this PR.
Alternative considered:
- SHA-1 is easier to suffer hash collision; I don't find SHA-2 util inside of duckdb, so use SHA-256 here;
- I also considered sanitize remote filename (i.e. replace
/
with-
), but I'm afraid the file path could be pretty long.
Use a regular PostgreSQL table to track the mapping between file_path and cached filename
I plan to handle it in the next PR;
for example, if the same file has already been cached locally, we simply add a reference count instead of download the whole remote object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe SHA-2 is just a family of hash functions that include SHA-256
Yes, I think it's better to use PG table to track the mapping, but we can do it in a separate PR
for example, if the same file has already been cached locally, we simply add a reference count instead of download the whole remote object
I'm not sure if we can use it for ref counting, when PG process crashes, it won't clear its ref count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we can use it for ref counting, when PG process crashes, it won't clear its ref count
Yes, I'm thinking to record the ref count along with its timestamp as well, so we could track broken ref count via reasonable postgres txn timeout.
@@ -0,0 +1,149 @@ | |||
// Columnstore read cache filesystem implements read cache to optimize query performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to create a CachedHttpfsExtension
similar to pg_duckdb
However, instead of copying the entire HttpfsExtension
and making changes to it, I believe we can just copy HttpfsExtension
's LoadInternal()
function, and replacing
fs.RegisterSubSystem(make_uniq<HTTPFileSystem>());
fs.RegisterSubSystem(make_uniq<HuggingFaceFileSystem>());
fs.RegisterSubSystem(make_uniq<S3FileSystem>(BufferManager::GetBufferManager(instance)));
with their cached versions to build a very light-weight extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that makes sense, I left a TODO for that;
when checking the extension related code, one thing caught my eye, LoadExtension
is disabled by us.
Disabled by pg_mooncake:
pg_mooncake/src/pgduckdb/pgduckdb_duckdb.cpp
Line 239 in edf716e
// LoadExtensions(context); |
Enabled for pg_duckdb:
https://github.com/duckdb/pg_duckdb/blob/cc4a2d6356b02ec773545247a0edcc091c1ac615/src/pgduckdb_duckdb.cpp#L204
Could you please tell me more context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LoadExtensions()
is used by the install_extension()
UDF to allow end-users to install custom DuckDB extensions: duckdb/pg_duckdb#62
While this is a nice-to-have feature, we didn’t enable it since we don’t have a strong need for it
Your case is different since you want to install some pre-defined extensions. You can simply run DuckDBQueryOrThrow(context, "LOAD <extension>")
in DuckDBManager::Initialize()
Even better, you don’t need to package an extension at all - just invoke the RegisterSubSystem()
calls directly on the database's filesystem, similar to LoadInternal()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even better, you don’t need to package an extension at all - just invoke the RegisterSubSystem() calls directly on the database's filesystem, similar to LoadInternal()
That's something I've considered but I do have one concern:
Current implementation for virtual filesystem is, for all the registered sub-filesystem, the first-fit will be returned instead of best-fit; which means filesystem used depends on the order they're registered.
https://github.com/duckdb/duckdb/blob/adc6f607a71b87da2d0a7550e90db623e9bea637/src/common/virtual_file_system.cpp#L182-L196
That's also why they introduce "mutual set" concept to provide the minimum priority concept: duckdb/duckdb#10698
That's why in the latest commit, I directly wrap the filesystem instance with our read-cache wrapper, which I think easy to implement and understand, and initialized only once.
Let me know if you think it ok :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you need to remove httpfs
from DuckDB: https://github.com/Mooncake-Labs/pg_mooncake/blob/main/Makefile#L109
Now when you register your cached versions of those sub-filesystems, there won't be any conflicts
Note that pg_duckdb also disallows users from loading httpfs extension, which will conflict with their own cached_httpfs extension: https://github.com/Mooncake-Labs/pg_mooncake/blob/main/src/pgduckdb/pgduckdb_duckdb.cpp#L330
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to apologize
Here is the approach I proposed earlier
- Create cached versions for
HTTPFileSystem
,HuggingFaceFileSystem
,S3FileSystem
, instead ofVirtualFileSystem
. The advantage is that you just need to overrideRead()
, which is way less changes. And you can prob use template to do it generically - In
DuckDBManager::Initialize()
, call something similar toHttpfsExtension::Load()
, but instead offs.RegisterSubSystem(make_uniq<HTTPFileSystem>());
, register the cached version you created earlier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I just noticed that DuckDB's httpfs has a flag
config.AddExtensionOption("force_download", "Forces upfront download of file", LogicalType::BOOLEAN, Value(false));
It enables caching of remote files in memory for the current process
I think this is good enough for us for now
- Non-serverless deployment like Docker is already fine with write cache
- For serverless deployment like Neon, I doubt you can reuse cache for different connections anyway. Also, they are building S3 cache themselves in Q2
In principal, I'm more inclined to move the cache responsibility to infra, not us. Because it's challenging to build a good read cache with Postgres' process-model, and it seems to be reinventing the wheel: nginx-s3-gateway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the approach I proposed earlier
I considered that, there're two concerns:
- If we have more filesystem instances in the future (likely via other extensions), we need to register more cached fs instances into vfs;
- These filesystem instances exist in duckdb's third-party or extension folder, which makes it harder to integrate into mooncake's dependency via makefile.
- I made it work, which requires extra build targets like
Lines 73 to 75 in ea4aced
thrift/transport/TBufferTransports.cpp.o: ../../third_party/duckdb/third_party/thrift/thrift/transport/TBufferTransports.cpp @mkdir -p $(dir $@) $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c $< -o $@
- I made it work, which requires extra build targets like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I just noticed that DuckDB's httpfs has a flag
Thank you! Good find! I completely agree it's better if caching could be delegated to filesystem instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Both of your concerns are very valid, so I agree that my proposal isn't better
This PR implements the initial version for read cache mentioned in the RFC here.
Key included items:
ClientContext
with read cache fs wrapperThere're a few TODOs unimplemented, to control the scope and size for the current PR, the most important ones:
How I tested: