-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External materialized model with per_thread_output
parameter does not work
#492
Comments
per_thread_output
does not work
per_thread_output
does not workper_thread_output
parameter does not work
@elephantmetropolis thanks for the report here, this is a fun one-- taking a look |
Quick update on this issue regarding My first (DuckDB-uneducated) assumption was that, due to some DuckDB optimization, the query planner decided to use only one thread to perform the I decided to test with a very simple model {{
config(
materialized='external',
format='parquet',
options={
'per_thread_output': true
}
)
}}
select *
from {{ ref('my_5GB_table') }} -- 'my_5GB_table' is materialized as a table The behaviour is still the same. It writes 1 file that is 5GB big in S3 in 2 min 30 sec. Now, if I connect to the DuckDB database (the same used by DBT) with copy (select * from my_5GB_table) to 's3://my_bucket/my_path' (format parquet, per_thread_output); It writes 40 files to S3 in ~3 sec I’ve checked everything I could in the implementation of the |
Oh I've been there before-- the most likely explanation is that the Python version of DuckDB you're using with dbt isn't the same as the CLI version you're running locally here (i.e. the python package is on an older version of DuckDB that doesn't have this feature supported yet.) The other possibility is that the version of dbt-duckdb you're using doesn't have the fix you included since I don't think we've cut a release since you merged it-- are you running off of your own build of dbt-duckdb, or did you do something like |
Hello, I am running into an issue when using the external materialization with
per_thread_output
option.This option is supposed to create a number of files based on the number of threads used by DuckDB.
When this parameter is passed along with
external
materialization, I receive an error:The funny thing is the files are actually written to S3 in the right place. And the current model is crashing, not the downstream model. For example, if I have
model_b
depending onmodel_a
, the run is as follow.After some research I found that part of the code:
dbt-duckdb/dbt/adapters/duckdb/impl.py
Line 173 in 6b539a6
From what I understand, only models created with
partition_by
parameter can use file glob, and I suppose it's the problem here.Is this assumption correct ? If yes, is it possible to add support for the parameter
per_thread_output
or is it better to manage this issue in a more "parameter agnostic" way ?I'm not an expert but if I can contribute in any way I would appreciate.
Edit: I tried to modify impl.py and it seems this works on my side : #493
Edit 2: After more investigation, it seems the parameter
per_thread_output
is not working properly for external materializations. I tested to write to S3 usingpost-hook
and macro and this code works:But using
per_thread_output
directly in{{ config() }}
will end up with DuckDB writing one big file to S3.For info, I set DuckDB to use 30 threads and 300GB ram.
However, the following code doesn't work properly, only one file is created, but, the file is created inside subfolder, which makes sense regarding expected
PER_THREAD_OUTPUT
behavior:It seems there is two issues here, one being the
per_thread_output
not taken into account with file glob and the second is theper_thread_output
parameter behaving weirdly (only using one thread ?).The text was updated successfully, but these errors were encountered: