Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make MAXIMUM_SEED_SIZE_MIB configurable #7125

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
6 changes: 6 additions & 0 deletions .changes/unreleased/Features-20230307-134838.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
kind: Features
body: Make MAXIMUM_SEED_SIZE configurable
time: 2023-03-07T13:48:38.792321024Z
custom:
Author: noppaz acurtis-evi
Issue: 7117 7124
13 changes: 11 additions & 2 deletions core/dbt/constants.py
Original file line number Diff line number Diff line change
@@ -1,9 +1,18 @@
import os

SECRET_ENV_PREFIX = "DBT_ENV_SECRET_"
DEFAULT_ENV_PLACEHOLDER = "DBT_DEFAULT_PLACEHOLDER"
METADATA_ENV_PREFIX = "DBT_ENV_CUSTOM_ENV_"

MAXIMUM_SEED_SIZE = 1 * 1024 * 1024
MAXIMUM_SEED_SIZE_NAME = "1MB"

def get_max_seed_size():
mx = os.getenv("DBT_MAXIMUM_SEED_SIZE", "1")
Copy link
Contributor

@jtcohen6 jtcohen6 Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that this config's name include the unit: DBT_MAXIMUM_SEED_SIZE_MIB

Rather than accessing an env var directly here, this should really be implemented as a "global config" (available in our Flags module). I don't think it would make very much sense to pass as a CLI flag (--maximum-seed-size), but I could very much see someone wanting to set this in "user config":

# profiles.yml -- for now
config:
  maximum_seed_size_mb: 5

That means:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed this in fa836bb. It required some additional moving of the usage of the global flag as it is not a constant anymore and the constants.py file is referenced much earlier in the import cycle.

return int(mx)


DEFAULT_MAXIMUM_SEED_SIZE = 1 * 1024 * 1024
MAXIMUM_SEED_SIZE = get_max_seed_size() * DEFAULT_MAXIMUM_SEED_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need both of these constants defined. MAXIMUM_SEED_SIZE should just be get_max_seed_size() * 1024 * 1024. The get_max_seed_size() function will return the default value of 1 if the user does not set/override.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for checking @jtcohen6!

One reason for keeping DEFAULT_MAXIMUM_SEED_SIZE would be to be able to keep the logic of load_seed_source_file() in read_files.py. Since we do not convert the file from bytes in the new incremental logic we opted to continue computing the checksum with the "old" method for files smaller than 1 MiB so that we'd ensure backward compatibility with current manifests. This is since the checksum of a file encoded as anything other than UTF-8 will be different depending on if it gets UTF-8 encoded or is read as raw bytes.

The reason for not converting from bytes to UTF-8 string is performance and somewhat cleaner code but if we want to remove DEFAULT_MAXIMUM_SEED_SIZE maybe we should remove the elif and only compute the seed file hash incrementally instead. I see two options then:

  1. Read the file as string and ensure UTF-8 encoding before hashing
  2. Continue reading the raw bytes and accept that some projects with seed files that are not utf-8 might have a diff in their manifest deferral when upgrading to this version of dbt-core.

What's your thoughts on this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've implemented this suggestion with removal of DEFAULT_MAXIMUM_SEED_SIZE and went with alternative 1 above. The changes can be seen in 53021d9.

MAXIMUM_SEED_SIZE_NAME = str(get_max_seed_size()) + "MiB"

PIN_PACKAGE_URL = (
"https://docs.getdbt.com/docs/package-management#section-specifying-package-versions"
Expand Down
22 changes: 18 additions & 4 deletions core/dbt/contracts/files.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,9 @@
from mashumaro.types import SerializableType
from typing import List, Optional, Union, Dict, Any

from dbt.constants import MAXIMUM_SEED_SIZE

from dbt.clients.system import convert_path

from dbt.dataclass_schema import dbtClassMixin, StrEnum

from .util import SourceKey
Expand Down Expand Up @@ -63,9 +65,8 @@ def absolute_path(self) -> str:
def original_file_path(self) -> str:
return os.path.join(self.searched_path, self.relative_path)

def seed_too_large(self) -> bool:
"""Return whether the file this represents is over the seed size limit"""
return os.stat(self.full_path).st_size > MAXIMUM_SEED_SIZE
def file_size(self) -> int:
return os.stat(self.full_path).st_size


@dataclass
Expand Down Expand Up @@ -107,6 +108,19 @@ def from_contents(cls, contents: str, name="sha256") -> "FileHash":
checksum = hashlib.new(name, data).hexdigest()
return cls(name=name, checksum=checksum)

@classmethod
def from_path(cls, path: str, name="sha256") -> "FileHash":
"""Create a file hash from the file at given path."""
path = convert_path(path)
chunk_size = 1 * 1024 * 1024
file_hash = hashlib.new(name)
with open(path, "rb") as handle:
chunk = handle.read(chunk_size)
while chunk:
file_hash.update(chunk)
chunk = handle.read(chunk_size)
return cls(name=name, checksum=file_hash.hexdigest())


@dataclass
class RemoteFile(dbtClassMixin):
Expand Down
14 changes: 12 additions & 2 deletions core/dbt/parser/read_files.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ class FileDiff(dbtClassMixin):
changed: List[InputFile]
added: List[InputFile]

from dbt.constants import MAXIMUM_SEED_SIZE, DEFAULT_MAXIMUM_SEED_SIZE


# This loads the files contents and creates the SourceFile object
def load_source_file(
Expand Down Expand Up @@ -114,14 +116,22 @@ def validate_yaml(file_path, dct):

# Special processing for big seed files
def load_seed_source_file(match: FilePath, project_name) -> SourceFile:
if match.seed_too_large():
# MAXIMUM_SEED_SIZE = 0 means no limit
if match.file_size() > MAXIMUM_SEED_SIZE and MAXIMUM_SEED_SIZE != 0:
# We don't want to calculate a hash of this file. Use the path.
source_file = SourceFile.big_seed(match)
else:
elif match.file_size() <= DEFAULT_MAXIMUM_SEED_SIZE:
# This is here because the original seed calculation used utf8
# and the FileHash.from_path does not. This will leave hashes
# unchanged for previous installations.
file_contents = load_file_contents(match.absolute_path, strip=True)
checksum = FileHash.from_contents(file_contents)
source_file = SourceFile(path=match, checksum=checksum)
source_file.contents = ""
else:
checksum = FileHash.from_path(match.absolute_path)
source_file = SourceFile(path=match, checksum=checksum)
source_file.contents = ""
source_file.parse_file_type = ParseFileType.Seed
source_file.project_name = project_name
return source_file
Expand Down