-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add current_date and current_timestamp functions #548
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions for clarification.
extensions/functions_datetime.yaml
Outdated
The `mode` option determines how often the date is evaluated. STREAMING mode | ||
evaluates the current date for each row. BATCH mode evaluates the current date | ||
once when the query starts and uses the same result for every row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the mode is BATCH
, and the function is used multiple times within a single query, should it have the same value for both invocations? In other words, imagine the query is:
SELECT CURRENT_DATE():mode=BATCH AS a, CURRENT_DATE():mode=BATCH AS b FROM ...
From the description I think that every row of a
should have the same value. Also, every row of b
should have the same value. However, must the values in a
be identical to the values in b
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding, a
would not always be identical to b
. Each CURRENT_DATE()
will get evaluated once separately and then those will be used...so if there is a date/time change between each evaluation they'll be different.
Do you have a suggestion for how I should update the description?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my confusion is around the words "when the query starts". I interpret that to be the same point in time for all function calls within a query. So, if we don't want to guarantee that, then I'd suggest removing those words.
BATCH mode evaluates the current date one and uses the same result for every row.
Note that many current engines do guarantee that a
is always equal to b
and they specifically use the "when the query starts" wording to do so.
Postgres:
It is important to know that CURRENT_TIMESTAMP and related functions return the start time of the current transaction; their values do not change during the transaction. This is considered a feature: the intent is to allow a single transaction to have a consistent notion of the "current" time, so that multiple modifications within the same transaction bear the same time stamp.
Presto:
Returns the current timestamp as of the start of the query.
Spark:
Returns the current timestamp at the start of query evaluation as a TimestampType column. All calls of current_timestamp within the same query return the same value.
Note that, if the query is distributed, then this becomes a rather tricky thing to guarantee.
My preference would probably be to state that we do guarantee it to be the same:
BATCH mode evaluates the current date once and uses the same result for every call to this function in the query.
If an engine is distributed then either it has to solve this problem somehow (e.g. by evaluating once at the start of the query and sending that down to the workers) or reject BATCH mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the details from the other backends! I was mainly basing my thought around this (probably very outdated) example on sql server: https://stackoverflow.com/questions/4056355/selecting-getdate-function-twice-in-a-select-list-same-value-for-both
I'm good with guaranteeing the same results though. Just updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@westonpace let me know if the updated descriptions are good with you
7b62b1a
to
b2117d2
Compare
- | ||
name: "current_date" | ||
description: >- | ||
Return the current date. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it obvious that this intended to be in UTC?
Unlike timestamp, the date doesn't have enough information to do time zone conversion, so a parameter or separate function would be needed to be able to support both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The options I see are using the locale the server was set up with and UTC. So perhaps we need an option. That could also apply to current_timestamp (either to return a timestamp with the timezone of current or UTC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively if all of the engines just return UTC then we should require that.
I made some comments in the last sync. Basically, I don't think the "batch" mode of these is actually a function. It would be what I would call a contextual function. The function needs to evaluate and is determinant given the same context values. Context values are things like the time when the query ran, the timezone of the user, the currently selected database, etc. These context functions could be a new unique expression type or could be an extended version of what we talked before for variables in prepared statements (since they are basically pointers to external variables). For the "STREAM" case, I think this is fine. However, I think we should probably use a different name to avoid confusion from people coming from the SQL world, where there is always the "BATCH" behavior. |
PR to add current_date and current_timestamp functions'