Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add missing functions and docs for LazyFrame #49

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

etiennebacher
Copy link
Contributor

Same changes as in #31, just without the tests

src/rust/src/conversion/mod.rs Outdated Show resolved Hide resolved
src/rust/src/conversion/mod.rs Outdated Show resolved Hide resolved
@eitsupi
Copy link
Owner

eitsupi commented Dec 14, 2024

Thanks, but just to be clear, don't expect to see this huge change merged immediately as it is very difficult to review.

R/lazyframe-frame.R Outdated Show resolved Hide resolved
R/lazyframe-frame.R Outdated Show resolved Hide resolved
R/lazyframe-frame.R Outdated Show resolved Hide resolved
R/lazyframe-frame.R Outdated Show resolved Hide resolved
#'
#' df$clear(n = 5)
lazyframe__clear <- function(n = 0) {
pl$DataFrame(schema = self$schema)$clear(n)$lazy()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It needs a small adjustment since we don't have schema but .schema_overrides in pl$DataFrame() but that's the idea, yes:

https://github.com/pola-rs/polars/blob/91ad299b17c48422131c584f63acedc758581100/py-polars/polars/lazyframe/frame.py#L3238

(Note that this doesn't work because $clear() is not implemented for DataFrame yet)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 4 to 12
#[savvy]
fn deserialize_lf(json: &str) -> Result<PlRLazyFrame> {
let lp = serde_json::from_str::<DslPlan>(json).map_err(|_| {
let msg = "could not deserialize input into a LazyFrame";
RPolarsErr::Other(msg.to_string())
})?;
let out = LazyFrame::from(lp);
Ok(<PlRLazyFrame>::from(out))
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a clone of Python Polars' ones, is it?
The function should be the same as Python Polars.

Of course, it must be an associated function.

Copy link
Contributor Author

@etiennebacher etiennebacher Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No I think I had copied this from r-polars actually, so it should be modified/removed.

Deserialization is a bit tricky in terms of API and I think that's why we have deviated with this deserialize_lf() in r-polars. In py-polars they have:

pl.DataFrame.deserialize(...)
pl.LazyFrame.deserialize(...)

and I'm not sure neo-r-polars support something like this since pl$DataFrame just returns the body of the function. I wouldn't mind removing this serde file for now, what do you think?

Copy link
Owner

@eitsupi eitsupi Dec 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pl$DataFrame is a function, something like pl$DataFrame$foo cannot be realized in R.

However, this is a different thing: whether the function on the R side is something like deserialize_lf, an associated function can be used on the Rust side (extendr binds the associated functiona to the R side in a strange way, but with savvy, it is not a problem).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand and I think this might be tricky to implement. I have removed this function for now: it's not essential and I don't want it to block the rest of the PR. I added a todo and we can address this later in a more focused PR.

@etiennebacher
Copy link
Contributor Author

@eitsupi just checking in to know if you were able to review parts of this PR already. I know this one is big, I could maybe split it in 2 or 3 if it helps.

I'll also try to fix the wasm build errors. Do you know if there's a way to mock this build locally or if it can only be done in CI?

@eitsupi
Copy link
Owner

eitsupi commented Jan 16, 2025

Sorry for the late review. I have not read through most of the individual methods.
I'm pretty stuck on the API, especially StatisticsOptions, and I think it would be more natural for R to have each element as a separate argument rather than specifying it as a list.

I'll also try to fix the wasm build errors. Do you know if there's a way to mock this build locally or if it can only be done in CI?

This would be relatively easy to run locally as it builds on Docker, but I have not tried it.

https://github.com/r-wasm/actions/tree/main/build-rwasm

@etiennebacher
Copy link
Contributor Author

etiennebacher commented Jan 20, 2025

I'm pretty stuck on the API, especially StatisticsOptions, and I think it would be more natural for R to have each element as a separate argument rather than specifying it as a list.

I'm not convinced about this for several reasons. First, it increases the number of arguments to browse in the docs, and maybe more statistics will be added in the future so it would be annoying to have 5-6 arguments only about statistics. Second, we use lists in other params as well (e.g. storage_options in sink_parquet()) so I don't see why we would diverge here.

Other packages like readr have functions like locale() which are not very useful on their own but can be passed to read_csv() for instance. I think it would be a good compromise: it allows to have proper named arguments while not adding extra args to sink_parquet(). What do you think about this?

Regarding the rust side, I'm also not a fan of the current impl TryFrom<ListSexp> for Wrap<StatisticsOptions>. Maybe we could internally split the statistics arguments to individual arguments in R and then pass each of those in Rust? That would depart a bit from the py-polars implementation though.

@etiennebacher
Copy link
Contributor Author

@eitsupi Thoughts on the previous comment?

@eitsupi
Copy link
Owner

eitsupi commented Jan 23, 2025

Sorry, I forgot to comment.

Your opinion makes sense, and I think it's a good idea to pass the elements to Rust that decompose the list on the R side.

@etiennebacher
Copy link
Contributor Author

I've added a parquet_statistics() function which is essentially a list() but allows us to ensure there are no misnamed statistics. I'd like to keep the TRUE/FALSE shortcut to include or exclude all stats at once.

We can discuss this further but if so I'd rather put a TODO on this and address this in a subsequent PR so we can focus on the other methods in this PR. Once this one is merged I can move on with the tests and the DataFrame methods (which I will do in smaller pieces to avoid massive PRs like this one).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants