Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why are active bindings and methods redefined in wrap()? #3

Open
etiennebacher opened this issue Sep 1, 2024 · 14 comments
Open

Why are active bindings and methods redefined in wrap()? #3

etiennebacher opened this issue Sep 1, 2024 · 14 comments

Comments

@etiennebacher
Copy link
Contributor

It's my first exploration of neo-polars so far so I'm just putting some remarks here.

I'm surprised by the block of makeActiveBinding and the lapply() call below:

wrap.PlRDataFrame <- function(x, ...) {
self <- new.env(parent = emptyenv())
self$`_df` <- x
# TODO: flags
makeActiveBinding("columns", function() self$`_df`$columns(), self)
makeActiveBinding("dtypes", function() {
self$`_df`$dtypes() |>
lapply(\(x) .savvy_wrap_PlRDataType(x) |> wrap())
}, self)
makeActiveBinding("schema", function() structure(self$dtypes, names = self$columns), self)
makeActiveBinding("shape", function() self$`_df`$shape(), self)
makeActiveBinding("height", function() self$`_df`$height(), self)
makeActiveBinding("width", function() self$`_df`$width(), self)
lapply(names(polars_dataframe__methods), function(name) {
fn <- polars_dataframe__methods[[name]]
environment(fn) <- environment()
assign(name, fn, envir = self)
})
class(self) <- c("polars_data_frame", "polars_object")
self

Correct me if I'm wrong, but wrap() is called in every single function, so this code basically stores the active bindings and the list of methods multiple times if we chain several functions together. This seems very wasteful and potentially quite a perf hit. The current way we do this in r-polars is to store functions in specific environments only once when we load the package. Why isn't this possible here? I suppose you thought about that and found some issues implementing it?

(I don't mean this as a criticism, I know this is a massive undertaking, just wondering if you can explain so that I can maybe help with that)

@etiennebacher
Copy link
Contributor Author

Regarding the lapply() call, could it be replaced by cloning the environment where dataframe functions are stored at the beginning of the function (instead of creating an empty one and filling it by hand)?

@eitsupi
Copy link
Owner

eitsupi commented Sep 1, 2024

I understand the concern and have noted it in the README.

neo-r-polars/README.md

Lines 216 to 222 in 130ea55

### Disadvantages
Due to the changes in the R class structure, the methods are now
dynamically added by a loop each time an R class is built. So I’m
worried that the performance will degrade after a large number of
methods are available. However, it is difficult to compare this at the
moment.

The current way we do this in r-polars is to store functions in specific environments only once when we load the package. Why isn't this possible here?

This is because it is different from the way Python (and R6 etc.) stores methods, causing R Polars to not be able to copy the behavior of Python Polars.

Without such a mechanism, it is not possible to implement features such as:
https://docs.pola.rs/api/python/stable/reference/api.html

@eitsupi
Copy link
Owner

eitsupi commented Sep 1, 2024

Regarding the lapply() call, could it be replaced by cloning the environment where dataframe functions are stored at the beginning of the function (instead of creating an empty one and filling it by hand)?

It certainly seems possible, but copying the environment doesn't seem easy and I don't know how much of a performance advantage it would be.

If I understand correctly, what I am doing here is the same thing that R6 and savvy are doing, and if there are performance issues, they may not adopt these approaches, I think.

@eitsupi
Copy link
Owner

eitsupi commented Sep 1, 2024

One place where this approach can be useful is where the following Datatype properties are registered in the bindings.

# Bindings mimic attributes of DataType classes of Python Polars
env_bind(self, !!!x$`_get_datatype_fields`())
## _inner is a pointer now, so it should be wrapped
if (exists("_inner", envir = self)) {
makeActiveBinding("inner", function() {
.savvy_wrap_PlRDataType(self$`_inner`) |>
wrap()
}, self)
}
## _fields is a list of pointers now, so they should be wrapped
if (exists("_fields", envir = self)) {
makeActiveBinding("fields", function() {
lapply(self$`_fields`, function(x) {
.savvy_wrap_PlRDataType(x) |>
wrap()
})
}, self)
}

fn _get_datatype_fields(&self) -> Result<Sexp> {
match &self.dt {
DataType::Decimal(precision, scale) => {
let mut out = OwnedListSexp::new(2, true)?;
let precision: Sexp =
precision.map_or_else(|| NullSexp.into(), |v| (v as f64).try_into())?;
let scale: Sexp =
scale.map_or_else(|| NullSexp.into(), |v| (v as f64).try_into())?;
let _ = out.set_name_and_value(0, "precision", precision);
let _ = out.set_name_and_value(1, "scale", scale);
Ok(out.into())
}
DataType::Datetime(time_unit, time_zone) => {
let mut out = OwnedListSexp::new(2, true)?;
let time_unit: Sexp = format!("{time_unit}").try_into()?;
let time_zone: Sexp = time_zone
.as_ref()
.map_or_else(|| NullSexp.into(), |v| v.to_owned().try_into())?;
let _ = out.set_name_and_value(0, "time_unit", time_unit);
let _ = out.set_name_and_value(1, "time_zone", time_zone);
Ok(out.into())
}
DataType::Duration(time_unit) => {
let mut out = OwnedListSexp::new(1, true)?;
let time_unit: Sexp = format!("{time_unit}").try_into()?;
let _ = out.set_name_and_value(0, "time_unit", time_unit);
Ok(out.into())
}
DataType::Array(inner, width) => {
let mut out = OwnedListSexp::new(2, true)?;
let inner: Sexp = PlRDataType { dt: *inner.clone() }.try_into()?;
let width: Sexp = (*width as f64).try_into()?;
let _ = out.set_name_and_value(0, "_inner", inner);
let _ = out.set_name_and_value(1, "width", width);
Ok(out.into())
}
DataType::List(inner) => {
let mut out = OwnedListSexp::new(1, true)?;
let inner: Sexp = PlRDataType { dt: *inner.clone() }.try_into()?;
let _ = out.set_name_and_value(0, "_inner", inner);
Ok(out.into())
}
DataType::Struct(fields) => {
let mut out = OwnedListSexp::new(1, true)?;
let mut list = OwnedListSexp::new(fields.len(), true)?;
for (i, field) in fields.iter().enumerate() {
let name = field.name().as_str();
let value: Sexp = PlRDataType {
dt: field.data_type().clone(),
}
.try_into()?;
let _ = list.set_name_and_value(i, name, value);
}
let _ = out.set_name_and_value(0, "_fields", list);
Ok(out.into())
}
DataType::Categorical(_, ordering) => {
let mut out = OwnedListSexp::new(1, true)?;
let ordering: Sexp = <String>::from(Wrap(ordering)).try_into()?;
let _ = out.set_name_and_value(0, "ordering", ordering);
Ok(out.into())
}
DataType::Enum(categories, _) => {
let mut out = OwnedListSexp::new(1, true)?;
let categories: Sexp = categories.as_ref().map_or_else(
|| NullSexp.into(),
|v| {
v.get_categories()
.into_iter()
.map(|v| v.unwrap_or_default().to_string())
.collect::<Vec<_>>()
.try_into()
},
)?;
let _ = out.set_name_and_value(0, "categories", categories);
Ok(out.into())
}
_ => Ok(NullSexp.into()),
}
}

@etiennebacher
Copy link
Contributor Author

It certainly seems possible, but copying the environment doesn't seem easy and I don't know how much of a performance advantage it would be.

I don't know about the perf but it would probably be possible (and cleaner) with rlang::env_clone(). I think this would be faster than going through a large list of functions.

@eitsupi
Copy link
Owner

eitsupi commented Sep 2, 2024

If it improves performance, then copying env seems like a good way to go, but I am not sure if it will work properly because of the processing we are doing to update self as shown below.

lapply(names(polars_series__methods), function(name) {
fn <- polars_series__methods[[name]]
environment(fn) <- environment()
assign(name, fn, envir = self)
})

If overwriting self is not possible, it may be necessary to wrap each function in a function with only self as an argument and then bind it to the environment, as the savvy CLI does.
This would require additional code and could increase complexity.

neo-r-polars/R/000-wrappers.R

Lines 1026 to 1030 in ddb0a2b

`PlRExpr_struct_with_fields` <- function(self) {
function(`fields`) {
.savvy_wrap_PlRExpr(.Call(savvy_PlRExpr_struct_with_fields__impl, `self`, `fields`))
}
}

e$`struct_with_fields` <- `PlRExpr_struct_with_fields`(ptr)

@etiennebacher
Copy link
Contributor Author

Sorry, I won't have enough time in the next weeks to think about this in detail. Let's keep going like this for now but I'd like to revisit this before we make the switch in polars (which isn't probably be very soon anyway I suppose)

@eitsupi
Copy link
Owner

eitsupi commented Sep 2, 2024

Yes, of course it would have to be after implementing a large number of methods to make a clear difference in performance.
So it would be better to do it later.

@yutannihilation
Copy link

If you want to avoid copying, extendr's way might be more preferable. It prepares an environment and extract the method from it. There's no strong reason I didn't follow this way in savvy. I just felt directly using environment is good in tab-completion.

https://github.com/extendr/extendr/blob/39ac2e8ce8fda7c86df5bd69be37190f49e9cf48/tests/extendrtests/R/extendr-wrappers.R#L186

(If I would implement, I would store the operation on R's side and pass it to Rust's side lazily, though.)

@eitsupi
Copy link
Owner

eitsupi commented Sep 6, 2024

Thank you.
A quick experiment shows that even if we copy the parent environment, the function is still tied to the original environment, so I guess this approach still won't work in a custom environment like R6.

at <- arrow::as_arrow_table(mtcars)
at$AddColumn
#> function (i, new_field, value)
#> Table__AddColumn(self, i, new_field, value)
#> <environment: 0x55e07fc9dc10>
at2 <- rlang::env_clone(at)
at2$AddColumn
#> function (i, new_field, value)
#> Table__AddColumn(self, i, new_field, value)
#> <environment: 0x55e07fc9dc10>
at2
#> <environment: 0x55e07fdbac68>

Created on 2024-09-06 with reprex v2.1.1

In any case, I don't think anyone is concerned about R6 performance in the arrow package, so I don't know if this is really a performance issue.

@eitsupi
Copy link
Owner

eitsupi commented Sep 9, 2024

A simple experiment: comparing the process with and without going through wrap() multiple times shows that there appears to be little penalty for wrap().

bench::mark(
  without_wrap = {
    neopolars:::PlRSeries$new_i32("", 1:10^5)$cast(neopolars::pl$Int64$`_dt`, TRUE)$cast(neopolars::pl$String$`_dt`, TRUE)
  },
  with_wrap = {
    neopolars::as_polars_series(1:10^5)$cast(neopolars::pl$Int64, TRUE)$cast(neopolars::pl$String, TRUE)
  },
  check = FALSE,
  min_iterations = 30
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 without_wrap   48.4ms   51.9ms      18.8    2.08MB        0
#> 2 with_wrap        49ms   51.9ms      18.6  741.02KB        0

Created on 2024-09-09 with reprex v2.1.1

@eitsupi
Copy link
Owner

eitsupi commented Jan 19, 2025

Just a note: if we move to S7, S7 does not allow environment-based objects (i.e., those with reference semantics) (it seems that this may be allowed in the future RConsortium/S7#290, but it seems incompatible with the functional OOP philosophy), so we need to go back to custom method implementations for the $ function.

One of the reasons I wanted to move away from custom $ methods was the difficulty of defining active bindings, but in S7, properties are accessed with @, so access with $ can be limited to functions (How to handle subnamespaces is an issue, though).

@etiennebacher
Copy link
Contributor Author

etiennebacher commented Jan 21, 2025

There clearly seems to be a performance hit due to the way wrap() is defined:

library(polars)

system.time({
  for (i in 1:300) {
    polars::pl$col("x")$cast(pl$String)$mean()$sum()$std()
  }
})
#>    user  system elapsed 
#>   0.353   0.017   0.369

library(neopolars)
#> 
#> Attaching package: 'neopolars'
#> The following objects are masked from 'package:polars':
#> 
#>     as_polars_df, as_polars_lf, as_polars_series, is_polars_df,
#>     is_polars_dtype, is_polars_lf, is_polars_series, pl

system.time({
  for (i in 1:300) {
    neopolars::pl$col("x")$cast(pl$String)$mean()$sum()$std()
  }
})
#>    user  system elapsed 
#>   2.724   0.002   2.727

image

(Note: this is with --profile release)


Of course this is not an actual usecase, but I think it's very possible to have this kind of scenario, for example chaining a lot of select(), with_columns(), etc.

@eitsupi
Copy link
Owner

eitsupi commented Jan 22, 2025

Thank you for profiling.
The use of custom environments is probably not a good idea given the compatibility with S7.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants