Why are active bindings and methods redefined in `wrap()`? #3

etiennebacher · 2024-09-01T13:35:19Z

It's my first exploration of neo-polars so far so I'm just putting some remarks here.

I'm surprised by the block of makeActiveBinding and the lapply() call below:

Lines 67 to 89 in eaa014f

    
           wrap.PlRDataFrame <- function(x, ...) { 
        
             self <- new.env(parent = emptyenv()) 
        
             self$`_df` <- x 
        
             # TODO: flags 
        
             makeActiveBinding("columns", function() self$`_df`$columns(), self) 
        
             makeActiveBinding("dtypes", function() { 
        
               self$`_df`$dtypes() |> 
        
                 lapply(\(x) .savvy_wrap_PlRDataType(x) |> wrap()) 
        
             }, self) 
        
             makeActiveBinding("schema", function() structure(self$dtypes, names = self$columns), self) 
        
             makeActiveBinding("shape", function() self$`_df`$shape(), self) 
        
             makeActiveBinding("height", function() self$`_df`$height(), self) 
        
             makeActiveBinding("width", function() self$`_df`$width(), self) 
        
             lapply(names(polars_dataframe__methods), function(name) { 
        
               fn <- polars_dataframe__methods[[name]] 
        
               environment(fn) <- environment() 
        
               assign(name, fn, envir = self) 
        
             }) 
        
             class(self) <- c("polars_data_frame", "polars_object") 
        
             self

Correct me if I'm wrong, but wrap() is called in every single function, so this code basically stores the active bindings and the list of methods multiple times if we chain several functions together. This seems very wasteful and potentially quite a perf hit. The current way we do this in r-polars is to store functions in specific environments only once when we load the package. Why isn't this possible here? I suppose you thought about that and found some issues implementing it?

(I don't mean this as a criticism, I know this is a massive undertaking, just wondering if you can explain so that I can maybe help with that)

The text was updated successfully, but these errors were encountered:

etiennebacher · 2024-09-01T14:01:11Z

Regarding the lapply() call, could it be replaced by cloning the environment where dataframe functions are stored at the beginning of the function (instead of creating an empty one and filling it by hand)?

eitsupi · 2024-09-01T14:03:22Z

I understand the concern and have noted it in the README.

neo-r-polars/README.md

Lines 216 to 222 in 130ea55

    
           ### Disadvantages 
        
           Due to the changes in the R class structure, the methods are now 
        
           dynamically added by a loop each time an R class is built. So I’m 
        
           worried that the performance will degrade after a large number of 
        
           methods are available. However, it is difficult to compare this at the 
        
           moment.

The current way we do this in r-polars is to store functions in specific environments only once when we load the package. Why isn't this possible here?

This is because it is different from the way Python (and R6 etc.) stores methods, causing R Polars to not be able to copy the behavior of Python Polars.

Without such a mechanism, it is not possible to implement features such as:
https://docs.pola.rs/api/python/stable/reference/api.html

eitsupi · 2024-09-01T14:12:27Z

Regarding the lapply() call, could it be replaced by cloning the environment where dataframe functions are stored at the beginning of the function (instead of creating an empty one and filling it by hand)?

It certainly seems possible, but copying the environment doesn't seem easy and I don't know how much of a performance advantage it would be.

If I understand correctly, what I am doing here is the same thing that R6 and savvy are doing, and if there are performance issues, they may not adopt these approaches, I think.

eitsupi · 2024-09-01T14:28:36Z

One place where this approach can be useful is where the following Datatype properties are registered in the bindings.

neo-r-polars/R/datatypes-classes.R

Lines 22 to 41 in 130ea55

    
           # Bindings mimic attributes of DataType classes of Python Polars 
        
           env_bind(self, !!!x$`_get_datatype_fields`()) 
        
           ## _inner is a pointer now, so it should be wrapped 
        
           if (exists("_inner", envir = self)) { 
        
             makeActiveBinding("inner", function() { 
        
               .savvy_wrap_PlRDataType(self$`_inner`) |> 
        
                 wrap() 
        
             }, self) 
        
           } 
        
           ## _fields is a list of pointers now, so they should be wrapped 
        
           if (exists("_fields", envir = self)) { 
        
             makeActiveBinding("fields", function() { 
        
               lapply(self$`_fields`, function(x) { 
        
                 .savvy_wrap_PlRDataType(x) |> 
        
                   wrap() 
        
               }) 
        
             }, self) 
        
           }

neo-r-polars/src/rust/src/datatypes.rs

Lines 91 to 170 in 130ea55

    
           fn _get_datatype_fields(&self) -> Result<Sexp> { 
        
               match &self.dt { 
        
                   DataType::Decimal(precision, scale) => { 
        
                       let mut out = OwnedListSexp::new(2, true)?; 
        
                       let precision: Sexp = 
        
                           precision.map_or_else(|| NullSexp.into(), |v| (v as f64).try_into())?; 
        
                       let scale: Sexp = 
        
                           scale.map_or_else(|| NullSexp.into(), |v| (v as f64).try_into())?; 
        
                       let _ = out.set_name_and_value(0, "precision", precision); 
        
                       let _ = out.set_name_and_value(1, "scale", scale); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::Datetime(time_unit, time_zone) => { 
        
                       let mut out = OwnedListSexp::new(2, true)?; 
        
                       let time_unit: Sexp = format!("{time_unit}").try_into()?; 
        
                       let time_zone: Sexp = time_zone 
        
                           .as_ref() 
        
                           .map_or_else(|| NullSexp.into(), |v| v.to_owned().try_into())?; 
        
                       let _ = out.set_name_and_value(0, "time_unit", time_unit); 
        
                       let _ = out.set_name_and_value(1, "time_zone", time_zone); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::Duration(time_unit) => { 
        
                       let mut out = OwnedListSexp::new(1, true)?; 
        
                       let time_unit: Sexp = format!("{time_unit}").try_into()?; 
        
                       let _ = out.set_name_and_value(0, "time_unit", time_unit); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::Array(inner, width) => { 
        
                       let mut out = OwnedListSexp::new(2, true)?; 
        
                       let inner: Sexp = PlRDataType { dt: *inner.clone() }.try_into()?; 
        
                       let width: Sexp = (*width as f64).try_into()?; 
        
                       let _ = out.set_name_and_value(0, "_inner", inner); 
        
                       let _ = out.set_name_and_value(1, "width", width); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::List(inner) => { 
        
                       let mut out = OwnedListSexp::new(1, true)?; 
        
                       let inner: Sexp = PlRDataType { dt: *inner.clone() }.try_into()?; 
        
                       let _ = out.set_name_and_value(0, "_inner", inner); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::Struct(fields) => { 
        
                       let mut out = OwnedListSexp::new(1, true)?; 
        
                       let mut list = OwnedListSexp::new(fields.len(), true)?; 
        
                       for (i, field) in fields.iter().enumerate() { 
        
                           let name = field.name().as_str(); 
        
                           let value: Sexp = PlRDataType { 
        
                               dt: field.data_type().clone(), 
        
                           } 
        
                           .try_into()?; 
        
                           let _ = list.set_name_and_value(i, name, value); 
        
                       } 
        
                       let _ = out.set_name_and_value(0, "_fields", list); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::Categorical(_, ordering) => { 
        
                       let mut out = OwnedListSexp::new(1, true)?; 
        
                       let ordering: Sexp = <String>::from(Wrap(ordering)).try_into()?; 
        
                       let _ = out.set_name_and_value(0, "ordering", ordering); 
        
                       Ok(out.into()) 
        
                   } 
        
                   DataType::Enum(categories, _) => { 
        
                       let mut out = OwnedListSexp::new(1, true)?; 
        
                       let categories: Sexp = categories.as_ref().map_or_else( 
        
                           || NullSexp.into(), 
        
                           |v| { 
        
                               v.get_categories() 
        
                                   .into_iter() 
        
                                   .map(|v| v.unwrap_or_default().to_string()) 
        
                                   .collect::<Vec<_>>() 
        
                                   .try_into() 
        
                           }, 
        
                       )?; 
        
                       let _ = out.set_name_and_value(0, "categories", categories); 
        
                       Ok(out.into()) 
        
                   } 
        
                   _ => Ok(NullSexp.into()), 
        
               } 
        
           }

etiennebacher · 2024-09-01T14:34:12Z

It certainly seems possible, but copying the environment doesn't seem easy and I don't know how much of a performance advantage it would be.

I don't know about the perf but it would probably be possible (and cleaner) with rlang::env_clone(). I think this would be faster than going through a large list of functions.

eitsupi · 2024-09-02T12:29:29Z

If it improves performance, then copying env seems like a good way to go, but I am not sure if it will work properly because of the processing we are doing to update self as shown below.

neo-r-polars/R/series-series.R

Lines 55 to 59 in ddb0a2b

    
           lapply(names(polars_series__methods), function(name) { 
        
             fn <- polars_series__methods[[name]] 
        
             environment(fn) <- environment() 
        
             assign(name, fn, envir = self) 
        
           })

If overwriting self is not possible, it may be necessary to wrap each function in a function with only self as an argument and then bind it to the environment, as the savvy CLI does.
This would require additional code and could increase complexity.

neo-r-polars/R/000-wrappers.R

Lines 1026 to 1030 in ddb0a2b

    
           `PlRExpr_struct_with_fields` <- function(self) { 
        
             function(`fields`) { 
        
               .savvy_wrap_PlRExpr(.Call(savvy_PlRExpr_struct_with_fields__impl, `self`, `fields`)) 
        
             } 
        
           }

neo-r-polars/R/000-wrappers.R

Line 1106 in ddb0a2b

e$`struct_with_fields` <- `PlRExpr_struct_with_fields`(ptr)

etiennebacher · 2024-09-02T12:34:58Z

Sorry, I won't have enough time in the next weeks to think about this in detail. Let's keep going like this for now but I'd like to revisit this before we make the switch in polars (which isn't probably be very soon anyway I suppose)

eitsupi · 2024-09-02T12:38:46Z

Yes, of course it would have to be after implementing a large number of methods to make a clear difference in performance.
So it would be better to do it later.

yutannihilation · 2024-09-06T12:07:27Z

If you want to avoid copying, extendr's way might be more preferable. It prepares an environment and extract the method from it. There's no strong reason I didn't follow this way in savvy. I just felt directly using environment is good in tab-completion.

https://github.com/extendr/extendr/blob/39ac2e8ce8fda7c86df5bd69be37190f49e9cf48/tests/extendrtests/R/extendr-wrappers.R#L186

(If I would implement, I would store the operation on R's side and pass it to Rust's side lazily, though.)

eitsupi · 2024-09-06T13:23:43Z

Thank you.
A quick experiment shows that even if we copy the parent environment, the function is still tied to the original environment, so I guess this approach still won't work in a custom environment like R6.

at <- arrow::as_arrow_table(mtcars)
at$AddColumn
#> function (i, new_field, value)
#> Table__AddColumn(self, i, new_field, value)
#> <environment: 0x55e07fc9dc10>

at2 <- rlang::env_clone(at)
at2$AddColumn
#> function (i, new_field, value)
#> Table__AddColumn(self, i, new_field, value)
#> <environment: 0x55e07fc9dc10>

at2
#> <environment: 0x55e07fdbac68>

^{Created on 2024-09-06 with reprex v2.1.1}

In any case, I don't think anyone is concerned about R6 performance in the arrow package, so I don't know if this is really a performance issue.

eitsupi · 2024-09-09T14:49:22Z

A simple experiment: comparing the process with and without going through wrap() multiple times shows that there appears to be little penalty for wrap().

bench::mark(
  without_wrap = {
    neopolars:::PlRSeries$new_i32("", 1:10^5)$cast(neopolars::pl$Int64$`_dt`, TRUE)$cast(neopolars::pl$String$`_dt`, TRUE)
  },
  with_wrap = {
    neopolars::as_polars_series(1:10^5)$cast(neopolars::pl$Int64, TRUE)$cast(neopolars::pl$String, TRUE)
  },
  check = FALSE,
  min_iterations = 30
)
#> # A tibble: 2 × 6
#>   expression        min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 without_wrap   48.4ms   51.9ms      18.8    2.08MB        0
#> 2 with_wrap        49ms   51.9ms      18.6  741.02KB        0

^{Created on 2024-09-09 with reprex v2.1.1}

eitsupi · 2025-01-19T06:03:52Z

Just a note: if we move to S7, S7 does not allow environment-based objects (i.e., those with reference semantics) (it seems that this may be allowed in the future RConsortium/S7#290, but it seems incompatible with the functional OOP philosophy), so we need to go back to custom method implementations for the $ function.

One of the reasons I wanted to move away from custom $ methods was the difficulty of defining active bindings, but in S7, properties are accessed with @, so access with $ can be limited to functions (How to handle subnamespaces is an issue, though).

etiennebacher · 2025-01-21T11:31:31Z

There clearly seems to be a performance hit due to the way wrap() is defined:

library(polars)

system.time({
  for (i in 1:300) {
    polars::pl$col("x")$cast(pl$String)$mean()$sum()$std()
  }
})
#>    user  system elapsed 
#>   0.353   0.017   0.369

library(neopolars)
#> 
#> Attaching package: 'neopolars'
#> The following objects are masked from 'package:polars':
#> 
#>     as_polars_df, as_polars_lf, as_polars_series, is_polars_df,
#>     is_polars_dtype, is_polars_lf, is_polars_series, pl

system.time({
  for (i in 1:300) {
    neopolars::pl$col("x")$cast(pl$String)$mean()$sum()$std()
  }
})
#>    user  system elapsed 
#>   2.724   0.002   2.727

(Note: this is with --profile release)

Of course this is not an actual usecase, but I think it's very possible to have this kind of scenario, for example chaining a lot of select(), with_columns(), etc.

eitsupi · 2025-01-22T07:06:05Z

Thank you for profiling.
The use of custom environments is probably not a good idea given the compatibility with S7.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why are active bindings and methods redefined in `wrap()`? #3

Why are active bindings and methods redefined in `wrap()`? #3

etiennebacher commented Sep 1, 2024

etiennebacher commented Sep 1, 2024

eitsupi commented Sep 1, 2024

eitsupi commented Sep 1, 2024 •

edited

Loading

eitsupi commented Sep 1, 2024 •

edited

Loading

etiennebacher commented Sep 1, 2024

eitsupi commented Sep 2, 2024

etiennebacher commented Sep 2, 2024

eitsupi commented Sep 2, 2024

yutannihilation commented Sep 6, 2024

eitsupi commented Sep 6, 2024

eitsupi commented Sep 9, 2024

eitsupi commented Jan 19, 2025 •

edited

Loading

etiennebacher commented Jan 21, 2025 •

edited

Loading

eitsupi commented Jan 22, 2025

Why are active bindings and methods redefined in wrap()? #3

Why are active bindings and methods redefined in wrap()? #3

Comments

etiennebacher commented Sep 1, 2024

etiennebacher commented Sep 1, 2024

eitsupi commented Sep 1, 2024

eitsupi commented Sep 1, 2024 • edited Loading

eitsupi commented Sep 1, 2024 • edited Loading

etiennebacher commented Sep 1, 2024

eitsupi commented Sep 2, 2024

etiennebacher commented Sep 2, 2024

eitsupi commented Sep 2, 2024

yutannihilation commented Sep 6, 2024

eitsupi commented Sep 6, 2024

eitsupi commented Sep 9, 2024

eitsupi commented Jan 19, 2025 • edited Loading

etiennebacher commented Jan 21, 2025 • edited Loading

eitsupi commented Jan 22, 2025

Why are active bindings and methods redefined in `wrap()`? #3

Why are active bindings and methods redefined in `wrap()`? #3

eitsupi commented Sep 1, 2024 •

edited

Loading

eitsupi commented Sep 1, 2024 •

edited

Loading

eitsupi commented Jan 19, 2025 •

edited

Loading

etiennebacher commented Jan 21, 2025 •

edited

Loading