Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

median calculation fails when using dplyr::mutate #993

Open
cy-james-lee opened this issue Jan 9, 2025 · 1 comment · May be fixed by #1011
Open

median calculation fails when using dplyr::mutate #993

cy-james-lee opened this issue Jan 9, 2025 · 1 comment · May be fixed by #1011

Comments

@cy-james-lee
Copy link

library(duckdb)
#> Loading required package: DBI
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
conn1 <- dbConnect(duckdb())
duckdb_register(
  conn1,
  "airquality",
  airquality
)
tbl_aq <- tbl(conn1, "airquality")
tbl_aq |> 
  summarize(
    .by = Day,
    med = median(Temp)
  )
#> Warning: Missing values are always removed in SQL aggregation functions.
#> Use `na.rm = TRUE` to silence this warning
#> This warning is displayed once every 8 hours.
#> # Source:   SQL [?? x 2]
#> # Database: DuckDB v1.1.3 [A6FHAZZ@Windows 10 x64:R 4.4.2/:memory:]
#>      Day   med
#>    <int> <dbl>
#>  1    11    82
#>  2    15    79
#>  3    16    77
#>  4    22    73
#>  5    26    78
#>  6    31    81
#>  7     5    85
#>  8     9    90
#>  9    14    80
#> 10    17    72
#> # ℹ more rows
tbl_aq |> 
  mutate(
    .by = Day,
    med = median(Temp)
  )
#> Error in `collect()`:
#> ! Failed to collect lazy table.
#> Caused by error in `dbSendQuery()`:
#> ! rapi_prepare: Failed to extract statements:
#> Parser Error: ORDER BY is not implemented for window functions!

Created on 2025-01-09 with reprex v2.1.1

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.2 (2024-10-31 ucrt)
#>  os       Windows 11 x64 (build 22631)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/New_York
#>  date     2025-01-09
#>  pandoc   3.6.1 @ C:/PROGRA~3/CHOCOL~1/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  blob          1.2.4   2023-03-17 [1] CRAN (R 4.4.1)
#>  cli           3.6.3   2024-06-21 [1] CRAN (R 4.4.1)
#>  DBI         * 1.2.3   2024-06-02 [1] CRAN (R 4.4.1)
#>  dbplyr        2.5.0   2024-03-19 [1] CRAN (R 4.4.1)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  dplyr       * 1.1.4   2023-11-17 [1] CRAN (R 4.4.1)
#>  duckdb      * 1.1.3-1 2024-12-08 [1] CRAN (R 4.4.2)
#>  evaluate      1.0.1   2024-10-10 [1] CRAN (R 4.4.1)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.1)
#>  fs            1.6.5   2024-10-30 [1] CRAN (R 4.4.2)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.1)
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.1)
#>  knitr         1.49    2024-11-08 [1] CRAN (R 4.4.2)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.1)
#>  pillar        1.10.1  2025-01-07 [1] CRAN (R 4.4.2)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.1)
#>  purrr         1.0.2   2023-08-10 [1] CRAN (R 4.4.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.1)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.1)
#>  rlang         1.1.4   2024-06-04 [1] CRAN (R 4.4.1)
#>  rmarkdown     2.29    2024-11-04 [1] CRAN (R 4.4.2)
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.1)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.1)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.1)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.1)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.1)
#>  xfun          0.50    2025-01-07 [1] CRAN (R 4.4.2)
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.1)
#> 
#>  [1] C:/Users/A6FHAZZ/AppData/Local/R/win-library/4.4
#>  [2] C:/Program Files/R/R-4.4.2/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

It looks like there is a median function available. https://duckdb.org/docs/sql/functions/aggregates#medianx

Couldn't this be just used instead of percentile_cont?

@toppyy toppyy linked a pull request Jan 18, 2025 that will close this issue
@toppyy
Copy link
Contributor

toppyy commented Jan 18, 2025

Added a PR for translating median.

You might also consider using duckplyr. For example:

library(duckplyr)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.

tbl_aq <- as_duck_tbl(airquality)

df <- tbl_aq |> 
  mutate(
    .by = Day,
    med = median(Temp)
  ) 

df |> explain()
#> ┌───────────────────────────┐
#> │         PROJECTION        │
#> │    ────────────────────   │
#> │             #0            │
#> │             #1            │
#> │             #2            │
#> │             #3            │
#> │             #4            │
#> │             #5            │
#> │             #6            │
#> │                           │
#> │         ~153 Rows         │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │           WINDOW          │
#> │    ────────────────────   │
#> │        Projections:       │
#> │     median(Temp) OVER     │
#> │     (PARTITION BY Day)    │
#> └─────────────┬─────────────┘
#> ┌─────────────┴─────────────┐
#> │     R_DATAFRAME_SCAN      │
#> │    ────────────────────   │
#> │      Text: data.frame     │
#> │                           │
#> │        Projections:       │
#> │           Ozone           │
#> │          Solar.R          │
#> │            Wind           │
#> │            Temp           │
#> │           Month           │
#> │            Day            │
#> │                           │
#> │         ~153 Rows         │
#> └───────────────────────────┘

df
#> # A duckplyr data frame: 7 variables
#>    Ozone Solar.R  Wind  Temp Month   Day   med
#>    <int>   <int> <dbl> <int> <int> <int> <dbl>
#>  1    16     256   9.7    69     5    12    76
#>  2    12     149  12.6    74     5     3    81
#>  3    NA     242  16.1    67     6     3    81
#>  4    32     236   9.2    81     7     3    81
#>  5    16      77   7.4    82     8     3    81
#>  6    73     183   2.8    93     9     3    81
#>  7    28      NA  14.9    66     5     6    83
#>  8    NA     264  14.3    79     6     6    83
#>  9    40     314  10.9    83     7     6    83
#> 10    66      NA   4.6    87     8     6    83
#> # ℹ more rows

Created on 2025-01-18 with reprex v2.1.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants