Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using future_map with dbGetQuery and stringr results in method error #188

Closed
D3SL opened this issue Jan 31, 2021 · 7 comments
Closed

Using future_map with dbGetQuery and stringr results in method error #188

D3SL opened this issue Jan 31, 2021 · 7 comments

Comments

@D3SL
Copy link

D3SL commented Jan 31, 2021

I used to construct batches of SQL queries programmatically via str_glue(), insert them into a list, and then use future_map() to massively speed up the process. Attempting this with R 4.0.3, Furrr 0.2.2, Pool 0.1.6, and stringr 1.4.0 results in the following error:

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'dbGetQuery' for signature '"Pool", "glue"'

On my machine at least I can both reproduce the issue and isolate it to future_map(). Furrr works with a list of queries made using paste(), and regular map() will work with a list of queries using str_glue. Trying to use a normal connection instead of pool doesn't change the outcome.

It may be unrelated but shortly before my slightly overdue R and packages update I also started getting many instances of the below error when using furrr for this, sometimes up to 5 or 6 sets.

1: In connection_release(conn@ptr) : Already disconnected
2: It wasn't possible to activate and/or validate the object. Trying again with a new object. 

Minimal example:

test_conn<-dbPool(<your connection here>)

#swap between str_glue and paste
test1<-str_glue("SELECT * FROM test_table WHERE FromDate='2021-01-31'") 
test2<-str_glue("SELECT * FROM test_table WHERE FromDate='2021-01-31'")

test_list<-list(test1,test2)

#swap between future_map and map
asr_results<-future_map(test_list, dbGetQuery, conn=test_conn, .options = furrr_options(seed = 123)) 

>

Session Info

R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_World.1252  LC_CTYPE=English_World.1252    LC_MONETARY=English_World.1252 LC_NUMERIC=C                  
[5] LC_TIME=English_World.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] beepr_1.3             readwritesqlite_0.1.2 data.table_1.13.6     here_1.0.1            pool_0.1.6            odbc_1.3.0           
 [7] RMariaDB_1.1.0        DBI_1.1.1             furrr_0.2.2           future_1.21.0         lubridate_1.7.9.2     magrittr_2.0.1       
[13] forcats_0.5.1         stringr_1.4.0         dplyr_1.0.3           purrr_0.3.4           readr_1.4.0           tidyr_1.1.2          
[19] tibble_3.0.6          ggplot2_3.3.3         tidyverse_1.3.0       RSQLite_2.2.3        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6        listenv_0.8.0     assertthat_0.2.1  rprojroot_2.0.2   digest_0.6.27     parallelly_1.23.0 R6_2.5.0          cellranger_1.1.0 
 [9] backports_1.2.1   reprex_1.0.0      httr_1.4.2        pillar_1.4.7      rlang_0.4.10      readxl_1.3.1      rstudioapi_0.13   blob_1.2.1       
[17] bit_4.0.4         munsell_0.5.0     broom_0.7.4       compiler_4.0.3    modelr_0.1.8      pkgconfig_2.0.3   globals_0.14.0    tidyselect_1.1.0 
[25] audio_0.1-7       codetools_0.2-18  fansi_0.4.2       crayon_1.4.0      dbplyr_2.0.0      withr_2.4.1       later_1.1.0.1     grid_4.0.3       
[33] jsonlite_1.7.2    gtable_0.3.0      lifecycle_0.2.0   scales_1.1.1      cli_2.2.0         stringi_1.5.3     fs_1.5.0          xml2_1.3.2       
[41] chk_0.5.1         ellipsis_0.3.1    generics_0.1.0    vctrs_0.3.6       tools_4.0.3       bit64_4.0.5       glue_1.4.2        hms_1.0.0        
[49] parallel_4.0.3    colorspace_2.0-0  rvest_0.3.6       memoise_2.0.0     haven_2.3.1 
@D3SL D3SL changed the title Using future_map with dbGetQuery and stringr results in errors Using future_map with dbGetQuery and stringr results in method error Jan 31, 2021
@DavisVaughan
Copy link
Collaborator

You could try explicitly loading {glue} on the worker. Even so, isn't serializing a connection object like that going to destroy it? For example, here is a reproducible example with RSQLite. You can overcome the weird method error, but you still get an error that the connection is invalid

library(DBI)
library(glue)
library(furrr)

plan(multisession, workers = 2)

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

dbWriteTable(con, "test_table", mtcars)

test1 <- glue("SELECT * FROM test_table") 
test2 <- glue("SELECT * FROM test_table")

test_list <- list(test1, test2)

future_map(
  test_list, 
  dbGetQuery, 
  conn = con, 
  .options = furrr_options(seed = 123)
) 
#> Loading required package: RSQLite
#> Warning: package ‘RSQLite’ was built under R version 4.0.2
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbGetQuery’ for signature ‘"SQLiteConnection", "glue"’

future_map(
  test_list, 
  dbGetQuery, 
  conn = con, 
  .options = furrr_options(seed = 123, packages = "glue")
) 
#> Error: external pointer is not valid

Can't serialize a connection object:

library(DBI)

con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

con
#> <SQLiteConnection>
#>   Path: :memory:
#>   Extensions: TRUE

unserialize(serialize(con, NULL))
#> <SQLiteConnection>
#>   DISCONNECTED

@DavisVaughan
Copy link
Collaborator

To read more about non exportable objects:
https://cran.r-project.org/web/packages/future/vignettes/future-4-non-exportable-objects.html

Search that for "DBI" to see that it is an object that can't be sent to workers

@DavisVaughan
Copy link
Collaborator

There is at least one bug coming from the underlying globals package not finding "DBI" as a required package
futureverse/globals#73

@DavisVaughan
Copy link
Collaborator

There is an additional issue with the glue part of this. There is no way for the underlying globals package to recognize that the "glue" class you passed in requires the {glue} package. So the glue package won't be loaded on the worker. This causes issues with DBI as you can see below (this is not furrr/future specific, so I'll demonstrate with callr):

fn <- function(query) {
  con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
  DBI::dbWriteTable(con, "test_table", mtcars)
  out <- DBI::dbGetQuery(conn = con, query)
  DBI::dbDisconnect(con)
  head(out, 1)
}

fn2 <- function(query) {
  library(glue)
  con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
  DBI::dbWriteTable(con, "test_table", mtcars)
  out <- DBI::dbGetQuery(conn = con, query)
  DBI::dbDisconnect(con)
  head(out, 1)
}

test_query1 <- "SELECT * FROM test_table"
test_query2 <- glue::glue(test_query1) 

callr::r(
  func = fn, 
  args = list(query = test_query1)
)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

callr::r(
  func = fn, 
  args = list(query = test_query2)
)
#> Error: callr subprocess failed: unable to find an inherited method for function ‘dbGetQuery’ for signature ‘"SQLiteConnection", "glue"’

callr::r(
  func = fn2, 
  args = list(query = test_query2)
)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

The issue is that when glue is loaded, it makes glue look like a "character" vector to S4 methods by calling setOldClass()
https://github.com/tidyverse/glue/blob/bcb719b488f6d6d0fe5652cd8e0e3744545e523d/R/glue.R#L311

If glue isn't loaded, this doesn't happen and you get the above failure. You can verify that this is the issue by calling it manually, and not loading glue:

fn <- function(query) {
  methods::setOldClass(c("glue", "character"))
  con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
  DBI::dbWriteTable(con, "test_table", mtcars)
  out <- DBI::dbGetQuery(conn = con, query)
  DBI::dbDisconnect(con)
  head(out, 1)
}

test_query1 <- "SELECT * FROM test_table"
test_query2 <- glue::glue(test_query1) 

callr::r(
  func = fn, 
  args = list(query = test_query2)
)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

Since the globals package has no way to identify that the glue package is required, my recommendation to you would be to:

  • Prefix all DBI calls with DBI:: until the globals bug I linked to is fixed
  • Manually load the globals package, or just convert to character first

So that would look something like:

library(furrr)
library(glue)

plan(multisession, workers = 2)

fn <- function(query) {
  con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
  DBI::dbWriteTable(con, "test_table", mtcars)
  out <- DBI::dbGetQuery(conn = con, query)
  DBI::dbDisconnect(con)
  head(out, 1)
}

test_query1 <- glue("SELECT * FROM test_table")
test_query2 <- glue("SELECT * FROM test_table")

test_list <- list(test_query1, test_query2)

future_map(
  test_list,
  fn,
  .options = furrr_options(seed = 123, packages = "glue")
)
#> [[1]]
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4
#> 
#> [[2]]
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

@D3SL
Copy link
Author

D3SL commented Feb 1, 2021

That's the thing, it did work before. My assumption at the time was the same sugar that handled objects and packages was also creating a new pool for each worker based on the existing one. The only other thing I can think of is it was silently failing to running sequentially in the main R process, but the significant speed gains over using regular map() would suggest otherwise.

@DavisVaughan
Copy link
Collaborator

DavisVaughan commented Feb 1, 2021

Without a fully reproducible example showing that it was working before (maybe with an old furrr / future / globals version), I don't think there is much else I can do here. I would be extremely surprised if the connection object was previously able to be serialized/unserialized, but can't be now.

Here is an old pool issue with another person not being able to roundtrip through serialization
rstudio/pool#72

My guess is that somehow it was running sequentially.

@D3SL
Copy link
Author

D3SL commented Feb 2, 2021

Originally my assumption was some kind of syntactic sugar taking the admittedly wasteful approach of making a new connection for each worker, but you're right it is definitely a possibility the code was silently failing to sequential operation. The performance difference I saw during initial testing may have been happenstance or due to the SQL server caching the results.

I'll rework things following your example. I've been looking into approaches like that or possibly using a second session/container for "heavy lifting" and throwing stuff back and forth between the two.

@D3SL D3SL closed this as completed Feb 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants