-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using future_map with dbGetQuery and stringr results in method error #188
Comments
You could try explicitly loading {glue} on the worker. Even so, isn't serializing a connection object like that going to destroy it? For example, here is a reproducible example with RSQLite. You can overcome the weird method error, but you still get an error that the connection is invalid library(DBI)
library(glue)
library(furrr)
plan(multisession, workers = 2)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
dbWriteTable(con, "test_table", mtcars)
test1 <- glue("SELECT * FROM test_table")
test2 <- glue("SELECT * FROM test_table")
test_list <- list(test1, test2)
future_map(
test_list,
dbGetQuery,
conn = con,
.options = furrr_options(seed = 123)
)
#> Loading required package: RSQLite
#> Warning: package ‘RSQLite’ was built under R version 4.0.2
#> Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘dbGetQuery’ for signature ‘"SQLiteConnection", "glue"’
future_map(
test_list,
dbGetQuery,
conn = con,
.options = furrr_options(seed = 123, packages = "glue")
)
#> Error: external pointer is not valid Can't serialize a connection object: library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
con
#> <SQLiteConnection>
#> Path: :memory:
#> Extensions: TRUE
unserialize(serialize(con, NULL))
#> <SQLiteConnection>
#> DISCONNECTED |
To read more about non exportable objects: Search that for "DBI" to see that it is an object that can't be sent to workers |
There is at least one bug coming from the underlying globals package not finding "DBI" as a required package |
There is an additional issue with the glue part of this. There is no way for the underlying globals package to recognize that the fn <- function(query) {
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
DBI::dbWriteTable(con, "test_table", mtcars)
out <- DBI::dbGetQuery(conn = con, query)
DBI::dbDisconnect(con)
head(out, 1)
}
fn2 <- function(query) {
library(glue)
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
DBI::dbWriteTable(con, "test_table", mtcars)
out <- DBI::dbGetQuery(conn = con, query)
DBI::dbDisconnect(con)
head(out, 1)
}
test_query1 <- "SELECT * FROM test_table"
test_query2 <- glue::glue(test_query1)
callr::r(
func = fn,
args = list(query = test_query1)
)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
callr::r(
func = fn,
args = list(query = test_query2)
)
#> Error: callr subprocess failed: unable to find an inherited method for function ‘dbGetQuery’ for signature ‘"SQLiteConnection", "glue"’
callr::r(
func = fn2,
args = list(query = test_query2)
)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 The issue is that when glue is loaded, it makes glue look like a "character" vector to S4 methods by calling If glue isn't loaded, this doesn't happen and you get the above failure. You can verify that this is the issue by calling it manually, and not loading glue: fn <- function(query) {
methods::setOldClass(c("glue", "character"))
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
DBI::dbWriteTable(con, "test_table", mtcars)
out <- DBI::dbGetQuery(conn = con, query)
DBI::dbDisconnect(con)
head(out, 1)
}
test_query1 <- "SELECT * FROM test_table"
test_query2 <- glue::glue(test_query1)
callr::r(
func = fn,
args = list(query = test_query2)
)
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 Since the globals package has no way to identify that the glue package is required, my recommendation to you would be to:
So that would look something like: library(furrr)
library(glue)
plan(multisession, workers = 2)
fn <- function(query) {
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = ":memory:")
DBI::dbWriteTable(con, "test_table", mtcars)
out <- DBI::dbGetQuery(conn = con, query)
DBI::dbDisconnect(con)
head(out, 1)
}
test_query1 <- glue("SELECT * FROM test_table")
test_query2 <- glue("SELECT * FROM test_table")
test_list <- list(test_query1, test_query2)
future_map(
test_list,
fn,
.options = furrr_options(seed = 123, packages = "glue")
)
#> [[1]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4
#>
#> [[2]]
#> mpg cyl disp hp drat wt qsec vs am gear carb
#> 1 21 6 160 110 3.9 2.62 16.46 0 1 4 4 |
That's the thing, it did work before. My assumption at the time was the same sugar that handled objects and packages was also creating a new pool for each worker based on the existing one. The only other thing I can think of is it was silently failing to running sequentially in the main R process, but the significant speed gains over using regular |
Without a fully reproducible example showing that it was working before (maybe with an old furrr / future / globals version), I don't think there is much else I can do here. I would be extremely surprised if the connection object was previously able to be serialized/unserialized, but can't be now. Here is an old pool issue with another person not being able to roundtrip through serialization My guess is that somehow it was running sequentially. |
Originally my assumption was some kind of syntactic sugar taking the admittedly wasteful approach of making a new connection for each worker, but you're right it is definitely a possibility the code was silently failing to sequential operation. The performance difference I saw during initial testing may have been happenstance or due to the SQL server caching the results. I'll rework things following your example. I've been looking into approaches like that or possibly using a second session/container for "heavy lifting" and throwing stuff back and forth between the two. |
I used to construct batches of SQL queries programmatically via
str_glue()
, insert them into a list, and then usefuture_map()
to massively speed up the process. Attempting this with R 4.0.3, Furrr 0.2.2, Pool 0.1.6, and stringr 1.4.0 results in the following error:On my machine at least I can both reproduce the issue and isolate it to
future_map()
. Furrr works with a list of queries made usingpaste()
, and regularmap()
will work with a list of queries usingstr_glue
. Trying to use a normal connection instead of pool doesn't change the outcome.It may be unrelated but shortly before my slightly overdue R and packages update I also started getting many instances of the below error when using furrr for this, sometimes up to 5 or 6 sets.
Minimal example:
Session Info
The text was updated successfully, but these errors were encountered: