delaytional
provides a backend for
DelayedArray
that stores the array data in an SQL database. The main motivation for
doing this is so that you can use the impressive performance of
DuckDB to store array data in Parquet files and
query them out of memory (without loading the entire dataset into R).
Other possible use cases include:
- Out-of-memory querying of CSV or NDJSON data, via DuckDB
- Using SQLite databases as array stores
- Re-using an existing database server (DBMS) to support matrix data
You can install the development version of delaytional
from
GitHub with:
# install.packages("remotes")
remotes::install_github("WEHI-ResearchComputing/delaytional")
Let’s say we’re working with a fairly large matrix:
in_mem <- rnorm(n = 4E8) |> matrix(ncol = 2E4)
lobstr::obj_size(in_mem)
#> 3.20 GB
This isn’t too bad, but you can easily imagine we might be in trouble if the matrix were 100 times larger.
Let’s convert it into a delaytional
array and see if we can improve
that. First we need to write the array to parquet using
array_to_table
:
table_dest <- tempfile(fileext=".parquet")
in_mem |>
delaytional::array_to_table() |>
arrow::write_parquet(table_dest)
Next, we can create an array from the parquet file. This is also how you
would create a DelayedArray
from an existing parquet dataset that you
didn’t make yourself:
delayed_parquet <- duckdb::duckdb() |>
DBI::dbConnect() |>
dplyr::tbl(table_dest) |>
delaytional::SqlArraySeed() |>
DelayedArray::DelayedArray()
How big is it now?
lobstr::obj_size(delayed_parquet)
#> 38.14 kB
Now let’s pull out some random indices from the two arrays, and compare the results:
set.seed(1)
x = sample.int(n = nrow(in_mem), size = 5)
y = sample.int(n = ncol(in_mem), size = 5)
in_mem[x, y]
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -0.6700517 0.9356821 -1.0847204 -0.3044768 0.3201105
#> [2,] 0.3117883 -1.1637274 0.5206586 0.6136792 0.3863917
#> [3,] 0.5248609 -1.1065383 -0.4499743 1.1963352 -0.8393234
#> [4,] -0.4036493 1.7691156 -1.7958300 -0.7343334 0.2773608
#> [5,] -0.5505380 1.4433347 0.3846686 -1.5767486 2.5452338
delayed_parquet[x, y]
#> <5 x 5> matrix of class DelayedMatrix and type "double":
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] -0.6700517 0.9356821 -1.0847204 -0.3044768 0.3201105
#> [2,] 0.3117883 -1.1637274 0.5206586 0.6136792 0.3863917
#> [3,] 0.5248609 -1.1065383 -0.4499743 1.1963352 -0.8393234
#> [4,] -0.4036493 1.7691156 -1.7958300 -0.7343334 0.2773608
#> [5,] -0.5505380 1.4433347 0.3846686 -1.5767486 2.5452338
all.equal(
in_mem[x, y],
as.array(delayed_parquet[x, y]),
)
#> [1] TRUE
The same results! The only difference is that delayed_parquet
has a
much smaller memory footprint.