Skip to content

Latest commit

 

History

History
118 lines (93 loc) · 3.38 KB

README.md

File metadata and controls

118 lines (93 loc) · 3.38 KB

delaytional

Lifecycle: experimental CRAN status

delaytional provides a backend for DelayedArray that stores the array data in an SQL database. The main motivation for doing this is so that you can use the impressive performance of DuckDB to store array data in Parquet files and query them out of memory (without loading the entire dataset into R). Other possible use cases include:

  • Out-of-memory querying of CSV or NDJSON data, via DuckDB
  • Using SQLite databases as array stores
  • Re-using an existing database server (DBMS) to support matrix data

Installation

You can install the development version of delaytional from GitHub with:

# install.packages("remotes")
remotes::install_github("WEHI-ResearchComputing/delaytional")

Example

Let’s say we’re working with a fairly large matrix:

in_mem <- rnorm(n = 4E8) |> matrix(ncol = 2E4)
lobstr::obj_size(in_mem)
#> 3.20 GB

This isn’t too bad, but you can easily imagine we might be in trouble if the matrix were 100 times larger.

Let’s convert it into a delaytional array and see if we can improve that. First we need to write the array to parquet using array_to_table:

table_dest <- tempfile(fileext=".parquet")
in_mem |>
    delaytional::array_to_table() |>
    arrow::write_parquet(table_dest)

Next, we can create an array from the parquet file. This is also how you would create a DelayedArray from an existing parquet dataset that you didn’t make yourself:

delayed_parquet <- duckdb::duckdb() |>
  DBI::dbConnect() |>
  dplyr::tbl(table_dest) |>
  delaytional::SqlArraySeed() |>
  DelayedArray::DelayedArray()

How big is it now?

lobstr::obj_size(delayed_parquet)
#> 38.14 kB

Now let’s pull out some random indices from the two arrays, and compare the results:

set.seed(1)
x = sample.int(n = nrow(in_mem), size = 5)
y = sample.int(n = ncol(in_mem), size = 5)
in_mem[x, y]
#>            [,1]       [,2]       [,3]       [,4]       [,5]
#> [1,] -0.6700517  0.9356821 -1.0847204 -0.3044768  0.3201105
#> [2,]  0.3117883 -1.1637274  0.5206586  0.6136792  0.3863917
#> [3,]  0.5248609 -1.1065383 -0.4499743  1.1963352 -0.8393234
#> [4,] -0.4036493  1.7691156 -1.7958300 -0.7343334  0.2773608
#> [5,] -0.5505380  1.4433347  0.3846686 -1.5767486  2.5452338
delayed_parquet[x, y]
#> <5 x 5> matrix of class DelayedMatrix and type "double":
#>            [,1]       [,2]       [,3]       [,4]       [,5]
#> [1,] -0.6700517  0.9356821 -1.0847204 -0.3044768  0.3201105
#> [2,]  0.3117883 -1.1637274  0.5206586  0.6136792  0.3863917
#> [3,]  0.5248609 -1.1065383 -0.4499743  1.1963352 -0.8393234
#> [4,] -0.4036493  1.7691156 -1.7958300 -0.7343334  0.2773608
#> [5,] -0.5505380  1.4433347  0.3846686 -1.5767486  2.5452338
all.equal(
    in_mem[x, y],
    as.array(delayed_parquet[x, y]),
)
#> [1] TRUE

The same results! The only difference is that delayed_parquet has a much smaller memory footprint.