-
Notifications
You must be signed in to change notification settings - Fork 11
Global operations
Although OAMap lets you operate on data as objects in procedural code (e.g. for loops/if statements), the fact that data are stored as arrays lets us perform some operations on the whole dataset with better time complexity than if the data were stored as individual objects.
There are two major categories of these operations: (1) those that transform OAMap datasets into new OAMap datasets, usually sharing the majority of their data, and (2) those that produce small, non-OAMap outputs from an OAMap dataset, sharing nothing. The first type must be performed on some sort of centralized database, where symbolic links are valid without copying, while the second may be used to get data out of the database.
Within the first category, there are: (A) metadata-only operations, which scale only with the size of the schema— they don't even need to look at any array data, (B) operations that need to read array data to check validity, but do not need to write array data, (C) operations that produce new arrays that must be written somewhere, and (D) operations that additionally require a user-defined function (functionals). Only type (D) requires just-in-time compilation.
Most of these operations must be applied to an entire OAMap dataset (large object, usually a list) but modify only a substructure. To identify this substructure, we use "paths," strings that identify record fields within record fields using a directory-like syntax (/
). In our Python implementation, we use the glob
library, which allows wildcards (*
and ?
) for multiple matches (if applicable). This syntax ignores distinctions between, for example, records of records and records of lists of records, but the wrong choice is not allowed by the operations described below and forcing users to make such distinctions, rather than deducing them from context (the choice that would not be erroneous), would likely only annoy the users.
The following operations allow the user to define new datasets, possibly on a remote server without downloading. New and old datasets can share a substantial fraction of their data by linking, rather than copying, dramatically reducing the storage cost associated with customizing data.
In all cases, the old datasets remain available and unchanged after the operation— these are functional transformations of immutable values.
The following operations scale only with the size of the schema— they do not even need to read the data.
fieldname(data, path, newname)
recordname(data, path, newname)
project(data, path)
keep(data, *paths)
drop(data, *paths)
split(data, *paths)
The following operations scale with the size of the schema and the read time of the few arrays they need to read.
merge(data, container, *paths)
The following operations scale with the size of the schema and the read and write times of the few arrays involved.
mask(data, path, low, high=None)
flatten(data, at="")
The following operations scale with the size of the schema, the read time of all the arrays required by the user-defined function, and the compute time of the user-defined function.
filter(data, fcn, args=(), at="", numba=True)
define(data, fieldname, fcn, args=(), at="", fieldtype=Primitive(float), numba=True)
The following operations export data from the OAMap schema in highly reduced forms (so that they can be easily downloaded, unlike the original OAMap data).
map(data, fcn, args=(), at="", names=None, numba=True)
Produces a flat table of data (Numpy recarray, which can be passed as a single argument to the pandas.DataFrame
constructor) from an OAMap dataset data
. The fcn
function with possible args
arguments is applied to every element at path at
to produce a row of the table. If names
are not supplied, they'll be Numpy recarray defaults (f0, f1, f2...
); otherwise, names
labels the columns. The numba
option has the same meaning as above.
fcn(datum, *args) -> row of a table as a tuple of numbers
(UNTESTED)
reduce(data, tally, fcn, args=(), at="", numba=True)
Aggregates data by repeatedly applying the fcn
function to the tally
with possible args
arguments at every element at path at
, expecting a new tally
in return. The numba
option has the same meaning as above.
fcn(datum, tally, *args) -> new tally
Intended for histogramming, though summation, max/min, averaging, etc. are also possible with arguments like this:
reduce(data, 0, lambda x, tally: x + tally, at="muons/pt")
The initial tally
value (0
above), the second argument of fcn
, and the return value of fcn
must agree in data type (explicitly tested for Numba-compiled functions).
(UNTESTED)