Skip to content
This repository has been archived by the owner on Mar 31, 2019. It is now read-only.

Global operations

Jim Pivarski edited this page Apr 14, 2018 · 34 revisions

Context

Although OAMap lets you operate on data as objects in procedural code (e.g. for loops/if statements), the fact that data are stored as arrays lets us perform some operations on the whole dataset with better time complexity than if the data were stored as individual objects.

There are two major categories of these operations: (1) those that transform OAMap datasets into new OAMap datasets, usually sharing the majority of their data, and (2) those that produce small, non-OAMap outputs from an OAMap dataset, sharing nothing. The first type must be performed on some sort of centralized database, where symbolic links are valid without copying, while the second may be used to get data out of the database.

Within the first category, there are: (A) metadata-only operations, which scale only with the size of the schema— they don't even need to look at any array data, (B) operations that need to read array data to check validity, but do not need to write array data, (C) operations that produce new arrays that must be written somewhere, and (D) operations that additionally require a user-defined function (functionals). Only type (D) requires just-in-time compilation.

Most of these operations must be applied to an entire OAMap dataset (large object, usually a list) but modify only a substructure. To identify this substructure, we use "paths," strings that identify record fields within record fields using a directory-like syntax (/). In our Python implementation, we use the glob library, which allows wildcards (* and ?) for multiple matches (if applicable). This syntax ignores distinctions between, for example, records of records and records of lists of records, but the wrong choice is not allowed by the operations described below and forcing users to make such distinctions, rather than deducing them from context (the choice that would not be erroneous), would likely only annoy the users.

1. Operations that transform datasets into datasets

The following operations allow the user to define new datasets, possibly on a remote server without downloading. New and old datasets can share a substantial fraction of their data by linking, rather than copying, dramatically reducing the storage cost associated with customizing data.

In all cases, the old datasets remain available and unchanged after the operation— these are functional transformations of immutable values.

A. Metadata-only operations

The following operations scale only with the size of the schema— they do not even need to read the data.

fieldname

fieldname(data, path, newname)

recordname

recordname(data, path, newname)

project

project(data, path)

keep

keep(data, *paths)

drop

drop(data, *paths)

split

split(data, *paths)

B. Operations that only need to read array data

The following operations scale with the size of the schema and the read time of the few arrays they need to read.

merge

merge(data, container, *paths)

C. Operations that also need to write array data

The following operations scale with the size of the schema and the read and write times of the few arrays involved.

mask

mask(data, path, low, high=None)

flatten

flatten(data, at="")

D. Functionals: operations that depend on user-defined functions

The following operations scale with the size of the schema, the read time of all the arrays required by the user-defined function, and the compute time of the user-defined function.

filter

filter(data, fcn, args=(), at="", numba=True)

define

define(data, fieldname, fcn, args=(), at="",
                              fieldtype=Primitive(numpy.float), numba=True)

Adds a field fieldname at path at by applying fcn function with possible args arguments to every object at path at. The return type of fcn must be fieldtype. The numba option has the same meaning as above.

>>> data = (List(Record({"muons":
...     List(Record({"pt": "float64", "eta": "float64", "phi": "float64"}))}))
...     .fromdata([
...         {"muons": [
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...         ]},
...         {"muons": [
...         ]},
...         {"muons": [
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...         ]},
...     ]))
>>> new = define(
...     data,
...     "nummuons",
...     lambda event: len(event.muons),
...     fieldtype=Primitive("int64"))
>>> new
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> new[0].nummuons
3
>>> project(new, "nummuons")
[3, 0, 2]
>>> from math import sinh
>>> new2 = define(
...     data,
...     "pz",
...     lambda muon: muon.pt * sinh(muon.eta),
...     at="muons")
>>> new2
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> project(new2, "muons/pz")
[[97.59408888782299, 97.59408888782299, 97.59408888782299],
 [],
 [97.59408888782299, 97.59408888782299]]

2. Operations that return non-linked, non-OAMap data

The following operations export data from the OAMap schema in highly reduced forms (so that they can be easily downloaded, unlike the original OAMap data).

map

map(data, fcn, args=(), at="", names=None, numba=True)

Produces a flat table of data (Numpy recarray, which can be passed as a single argument to the pandas.DataFrame constructor) from an OAMap dataset data. The fcn function with possible args arguments is applied to every element at path at to produce a row of the table. If names are not supplied, they'll be Numpy recarray defaults (f0, f1, f2...); otherwise, names labels the columns. The numba option has the same meaning as above.

fcn(datum, *args) -> row of a table as a tuple of numbers

(UNTESTED)

reduce

reduce(data, tally, fcn, args=(), at="", numba=True)

Aggregates data by repeatedly applying the fcn function to the tally with possible args arguments at every element at path at, expecting a new tally in return. The numba option has the same meaning as above.

fcn(datum, tally, *args) -> new tally

Intended for histogramming, though summation, max/min, averaging, etc. are also possible with arguments like this:

reduce(data, 0, lambda x, tally: x + tally, at="muons/pt")

The initial tally value (0 above), the second argument of fcn, and the return value of fcn must agree in data type (explicitly tested for Numba-compiled functions).

(UNTESTED)

Clone this wiki locally