Skip to content
This repository has been archived by the owner on Mar 31, 2019. It is now read-only.

Global operations

Jim Pivarski edited this page Apr 15, 2018 · 34 revisions

Context

Although OAMap lets you operate on data as objects in procedural code (e.g. for loops/if statements), the fact that data are stored as arrays lets us perform some operations on the whole dataset with better time complexity than if the data were stored as individual objects.

There are two major categories of these operations: (1) those that transform OAMap datasets into new OAMap datasets, usually sharing the majority of the data between new and old, and (2) those that produce small, non-OAMap outputs from an OAMap dataset, sharing nothing. The first type must be performed on some sort of centralized database, where symbolic links are valid without copying, while the second may be used to get data out of the database.

Within the first category, there are: (A) metadata-only operations, which scale (time complexity) only with the size of the schema— they don't even need to look at any array data, (B) operations that need to read array data to check validity, but do not need to write array data, (C) operations that produce new arrays that must be written somewhere, and (D) operations that additionally require a user-defined function (functionals). Only type (D) requires just-in-time compilation.

Most of these operations must be applied to an entire OAMap dataset (large object, usually a list) but modify only a substructure. To identify this substructure, we use "paths," strings that identify record fields within record fields using a directory-like syntax (/). In our Python implementation, we use the glob library, which allows wildcards (* and ?) for multiple matches (if applicable). This syntax ignores distinctions between, for example, records of records and records of lists of records, but the wrong choice is not allowed by the operations described below and forcing users to make such distinctions, rather than deducing them from context (the choice that would not be erroneous), would likely only annoy the users.

1. Operations that transform datasets into datasets

The following operations allow the user to define new datasets, possibly on a remote server without downloading. New and old datasets can share a substantial fraction of their data by linking, rather than copying, dramatically reducing the storage cost associated with customizing data.

In all cases, the old datasets remain available and unchanged after the operation— these are functional transformations of immutable values.

A. Metadata-only operations

The following operations scale only with the size of the schema— they do not even need to read the data.

fieldname

fieldname(data, path, newname)

Renames a field of a record.

Consider the following example:

>>> from oamap.schema import *
>>> data = (Record({"a": "int", "b": Record({"x": "bool", "y": List("int")})})
...         .fromdata({"a": 1, "b": {"x": True, "y": [1, 2, 3]}}))
>>> data
<Record at index 0>
>>> data.a
1
>>> data.b
<Record at index 0>
>>> data.b.x
True
>>> data.b.y
[1, 2, 3]

Let's replace them with more fun names:

>>> data = fieldname(data, "a", "awesome")
>>> data = fieldname(data, "b", "bodacious")
>>> data = fieldname(data, "bodacious/x", "xcellent")
>>> data = fieldname(data, "bodacious/y", "yippee")

>>> data.awesome
1
>>> data.bodacious
<Record at index 0>
>>> data.bodacious.xcellent
True
>>> data.bodacious.yippee
[1, 2, 3]

recordname

recordname(data, path, newname)

Renames a record, which affects how it is displayed and possibly how it behaves (if extensions are defined for a particular name, like "LorentzVector").

Continuing with the previous example:

>>> data = recordname(data, "", "Bill")             # top-level path is ""
>>> data = recordname(data, "bodacious", "Ted")     # using new field names

>>> data
<Bill at index 0>
>>> data.bodacious
<Ted at index 0>

project

project(data, path)

View a projection of the OAMap data at some path.

For example, consider the following data:

>>> data = (List(Record({
...         "met": Record({"pt": "float64",
...                        "phi": "float64"}),
...         "muons": List(Record({"pt": "float64",
...                               "eta": "float64",
...                               "phi": "float64"}))
...     }))
...     .fromdata([
...         {"met": {"pt": 10.1, "phi": 32.1},
...          "muons": [
...             {"pt": 1.1, "eta": 4.13, "phi": 22.2},
...             {"pt": 2.2, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.3, "eta": 4.13, "phi": 22.2},
...         ]},
...         {"met": {"pt": 20.1, "phi": 32.1},
...          "muons": [
...         ]},
...         {"met": {"pt": 30.1, "phi": 32.1},
...          "muons": [
...             {"pt": 4.4, "eta": 4.13, "phi": 22.2},
...             {"pt": 5.5, "eta": 4.13, "phi": 22.2},
...         ]},
...     ]))

OAMap presents these as objects, which requires individual dereferencing to access:

>>> data
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> data[0].met
<Record at index 0>
>>> data[0].met.pt
10.1
>>> data[0].muons
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> data[0].muons[0].pt
1.1

But if we get rid of the pesky record structure, we can see through to the values more easily:

>>> project(data, "met")
[<Record at index 0>, <Record at index 1>, <Record at index 2>]

>>> project(data, "met/pt")
[10.1, 20.1, 30.1]

>>> project(data, "muons")
[[<Record at index 0>, <Record at index 1>, <Record at index 2>],
 [],
 [<Record at index 3>, <Record at index 4>]]

>>> project(data, "muons/pt")
[[1.1, 2.2, 3.3], [], [4.4, 5.5]]

This operation basically removes some of the abstraction that OAMap provides.

keep

keep(data, *paths)

Eliminate all but the paths specified in paths.

Example:

>>> data = List(Record({"good": "int",
...                     "goody": "float",
...                     "bad": List("bool"),
...                     "baddy": List("int")})).fromdata([
...            {"good": 1, "goody": 1.1, "bad": [], "baddy": []},
...            {"good": 2, "goody": 2.2, "bad": [True], "baddy": [1, 2, 3]},
...            {"good": 3, "goody": 3.3, "bad": [False], "baddy": [4]}
...        ])

>>> new = keep(data, "good*")
>>> new[0].good
1
>>> new[0].goody
1.1
>>> new[0].bad
AttributeError: 'Record' object has no attribute 'bad'
>>> new[0].baddy
Traceback (most recent call last):
AttributeError: 'Record' object has no attribute 'baddy'

drop

drop(data, *paths)

Eliminate only the paths specified in paths.

Example (different from above because they're nested in "x"):

>>> data = Record({"x": List(Record({"good": "int",
...                                  "goody": "float",
...                                  "bad": List("bool"),
...                                  "baddy": List("int")}))}).fromdata({"x": [
...            {"good": 1, "goody": 1.1, "bad": [], "baddy": []},
...            {"good": 2, "goody": 2.2, "bad": [True], "baddy": [1, 2, 3]},
...            {"good": 3, "goody": 3.3, "bad": [False], "baddy": [4]}
...        ]})

>>> new = drop(data, "x/bad*")
>>> new.x[0].good
1
>>> new.x[0].goody
1.1
>>> new.x[0].bad
AttributeError: 'Record' object has no attribute 'bad'
>>> new.x[0].baddy
AttributeError: 'Record' object has no attribute 'baddy'

split

split(data, *paths)

Splits a single list of records with fields identified by paths into separate lists, each representing just one field value.

For example, consider the following data:

>>> data = (List(Record({"muons":
...     List(Record({"pt": "float64", "eta": "float64", "phi": "float64"}))}))
...     .fromdata([
...         {"muons": [
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...         ]},
...         {"muons": [
...         ]},
...         {"muons": [
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...         ]},
...     ]))

>>> for event in data:
...     print "event"
...     for muon in event.muons:
...         print "    muon", muon.pt, muon.eta, muon.phi
... 
event
    muon 3.14 4.13 22.2
    muon 3.14 4.13 22.2
    muon 3.14 4.13 22.2
event
event
    muon 3.14 4.13 22.2
    muon 3.14 4.13 22.2

This split call removes the "phi" field from "muons," promoting it to its own list.

>>> new = split(data, "muons/phi")

>>> for event in new:
...     print "new event"
...     for muon in event.muons:
...         print "    muon", muon.pt, muon.eta
...     for phi in event.phi:
...         print "    phi", phi
... 
new event
    muon 3.14 4.13
    muon 3.14 4.13
    muon 3.14 4.13
    phi 22.2
    phi 22.2
    phi 22.2
new event
new event
    muon 3.14 4.13
    muon 3.14 4.13
    phi 22.2
    phi 22.2

This is a purely metadata operation (though it couldn't be for rowwise data!). All that was changed was the schema:

>>> new.schema.show()
List(
  starts = 'object-B',
  stops = 'object-E',
  content = Record(
    fields = {
      'muons': List(
        starts = 'object-L-Fmuons-B',
        stops = 'object-L-Fmuons-E',
        content = Record(
          fields = {
            'eta': Primitive(dtype('float64'),
                       data='object-L-Fmuons-L-Feta-Df8'),
            'pt': Primitive(dtype('float64'),
                       data='object-L-Fmuons-L-Fpt-Df8')
          })
      ),
      'phi': List(
        starts = 'object-L-Fmuons-B',    # same arrays as the muon records,
        stops = 'object-L-Fmuons-E',     # in multiple places in the same schema
        content = Primitive(dtype('float64'),
                      data='object-L-Fmuons-L-Fphi-Df8')
      )
    })
)

Many fields can be split out of a record using the glob-pattern feature of paths:

>>> new2 = split(data, "muons/*")

>>> for event in new2:
...     print "new event"
...     for pt in event.pt:
...         print "    pt", muon.pt
...     for eta in event.eta:
...         print "    eta", muon.eta
...     for phi in event.phi:
...         print "    phi", phi
... 
new event
    pt 3.14
    pt 3.14
    pt 3.14
    eta 4.13
    eta 4.13
    eta 4.13
    phi 22.2
    phi 22.2
    phi 22.2
new event
new event
    pt 3.14
    pt 3.14
    eta 4.13
    eta 4.13
    phi 22.2
    phi 22.2

B. Operations that only need to read array data

The following operations scale with the size of the schema and the read time of the few arrays they need to read.

merge

merge(data, container, *paths)

Reverses the split operation by combining paths into a pre-existing or new container.

Continuing with the previous example, merge can put the "phi" values back into the "muons" records because their lists are identical. This particular case is a metadata-only operation because the list's starts/stops arrays have the same names and therefore must be identical. However, if the lists do not have the same names, the merge operation must read the arrays to verify that they are the same.

>>> undo = merge(new, "muons", "phi")

>>> for event in undo:
...     print "event"
...     for muon in event.muons:
...         print "    muon", muon.pt, muon.eta, muon.phi
... 
event
    muon 3.14 4.13 22.2
    muon 3.14 4.13 22.2
    muon 3.14 4.13 22.2
event
event
    muon 3.14 4.13 22.2
    muon 3.14 4.13 22.2

C. Operations that also need to write array data

The following operations scale with the size of the schema and the read and write times of the few arrays involved.

mask

mask(data, at, low, high=None)

Replace values with None in data at path at that are either low or between low and high (inclusively on both endpoints). NaN is handled correctly.

Example:

>>> data = (List(Record({"muons":
...     List(Record({"pt": "float64"}))}))
...     .fromdata([
...         {"muons": [
...             {"pt": 1.1},
...             {"pt": 2.2},
...             {"pt": -1000}
...         ]},
...         {"muons": [
...             {"pt": -1000},
...             {"pt": 4.4},
...             {"pt": 5.5}
...         ]}
...     ]))
>>> data
[<Record at index 0>, <Record at index 1>]
>>> project(data, "muons/pt")
[[1.1, 2.2, -1000.0], [-1000.0, 4.4, 5.5]]   # physicist used -1000 for "missing"

>>> new = mask(data, "muons/pt", -1000)
>>> new
[<Record at index 0>, <Record at index 1>]
>>> project(new, "muons/pt")
[[1.1, 2.2, None], [None, 4.4, 5.5]]         # None instead of -1000

The schema has been changed; new arrays have been added and distinguished from the old ones with a namespace.

>>> new.schema.show()
List(
  starts = 'object-B',
  stops = 'object-E',
  content = Record(
    fields = {
      'muons': List(
        starts = 'object-L-Fmuons-B',
        stops = 'object-L-Fmuons-E',
        content = Record(
          fields = {
            'pt': Primitive(dtype('float64'),
                            nullable=True,   # pt is now nullable
                            data='array-0',
                            mask='array-1',  # with a mask array
                            namespace='namespace-0')
          })
      )
    })
)

The actual data are in an in-memory dict that needs to be saved in the database (somehow).

>>> new._arrays
<__main__.DualSource object at 0x7aa1eba32550>
>>> new._arrays.new
{'array-1': array([ 0,  1, -1, -1,  4,  5], dtype=int32),
 'array-0': array([    1.1,     2.2, -1000. , -1000. ,     4.4,     5.5])}

flatten

flatten(data, at="")

Turn a list of lists into a simple list at path at.

>>> data = List(List("int")).fromdata([[1, 2, 3], [], [4, 5]])
>>> data
[[1, 2, 3], [], [4, 5]]
>>> flatten(data)
[1, 2, 3, 4, 5]

>>> data2 = (Record({"x": List(List("int")), "y": "bool"})
...     .fromdata({"x": [[1, 2, 3], [], [4, 5]], "y": True}))
>>> 
>>> new = flatten(data2, at="x")
>>> new.x
[1, 2, 3, 4, 5]

D. Functionals: operations that depend on user-defined functions

The following operations scale with the size of the schema, the read time of all the arrays required by the user-defined function, and the compute time of the user-defined function.

In all such functions, the numba option has the following possible values:

  • True (default): compile with Numba if Numba is installed (import numba does not raise an ImportError). Default Numba options will be used, which includes a fallback to partially compiled code if the user-defined function fails a type check.
  • dict of options: options to pass to Numba compilation. For example, {"nopython": True, "nogil": True} fails for partially compiled code and releases Python's GIL during execution.
  • False or None: pure Python; do not attempt to compile with Numba, even if Numba is available. This can be faster for small datasets (where compilation overhead dominates over runtime).

filter

filter(data, fcn, args=(), at="", numba=True)

Removes objects from data that fail the fcn function with possible args arguments at path at. The return type of fcn must be boolean (checked when compiled with Numba). The numba option has the meaning described above.

Example of filtering events:

>>> data = (List(Record({"muons":
...     List(Record({"pt": "float64", "charge": "int8"}))}))
...     .fromdata([
...         {"muons": [
...             {"pt": 1.1, "charge":  1},
...             {"pt": 2.2, "charge": -1},
...             {"pt": 3.3, "charge":  1},
...         ]},
...         {"muons": [
...         ]},
...         {"muons": [
...             {"pt": 4.4, "charge": -1},
...             {"pt": 5.5, "charge": -1},
...         ]},
...     ]))
>>> data
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> project(data, "muons")
[[<Record at index 0>, <Record at index 1>, <Record at index 2>],
 [],
 [<Record at index 3>, <Record at index 4>]]

>>> new = filter(
...     data,
...     lambda event: len(event.muons) > 0)
>>> new
[<Record at index 0>, <Record at index 2>]
>>> project(new, "muons")
[[<Record at index 0>, <Record at index 1>, <Record at index 2>],
 [<Record at index 3>, <Record at index 4>]]

Example of filtering particles:

>>> new2 = filter(
...     data,
...     lambda muon: muon.charge > 0,
...     at="muons")
>>> new2
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> project(new2, "muons")
[[<Record at index 0>, <Record at index 2>], [], []]

Note that filtering without modifying the original is accomplished through pointers: the filtered data is a list of pointers to the data in the original list. Thus, a filtered dataset is actually an "event list," though this is transparent to the physicist. Compare the fully qualified schemas of data and new:

>>> data._generator.namedschema().show()   # internal: adds fully qualified names
List(
  starts = 'object-B',
  stops = 'object-E',
  content = Record(
    fields = {
      'muons': List(
        starts = 'object-L-Fmuons-B',
        stops = 'object-L-Fmuons-E',
        content = Record(
          fields = {
            'charge': Primitive(dtype('int8'),
                          data='object-L-Fmuons-L-Fcharge-Di1'),
            'pt': Primitive(dtype('float64'),
                          data='object-L-Fmuons-L-Fpt-Df8')
          })
      )
    })
)
>>> new.schema.show()
List(                               # new List (shorter than the original)
  starts = 'array-0',
  stops = 'array-1',
  namespace = 'namespace-0',
  content = Pointer(                # new Pointer (to contents of the original)
    positions = 'array-2',
    namespace = 'namespace-0',
    target = Record(
      fields = {
        'muons': List(
          starts = 'object-L-Fmuons-B',
          stops = 'object-L-Fmuons-E',
          content = Record(
            fields = {
              'charge': Primitive(dtype('int8'),
                            data='object-L-Fmuons-L-Fcharge-Di1'),
              'pt': Primitive(dtype('float64'),
                            data='object-L-Fmuons-L-Fpt-Df8')
            })
        )
      })
  )
)

define

define(data, fieldname, fcn, args=(), at="",
                              fieldtype=Primitive(numpy.float), numba=True)

Adds a field fieldname at path at by applying fcn function with possible args arguments to every object at path at. The return type of fcn must be fieldtype. The numba option has the meaning described above.

Example of defining a new event attribute:

>>> data = (List(Record({"muons":
...     List(Record({"pt": "float64", "eta": "float64", "phi": "float64"}))}))
...     .fromdata([
...         {"muons": [
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...         ]},
...         {"muons": [
...         ]},
...         {"muons": [
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...             {"pt": 3.14, "eta": 4.13, "phi": 22.2},
...         ]},
...     ]))
>>> new = define(
...     data,
...     "nummuons",
...     lambda event: len(event.muons),
...     fieldtype=Primitive("int64"))
>>> new
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> new[0].nummuons
3
>>> project(new, "nummuons")
[3, 0, 2]

Example of defining a new particle attribute:

>>> from math import sinh
>>> new2 = define(
...     data,
...     "pz",
...     lambda muon: muon.pt * sinh(muon.eta),
...     at="muons")
>>> new2
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> new2[0].muons[0].pz
97.59408888782299
>>> project(new2, "muons/pz")
[[97.59408888782299, 97.59408888782299, 97.59408888782299],
 [],
 [97.59408888782299, 97.59408888782299]]

2. Operations that return non-linked, non-OAMap data

The following operations export data from the OAMap schema in highly reduced forms (so that they can be easily downloaded, unlike the original OAMap data).

map

map(data, fcn, args=(), at="", names=None, numba=True)

Produces a flat table of data (Numpy recarray, which can be passed as a single argument to the pandas.DataFrame constructor) from an OAMap dataset data. The fcn function with possible args arguments is applied to every element at path at to produce a row of the table. If names are not supplied, they'll be Numpy recarray defaults (f0, f1, f2...); otherwise, names labels the columns. The numba option has the meaning described above.

fcn(datum, *args) -> row of a table as a tuple of numbers

(UNTESTED)

reduce

reduce(data, tally, fcn, args=(), at="", numba=True)

Aggregates data by repeatedly applying the fcn function to the tally with possible args arguments at every element at path at, expecting a new tally in return. The numba option has the meaning described above.

fcn(datum, tally, *args) -> new tally

Intended for histogramming, though summation, max/min, averaging, etc. are also possible with arguments like this:

reduce(data, 0, lambda x, tally: x + tally, at="muons/pt")

The initial tally value (0 above), the second argument of fcn, and the return value of fcn must agree in data type (explicitly tested for Numba-compiled functions).

(UNTESTED)

Clone this wiki locally