-
Notifications
You must be signed in to change notification settings - Fork 11
Global operations
Although OAMap lets you operate on data as objects in procedural code (e.g. for loops/if statements), the fact that data are stored as arrays lets us perform some operations on the whole dataset with better time complexity than if the data were stored as individual objects.
There are two major categories of these operations: (1) those that transform OAMap datasets into new OAMap datasets, usually sharing the majority of the data between new and old, and (2) those that produce small, non-OAMap outputs from an OAMap dataset, sharing nothing. The first type must be performed on some sort of centralized database, where symbolic links are valid without copying, while the second may be used to get data out of the database.
Within the first category, there are: (A) metadata-only operations, which scale (time complexity) only with the size of the schema— they don't even need to look at any array data, (B) operations that need to read array data to check validity, but do not need to write array data, (C) operations that produce new arrays that must be written somewhere, and (D) operations that additionally require a user-defined function (functionals). Only type (D) requires just-in-time compilation.
Most of these operations must be applied to an entire OAMap dataset (large object, usually a list) but modify only a substructure. To identify this substructure, we use "paths," strings that identify record fields within record fields using a directory-like syntax (/
). In our Python implementation, we use the glob
library, which allows wildcards (*
and ?
) for multiple matches (if applicable). This syntax ignores distinctions between, for example, records of records and records of lists of records, but the wrong choice is not allowed by the operations described below and forcing users to make such distinctions, rather than deducing them from context (the choice that would not be erroneous), would likely only annoy the users.
The following operations allow the user to define new datasets, possibly on a remote server without downloading. New and old datasets can share a substantial fraction of their data by linking, rather than copying, dramatically reducing the storage cost associated with customizing data.
In all cases, the old datasets remain available and unchanged after the operation— these are functional transformations of immutable values.
The following operations scale only with the size of the schema— they do not even need to read the data.
fieldname(data, path, newname)
Renames a field at path
of a record to newname
in OAMap dataset data
.
Consider the following example:
>>> from oamap.schema import *
>>> data = (Record({"a": "int", "b": Record({"x": "bool", "y": List("int")})})
... .fromdata({"a": 1, "b": {"x": True, "y": [1, 2, 3]}}))
>>> data
<Record at index 0>
>>> data.a
1
>>> data.b
<Record at index 0>
>>> data.b.x
True
>>> data.b.y
[1, 2, 3]
Let's replace them with more fun names:
>>> data = fieldname(data, "a", "awesome")
>>> data = fieldname(data, "b", "bodacious")
>>> data = fieldname(data, "bodacious/x", "xcellent")
>>> data = fieldname(data, "bodacious/y", "yippee")
>>> data.awesome
1
>>> data.bodacious
<Record at index 0>
>>> data.bodacious.xcellent
True
>>> data.bodacious.yippee
[1, 2, 3]
recordname(data, path, newname)
Renames a record at path
to newname
in OAMap dataset data
. This affects how it is displayed and possibly how it behaves (if extensions are defined for a particular name, like "LorentzVector").
Continuing with the previous example:
>>> data = recordname(data, "", "Bill") # top-level path is ""
>>> data = recordname(data, "bodacious", "Ted") # using new field names
>>> data
<Bill at index 0>
>>> data.bodacious
<Ted at index 0>
project(data, path)
View a projection of the OAMap data
at some path
.
For example, consider the following data:
>>> data = (List(Record({
... "met": Record({"pt": "float64",
... "phi": "float64"}),
... "muons": List(Record({"pt": "float64",
... "eta": "float64",
... "phi": "float64"}))
... }))
... .fromdata([
... {"met": {"pt": 10.1, "phi": 32.1},
... "muons": [
... {"pt": 1.1, "eta": 4.13, "phi": 22.2},
... {"pt": 2.2, "eta": 4.13, "phi": 22.2},
... {"pt": 3.3, "eta": 4.13, "phi": 22.2},
... ]},
... {"met": {"pt": 20.1, "phi": 32.1},
... "muons": [
... ]},
... {"met": {"pt": 30.1, "phi": 32.1},
... "muons": [
... {"pt": 4.4, "eta": 4.13, "phi": 22.2},
... {"pt": 5.5, "eta": 4.13, "phi": 22.2},
... ]},
... ]))
OAMap presents these as objects, which requires individual dereferencing to access:
>>> data
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> data[0].met
<Record at index 0>
>>> data[0].met.pt
10.1
>>> data[0].muons
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> data[0].muons[0].pt
1.1
But if we get rid of the pesky record structure, we can see through to the values more easily:
>>> project(data, "met")
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> project(data, "met/pt")
[10.1, 20.1, 30.1]
>>> project(data, "muons")
[[<Record at index 0>, <Record at index 1>, <Record at index 2>],
[],
[<Record at index 3>, <Record at index 4>]]
>>> project(data, "muons/pt")
[[1.1, 2.2, 3.3], [], [4.4, 5.5]]
This operation basically removes some of the abstraction that OAMap provides.
keep(data, *paths)
Eliminate all but the paths specified in paths
.
Example:
>>> data = List(Record({"good": "int",
... "goody": "float",
... "bad": List("bool"),
... "baddy": List("int")})).fromdata([
... {"good": 1, "goody": 1.1, "bad": [], "baddy": []},
... {"good": 2, "goody": 2.2, "bad": [True], "baddy": [1, 2, 3]},
... {"good": 3, "goody": 3.3, "bad": [False], "baddy": [4]}
... ])
>>> new = keep(data, "good*")
>>> new[0].good
1
>>> new[0].goody
1.1
>>> new[0].bad
AttributeError: 'Record' object has no attribute 'bad'
>>> new[0].baddy
Traceback (most recent call last):
AttributeError: 'Record' object has no attribute 'baddy'
drop(data, *paths)
Eliminate only the paths specified in paths
.
Example (different from above because they're nested in "x"):
>>> data = Record({"x": List(Record({"good": "int",
... "goody": "float",
... "bad": List("bool"),
... "baddy": List("int")}))}).fromdata({"x": [
... {"good": 1, "goody": 1.1, "bad": [], "baddy": []},
... {"good": 2, "goody": 2.2, "bad": [True], "baddy": [1, 2, 3]},
... {"good": 3, "goody": 3.3, "bad": [False], "baddy": [4]}
... ]})
>>> new = drop(data, "x/bad*")
>>> new.x[0].good
1
>>> new.x[0].goody
1.1
>>> new.x[0].bad
AttributeError: 'Record' object has no attribute 'bad'
>>> new.x[0].baddy
AttributeError: 'Record' object has no attribute 'baddy'
split(data, *paths)
Splits a single list of records with fields identified by paths
into separate lists, each representing just one field value.
For example, consider the following data:
>>> data = (List(Record({"muons":
... List(Record({"pt": "float64", "eta": "float64", "phi": "float64"}))}))
... .fromdata([
... {"muons": [
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... ]},
... {"muons": [
... ]},
... {"muons": [
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... ]},
... ]))
>>> for event in data:
... print "event"
... for muon in event.muons:
... print " muon", muon.pt, muon.eta, muon.phi
...
event
muon 3.14 4.13 22.2
muon 3.14 4.13 22.2
muon 3.14 4.13 22.2
event
event
muon 3.14 4.13 22.2
muon 3.14 4.13 22.2
This split call removes the "phi" field from "muons," promoting it to its own list.
>>> new = split(data, "muons/phi")
>>> for event in new:
... print "new event"
... for muon in event.muons:
... print " muon", muon.pt, muon.eta
... for phi in event.phi:
... print " phi", phi
...
new event
muon 3.14 4.13
muon 3.14 4.13
muon 3.14 4.13
phi 22.2
phi 22.2
phi 22.2
new event
new event
muon 3.14 4.13
muon 3.14 4.13
phi 22.2
phi 22.2
This is a purely metadata operation (though it couldn't be for rowwise data!). All that was changed was the schema:
>>> new.schema.show()
List(
starts = 'object-B',
stops = 'object-E',
content = Record(
fields = {
'muons': List(
starts = 'object-L-Fmuons-B',
stops = 'object-L-Fmuons-E',
content = Record(
fields = {
'eta': Primitive(dtype('float64'),
data='object-L-Fmuons-L-Feta-Df8'),
'pt': Primitive(dtype('float64'),
data='object-L-Fmuons-L-Fpt-Df8')
})
),
'phi': List(
starts = 'object-L-Fmuons-B', # same arrays as the muon records,
stops = 'object-L-Fmuons-E', # in multiple places in the same schema
content = Primitive(dtype('float64'),
data='object-L-Fmuons-L-Fphi-Df8')
)
})
)
Many fields can be split out of a record using the glob-pattern feature of paths:
>>> new2 = split(data, "muons/*")
>>> for event in new2:
... print "new event"
... for pt in event.pt:
... print " pt", muon.pt
... for eta in event.eta:
... print " eta", muon.eta
... for phi in event.phi:
... print " phi", phi
...
new event
pt 3.14
pt 3.14
pt 3.14
eta 4.13
eta 4.13
eta 4.13
phi 22.2
phi 22.2
phi 22.2
new event
new event
pt 3.14
pt 3.14
eta 4.13
eta 4.13
phi 22.2
phi 22.2
The following operations scale with the size of the schema and the read time of the few arrays they need to read.
merge(data, container, *paths)
Reverses the split operation by combining paths
into a pre-existing or new container
.
Continuing with the previous example, merge can put the "phi" values back into the "muons" records because their lists are identical. This particular case is a metadata-only operation because the list's starts/stops arrays have the same names and therefore must be identical. However, if the lists do not have the same names, the merge operation must read the arrays to verify that they are the same.
>>> undo = merge(new, "muons", "phi")
>>> for event in undo:
... print "event"
... for muon in event.muons:
... print " muon", muon.pt, muon.eta, muon.phi
...
event
muon 3.14 4.13 22.2
muon 3.14 4.13 22.2
muon 3.14 4.13 22.2
event
event
muon 3.14 4.13 22.2
muon 3.14 4.13 22.2
The following operations scale with the size of the schema and the read and write times of the few arrays involved.
mask(data, at, low, high=None)
Replace values with None
in data
at path at
that are either low
or between low
and high
(inclusively on both endpoints). NaN
is handled correctly.
Example:
>>> data = (List(Record({"muons":
... List(Record({"pt": "float64"}))}))
... .fromdata([
... {"muons": [
... {"pt": 1.1},
... {"pt": 2.2},
... {"pt": -1000}
... ]},
... {"muons": [
... {"pt": -1000},
... {"pt": 4.4},
... {"pt": 5.5}
... ]}
... ]))
>>> data
[<Record at index 0>, <Record at index 1>]
>>> project(data, "muons/pt")
[[1.1, 2.2, -1000.0], [-1000.0, 4.4, 5.5]] # physicist used -1000 for "missing"
>>> new = mask(data, "muons/pt", -1000)
>>> new
[<Record at index 0>, <Record at index 1>]
>>> project(new, "muons/pt")
[[1.1, 2.2, None], [None, 4.4, 5.5]] # None instead of -1000
The schema has been changed; new arrays have been added and distinguished from the old ones with a namespace.
>>> new.schema.show()
List(
starts = 'object-B',
stops = 'object-E',
content = Record(
fields = {
'muons': List(
starts = 'object-L-Fmuons-B',
stops = 'object-L-Fmuons-E',
content = Record(
fields = {
'pt': Primitive(dtype('float64'),
nullable=True, # pt is now nullable
data='array-0',
mask='array-1', # with a mask array
namespace='namespace-0')
})
)
})
)
The actual data are in an in-memory dict
that needs to be saved in the database (somehow).
>>> new._arrays
<__main__.DualSource object at 0x7aa1eba32550>
>>> new._arrays.new
{'array-1': array([ 0, 1, -1, -1, 4, 5], dtype=int32),
'array-0': array([ 1.1, 2.2, -1000. , -1000. , 4.4, 5.5])}
flatten(data, at="")
Turn a list of lists into a simple list at path at
.
>>> data = List(List("int")).fromdata([[1, 2, 3], [], [4, 5]])
>>> data
[[1, 2, 3], [], [4, 5]]
>>> flatten(data)
[1, 2, 3, 4, 5]
>>> data2 = (Record({"x": List(List("int")), "y": "bool"})
... .fromdata({"x": [[1, 2, 3], [], [4, 5]], "y": True}))
>>>
>>> new = flatten(data2, at="x")
>>> new.x
[1, 2, 3, 4, 5]
The following operations scale with the size of the schema, the read time of all the arrays required by the user-defined function, and the compute time of the user-defined function.
In all such functions, the numba
option has the following possible values:
-
True
(default): compile with Numba if Numba is installed (import numba
does not raise anImportError
). Default Numba options will be used, which includes a fallback to partially compiled code if the user-defined function fails a type check. -
dict
of options: options to pass to Numba compilation. For example,{"nopython": True, "nogil": True}
fails for partially compiled code and releases Python's GIL during execution. -
False
orNone
: pure Python; do not attempt to compile with Numba, even if Numba is available. This can be faster for small datasets (where compilation overhead dominates over runtime).
filter(data, fcn, args=(), at="", numba=True)
Removes objects from data
that fail the fcn
function with possible args
arguments at path at
. The return type of fcn
must be boolean (checked when compiled with Numba). The numba
option has the meaning described above.
Example of filtering events:
>>> data = (List(Record({"muons":
... List(Record({"pt": "float64", "charge": "int8"}))}))
... .fromdata([
... {"muons": [
... {"pt": 1.1, "charge": 1},
... {"pt": 2.2, "charge": -1},
... {"pt": 3.3, "charge": 1},
... ]},
... {"muons": [
... ]},
... {"muons": [
... {"pt": 4.4, "charge": -1},
... {"pt": 5.5, "charge": -1},
... ]},
... ]))
>>> data
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> project(data, "muons")
[[<Record at index 0>, <Record at index 1>, <Record at index 2>],
[],
[<Record at index 3>, <Record at index 4>]]
>>> new = filter(
... data,
... lambda event: len(event.muons) > 0)
>>> new
[<Record at index 0>, <Record at index 2>]
>>> project(new, "muons")
[[<Record at index 0>, <Record at index 1>, <Record at index 2>],
[<Record at index 3>, <Record at index 4>]]
Example of filtering particles:
>>> new2 = filter(
... data,
... lambda muon: muon.charge > 0,
... at="muons")
>>> new2
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> project(new2, "muons")
[[<Record at index 0>, <Record at index 2>], [], []]
Note that filtering without modifying the original is accomplished through pointers: the filtered data is a list of pointers to the data in the original list. Thus, a filtered dataset is actually an "event list," though this is transparent to the physicist. Compare the fully qualified schemas of data
and new
:
>>> data._generator.namedschema().show() # internal: adds fully qualified names
List(
starts = 'object-B',
stops = 'object-E',
content = Record(
fields = {
'muons': List(
starts = 'object-L-Fmuons-B',
stops = 'object-L-Fmuons-E',
content = Record(
fields = {
'charge': Primitive(dtype('int8'),
data='object-L-Fmuons-L-Fcharge-Di1'),
'pt': Primitive(dtype('float64'),
data='object-L-Fmuons-L-Fpt-Df8')
})
)
})
)
>>> new.schema.show()
List( # new List (shorter than the original)
starts = 'array-0',
stops = 'array-1',
namespace = 'namespace-0',
content = Pointer( # new Pointer (to contents of the original)
positions = 'array-2',
namespace = 'namespace-0',
target = Record(
fields = {
'muons': List(
starts = 'object-L-Fmuons-B',
stops = 'object-L-Fmuons-E',
content = Record(
fields = {
'charge': Primitive(dtype('int8'),
data='object-L-Fmuons-L-Fcharge-Di1'),
'pt': Primitive(dtype('float64'),
data='object-L-Fmuons-L-Fpt-Df8')
})
)
})
)
)
define(data, fieldname, fcn, args=(), at="",
fieldtype=Primitive(numpy.float), numba=True)
Adds a field fieldname
at path at
by applying fcn
function with possible args
arguments to every object at path at
. The return type of fcn
must be fieldtype
. The numba
option has the meaning described above.
Example of defining a new event attribute:
>>> data = (List(Record({"muons":
... List(Record({"pt": "float64", "eta": "float64", "phi": "float64"}))}))
... .fromdata([
... {"muons": [
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... ]},
... {"muons": [
... ]},
... {"muons": [
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... {"pt": 3.14, "eta": 4.13, "phi": 22.2},
... ]},
... ]))
>>> new = define(
... data,
... "nummuons",
... lambda event: len(event.muons),
... fieldtype=Primitive("int64"))
>>> new
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> new[0].nummuons
3
>>> project(new, "nummuons")
[3, 0, 2]
Example of defining a new particle attribute:
>>> from math import sinh
>>> new2 = define(
... data,
... "pz",
... lambda muon: muon.pt * sinh(muon.eta),
... at="muons")
>>> new2
[<Record at index 0>, <Record at index 1>, <Record at index 2>]
>>> new2[0].muons[0].pz
97.59408888782299
>>> project(new2, "muons/pz")
[[97.59408888782299, 97.59408888782299, 97.59408888782299],
[],
[97.59408888782299, 97.59408888782299]]
The following operations export data from the OAMap schema in highly reduced forms (so that they can be easily downloaded, unlike the original OAMap data).
map(data, fcn, args=(), at="", names=None, numba=True)
Produces a flat table of data (Numpy recarray, which can be passed as a single argument to the pandas.DataFrame
constructor) from an OAMap dataset data
. The fcn
function with possible args
arguments is applied to every element at path at
to produce a row of the table. If names
are not supplied, they'll be Numpy recarray defaults (f0, f1, f2...
); otherwise, names
labels the columns. The numba
option has the meaning described above.
fcn(datum, *args) -> row of a table as a tuple of numbers
(UNTESTED, NO WORKING EXAMPLES)
reduce(data, tally, fcn, args=(), at="", numba=True)
Aggregates data by repeatedly applying the fcn
function to the tally
with possible args
arguments at every element at path at
, expecting a new tally
in return. The numba
option has the meaning described above.
fcn(datum, tally, *args) -> new tally
Intended for histogramming, though summation, max/min, averaging, etc. are also possible with arguments like this:
reduce(data, 0, lambda x, tally: x + tally, at="muons/pt")
The initial tally
value (0
above), the second argument of fcn
, and the return value of fcn
must agree in data type (explicitly tested for Numba-compiled functions).
(UNTESTED, NO WORKING EXAMPLES)