Replies: 7 comments
-
This is motivated by the idea of having some
That works but seems slow and I hope to be faster by doing:
Which doesn't work (for me). |
Beta Was this translation helpful? Give feedback.
-
This doesn't work for >>> a = ak.Array([[1, 2, 3], [], [4, 5]])
>>> b = ak.Array([["a", "b"], ["c"], ["d", "e"]])
>>> index1 = ak.argcartesian({"x": a, "y": b}, nested=False)
>>> index2 = ak.argcartesian({"x": a, "y": b}, nested=True)
>>> index1.tolist()
[[{'x': 0, 'y': 0},
{'x': 0, 'y': 1},
{'x': 1, 'y': 0},
{'x': 1, 'y': 1},
{'x': 2, 'y': 0},
{'x': 2, 'y': 1}],
[],
[{'x': 0, 'y': 0},
{'x': 0, 'y': 1},
{'x': 1, 'y': 0},
{'x': 1, 'y': 1}]]
>>> index2.tolist()
[[[{'x': 0, 'y': 0}, {'x': 0, 'y': 1}],
[{'x': 1, 'y': 0}, {'x': 1, 'y': 1}],
[{'x': 2, 'y': 0}, {'x': 2, 'y': 1}]],
[],
[[{'x': 0, 'y': 0}, {'x': 0, 'y': 1}],
[{'x': 1, 'y': 0}, {'x': 1, 'y': 1}]]]
>>> index1["x"].tolist()
[[0, 0, 1, 1, 2, 2], [], [0, 0, 1, 1]]
>>> index2["x"].tolist()
[[[0, 0], [1, 1], [2, 2]], [], [[0, 0], [1, 1]]]
>>> ak.broadcast_arrays(a, index2["x"])[0].tolist()
[[[1, 1], [2, 2], [3, 3]], [], [[4, 4], [5, 5]]]
>>> ak.broadcast_arrays(a, index2["x"])[0].tolist()
[[[1, 1], [2, 2], [3, 3]], [], [[4, 4], [5, 5]]]
>>> index2["x"].tolist()
[[[0, 0], [1, 1], [2, 2]], [], [[0, 0], [1, 1]]] but it still doesn't work: >>> ak.broadcast_arrays(a, index2["x"])[0][index2["x"]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/jpivarski/miniconda3/lib/python3.8/site-packages/awkward/highlevel.py", line 946, in __getitem__
return ak._util.wrap(self._layout[where], self._behavior)
ValueError: in ListArray64 attempting to get 2, index out of range because, for example, the |
Beta Was this translation helpful? Give feedback.
-
An easier way to deal with this conceptually, with possibly better performance as well, is to >>> many_a = ak.zip({"a1": a, "a2": a, "a3": a})
>>> many_b = ak.zip({"b1": b, "b2": b, "b3": b})
>>> ak.cartesian({"x": many_a, "y": many_b}).tolist()
[[{'x': {'a1': 1, 'a2': 1, 'a3': 1}, 'y': {'b1': 'a', 'b2': 'a', 'b3': 'a'}},
{'x': {'a1': 1, 'a2': 1, 'a3': 1}, 'y': {'b1': 'b', 'b2': 'b', 'b3': 'b'}},
{'x': {'a1': 2, 'a2': 2, 'a3': 2}, 'y': {'b1': 'a', 'b2': 'a', 'b3': 'a'}},
{'x': {'a1': 2, 'a2': 2, 'a3': 2}, 'y': {'b1': 'b', 'b2': 'b', 'b3': 'b'}},
{'x': {'a1': 3, 'a2': 3, 'a3': 3}, 'y': {'b1': 'a', 'b2': 'a', 'b3': 'a'}},
{'x': {'a1': 3, 'a2': 3, 'a3': 3}, 'y': {'b1': 'b', 'b2': 'b', 'b3': 'b'}}],
[],
[{'x': {'a1': 4, 'a2': 4, 'a3': 4}, 'y': {'b1': 'd', 'b2': 'd', 'b3': 'd'}},
{'x': {'a1': 4, 'a2': 4, 'a3': 4}, 'y': {'b1': 'e', 'b2': 'e', 'b3': 'e'}},
{'x': {'a1': 5, 'a2': 5, 'a3': 5}, 'y': {'b1': 'd', 'b2': 'd', 'b3': 'd'}},
{'x': {'a1': 5, 'a2': 5, 'a3': 5}, 'y': {'b1': 'e', 'b2': 'e', 'b3': 'e'}}]] Beyond that, Numba is generally faster than these array-at-a-time functions, though the array-at-a-time functions are generally more convenient. The second thing to try would be Numba. |
Beta Was this translation helpful? Give feedback.
-
Thank you for the fast response. |
Beta Was this translation helpful? Give feedback.
-
#555 is about adding a way to create new arrays-of-lists. It's something one can already do from the low level (directly manipulating layouts), but it would provide a high-level way to do it, motivated by I've relabeled this issue as a question, because I think it was about technique, not new functionality. Zipping all of your arrays would let you do a single |
Beta Was this translation helpful? Give feedback.
-
Okay, thank you for the clarification.
The above takes
The above takes
The above takes
The above takes almost no time.
The above suddenly takes The new approach is faster in crossing but for some reason way slower in calculating the difference, which doesn't make sense to me as |
Beta Was this translation helpful? Give feedback.
-
The trade-off is due to a difference in technique. When computing the Cartesian product (as well as several other operations) of records, we don't descend through every field of the record, computing the Cartesian product of each field. We create an IndexedArray of those records. I'll demonstrate. For the Caresian product of two record arrays, >>> one = ak.Array([[{"x": 1}, {"x": 2}, {"x": 3}], [], [{"x": 4}, {"x": 5}]])
>>> two = ak.Array([[{"y": 1}, {"y": 2}], [{"y": 3}], [{"y": 4}, {"y": 5}]])
>>> ak.cartesian([one, two]).tolist()
[[({'x': 1}, {'y': 1}),
({'x': 1}, {'y': 2}),
({'x': 2}, {'y': 1}),
({'x': 2}, {'y': 2}),
({'x': 3}, {'y': 1}),
({'x': 3}, {'y': 2})],
[],
[({'x': 4}, {'y': 4}),
({'x': 4}, {'y': 5}),
({'x': 5}, {'y': 4}),
({'x': 5}, {'y': 5})]] The result is a ListOffsetArray of RecordArray of IndexedArrays, with the unduplicated data inside the IndexedArrays. >>> ak.cartesian([one, two]).layout
<ListOffsetArray64>
<offsets><Index64 i="[0 6 6 10]" offset="0" length="4" at="0x562c068de5d0"/></offsets>
<content><RecordArray>
<field index="0">
<IndexedArray64>
<index><Index64 i="[0 0 1 1 2 2 3 3 4 4]" offset="0" length="10" at="0x562c06903df0"/></index>
<content><RecordArray>
<field index="0" key="x">
<NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x562c069f74e0"/>
</field>
</RecordArray></content>
</IndexedArray64>
</field>
<field index="1">
<IndexedArray64>
<index><Index64 i="[0 1 0 1 0 1 3 4 3 4]" offset="0" length="10" at="0x562c068e3ee0"/></index>
<content><RecordArray>
<field index="0" key="y">
<NumpyArray format="l" shape="5" data="1 2 3 4 5" at="0x562c06a3e660"/>
</field>
</RecordArray></content>
</IndexedArray64>
</field>
</RecordArray></content>
</ListOffsetArray64> The By contrast, if you did a Cartesian product of non-record arrays: >>> ak.cartesian([ak.Array([[1, 2, 3], [], [4, 5]]), ak.Array([[1, 2], [3], [4, 5]])]).layout
<ListOffsetArray64>
<offsets><Index64 i="[0 6 6 10]" offset="0" length="4" at="0x562c06a43ef0"/></offsets>
<content><RecordArray>
<field index="0">
<NumpyArray format="l" shape="10" data="1 1 2 2 3 3 4 4 5 5" at="0x562c0688a7d0"/>
</field>
<field index="1">
<NumpyArray format="l" shape="10" data="1 2 1 2 1 2 4 5 4 5" at="0x562c068e3ee0"/>
</field>
</RecordArray></content>
</ListOffsetArray64> there are no IndexedArrays and the innermost NumpyArray data are duplicated. Adding this IndexedArray layer to most operations with RecordArrays was a performance upgrade this summer: issue #204, PR #261. However, as you point out, it is a tradeoff. If you have records with few fields and frequently access the results (in most or all fields), then you'd rather do We were motivated to make the change because (1) many physics records are wide, and a given analysis typically doesn't use all of those fields and (2) it's also fairly common for records to contain VirtualArrays for lazy-reading of data. The IndexedArray indirection not only delays duplication, it also prevents unaccessed fields from being eagerly read. So we were seeing cases in which dozens of fields were "Cartesianed" and then ignored, as well as cases in which fields were unnecessarily read from disk, "Cartesianed," and ignored. What you're seeing in your examples is just illustrating that you have to pay the price eventually: either up front or upon access. Since you will be using all of the fields of your record, there's not a strong advantage to one case or the other. We could imagine complicating the interface, adding compute-once caches to the IndexedArrays, but there's diminishing returns in that: the array-at-a-time interface is a balance of fast and convenient (at least, faster than Python loops and more convenient than writing a function in Numba). If you really need speed for a given application, you should probably turn to Numba. That avoids the whole intermediate arrays that duplicate the data: a traditional for loop will win on a CPU. (It's not as clear on a GPU, but the GPU implementations are far from ready.) This is the same reason Numba usually beats NumPy. But this isn't a concession: the ability to mix Awkward Array operations with the occasional Numba-accelerated loop was part of the plan. It enables "non-premature optimization": you try it in Awkward operations because they're the most convenient, then replace just the hot-spots with Numba, leaving everything else intact. |
Beta Was this translation helpful? Give feedback.
-
Let
a
andb
be some jagged arrays of same outer (axis = 0) lengths.It is possible to do:
However, it would also be nice to have this working with
nested = True
At the moment, this results in:
Beta Was this translation helpful? Give feedback.
All reactions