Implement TMA store reduction for add, min, max #5699

ardaunal · 2025-01-25T01:13:37Z

This change adds support for lowering _experimental_descriptor_store to TMA store reduction instruction cp.reduce.async.bulk.tensor. Performance of reduction operations that can utilize this async instruction should improve. That could be seen in the plot below that's generated from 10-vector-tma-reduce-add.py.

During the implementation, we noticed a discrepancy in TMA descriptor generation in driver.c that element size needed to be the element type. This is currently fixed for a subset of types for fill1DTMADescriptor and could be improved further. After we decide on a final design, fill2DTMADescriptor should be similarly updated.

I am also planning to convert the tutorial 10-vector-tma-reduce-add.py into an end-to-end test.

New contributor declaration

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
Select one of the following.
- I have added tests.
  - /test for lit tests
  - /unittest for C++ tests
  - /python/test for end-to-end tests
- This PR does not need a test because FILL THIS IN.
Select one of the following.
- I have not added any lit tests.
- The lit tests I have added follow these best practices,
  including the "tests should be minimal" section. (Usually running Python code
  and using the instructions it generates is not minimal.)

ardaunal · 2025-01-25T01:14:09Z

@htyu

ThomasRaoux · 2025-01-25T06:39:58Z

I haven't looked in details but a high level comment:
I had experimented with this in #3911 but couldn't find a good case where it helps.
Reductions should be highly bandwidth bound so I'm surprised about the big performance boost from the comment, how could we be so far from peak performance on just a pure reduction, I suspect something is off in the benchmarking. Also the bandwidth of the Triton case is way above the theoretical peak of H100 (3.35 TB/s)

peterbell10 · 2025-01-26T23:47:25Z

python/tutorials/10-vector-tma-reduce-add.py

+    tl._experimental_descriptor_store(output_desc, x, [block_start])
+    # Load y through TMA.
+    y = tl._experimental_descriptor_load(
+        y_desc, [block_start], [BLOCK_SIZE], y_ptr.dtype.element_ty
+    )
+    # Store y to through TMA reduce add.
+    tl._experimental_descriptor_store(output_desc, y, [block_start], store_reduce="add")


Is this a race? For the store, the PTX manual says

The copy operation in cp.async.bulk.tensor is treated as a weak memory operation and the complete-tx operation on the mbarrier has .release semantics at the .cluster scope

And for the reduction,

Each reduction operation performed by cp.reduce.async.bulk.tensor has individually .relaxed.gpu memory ordering semantics.

Therefore we need an aquire fence between the store and reduce ops.

peterbell10 · 2025-01-26T23:50:03Z

python/triton/language/core.py

+        self,
+        offsets: List[constexpr | tensor],
+        value: tensor,
+        store_reduce,


IMO this should be a separate API from store since it has different memory semantics.

peterbell10 · 2025-01-26T23:51:52Z

third_party/nvidia/backend/driver.c

  case 1:
-    type = CU_TENSOR_MAP_DATA_TYPE_UINT8;


Obviously this is incomplete as it breaks lots of dtypes. Also need to update the MakeTensorDescriptor op.

htyu · 2025-01-27T00:22:25Z

I had experimented with this in #3911 but couldn't find a good case where it helps.

Did you try the split-k gemm kernel? It looks like TMA reduction gives better perf than tl.atomic_add.

TMA reduction is also used in thunderkitten’s attn bwd kernel.

ThomasRaoux · 2025-01-27T00:27:48Z

I had experimented with this in #3911 but couldn't find a good case where it helps.

Did you try the split-k gemm kernel? It looks like TMA reduction gives better perf than tl.atomic_add.

TMA reduction is also used in thunderkitten’s attn bwd kernel.

no I was trying to use the reduction as a replacement to matmul with accumulator from HBM. Split-k gemm with atomics will cause non-determinism?

htyu · 2025-01-27T00:32:14Z

Split-k gemm with atomics will cause non-determinism

That's true. I have clearly seen that causes a numerical issue with FP8 gemm, but somehow it is still used in some places where performance is more important.

Implement TMA store reduction for add, min, max

10b8b8d

ardaunal requested a review from ptillet as a code owner January 25, 2025 01:13

peterbell10 requested changes Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement TMA store reduction for add, min, max #5699

Implement TMA store reduction for add, min, max #5699

ardaunal commented Jan 25, 2025

ardaunal commented Jan 25, 2025

ThomasRaoux commented Jan 25, 2025

peterbell10 Jan 26, 2025

peterbell10 Jan 26, 2025

peterbell10 Jan 26, 2025

htyu commented Jan 27, 2025

ThomasRaoux commented Jan 27, 2025

htyu commented Jan 27, 2025

Implement TMA store reduction for add, min, max #5699

Are you sure you want to change the base?

Implement TMA store reduction for add, min, max #5699

Conversation

ardaunal commented Jan 25, 2025

New contributor declaration

ardaunal commented Jan 25, 2025

ThomasRaoux commented Jan 25, 2025

peterbell10 Jan 26, 2025

Choose a reason for hiding this comment

peterbell10 Jan 26, 2025

Choose a reason for hiding this comment

peterbell10 Jan 26, 2025

Choose a reason for hiding this comment

htyu commented Jan 27, 2025

ThomasRaoux commented Jan 27, 2025

htyu commented Jan 27, 2025