Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement TMA store reduction for add, min, max #5699

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ardaunal
Copy link

This change adds support for lowering _experimental_descriptor_store to TMA store reduction instruction cp.reduce.async.bulk.tensor. Performance of reduction operations that can utilize this async instruction should improve. That could be seen in the plot below that's generated from 10-vector-tma-reduce-add.py.

image

During the implementation, we noticed a discrepancy in TMA descriptor generation in driver.c that element size needed to be the element type. This is currently fixed for a subset of types for fill1DTMADescriptor and could be improved further. After we decide on a final design, fill2DTMADescriptor should be similarly updated.

I am also planning to convert the tutorial 10-vector-tma-reduce-add.py into an end-to-end test.

New contributor declaration

  • I am not making a trivial change, such as fixing a typo in a comment.

  • I have written a PR description following these
    rules.

  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.

  • Select one of the following.

    • I have added tests.
      • /test for lit tests
      • /unittest for C++ tests
      • /python/test for end-to-end tests
    • This PR does not need a test because FILL THIS IN.
  • Select one of the following.

    • I have not added any lit tests.
    • The lit tests I have added follow these best practices,
      including the "tests should be minimal" section. (Usually running Python code
      and using the instructions it generates is not minimal.)

@ardaunal ardaunal requested a review from ptillet as a code owner January 25, 2025 01:13
@ardaunal
Copy link
Author

@htyu

@ThomasRaoux
Copy link
Collaborator

I haven't looked in details but a high level comment:
I had experimented with this in #3911 but couldn't find a good case where it helps.
Reductions should be highly bandwidth bound so I'm surprised about the big performance boost from the comment, how could we be so far from peak performance on just a pure reduction, I suspect something is off in the benchmarking. Also the bandwidth of the Triton case is way above the theoretical peak of H100 (3.35 TB/s)

Comment on lines +48 to +54
tl._experimental_descriptor_store(output_desc, x, [block_start])
# Load y through TMA.
y = tl._experimental_descriptor_load(
y_desc, [block_start], [BLOCK_SIZE], y_ptr.dtype.element_ty
)
# Store y to through TMA reduce add.
tl._experimental_descriptor_store(output_desc, y, [block_start], store_reduce="add")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a race? For the store, the PTX manual says

The copy operation in cp.async.bulk.tensor is treated as a weak memory operation and the complete-tx operation on the mbarrier has .release semantics at the .cluster scope

And for the reduction,

Each reduction operation performed by cp.reduce.async.bulk.tensor has individually .relaxed.gpu memory ordering semantics.

Therefore we need an aquire fence between the store and reduce ops.

self,
offsets: List[constexpr | tensor],
value: tensor,
store_reduce,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this should be a separate API from store since it has different memory semantics.

case 1:
type = CU_TENSOR_MAP_DATA_TYPE_UINT8;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously this is incomplete as it breaks lots of dtypes. Also need to update the MakeTensorDescriptor op.

@htyu
Copy link
Collaborator

htyu commented Jan 27, 2025

I had experimented with this in #3911 but couldn't find a good case where it helps.

Did you try the split-k gemm kernel? It looks like TMA reduction gives better perf than tl.atomic_add.

TMA reduction is also used in thunderkitten’s attn bwd kernel.

@ThomasRaoux
Copy link
Collaborator

I had experimented with this in #3911 but couldn't find a good case where it helps.

Did you try the split-k gemm kernel? It looks like TMA reduction gives better perf than tl.atomic_add.

TMA reduction is also used in thunderkitten’s attn bwd kernel.

no I was trying to use the reduction as a replacement to matmul with accumulator from HBM. Split-k gemm with atomics will cause non-determinism?

@htyu
Copy link
Collaborator

htyu commented Jan 27, 2025

Split-k gemm with atomics will cause non-determinism

That's true. I have clearly seen that causes a numerical issue with FP8 gemm, but somehow it is still used in some places where performance is more important.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants