[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass #5606

CRobeck · 2025-01-14T17:53:05Z

Add the init and allocation of the Proton dialect device buffer that can be used in place of the shared memory buffer. The device buffer is just a module local, zero initialized, stack buffer in address space(1).

fywkevin · 2025-01-25T19:27:43Z

third_party/amd/backend/compiler.py

@@ -256,6 +256,9 @@ def make_ttgir(mod, metadata, options):
        passes.common.add_canonicalizer(pm)
        passes.common.add_cse(pm)
        passes.common.add_symbol_dce(pm)
+
+        proton.passes.ttgpuir.add_allocate_device_buffer(pm)


I don't think we need a separate pass for adding those ops.

What's the alternative? Just allocating the device buffer allocation in the same pass as the init op insertion? Or get rid of the TTGIR init op altogether and just do this entirely on the LLVM IR level?

I'm somewhat worried about hiding the details of how all this is done behind the scenes from developers. So my thinking was if we have an "init" op at the TTGIR level that maps to the device allocation buffer on the LLVM IR things would at least have a 1:1 mapping. But I think that needs to be done in two passes then - one to add the init op if a record op is found and then another to rewrite the init op into the LLVM level device address space buffer.

lib/Conversion/TritonToTritonGPU/TritonToTritonGPUPass.cpp

fywkevin · 2025-01-25T19:31:51Z

third_party/amd/lib/TritonAMDGPUToLLVM/TritonGPUToLLVM.cpp

@@ -231,6 +232,9 @@ struct ConvertTritonAMDGPUToLLVM
    mlir::triton::populatePrintOpToLLVMPattern(typeConverter, patterns,
                                               targetInfo, commonBenefit);

+    mlir::triton::proton::populateInitDeviceBufferOpToLLVMPattern(


We shouldn't couple our proton LLVM lowering logic into other backends (nvidia and amd) llvm lowering rules. Instead, we are going to have a separate proton conversion pass which populates all kinds of proton op lowering patterns.

fywkevin · 2025-01-25T19:32:34Z

third_party/nvidia/backend/compiler.py

@@ -267,6 +267,7 @@ def make_ttgir(mod, metadata, opt, capability):
            nvidia.passes.ttnvgpuir.add_fence_insertion(pm)
            nvidia.passes.ttnvgpuir.add_tma_lowering(pm)
        passes.common.add_canonicalizer(pm)
+        proton.passes.ttgpuir.add_allocate_device_buffer(pm)


same here, we don't want this to be a separate pass

fywkevin · 2025-01-25T19:32:56Z

third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/TritonGPUToLLVM.cpp

@@ -153,6 +153,9 @@ struct ConvertTritonGPUToLLVM
                                               targetInfo, benefit);
    mlir::triton::proton::populateRecordOpToLLVMPattern(typeConverter, patterns,
                                                        targetInfo, benefit);
+    mlir::triton::proton::populateInitDeviceBufferOpToLLVMPattern(


same as amd's comment

fywkevin · 2025-01-25T19:36:12Z

third_party/proton/dialect/include/Dialect/Proton/IR/ProtonOps.td

@@ -62,4 +64,15 @@ def TT_RecordOp : TT_Proton_Op<"record", [DeclareOpInterfaceMethods<MemoryEffect
  let assemblyFormat = " `(` operands `)` attr-dict";
 }

+
+def TT_InitDeviceBufferOp : TT_Proton_Op<"init_device_buffer", [DeclareOpInterfaceMethods<MemoryEffectsOpInterface>]> {


Could you rename it to something like buffer_alloc (consistent with local_alloc)? It is better to have the return type of memory descriptor. Then we could use smem and device_buffer in different RecordOp strategies interchangeably.

fywkevin · 2025-01-25T19:37:05Z

third_party/proton/dialect/include/Dialect/Proton/IR/ProtonOps.td

@@ -62,4 +64,15 @@ def TT_RecordOp : TT_Proton_Op<"record", [DeclareOpInterfaceMethods<MemoryEffect
  let assemblyFormat = " `(` operands `)` attr-dict";
 }

+
+def TT_InitDeviceBufferOp : TT_Proton_Op<"init_device_buffer", [DeclareOpInterfaceMethods<MemoryEffectsOpInterface>]> {


It is also a good idea to have its dual operation: buffer_dealloc. Not necessary to be in the same PR but good to think about it as well.

I don't think that is actually possible in this case. We're allocating a compile time, fixed size, stack buffer in device memory. I don't know how we would dealloc that. Keep in mind this is different from the shared memory buffer which is dynamically allocated.

fywkevin · 2025-01-25T19:39:13Z

third_party/proton/dialect/include/TritonProtonToLLVM/PatternTritonProtonOpToLLVM.h

@@ -10,6 +10,11 @@ void populateRecordOpToLLVMPattern(LLVMTypeConverter &typeConverter,
                                   RewritePatternSet &patterns,
                                   const TargetInfoBase &targetInfo,
                                   PatternBenefit benefit);
+void populateInitDeviceBufferOpToLLVMPattern(LLVMTypeConverter &typeConverter,


After we have our own llvm lowering conversion pass, let's this move to proton/dialect/lib/...

third_party/proton/dialect/lib/TritonProtonToLLVM/InitDeviceBufferOpToLLVM.cpp

fywkevin · 2025-01-25T19:43:08Z

third_party/proton/test/test_device_buffer.py

+
+
+@triton.jit
+def softmax_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_rows, n_cols, BLOCK_SIZE: tl.constexpr,


For the end-to-end testing, you could manually construct a TTGIR with the buffer_alloc_op and read write to it and finally write it back to gmem to check its value in python.

Right, I think we'll want to go through in another PR and add all the end to end testing at once to make sure we have the code coverage we want.

…_buffer

CRobeck added 20 commits January 10, 2025 22:02

temp

afab47b

temp

05adc3e

update

0b6947c

update

0a286b2

update

2ce6944

update

66b78f5

temp

1320379

temp

18e05e3

temp

e188eb8

temp

b22610d

clean up

6c1b80e

clean up

6cc88d0

temp

6c2f237

temp

01be85e

temp

6fb8b59

temp

5e0b4ac

temp

3642a3b

temp

93f1ae1

update

1f3fa88

update

c4b173a

CRobeck changed the base branch from main to proton-dev January 23, 2025 03:55

CRobeck changed the title ~~[WIP][Proton][Dialect] Add Initial Infrastructure For Proton Shared Memory Buffer~~ [Proton][Dialect] Add Infrastructure For Proton Device Memory Buffer Jan 23, 2025

update

73d584d

CRobeck force-pushed the proton_buffer branch from 708cb2c to 73d584d Compare January 23, 2025 21:58

CRobeck added 3 commits January 23, 2025 22:21

update

303a53a

update

755e30d

update

51754f9

CRobeck marked this pull request as ready for review January 23, 2025 22:26

CRobeck requested review from antiagainst and zhanglx13 as code owners January 23, 2025 22:26

CRobeck requested a review from ptillet as a code owner January 23, 2025 22:26

CRobeck added 2 commits January 23, 2025 18:05

Merge branch 'proton-dev' into proton_buffer

1f314eb

update

efd3faa

CRobeck mentioned this pull request Jan 24, 2025

[Dialect] Implement Proton device buffer init and alloc ops #5689

Open

CRobeck added 3 commits January 24, 2025 02:21

update

97fc32a

update

e86281e

update

b6c0b85

CRobeck changed the title ~~[Proton][Dialect] Add Infrastructure For Proton Device Memory Buffer~~ [Proton][Dialect] Add Infrastructure For Proton Device Memory Buffer Pass Jan 24, 2025

CRobeck changed the title ~~[Proton][Dialect] Add Infrastructure For Proton Device Memory Buffer Pass~~ [Proton][Dialect] Add Proton Device Memory Buffer Pass Jan 24, 2025

CRobeck changed the title ~~[Proton][Dialect] Add Proton Device Memory Buffer Pass~~ [Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass Jan 25, 2025

fywkevin self-assigned this Jan 25, 2025

fywkevin requested changes Jan 25, 2025

View reviewed changes

CRobeck added 3 commits January 26, 2025 01:12

update

f5c8fe2

Merge branch 'proton_buffer' of github.com:CRobeck/triton into proton…

892f8ec

…_buffer

update

7c64da9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass #5606

[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass #5606

CRobeck commented Jan 14, 2025 •

edited

Loading

fywkevin Jan 25, 2025

CRobeck Jan 25, 2025 •

edited

Loading

fywkevin Jan 25, 2025

fywkevin Jan 25, 2025

fywkevin Jan 25, 2025

fywkevin Jan 25, 2025

fywkevin Jan 25, 2025

CRobeck Jan 25, 2025 •

edited

Loading

fywkevin Jan 25, 2025

fywkevin Jan 25, 2025

CRobeck Jan 25, 2025



		@triton.jit
		def softmax_kernel(output_ptr, input_ptr, input_row_stride, output_row_stride, n_rows, n_cols, BLOCK_SIZE: tl.constexpr,

[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass #5606

Are you sure you want to change the base?

[Proton][Dialect] Add Proton Device Memory Buffer Init and Allocate Pass #5606

Conversation

CRobeck commented Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

CRobeck Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CRobeck Jan 25, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CRobeck commented Jan 14, 2025 •

edited

Loading

CRobeck Jan 25, 2025 •

edited

Loading

CRobeck Jan 25, 2025 •

edited

Loading