[Major] Fuse bias+gemm and layernorm+quantization for more efficient ViT #254

Louym · 2025-01-13T08:39:30Z

@ys-2020 I have fused bias+gemm and layernorm+quantization. These optimizations achieve 1.5-1.6x and 1.5-2.0x kernel speedup respectively on RTX 4090, which lead to about 1.15x e2e speedup for ViT. I have also made some optimizations on these for Orin, but there is still room for improvement. I will complete it later.

ys-2020 · 2025-01-14T03:09:05Z

awq/kernels/csrc/fused_layernorm/layernorm_kernels.cu

  int hidden_size = input.size(-1);
  int num_tokens = input.numel() / hidden_size;
  dim3 grid(num_tokens);
-  dim3 block(std::min(hidden_size, 1024));
+  dim3 block(std::min(hidden_size/2, 1024));//Prevent thread idling when the embedding size is greater than 1024 and not an integer multiple of it.


Can we fix this in a more elegant way to improve utilzation?

ys-2020 · 2025-01-14T03:13:08Z

awq/kernels/csrc/fused_layernorm/layernorm_kernels.cu

@@ -41,23 +40,24 @@ __inline__ __device__ Tf compute_layernorm(Tf val, float s_mean, float s_varianc
 * First pass (loop) computes the mean.
 * Second computes the variance via Var[x] = E[(x - E[x])²].
 * Third pass computes and writes normed_output
- *
- * with USE_DIFF_OF_SQUARES set to true (may be faster but less accurate):
+ * For better speedup, we set USE_DIFF_OF_SQUARES to true (may be faster but less accurate):


Lets keep the original template for better flexibility.

ys-2020 · 2025-01-14T03:15:19Z

awq/kernels/csrc/w8a8/w8a8_gemm_cuda.cu

+  #pragma unroll
+  for (int i = 0; i < CTA_N ; i++)
+  {
+    Bias_shared[i] = __half2float(Bias[cta_offset_n+i]);


Is this the bottleneck?

ys-2020 · 2025-01-14T03:16:44Z

awq/kernels/csrc/w8a8/w8a8_gemm_cuda.cu

@@ -592,10 +606,10 @@ void w8a8_gemm_fuse_bias_forward_cuda(torch::Tensor _in_feats,
    constexpr int CTA_M = 128;
    constexpr int CTA_N = 128;
    constexpr int CTA_K = 64;
-    constexpr int WARP_M = 128;
+    constexpr int WARP_M = 64;


[Issue] Why couldn't we have 128 here?

ys-2020 · 2025-01-14T03:17:03Z

awq/kernels/csrc/w8a8/w8a8_gemm_cuda.cu

@@ -604,7 +618,7 @@ void w8a8_gemm_fuse_bias_forward_cuda(torch::Tensor _in_feats,
    constexpr int CTA_N = 64;
    constexpr int CTA_K = 64;
    constexpr int WARP_M = 32;
-    constexpr int WARP_N = 32;
+    constexpr int WARP_N = 16;


[Issue] Why couldn't we have 16 here?

[Minor] Fused some kernels

be97857

ys-2020 reviewed Jan 14, 2025

View reviewed changes

ys-2020 changed the title ~~[Minor] Fused some kernels~~ [Major] Fuse bias+gemm and layernorm+quantization for more efficient ViT Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Major] Fuse bias+gemm and layernorm+quantization for more efficient ViT #254

[Major] Fuse bias+gemm and layernorm+quantization for more efficient ViT #254

Louym commented Jan 13, 2025 •

edited

Loading

ys-2020 Jan 14, 2025

ys-2020 Jan 14, 2025

ys-2020 Jan 14, 2025

ys-2020 Jan 14, 2025

ys-2020 Jan 14, 2025

[Major] Fuse bias+gemm and layernorm+quantization for more efficient ViT #254

Are you sure you want to change the base?

[Major] Fuse bias+gemm and layernorm+quantization for more efficient ViT #254

Conversation

Louym commented Jan 13, 2025 • edited Loading

ys-2020 Jan 14, 2025

Choose a reason for hiding this comment

ys-2020 Jan 14, 2025

Choose a reason for hiding this comment

ys-2020 Jan 14, 2025

Choose a reason for hiding this comment

ys-2020 Jan 14, 2025

Choose a reason for hiding this comment

ys-2020 Jan 14, 2025

Choose a reason for hiding this comment

Louym commented Jan 13, 2025 •

edited

Loading