Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added SVE implementation to improve the performance on ARM architecture #10680

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

divya2108
Copy link

Motivation: This pull request aims to improve the performance of training algorithm of XGBoost on ARM architecture by leveraging SVE intrinsics.

Brief description:

  1. This change of including SVE intrinsics improves the performance by 55% as compared to the ARM default.
  2. The modified function iterates over a row of data and updates a histogram based on the given indices and offsets.
  3. The accuracy has been verified after the modifications.
image

@trivialfis
Copy link
Member

Thank you for the PR! I'm not an expert in SIMDs, is it guaranteed to have aligned pointers and padded memory allocation for the intrinsics?

@divya2108
Copy link
Author

Hi @trivialfis
The code has been thoroughly validated to ensure alignment and padding issues are addressed. The datatypes have not been altered from the scalar code; instead, the original scalar operations have been translated into SIMD using equivalent SVE intrinsics and there are no compile-time errors.
All potential accuracy issues have been resolved and verified on widely used datasets like Iris, Airlines delay and breast cancer detection.

@trivialfis
Copy link
Member

trivialfis commented Aug 7, 2024

Thank you for the detailed info. Could you please help explain why it works without the use of specialized allocators like https://en.cppreference.com/w/c/memory/aligned_alloc ? It's important for us to know the logic for future maintenance.

@divya2108
Copy link
Author

Specialized allocators like aligned_alloc() doesn't help with SVE intrinsics because:

  1. ARM's SVE SIMD architecture handles data processing in parallel, which inherently considers data alignment. For example for a 256 bit vector length system, we load 8 float elements (8*32) through VLA(vector length agnostic) instructions into a SVE register.
  2. Most of the instructions including widening and narrowing instructions helps take care of the data alignment.

@divya2108
Copy link
Author

Hi @trivialfis,
Additionally, SVE also provides predicate registers enabling key features such as:
a) Per-lane predication that allows SIMD instructions to be executed conditionally on specific lanes of a SIMD register
b) Predicate-driven loop control and management that helps to manage data that does not align perfectly with the vector length.

@Mousius
Copy link

Mousius commented Aug 13, 2024

Thank you for the detailed info. Could you please help explain why it works without the use of specialized allocators like https://en.cppreference.com/w/c/memory/aligned_alloc ? It's important for us to know the logic for future maintenance.

Hi @trivialfis,

As @divya2108 mentioned, SVE has predication support.

These lines create masks which limit the load/stores from going out of bounds:

svbool_t pg32 = svwhilelt_b32(j, row_size);
svbool_t pg64 = svwhilelt_b64(j, row_size);

SVE is also happy to do element-aligned loads and stores rather than full vectors.

@trivialfis
Copy link
Member

Thank you for the explanation! I will take a deeper look.

Copy link
Member

@trivialfis trivialfis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started looking into this PR today. Thank you for working on using the arm intrinsic, but could you please add detailed code comments and extract the code into an independent section (like a function that can be inlined)? Most people here (me included) have a background closer to data science instead of low-level programming.

CMakeLists.txt Outdated
@@ -265,6 +265,51 @@ if(${CMAKE_SYSTEM_NAME} MATCHES "OS400")
set(CMAKE_CXX_ARCHIVE_CREATE "<CMAKE_AR> -X64 qc <TARGET> <OBJECTS>")
endif()

if(CMAKE_SYSTEM_PROCESSOR MATCHES "aarch64")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please extract this into a module similar to cmake/PrefetchIntrinsics.cmake?

CMakeLists.txt Outdated
if(RUN_RESULT EQUAL 0)
message(STATUS "ARM SVE hardware support detected")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -march=armv8-a+sve")
string(APPEND CMAKE_CXX_FLAGS " -DSVE_SUPPORT_DETECTED")
Copy link
Member

@trivialfis trivialfis Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please prefix the flag with XGBOOST_ and use targeted flags instead of the CMAKE_CXX_FLAGS.

Comment on lines 261 to 262
svfloat64_t pgh_t0_vec = svdup_n_f64(pgh_t[0]);
svfloat64_t pgh_t1_vec = svdup_n_f64(pgh_t[1]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you don't need the pgh_t in the SVE code section. Could you please use p_gpair and have the names of the loaded vectors, such as svfloat64_t grad, svfloat64_t hess, for readability?

@divya2108
Copy link
Author

Started looking into this PR today. Thank you for working on using the arm intrinsic, but could you please add detailed code comments and extract the code into an independent section (like a function that can be inlined)? Most people here (me included) have a background closer to data science instead of low-level programming.

Hi @trivialfis, Thank you for suggesting the appropriate changes. I have made the modifications as recommended. Could you please review the updated changes?

@maajidkhann
Copy link

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

@Mousius
Copy link

Mousius commented Sep 3, 2024

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE.

The new path is conditionally compiled with #ifdef XGBOOST_SVE_SUPPORT_DETECTED with no fallback at runtime here:
https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R305-R308

This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the #ifdef.

Correct me if I'm wrong 😸

@maajidkhann
Copy link

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE.

The new path is conditionally compiled with #ifdef XGBOOST_SVE_SUPPORT_DETECTED with no fallback at runtime here: https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R305-R308

This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the #ifdef.

Correct me if I'm wrong 😸

I agree with you. I found the HW detection logic here: https://github.com/dmlc/xgboost/pull/10680/files#diff-5650b69c609ef22dea88915eb256a6838341248d3ddfd17430388f7f7e58c4feR24

But this is just for compile time. I think similar logic need to be used during runtime and a runtime check is required.

Since there's already a working SVE HW detection logic, should be easy to reintroduce it in the source code file.

CC @divya2108

@trivialfis
Copy link
Member

Is SVE guaranteed to be available for ARM implementation?

@divya2108
Copy link
Author

Is SVE guaranteed to be available for ARM implementation?

No, SVE is not guaranteed to be available on all ARM implementations. While ARMv8-A architecture, which includes SVE support, is present in newer processors like Graviton3, Graviton4, Grace, it is not mandatory for all ARM CPUs to implement SVE. The code in hist_util.cc checks for SVE support at runtime to ensure that the target hardware supports it & runs the default code otherwise.

@divya2108
Copy link
Author

divya2108 commented Sep 9, 2024

The CMake logic looks right. It only compiles SVE code when the compiler supports it and during the runtime it triggers the SVE code only when the hardware supports SVE (I see there's a runtime HW check for SVE ISA). Changes LGTM.

Can you point me to where the runtime check happens? As far as I can tell, this only works if the build environment supports compiling and running SVE.
The new path is conditionally compiled with #ifdef XGBOOST_SVE_SUPPORT_DETECTED with no fallback at runtime here: https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R305-R308
This makes me think this is only for users compiling from sources on a specific piece of hardware. If we wanted this to work in the generically distributed wheel, we'd have to do the SVE runtime check instead of the #ifdef.
Correct me if I'm wrong 😸

I agree with you. I found the HW detection logic here: https://github.com/dmlc/xgboost/pull/10680/files#diff-5650b69c609ef22dea88915eb256a6838341248d3ddfd17430388f7f7e58c4feR24

But this is just for compile time. I think similar logic need to be used during runtime and a runtime check is required.

Since there's already a working SVE HW detection logic, should be easy to reintroduce it in the source code file.

CC @divya2108

Yes, I agree. Thank you for bringing this to notice.
I have added a SVE hardware check at runtime. Now it is generically compiled and falls back on the default code if SVE hardware support is not detected.

I have verified this by building the code on different architectures. Here is a summary for more clarity:
image

@rageshhajela16
Copy link

rageshhajela16 commented Sep 20, 2024

@trivialfis Thanks for the initial review and your comments. Can you please suggest any additional feedback which might need further clarification/evaluation from our side or any improvements to incorporate in the proposed implementation. Thanks. cc: @divya2108

@trivialfis
Copy link
Member

Sorry for the slow reply, got stuck at some other work lately. One question, is it possible to reduce the call frequency of check_sve_hw_support to maybe once per training session?

@divya2108
Copy link
Author

Sorry for the slow reply, got stuck at some other work lately. One question, is it possible to reduce the call frequency of check_sve_hw_support to maybe once per training session?

Yes, it's possible to reduce the frequency of calls to check_sve_hw_support by implementing a caching mechanism that checks the SVE hardware support status only once at the beginning of a training session. I have stored the result and it is being reused throughout the session.

@divya2108 divya2108 force-pushed the sve-optimised branch 4 times, most recently from b43108b to 03298b5 Compare October 8, 2024 10:51
@divya2108
Copy link
Author

Hi @trivialfis, just wanted to follow up on the code review. Let me know if you need any additional details or clarifications.

@rageshhajela16
Copy link

rageshhajela16 commented Oct 17, 2024

Hi @trivialfis, just wanted to follow up on the code review. Let me know if you need any additional details or clarifications.

Hi @trivialfis , we have pushed all the necessary changes. Kindly review and let us know for any additional details or modifications required. Thanks in advance for your time while reviewing this. Thanks.

@trivialfis
Copy link
Member

Apologies for the slow response, will look into this. Thank you very much for your patience!

@hcho3 hcho3 self-assigned this Oct 17, 2024
@trivialfis
Copy link
Member

trivialfis commented Oct 18, 2024

@hcho3 I see you have assigned yourself to the PR. Thank you for volunteering! Feel free to review the code.

I recently got access to a Grace machine and might be able to do some tests there.

@rageshhajela16
Copy link

Thanks @trivialfis for your review and contributions to this implementation! Please let us if you would like us to contribute any additional fixes based on review comments. We would like to confirm with you before proceeding to avoid any duplicate work. Thanks again for your time in review of this PR. We appreciate! cc: @divya2108

@trivialfis
Copy link
Member

@hcho3 Do you think it's possible to have this in the pip wheel?

@trivialfis
Copy link
Member

trivialfis commented Nov 7, 2024

Could you please share the CPU you were using for the benchmarks? I ran a benchmark on a Grace machine (I work for NVIDIA) with synthetic data, and the performance is actually lower. I have verified that the row-wise kernel is being used.

My synthetic data:

  • n_samples: 67108864
  • n_features: 256

Training parameters:

  • 64 iterations
  • 6 max depth

Compilers:

  • g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
* SVE disabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  123.9944839477539 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.225183725357056}, 'load-all': {'concat (sec)': 1.3589859008789062e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.529566526412964, 'train (sec)': 123.9944839477539}}

* SVE enabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  154.86435317993164 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.193156003952026}, 'load-all': {'concat (sec)': 1.430511474609375e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.482257604599, 'train (sec)': 154.86435317993164}}

It's okay to be slower on certain platforms, we can look for a way to disable it. But I would like to get some understanding of how the performance works for your platform as well.

@hcho3
Copy link
Collaborator

hcho3 commented Nov 8, 2024

Do you think it's possible to have this in the pip wheel?

Yes, it should be possible.

@divya2108
Copy link
Author

divya2108 commented Nov 8, 2024

Could you please share the CPU you were using for the benchmarks? I ran a benchmark on a Grace machine (I work for NVIDIA) with synthetic data, and the performance is actually lower. I have verified that the row-wise kernel is being used.

My synthetic data:

  • n_samples: 67108864
  • n_features: 256

Training parameters:

  • 64 iterations
  • 6 max depth

Compilers:

  • g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
* SVE disabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  123.9944839477539 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.225183725357056}, 'load-all': {'concat (sec)': 1.3589859008789062e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.529566526412964, 'train (sec)': 123.9944839477539}}

* SVE enabled

[63]    Train-rmse:18.88612
Qdm train (sec) ended in:  154.86435317993164 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.193156003952026}, 'load-all': {'concat (sec)': 1.430511474609375e-05}, 'Qdm': {'Train-DMatrix (sec)': 29.482257604599, 'train (sec)': 154.86435317993164}}

It's okay to be slower on certain platforms, we can look for a way to disable it. But I would like to get some understanding of how the performance works for your platform as well.

These are the machine and dataset details which I used:

  • AWS Graviton3, ARM-based CPU
  • Dataset details: kaggle higgs boson dataset (250000 samples, 32 features)
  • Training parameters: 120 iterations, 6 max_depth
  • compiler: g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

@divya2108
Copy link
Author

Hi @trivialfis, just wanted to follow up on the PR review. Let me know if there’s anything I can do to help or clarify. Thanks!

@trivialfis
Copy link
Member

apologies for the slow response here. I need to test it on aws, and find a way to disable it for grace.

@Mousius
Copy link

Mousius commented Nov 26, 2024

apologies for the slow response here. I need to test it on aws, and find a way to disable it for grace.

My guess is that it's slightly lower overhead to use ASIMD without SVE for the 128-bit vector length?

Maybe we change this:

https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R275-R280

to check if the length is > 16 (128-bit in bytes) ?

@rageshhajela16
Copy link

rageshhajela16 commented Dec 3, 2024

apologies for the slow response here. I need to test it on aws, and find a way to disable it for grace.

My guess is that it's slightly lower overhead to use ASIMD without SVE for the 128-bit vector length?

Maybe we change this:

https://github.com/dmlc/xgboost/pull/10680/files#diff-def34075edb2b3bdb6dc7b5ebcffd518793520fd4fffd70870b12f076a3cb481R275-R280

to check if the length is > 16 (128-bit in bytes) ?

Thanks for your suggestion @Mousius , Agree.
@trivialfis can you please help to review and try it! Let us know your feedback. Thanks

@trivialfis
Copy link
Member

can you please help to review and try it! Let us know your feedback. Thanks

Could you please elaborate on what I should try?

- Changed cmake design by extracting the code into
  cmake/CheckSVEsupport.cmake
- Prefixed the flags with XGBOOST_ and used targeted flags
- Extracted the SVE code into an inlined function
- Added detailed code comments
- Modified vector names for better readability
Signed-off-by: divya2108 <divya.kotadiya@fujitsu.com>
Disables SVE for SVE128 supported hardware and runs the default Neon flow
@divya2108 divya2108 force-pushed the sve-optimised branch 3 times, most recently from 9b3a0d9 to e03ca37 Compare December 11, 2024 12:28
@divya2108
Copy link
Author

apologies for the slow response here. I need to test it on aws, and find a way to disable it for grace.

Hi @trivialfis, I have added a change to disable SVE for SVE128 supported hardware (grace) and run the default Neon flow in that. case. Now the SVE implementation works for ARM CPU's having vector length >=256.

And also seems like the changes you pushed in the past seem to be missing now somehow? I can't locate it here in remote and even locally. Can you help me point to the source of these commits (Can't even find them in your xgboost forked repo), I can cherry-pick it back on top of my branch.

@trivialfis
Copy link
Member

Let me take a look tomorrow. Seems there's a force push that overrode the previous changes.

@trivialfis
Copy link
Member

Will try to redo some of the changes and run test.

@trivialfis
Copy link
Member

Tested again on grace today, small overhead for dispatching, I think I can work around it:

master:

[63]	Train-rmse:18.88612
Qdm train (sec) ended in:  97.99142909049988 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.013079643249512}, 'load-all': {'concat (sec)': 1.7404556274414062e-05}, 'Qdm': {'Train-DMatrix (sec)': 30.664090633392334, 'train (sec)': 97.99142909049988}}

sve:

[63]	Train-rmse:18.88612
Qdm train (sec) ended in:  103.17199659347534 seconds.
Trained for 64 iterations.
{'load-batches': {'load (sec)': 7.116881370544434}, 'load-all': {'concat (sec)': 1.6450881958007812e-05}, 'Qdm': {'Train-DMatrix (sec)': 30.612082481384277, 'train (sec)': 103.17199659347534}}

Will test again on AWS with graviton.

@trivialfis
Copy link
Member

trivialfis commented Dec 18, 2024

I just tested the default build on aws c8g.x16large, the vector_length is 128 and the extension is not used there?

@divya2108
Copy link
Author

I just tested the default build on aws c8g.x16large, the vector_length is 128 and the extension is not used there?

Hi @trivialfis. As discussed above in my comment, the current logic, checks for SVE VL of the hardware
and if the VL < 256, then it falls back to the default flow (Neon).

So, in the machine you tested earlier (Grace) and the c8g.x16large (Graviton4) machine you tested today,
both have SVE128 in the hardware and as per our current logic, it will fall back to default flow (Neon).

Curious to know, if overhead remained the same even in c8g.x16large or it came down? Did you try any workarounds
around that in the code with the dispatcher?

I believe, if you now run the code on machines with SVE>=256, you would observe the boost in performance because
of SVE ACLE code. One good machine to test SVE goodness would be m7g.8xlarge (Graviton 3) which has SVE256 in the hardware.

@trivialfis
Copy link
Member

trivialfis commented Dec 19, 2024

@divya2108 Do you expect the extension would provide speedup for graviton 4 if we remove the 256-bit restriction? If so I can run the test again. Otherwise, I don't think we will merge the extension only for graviton3 unless there are other latest and accessible platforms that can benefit from this.

@Mousius
Copy link

Mousius commented Dec 30, 2024

@divya2108 Do you expect the extension would provide speedup for graviton 4 if we remove the 256-bit restriction? If so I can run the test again. Otherwise, I don't think we will merge the extension only for graviton3 unless there are other latest and accessible platforms that can benefit from this.

It wouldn't be only Graviton 3, A64FX has 512-bit SVE, and we can incrementally improve this to detect the streaming SVE vector length on devices where that's >128-bit.

@divya2108
Copy link
Author

I agree with @Mousius. Right now, we have ARM CPUs with varying SVE VL (Graviton 4 - SVE128, Graviton 3 - SVE256 and Fugaku A64FX - SVE512). The hardware's with SVE VL>128, will directly benefit by this PR by leveraging the goodness of SVE.

Briefly summarizing this PR:

We are very keen to contribute more optimizations using SVE and support multiple vector lengths with VLA SVE ACLE code. We request to merge this PR and make it possible for us to continue working on more optimizations and contribute to XGBoost.

@trivialfis @hcho3 @Mousius

@trivialfis
Copy link
Member

trivialfis commented Jan 21, 2025

Apologies for the slow response here.

Would be great if there's a latest hardware that we can benchmark on and see the merit and have some confidence that the merit will stay with future hardware. Otherwise it's quite unlikely that we will merge a assembly like PR.

We love performance, but most people here are data scientists instead of performance engineers, it takes some really good use cases to justify the code complexity. As you can see, we couldn't attrach anyone to help maintain the C++ codebase, let alone an assembly function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants