feat: 44 make flash attention configurable #60

theissenhelen · 2025-01-06T16:27:11Z

Current setup:

If flash-attn is available in the environment, the MultiHeadSelfAttention module automatically imports the corresponding attention function. In inference however we do not have that information.

Now:

flex attention available
user specifies whether flash-attn, flex attention or scaled dot product attention should be used in the model config.
adds configurable parameters (soft cap, aLiBi) for flash attention
for aLiB:i adds a function to compute the slopes according to the number of attention heads
scaled dot product attention now supports sliding window (making it numerically equivalent to flash/flex)

Todo:

test various attention options
adjust test case coverage

* fix: change pre-cmmit autoupdate schedule to monthly * fix: change the merge strategy for Changelog to Union * fix: add .envrc to .gitignore * ci: ignore pre-commit-config and readthedocs for changelog updates * ci: fix to correct hpc workflow call * fix: update precommit config * chore: update pre-commits * feat: add codeowners file * chore: update dependencies * ci: add hpc-config * docs: changelog * fix: respond to review comments --------- Co-authored-by: Jesper Dramsch <jesper.dramsch@ecmwf.int>

* feat: add configurability to dropout in MultiHeadSelfAttention Co-authored-by: Rilwan (Akanni) Adewoyin <18564167+Rilwan-Adewoyin@users.noreply.github.com> * test: adjust to dropout_p * doc: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * Update CHANGELOG.md to KeepChangelog format * [pre-commit.ci] pre-commit autoupdate (#25) updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) - [github.com/astral-sh/ruff-pre-commit: v0.4.6 → v0.6.2](astral-sh/ruff-pre-commit@v0.4.6...v0.6.2) - [github.com/tox-dev/pyproject-fmt: 2.1.3 → 2.2.1](tox-dev/pyproject-fmt@2.1.3...2.2.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * Update CHANGELOG.md to KeepChangelog format * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog --------- Co-authored-by: Rilwan (Akanni) Adewoyin <18564167+Rilwan-Adewoyin@users.noreply.github.com> Co-authored-by: Gert Mertes <gert.mertes@ecmwf.int> Co-authored-by: Mario Santa Cruz <48736305+JPXKQX@users.noreply.github.com> Co-authored-by: Jesper Dramsch <jesper.dramsch@ecmwf.int> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

xfail for MultiHeadSelfAttention

for more information, see https://pre-commit.ci

* fix: change pre-cmmit autoupdate schedule to monthly * fix: change the merge strategy for Changelog to Union * fix: add .envrc to .gitignore * ci: ignore pre-commit-config and readthedocs for changelog updates * ci: fix to correct hpc workflow call * fix: update precommit config * chore: update pre-commits * feat: add codeowners file * chore: update dependencies * ci: add hpc-config * docs: changelog * fix: respond to review comments --------- Co-authored-by: Jesper Dramsch <jesper.dramsch@ecmwf.int>

* feat: add configurability to dropout in MultiHeadSelfAttention Co-authored-by: Rilwan (Akanni) Adewoyin <18564167+Rilwan-Adewoyin@users.noreply.github.com> * test: adjust to dropout_p * doc: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * Update CHANGELOG.md to KeepChangelog format * [pre-commit.ci] pre-commit autoupdate (#25) updates: - [github.com/psf/black-pre-commit-mirror: 24.4.2 → 24.8.0](psf/black-pre-commit-mirror@24.4.2...24.8.0) - [github.com/astral-sh/ruff-pre-commit: v0.4.6 → v0.6.2](astral-sh/ruff-pre-commit@v0.4.6...v0.6.2) - [github.com/tox-dev/pyproject-fmt: 2.1.3 → 2.2.1](tox-dev/pyproject-fmt@2.1.3...2.2.1) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog * Feature/integrate reusable workflows (#16) * ci: add public pr label * ci: add readthedocs update check * ci: add downstream ci * ci: add ci-config * chore(deps): remove unused dependency * docs: update changelog * ci: switch to main * chore: changelog 0.2.1 * Update error messages from invalid sub_graph in model instantiation (#20) * ci: inherit pypi publish flow (#17) * ci: inherit pypi publish flow Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * docs: add to changelog * fix: typo in reusable workflow * fix: another typo * chore: bump actions/setup-python to v5 * ci: run downstream-ci for changes in src and tests * docs: update changelog --------- Co-authored-by: Helen Theissen <helen.theissen@ecmwf.int> * Update CHANGELOG.md to KeepChangelog format * Ci/changelog-release-updater (#26) * ci: add changelof release updater * docs: update changelog --------- Co-authored-by: Rilwan (Akanni) Adewoyin <18564167+Rilwan-Adewoyin@users.noreply.github.com> Co-authored-by: Gert Mertes <gert.mertes@ecmwf.int> Co-authored-by: Mario Santa Cruz <48736305+JPXKQX@users.noreply.github.com> Co-authored-by: Jesper Dramsch <jesper.dramsch@ecmwf.int> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

xfail for MultiHeadSelfAttention

for more information, see https://pre-commit.ci

FussyDuck · 2025-01-06T16:27:17Z

All committers have signed the CLA.

for more information, see https://pre-commit.ci

….com:ecmwf/anemoi-core into feature/44-make-flash-attention-configurable

models/src/anemoi/models/layers/attention.py

mishooax

This looks OK to me (I haven't run the code, but Cathal did) - nice work @theissenhelen.

HCookie

Brilliant work exists here,
Just a few thoughts and questions,
Looks great overall.

HCookie · 2025-01-24T16:28:34Z

models/src/anemoi/models/layers/attention.py

+
+        For the flash attention implementation, two additional parameters are available: softcap, use_alibi_slopes
+
+        softcap: Softcapping prevents the logits from grwoing excessively large


Suggested change

softcap: Softcapping prevents the logits from grwoing excessively large

softcap: Softcapping prevents the logits from growing excessively large

HCookie · 2025-01-24T16:28:57Z

models/src/anemoi/models/layers/attention.py

+
+        softcap: Softcapping prevents the logits from grwoing excessively large
+
+        use_alibi_slopes: Adds bias of (-alibi_slope * |i + seqlen_k - seqlen_q - j|) to the attention score of


Suggested change

use_alibi_slopes: Adds bias of (-alibi_slope * |i + seqlen_k - seqlen_q - j|) to the attention score of

use_alibi_slopes: Adds bias of `(-alibi_slope * |i + seqlen_k - seqlen_q - j|)` to the attention score of

HCookie · 2025-01-24T16:30:04Z

models/src/anemoi/models/layers/attention.py

+              Please change model.processor.attention_implementation in the config to one of: {attn_funcs.keys()}"
+        LOGGER.info(f"Using {self.attention_implementation}")


Should this reference the config or just say that the arg was wrong?

HCookie · 2025-01-27T09:42:46Z

models/src/anemoi/models/layers/attention.py

+        softcap : float, optional
+            Anything > 0 activates softcapping attention, by default None
+        use_alibi_slopes : bool, optional
+            Adds bias


Given that these are only used for flash_attention, if kwargs are needed for the other attention types, we may end up with a large number of unused kwargs, could it make sense to add attention_kwargs: dict[str, Any], and use that?

anaprietonem · 2025-01-27T12:32:37Z

models/src/anemoi/models/layers/attention.py

+        if self.attention_implementation == "flex_attention":
+            assert (
+                self.embed_dim / self.num_heads >= 16
+            ), f"Embedding dimension ({self.embed_dim}) devided by number of heads ({self.num_heads}) must be >= 16."


minor typo, 'devided' -> 'divided'

anaprietonem · 2025-01-27T12:43:09Z

models/src/anemoi/models/layers/attention.py

+        # initalise the attn func here
+        self.attention = attn_funcs[self.attention_implementation]()
+
+        if self.attention_implementation == "flex_attention":


I believe this if should go before self.attention = attn_funcs[self.attention_implementation]()

anaprietonem · 2025-01-27T12:59:04Z

training/src/anemoi/training/config/model/transformer.yaml

@@ -14,6 +14,7 @@ processor:
  num_heads: 16 # GraphTransformer or Transformer only
  window_size: 512
  dropout_p: 0.0 # GraphTransformer
+  attention_implementation: flash_attention # flash_attention, flex_attention, scaled_dot_product_attention



Are use_alibi_slopes and softcap meant to also be passed via the config? Since in the default we have flash_attention I was wondering if it would be better to also include the defaults for those two other parameters

anaprietonem · 2025-01-27T13:20:54Z

models/pytest.ini

@@ -0,0 +1,7 @@
+[pytest]


Out of curiosity, what's is this file doing?

anaprietonem · 2025-01-27T13:23:42Z

models/src/anemoi/models/layers/attention.py

+            NotImplementedError("Error. Causal not yet implemented in FlexAttn in Anemoi.")
+
+        # This assumes seq_len never changes
+        # across iterations and stages


what's this comment about? 🤔

anaprietonem · 2025-01-27T13:24:05Z

models/src/anemoi/models/layers/attention.py

+                )
+                self.attention = functools.partial(
+                    flex_attention, block_mask=self.block_mask
+                )  # Cache the block mask (recomended in attn blog post)


could be nice adding reference to that blog?

anaprietonem · 2025-01-27T13:53:22Z

models/src/anemoi/models/layers/attention.py

+        super().__init__()
+        import flash_attn
+
+        if version.parse(flash_attn.__version__) < version.parse("2.6.0"):


Flash attention can't yet be installed through the pyproject right? Do we have some checks ahead of this to confirm that the user has installed flash_attn before trying to use it?

theissenhelen and others added 30 commits September 27, 2024 08:42

feat: FlashMultiHeadSelfAttention

539e8a2

chore!: drop support for scaled_dot_product_attention

a86c9a8

feat: add softcap

105443f

test: add softcap

e82a59e

xfail for MultiHeadSelfAttention

[pre-commit.ci] auto fixes from pre-commit.com hooks

e648eb0

for more information, see https://pre-commit.ci

feat: flash attention lazy import

6271cd8

feat: make alibi slopes configurable

d4940e7

chore(deps): add flash-attn

9ff6cb9

feat: use scaled_dot_product as default

bbd89dc

feat: make alibi_slope cinfigurable in block, chunk processor

91533c6

chore(deps): remove flash-attn

0eb5c50

feat: get alibi_slopes

c04e641

docs: update docstrings

6523b47

fix: bias shape

22623cc

fix: softcap optional

ed07e34

fix: import annotations from future

c841324

fix: annotation error

6c12dda

docs: update changelog

b7b8f2e

fix: type annotation

df353d9

feat: catch low flash-attn version

fc335c7

feat: FlashMultiHeadSelfAttention

663fea0

chore!: drop support for scaled_dot_product_attention

0c55a9c

feat: add softcap

ea665be

test: add softcap

ffa2d99

xfail for MultiHeadSelfAttention

[pre-commit.ci] auto fixes from pre-commit.com hooks

7c2d634

for more information, see https://pre-commit.ci

feat: flash attention lazy import

d2ed932

anaprietonem added the models label Jan 9, 2025

anaprietonem assigned theissenhelen Jan 9, 2025

theissenhelen and others added 8 commits January 13, 2025 15:22

test: sepa:rate test for sdpa and flex attention

aa1abe7

[pre-commit.ci] auto fixes from pre-commit.com hooks

eb5ed7f

for more information, see https://pre-commit.ci

added asserts and tests for flex attn

a10d832

[pre-commit.ci] auto fixes from pre-commit.com hooks

a010860

for more information, see https://pre-commit.ci

fix: embed_dim / num_heads >=16

c0f462c

test: fix tests to account for embed_dim constraints

752de28

fix tests

3818719

Merge branch 'feature/44-make-flash-attention-configurable' of github…

c2f8890

….com:ecmwf/anemoi-core into feature/44-make-flash-attention-configurable

theissenhelen marked this pull request as ready for review January 17, 2025 10:59

theissenhelen changed the title ~~Feature/44 make flash attention configurable~~ feat: 44 make flash attention configurable Jan 17, 2025

chore: remove debugging code

2d8b775

mishooax reviewed Jan 20, 2025

View reviewed changes

models/src/anemoi/models/layers/attention.py Outdated Show resolved Hide resolved

mishooax previously approved these changes Jan 20, 2025

View reviewed changes

consitency change

7665c7f

cathalobrien dismissed mishooax’s stale review via 7665c7f January 22, 2025 10:27

HCookie requested review from HCookie and anaprietonem January 22, 2025 10:28

chore(configs): add attention_implementation

230f044

github-actions bot added the training label Jan 24, 2025

HCookie reviewed Jan 27, 2025

View reviewed changes

anaprietonem reviewed Jan 27, 2025

View reviewed changes

models/pytest.ini

@@ -0,0 +1,7 @@

[pytest]

Copy link

Contributor

anaprietonem Jan 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what's is this file doing?

anaprietonem reviewed Jan 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: 44 make flash attention configurable #60

feat: 44 make flash attention configurable #60

theissenhelen commented Jan 6, 2025 •

edited

Loading

FussyDuck commented Jan 6, 2025 •

edited

Loading

mishooax left a comment

HCookie left a comment

HCookie Jan 24, 2025

HCookie Jan 24, 2025

HCookie Jan 24, 2025

HCookie Jan 27, 2025

anaprietonem Jan 27, 2025

anaprietonem Jan 27, 2025

anaprietonem Jan 27, 2025

anaprietonem Jan 27, 2025

anaprietonem Jan 27, 2025

anaprietonem Jan 27, 2025

anaprietonem Jan 27, 2025


		For the flash attention implementation, two additional parameters are available: softcap, use_alibi_slopes

		softcap: Softcapping prevents the logits from grwoing excessively large


		softcap: Softcapping prevents the logits from grwoing excessively large

		use_alibi_slopes: Adds bias of (-alibi_slope * \|i + seqlen_k - seqlen_q - j\|) to the attention score of

	use_alibi_slopes: Adds bias of (-alibi_slope * \|i + seqlen_k - seqlen_q - j\|) to the attention score of
	use_alibi_slopes: Adds bias of `(-alibi_slope * \|i + seqlen_k - seqlen_q - j\|)` to the attention score of

		Please change model.processor.attention_implementation in the config to one of: {attn_funcs.keys()}"
		LOGGER.info(f"Using {self.attention_implementation}")

feat: 44 make flash attention configurable #60

Are you sure you want to change the base?

feat: 44 make flash attention configurable #60

Conversation

theissenhelen commented Jan 6, 2025 • edited Loading

FussyDuck commented Jan 6, 2025 • edited Loading

mishooax left a comment

Choose a reason for hiding this comment

HCookie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theissenhelen commented Jan 6, 2025 •

edited

Loading

FussyDuck commented Jan 6, 2025 •

edited

Loading