Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(satp-hermes): add crash recovery & rollback protocol #3491

Conversation

Yogesh01000100
Copy link
Contributor

@Yogesh01000100 Yogesh01000100 commented Aug 20, 2024

Description

This PR addresses issue #3114 by implementing core components for crash recovery and rollback protocols. The changes enhance fault tolerance and ensure consistent recovery during failures.


Key Changes

1. CrashManager

Introduced a CrashManager class responsible for managing crash detection, recovery, and rollback processes.
Key functionalities include:

  • Session Management: Tracks and maintains SATP sessions.
  • Recovery Initiation: Detects crashes and triggers recovery logic.
  • Rollback Execution: Handles rollback processes for failed recovery attempts.
  • Cron Job Integration: Added scheduled crash detection using node-schedule.
    • Ensures jobs pause during rollback to prevent conflicts.

2. Protocol Services

Updated crash_recovery.proto to define:

  • RecoverMessage, RecoverUpdateMessage, and RecoverSuccessMessage for crash recovery.
  • RollbackMessage and RollbackAckMessage for rollback processes.

3. Recovery & Rollback Strategies

Implemented recovery & rollback strategies for all SATP protocol stages, ensuring the ability to revert to a consistent state upon failure.

  • Added RollbackStrategyFactory to centralize strategy selection.

4. Crash Detection and Handling

Added mechanisms to:

  • Detect incomplete operations (via logs).
  • Compare timestamps against session timeouts to trigger recovery/rollback.

@RafaelAPB
Copy link
Contributor

I will review this PR

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from f9014b0 to 0de9744 Compare August 21, 2024 20:08
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please rebase with satp-dev (should not have conflicts)

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 0de9744 to 4c0124d Compare August 23, 2024 17:19
@Yogesh01000100 Yogesh01000100 changed the title feat: add crash recovery and knex config for production feat(recovery): add crash recovery implementation Aug 25, 2024
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 please include documentation and tests, and update the description, as discussed.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from ce9a179 to 24b8eaf Compare September 8, 2024 07:44
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 24b8eaf to 728e7cb Compare September 16, 2024 18:56
@RafaelAPB
Copy link
Contributor

@Yogesh01000100 could you please squash the commits and rebase with latest version of satp-dev, prior to merge?

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 1a55673 to 21ad772 Compare September 17, 2024 10:11
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks very good, but there are some changes to be done prior to merging.
Summarizing my comments:

  1. Add other authors to the commit
  2. Incorporate feedback from the logging process (namely un-hardcoding logs and adding more information)
  3. Implement RollbackState (for example, should state how many more steps are to be rolled-back, at any moment; what was rolledback already; estimated time to completion, etc)
  4. Please add tests that support the new feature
  5. Please add comprehensive documentation on this feature. Example: The readme of SATP should have a section on how to run the docker compose with several examples of configurations.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from 49e1135 to fb703b4 Compare October 16, 2024 19:57
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from fb703b4 to b30ccb5 Compare November 3, 2024 19:16
Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review how sessionData is being used, and take a look at the Stage 3 question.
Please document the new code as well. The rest is being documented in this PR:
#3619

@RafaelAPB RafaelAPB force-pushed the satp-dev branch 2 times, most recently from 13e0302 to 2896426 Compare November 13, 2024 15:33
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from f0e50ef to cb24d53 Compare November 15, 2024 13:46
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from cb24d53 to d14f178 Compare November 18, 2024 16:23
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from d14f178 to 4eef528 Compare November 26, 2024 20:35
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from d6ffbca to 1405923 Compare December 11, 2024 22:12
@Yogesh01000100 Yogesh01000100 changed the title feat(recovery): add crash recovery implementation feat(satp-hermes): add crash recovery & rollback protocol Dec 11, 2024
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from d73a5eb to e16e84d Compare December 13, 2024 22:11
@RafaelAPB RafaelAPB self-requested a review December 14, 2024 13:31
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave some comments:

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave some comments:

As discussed 3 months ago: @Yogesh01000100 please include documentation and tests, and update the description, as discussed.
Add other authors to the commit

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

@Yogesh01000100
Copy link
Contributor Author

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch 2 times, most recently from b094409 to 222d088 Compare December 16, 2024 09:15
Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider this carefully. If sessionData remains as it is, we must handle it with care and clearly differentiate between the client and server sides of the gateway. I designed the sessionData this way to ensure that a gateway can act as both a client and server to itself.

@yogesh please address this concern

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

about this open API spec part as it has a tpl.json and a .json, which one to update I'm a bit unsure
Please see the package.json to see which one is used for generation and which purpose

Copy link
Contributor

@RafaelAPB RafaelAPB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RafaelAPB

Yogesh, can you confirm this has been addressed?

Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please resolve this issues, they are important and can cause problems in the future

@Yogesh01000100
Copy link
Contributor Author

Please include CrashStatus and LocalLog types in the open api spec and import them where needed

Is there any specific need for this to be in openapi ? i dont see it being used as any respone towards any API it's status tracker, and localLog type is okay in types, please clarify if there is any specific requirement to be in openapi?
image

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 503658c to b56fe20 Compare December 19, 2024 18:40
Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are problems that needs to be fixed.

@RafaelAPB
Copy link
Contributor

Added a commit fixing several issues. @Yogesh01000100 could you please take a look at the tests and double check everything works?

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 6efaa6f to 523284f Compare January 8, 2025 08:49
Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some issues that need to be attended, If there are any issue that you cannot resolve please comment a TODO and the explanation in everywhere it is needed.

@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from 523284f to c5eaa19 Compare January 9, 2025 22:40
1. Implemented recovery & rollback using RPC-based message handlers.
2. Added rollback strategies for all SATP stages.
3. Integrated database log management for recovery and rollback.
4. Added cron jobs for scheduled crash detection and recovery initiation.

Co-authored-by: Rafael Belchior <rafael.belchior@tecnico.ulisboa.pt>
Co-authored-by: Carlos Amaro <carlosrscamaro@tecnico.ulisboa.pt>
Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

chore(satp-hermes): improve DB management

Signed-off-by: Rafael Belchior <rafael.belchior@tecnico.ulisboa.pt>

chore(satp-hermes): crash recovery architecture

Signed-off-by: Rafael Belchior <rafael.belchior@tecnico.ulisboa.pt>

fix(recovery): enhance crash recovery and rollback implementation

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

refactor(recovery): consolidate logic and improve SATP message handling

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

feat(recovery): add rollback implementations

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: correct return types and inits

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: add unit tests and resolve rollbackstate

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

feat: add function processing logs from g2

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

feat: add cron schedule for periodic crash checks

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: resolve rollback condition and add tests

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

feat: add orchestrator communication layer using connect-RPC

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

feat: add rollback protocol rpc

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: handle server log synchronization

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: resolve gol errors, add unit tests

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: handle server-side rollback

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>

fix: resolve networkId in rollback strategies

Signed-off-by: Yogesh01000100 <yogeshone678@gmail.com>
@Yogesh01000100 Yogesh01000100 force-pushed the feature/crash-recovery-improvements branch from c5eaa19 to 43367c9 Compare January 9, 2025 22:57
Copy link
Contributor

@LordKubaya LordKubaya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@RafaelAPB RafaelAPB merged commit 1737df4 into hyperledger-cacti:satp-dev Jan 10, 2025
7 of 8 checks passed
@Yogesh01000100 Yogesh01000100 deleted the feature/crash-recovery-improvements branch January 13, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants