Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Raft protocol to replicate KV pairs #10

Open
wants to merge 24 commits into
base: master
Choose a base branch
from
Open

Conversation

lnikon
Copy link
Owner

@lnikon lnikon commented Dec 28, 2024

Summary by CodeRabbit

  • New Features

    • Added Raft consensus algorithm implementation.
    • Introduced gRPC-based node communication.
    • Implemented thread safety annotations.
    • New debugging configurations for the RaftMain application.
    • Added a script to launch multiple Raft replica instances.
    • New protocol buffer definitions for Raft service and messages.
    • Introduced a new header file for thread safety macros.
    • Added unit tests for NodeClient and ConsensusModule.
  • Bug Fixes

    • Improved segment merging logic for accuracy.
  • Documentation

    • Added script to run multiple Raft replica instances.
  • Chores

    • Updated CMake configurations.
    • Enhanced compiler warning flags for thread safety.
    • Updated compiler configurations for Clang and GCC.
    • Removed unnecessary C++ standard library specifications in profiles.
    • Added Google Test framework as a project dependency.

Copy link

coderabbitai bot commented Dec 28, 2024

Walkthrough

This pull request introduces new debugging configurations for the RaftMain program, enhances compiler flags for thread safety checks, and implements the Raft consensus algorithm in C++. It adds a new subdirectory for the Raft module, defines a new protocol buffer file for Raft RPCs, and creates a script to launch multiple instances of the Raft application. Additionally, it establishes a structured approach for thread safety annotations in the codebase.

Changes

File Change Summary
.vscode/launch.json Added new debugging configurations for RaftMain with GDB setup; modified comment in existing configuration
CMakeLists.txt Updated compiler flags for C and C++ to use new variables; added -Wthread-safety flag conditionally for Clang
examples/CMakeLists.txt Added raft subdirectory to build process
examples/raft/CMakeLists.txt Created new CMake file for RaftMain executable; defined executable targets for RaftMain and RaftTest
examples/raft/main.cpp Implemented Raft consensus algorithm with node_client_t and consensus_module_t classes; enhanced command-line options parsing
examples/raft/thread_safety.h Added thread safety annotation macros and mutex handling
lib/proto/CMakeLists.txt Added support for Raft.proto compilation; created new library for Raft protocol objects
lib/proto/Raft.proto Defined Raft service messages and RPCs
run_replicas.sh Created script to launch multiple Raft node instances
examples/raft/raft.cpp Implemented node_client_t and consensus_module_t classes for Raft algorithm
examples/raft/raft.h Declared node_client_t and consensus_module_t classes and their methods
conan/profiles/debug-clang Updated compiler version from 18 to 19 and removed compiler.libcxx setting
conan/profiles/debug-clang-tsan Removed compiler.libcxx setting
conan/profiles/debug-gcc Changed C++ standard version setting from -compiler.cppstd=gnu23 to +compiler.cppstd=gnu23
conan/profiles/release-clang Updated compiler version from 18 to 19 and removed compiler.libcxx setting
conan/profiles/release-gcc Changed C++ standard version setting from compiler.cppstd=gnu23 to compiler.cppstd=23
conanfile.txt Added dependency on gtest/1.15.0
examples/raft/raft_test.cpp Introduced unit tests for NodeClient and ConsensusModule components

Sequence Diagram

sequenceDiagram
    participant Candidate
    participant Follower1
    participant Follower2
    
    Candidate->>Follower1: RequestVote RPC
    Candidate->>Follower2: RequestVote RPC
    
    Follower1-->>Candidate: Vote Granted
    Follower2-->>Candidate: Vote Granted
    
    Candidate->>Candidate: Become Leader
    
    loop Heartbeat
        Candidate->>Follower1: AppendEntries RPC
        Candidate->>Follower2: AppendEntries RPC
    end
Loading

Poem

🐰 In the realm of distributed might,
Raft nodes dance with consensus bright.
Votes are cast, leaders rise and fall,
Synchronizing state through protocol's call.
A rabbit's tale of nodes in sync! 🌐

✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (7)
examples/raft/thread_safety.h (2)

52-96: Commented-out Mutex class

Currently, the annotated Mutex class is left commented out. If it’s still under development, consider removing it to reduce code clutter or enabling it with an appropriate implementation.


229-279: Deprecated lock-style annotations

The lock-style annotations at the end of this file are marked as deprecated. If the project no longer relies on them, removing the dead code can improve readability. Otherwise, keep them in a dedicated legacy-compatibility file or mark them with a deprecation notice to guide users to the new annotation approach.

examples/raft/main.cpp (1)

58-58: Consider passing 'ip' by const reference

Static analysis suggests passing 'ip' by const reference to avoid potential performance overhead and unnecessary copying when constructing NodeClient.

- NodeClient(const ID id, const IP ip)
+ NodeClient(const ID id, const IP& ip)
🧰 Tools
🪛 cppcheck (2.10-2)

[performance] 58-58: Function parameter 'ip' should be passed by const reference.

(passedByValue)

lib/proto/Raft.proto (1)

45-45: Inconsistent field naming: “lastLogterm”

For consistency with other fields like term, lastLogIndex, and typical naming conventions, consider renaming lastLogterm to lastLogTerm.

- uint32 lastLogterm = 4;
+ uint32 lastLogTerm = 4;
lib/proto/CMakeLists.txt (3)

40-41: Remove duplicate variable definitions.

These variables are already defined above (lines 11-12) and don't need to be redefined since they haven't changed.

-set(PROTO_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}")
-set(PROTO_IMPORT_DIRS "${CMAKE_CURRENT_LIST_DIR}")

34-62: Consider reducing code duplication in proto generation setup.

The Raft proto setup duplicates the same pattern as TinyKVPP. Consider creating a function to reduce duplication and improve maintainability.

Example approach:

function(setup_proto_library TARGET_NAME PROTO_FILE)
    if(NOT EXISTS "${PROTO_FILE}")
        message(FATAL_ERROR "Proto file not found: ${PROTO_FILE}")
    endif()

    add_library(${TARGET_NAME} OBJECT "${PROTO_FILE}")
    target_include_directories(${TARGET_NAME} PUBLIC 
        "$<BUILD_INTERFACE:${PROTO_BINARY_DIR}>" 
        ${protobuf_INCLUDE_DIR})
    target_link_libraries(${TARGET_NAME} PUBLIC 
        protobuf::libprotobuf 
        gRPC::grpc++)

    protobuf_generate(
        TARGET ${TARGET_NAME}
        IMPORT_DIRS ${PROTO_IMPORT_DIRS}
        PROTOC_OUT_DIR "${PROTO_BINARY_DIR}"
    )

    protobuf_generate(
        TARGET ${TARGET_NAME}
        LANGUAGE grpc
        PLUGIN protoc-gen-grpc=$<TARGET_FILE:gRPC::grpc_cpp_plugin>
        PLUGIN_OPTIONS generate_mock_code=true
        GENERATE_EXTENSIONS .grpc.pb.h .grpc.pb.cc)
endfunction()

# Usage:
setup_proto_library(TKVProtoObjects "${TINYKVPP_PROTO_FILE}")
setup_proto_library(RaftProtoObjects "${RAFT_PROTO_FILE}")

1-3: Add version requirements for protobuf and gRPC.

Since this project relies heavily on protobuf and gRPC for the Raft implementation, consider adding minimum version requirements to ensure compatibility.

 cmake_minimum_required(VERSION 3.25)
 project(zkv)
+
+find_package(Protobuf 3.21 REQUIRED)
+find_package(gRPC 1.50 REQUIRED)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 928ff53 and 10396a7.

📒 Files selected for processing (10)
  • .vscode/launch.json (2 hunks)
  • CMakeLists.txt (1 hunks)
  • examples/CMakeLists.txt (1 hunks)
  • examples/raft/CMakeLists.txt (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/thread_safety.h (1 hunks)
  • lib/proto/CMakeLists.txt (2 hunks)
  • lib/proto/Raft.proto (1 hunks)
  • lib/structures/lsmtree/levels/level.cpp (1 hunks)
  • run_replicas.sh (1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/main.cpp

[performance] 58-58: Function parameter 'ip' should be passed by const reference.

(passedByValue)

🔇 Additional comments (6)
examples/raft/thread_safety.h (2)

6-10: Clang-specific annotations and portability concern

These conditional macros work well for Clang’s thread safety analysis. However, if this header is consumed in a project that compiles with different compilers lacking Clang’s __attribute__ support, the macros become no-ops. Ensure that consumers of this header file are aware of how these macros behave in non-Clang environments.


108-124: Effective RAII-based locking

The MutexLocker class employs a clean RAII pattern to ensure mutex acquisition and release. This pattern mitigates risks of forgetting to unlock in exceptional code paths. The approach is sound and safe.

examples/raft/main.cpp (1)

84-85: Review election timeout usage

Using generateRandomTimeout() directly as the RPC deadline is somewhat ad-hoc. If threads or timeouts are short, requests might prematurely fail and disrupt the election process. Validate that this approach is within expected design constraints.

examples/CMakeLists.txt (1)

6-6: New subdirectory addition

Adding the raft subdirectory properly includes the new Raft module. Ensure that all required libraries and dependencies (like grpc, spdlog, and the proto files) are consistently linked in the raft subdirectory’s CMakeLists.txt.

run_replicas.sh (1)

3-5: Verify the deployment strategy for production

The current configuration uses localhost addresses (0.0.0.0), which is suitable for development but may need adjustment for production deployment.

Please clarify:

  1. Is this script intended for development only?
  2. For production deployment, will there be:
    • Configuration for actual host addresses?
    • Container orchestration (e.g., Kubernetes)?
    • Load balancer configuration?
.vscode/launch.json (1)

15-16: Verify external console requirement.

The configuration enables external console which differs from other configurations. Verify if this is required for the Raft implementation (e.g., for viewing distributed node logs separately).

examples/raft/main.cpp Outdated Show resolved Hide resolved
lib/structures/lsmtree/levels/level.cpp Outdated Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
Comment on lines 1 to 6
cmake_minimum_required(VERSION 3.25)
project(zkv)

add_executable(RaftMain "${CMAKE_CURRENT_SOURCE_DIR}/main.cpp")
set_target_properties(RaftMain PROPERTIES CXX_STANDARD 23)
target_link_libraries(RaftMain PRIVATE DB RaftProtoObjects)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add test configuration for Raft implementation

The Raft consensus is a critical component that requires thorough testing. Consider adding test configuration to ensure reliability.

Add the following after the existing code:

# Add test executable
add_executable(RaftTests "${CMAKE_CURRENT_SOURCE_DIR}/tests/raft_tests.cpp")
set_target_properties(RaftTests PROPERTIES CXX_STANDARD 23)
target_link_libraries(RaftTests PRIVATE 
    DB 
    RaftProtoObjects 
    Catch2::Catch2WithMain
)

# Register with CTest
add_test(NAME raft_tests COMMAND RaftTests)

This will help ensure:

  • Leader election correctness
  • Log replication consistency
  • Network partition handling
  • State persistence reliability

Comment on lines +1 to +5
#!/bin/bash

./build/RaftMain --id 1 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_1.txt &
./build/RaftMain --id 2 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_2.txt &
./build/RaftMain --id 3 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_3.txt &
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enhance the replica deployment script for production readiness

The current script has several limitations that could cause issues in production:

  1. No cleanup of background processes
  2. No health checks for node startup
  3. No error handling for port availability
  4. Hardcoded configuration
  5. No graceful shutdown mechanism

Replace the script with this more robust version:

-#!/bin/bash
-
-./build/RaftMain --id 1 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_1.txt &
-./build/RaftMain --id 2 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_2.txt &
-./build/RaftMain --id 3 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_3.txt &
+#!/bin/bash
+set -euo pipefail
+
+# Configuration
+readonly BASE_PORT=8080
+readonly NUM_REPLICAS=3
+readonly BINARY="./build/RaftMain"
+readonly LOG_DIR="logs"
+
+# Create log directory
+mkdir -p "${LOG_DIR}"
+
+# Cleanup function
+cleanup() {
+    echo "Shutting down Raft nodes..."
+    kill $(jobs -p) 2>/dev/null || true
+    wait
+}
+
+# Register cleanup on script exit
+trap cleanup EXIT
+
+# Build node list
+NODES=""
+for ((i=1; i<=NUM_REPLICAS; i++)); do
+    port=$((BASE_PORT + i - 1))
+    if [ ! -z "$NODES" ]; then
+        NODES="${NODES},"
+    fi
+    NODES="${NODES}0.0.0.0:${port}"
+done
+
+# Start nodes
+for ((i=1; i<=NUM_REPLICAS; i++)); do
+    port=$((BASE_PORT + i - 1))
+    
+    # Check if port is available
+    if nc -z localhost $port 2>/dev/null; then
+        echo "Error: Port $port is already in use"
+        exit 1
+    fi
+    
+    # Start node
+    echo "Starting Raft node $i on port $port..."
+    $BINARY --id $i --nodes $NODES &> "${LOG_DIR}/log_${i}.txt" &
+    
+    # Wait briefly to ensure proper startup order
+    sleep 1
+    
+    # Check if process is running
+    if ! kill -0 $! 2>/dev/null; then
+        echo "Error: Failed to start node $i"
+        exit 1
+    fi
+done
+
+echo "All Raft nodes started. Logs available in ${LOG_DIR}/"
+echo "Press Ctrl+C to shutdown all nodes"
+
+# Wait for Ctrl+C
+wait

This enhanced version includes:

  • Process cleanup on exit
  • Port availability checking
  • Configurable number of replicas
  • Organized logging
  • Startup health checks
  • Proper error handling
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#!/bin/bash
./build/RaftMain --id 1 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_1.txt &
./build/RaftMain --id 2 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_2.txt &
./build/RaftMain --id 3 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_3.txt &
#!/bin/bash
set -euo pipefail
# Configuration
readonly BASE_PORT=8080
readonly NUM_REPLICAS=3
readonly BINARY="./build/RaftMain"
readonly LOG_DIR="logs"
# Create log directory
mkdir -p "${LOG_DIR}"
# Cleanup function
cleanup() {
echo "Shutting down Raft nodes..."
kill $(jobs -p) 2>/dev/null || true
wait
}
# Register cleanup on script exit
trap cleanup EXIT
# Build node list
NODES=""
for ((i=1; i<=NUM_REPLICAS; i++)); do
port=$((BASE_PORT + i - 1))
if [ ! -z "$NODES" ]; then
NODES="${NODES},"
fi
NODES="${NODES}0.0.0.0:${port}"
done
# Start nodes
for ((i=1; i<=NUM_REPLICAS; i++)); do
port=$((BASE_PORT + i - 1))
# Check if port is available
if nc -z localhost $port 2>/dev/null; then
echo "Error: Port $port is already in use"
exit 1
fi
# Start node
echo "Starting Raft node $i on port $port..."
$BINARY --id $i --nodes $NODES &> "${LOG_DIR}/log_${i}.txt" &
# Wait briefly to ensure proper startup order
sleep 1
# Check if process is running
if ! kill -0 $! 2>/dev/null; then
echo "Error: Failed to start node $i"
exit 1
fi
done
echo "All Raft nodes started. Logs available in ${LOG_DIR}/"
echo "Press Ctrl+C to shutdown all nodes"
# Wait for Ctrl+C
wait

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (5)
examples/raft/main.cpp (5)

32-33: Consider making timeout values configurable.

The hardcoded timeout values could be moved to configuration to allow tuning based on network conditions and deployment requirements.

-    const int minTimeout{150};
-    const int maxTimeout{300};
+    static constexpr const char* CONFIG_MIN_TIMEOUT = "RAFT_MIN_TIMEOUT";
+    static constexpr const char* CONFIG_MAX_TIMEOUT = "RAFT_MAX_TIMEOUT";
+    const int minTimeout = std::getenv(CONFIG_MIN_TIMEOUT) ? std::atoi(std::getenv(CONFIG_MIN_TIMEOUT)) : 150;
+    const int maxTimeout = std::getenv(CONFIG_MAX_TIMEOUT) ? std::atoi(std::getenv(CONFIG_MAX_TIMEOUT)) : 300;

53-60: Optimize parameter passing in constructor.

The ip parameter should be passed by const reference for better performance with strings.

-    NodeClient(const ID id, const IP ip)
+    NodeClient(const ID id, const IP& ip)
🧰 Tools
🪛 cppcheck (2.10-2)

[performance] 53-53: Function parameter 'ip' should be passed by const reference.

(passedByValue)


103-103: Remove commented-out code.

The commented-out completion queue declaration should be removed if it's not needed.

-    /*grpc::CompletionQueue                   m_cq;*/

309-309: Implement persistent state initialization.

The TODO comment indicates missing persistence implementation which is crucial for Raft's durability guarantees.

Would you like me to help implement the persistent state initialization using a simple file-based storage solution?


591-592: Improve error message formatting.

The error message should follow consistent formatting and include the actual value for better debugging.

-        spdlog::error("ID of the node should be positve integer");
+        spdlog::error("Invalid node ID: {}. Node ID must be a positive integer.", nodeId);
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 10396a7 and b9b5995.

📒 Files selected for processing (1)
  • examples/raft/main.cpp (1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/main.cpp

[performance] 53-53: Function parameter 'ip' should be passed by const reference.

(passedByValue)

Comment on lines 504 to 507
void decrementNextIndex(ID id)
{
(void)id;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement decrementNextIndex method.

The method is currently a no-op but is crucial for Raft's log replication mechanism.

     void decrementNextIndex(ID id)
     {
-        (void)id;
+        if (m_nextIndex[id - 1] > 0) {
+            m_nextIndex[id - 1]--;
+        }
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void decrementNextIndex(ID id)
{
(void)id;
}
void decrementNextIndex(ID id)
{
if (m_nextIndex[id - 1] > 0) {
m_nextIndex[id - 1]--;
}
}

Comment on lines 226 to 240
while (!m_stopElectionTimer)
{
if (token.stop_requested())
{
spdlog::info("Stopping election timer thread");
return;
}

{
absl::MutexLock locker(&m_electionMutex);
if (m_state == NodeState::LEADER)
{
continue;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential race condition in election timer loop.

The continue statement inside the leader check could cause the thread to spin continuously if the node is the leader. Consider using a condition variable to wait efficiently.

                     if (m_state == NodeState::LEADER)
                     {
-                        continue;
+                        m_timerCV.WaitWithTimeout(&m_timerMutex, absl::Milliseconds(100));
+                        continue;
                     }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 449 to 502
std::thread(
[this, &client, heartbeatInterval]()
{
while (m_state == NodeState::LEADER)
{
AppendEntriesRequest request;
{
absl::ReaderMutexLock locker(&m_electionMutex);

request.set_term(m_currentTerm);
request.set_prevlogterm(getLastLogTerm());
request.set_prevlogindex(getLastLogIndex());
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_id);
}

{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
spdlog::error("AppendEntriesRequest failed during heartbeat");
return;
}

auto responseTerm = response.term();
auto success = response.success();

spdlog::info(
"Received AppendEntriesResponse in requester thread peerTerm={} success={} responderId={}",
responseTerm,
success,
response.responderid());

{
absl::WriterMutexLock locker(&m_electionMutex);

if (responseTerm > m_currentTerm)
{
becomeFollower(responseTerm);
return;
}

if (!success)
{
decrementNextIndex(client.getId());
}
}
}

std::this_thread::sleep_for(heartbeatInterval);
}
})
.detach();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Potential resource leak in heartbeat thread.

The detached thread might outlive the NodeClient reference it captures, leading to undefined behavior. Consider using a managed thread pool or ensuring proper lifecycle management.

-    std::thread(
+    auto heartbeatThread = std::make_shared<std::jthread>(
         [this, &client, heartbeatInterval]()
         {
             // ... existing code ...
-        })
-        .detach();
+        });
+    m_heartbeatThreads.push_back(heartbeatThread);

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
examples/raft/main.cpp (2)

30-40: Consider making timeout constants configurable and more descriptive.

The hardcoded timeout values could be made configurable to allow tuning based on network conditions and deployment requirements.

-    const int minTimeout{150};
-    const int maxTimeout{300};
+    static constexpr int MIN_ELECTION_TIMEOUT_MS{150};
+    static constexpr int MAX_ELECTION_TIMEOUT_MS{300};
+    // TODO: Consider making these configurable via command line arguments

53-60: Pass 'ip' parameter by const reference for better performance.

Passing string parameters by value creates unnecessary copies.

-    NodeClient(const ID id, const IP ip)
+    NodeClient(const ID id, const IP& ip)
🧰 Tools
🪛 cppcheck (2.10-2)

[performance] 53-53: Function parameter 'ip' should be passed by const reference.

(passedByValue)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9b5995 and 614c2bc.

📒 Files selected for processing (1)
  • examples/raft/main.cpp (1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/main.cpp

[performance] 53-53: Function parameter 'ip' should be passed by const reference.

(passedByValue)

🔇 Additional comments (3)
examples/raft/main.cpp (3)

491-494: ⚠️ Potential issue

Implement decrementNextIndex method.

The method is currently a no-op but is crucial for Raft's log replication mechanism.

     void decrementNextIndex(ID id)
     {
-        (void)id;
+        if (m_nextIndex[id - 1] > 0) {
+            m_nextIndex[id - 1]--;
+        }
     }

Likely invalid or redundant comment.


436-489: ⚠️ Potential issue

Replace detached thread with managed thread for heartbeat.

Using detached threads can lead to resource leaks and undefined behavior if the referenced objects are destroyed. Consider using a managed thread pool or at least storing the thread handle.

-        std::thread(
+        auto heartbeatThread = std::make_shared<std::jthread>(
             [this, &client, heartbeatInterval]()
             {
                 // ... existing code ...
-            })
-            .detach();
+            });
+        m_heartbeatThreads.push_back(heartbeatThread);

Likely invalid or redundant comment.


226-240: ⚠️ Potential issue

Fix potential deadlock in election timer loop.

The current implementation might deadlock if the node becomes a leader while holding the election mutex. Consider using a condition variable with a timeout instead of a busy wait.

-                    if (m_state == NodeState::LEADER)
-                    {
-                        continue;
-                    }
+                    if (m_state == NodeState::LEADER)
+                    {
+                        m_timerCV.WaitWithTimeout(&m_timerMutex, absl::Milliseconds(100));
+                        continue;
+                    }

Likely invalid or redundant comment.

Comment on lines 56 to 57
m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())),
m_stub(RaftService::NewStub(m_channel))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Security concern: Using insecure channel credentials.

The use of grpc::InsecureChannelCredentials() is not suitable for production environments as it exposes the communication to potential man-in-the-middle attacks.

Consider using TLS certificates for secure communication:

auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions());
m_channel = grpc::CreateChannel(m_ip, creds);

Comment on lines 595 to 597
cm.start();

cm.stop();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Implement graceful shutdown handling.

The server should handle SIGTERM/SIGINT signals for graceful shutdown instead of immediate stop.

void signal_handler(int signal) {
    spdlog::info("Received signal {}. Initiating graceful shutdown...", signal);
    // Trigger graceful shutdown
}

// In main():
std::signal(SIGTERM, signal_handler);
std::signal(SIGINT, signal_handler);

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
examples/raft/raft.h (1)

22-38: Consider passing ip by const reference in the NodeClient constructor.

Static analysis flagged this constructor parameter as a performance concern. Although this might be minor for small strings, passing by const reference avoids unnecessary copies, especially if IP can hold large strings or if this constructor is called frequently.

- NodeClient(const ID id, const IP ip);
+ NodeClient(const ID id, const IP& ip);
examples/raft/raft.cpp (1)

22-22: Pass ip by const reference to optimize.

As noted by static analysis, passing ip by const reference can avoid copying strings if they are large or if the constructor is called often.

- NodeClient::NodeClient(const ID id, const IP ip)
+ NodeClient::NodeClient(const ID id, const IP& ip)
🧰 Tools
🪛 cppcheck (2.10-2)

[performance] 22-22: Function parameter 'ip' should be passed by const reference.

(passedByValue)

examples/raft/main.cpp (1)

39-42: Avoid copying IP strings in the for loop.

In the loop iterating over nodeIps, consider capturing the iterated string by const reference to avoid unnecessary copies. For large or numerous IP strings, this minor refactor can improve performance:

- for (auto ip : nodeIps)
+ for (const auto& ip : nodeIps)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 614c2bc and 2e6374a.

📒 Files selected for processing (4)
  • examples/raft/CMakeLists.txt (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/raft.cpp

[performance] 22-22: Function parameter 'ip' should be passed by const reference.

(passedByValue)

examples/raft/main.cpp

[performance] 25-25: Function parameter 'ip' should be passed by const reference.

(passedByValue)

🔇 Additional comments (4)
examples/raft/raft.h (1)

40-125: enable_shared_from_this usage check.

The class ConsensusModule inherits from std::enable_shared_from_this<ConsensusModule> but does not appear to call shared_from_this(). Verify whether this inheritance is intentional. If not, consider removing it to simplify the class.

examples/raft/raft.cpp (2)

193-198: Election timer loop concurrency risk
This comment reaffirms a previously raised concern about potential continuous spinning or premature exit when the node becomes a leader, preventing election timeout monitoring if leadership changes again.


459-463: decrementNextIndex remains unimplemented
This method is critical for the Raft log replication process but is currently a no-op. Consider implementing the logic to decrement m_nextIndex[id - 1], as recommended in a prior review.

examples/raft/CMakeLists.txt (1)

4-6: Add specialized test target for Raft.

A test configuration for the Raft consensus ensures reliability of leader election, log replication, and fault tolerance. See previous suggestion for details on adding a dedicated test executable.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (2)
examples/raft/raft.cpp (2)

31-43: Add deadline to appendEntries RPC call.

For consistency with requestVote, consider adding a deadline to prevent hanging on network issues:

 auto NodeClient::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool
 {
     grpc::ClientContext context;
+    context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generateRandomTimeout()));
 
     grpc::Status status = m_stub->AppendEntries(&context, request, response);

279-286: Implement persistent state storage.

The TODO comment indicates missing persistence implementation:

Would you like me to help implement the persistent state storage using RocksDB or a similar key-value store? This would ensure durability of critical Raft state across restarts.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e6374a and e0a354b.

📒 Files selected for processing (2)
  • examples/raft/raft.cpp (1 hunks)
  • lib/structures/lsmtree/levels/level.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/structures/lsmtree/levels/level.cpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/raft.cpp

[performance] 22-22: Function parameter 'ip' should be passed by const reference.

(passedByValue)

examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
examples/raft/raft.cpp (2)

31-43: Enhance error handling in appendEntries.

The error handling could be more detailed and include retry logic for transient failures.

Consider adding:

 auto NodeClient::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool
 {
+    constexpr int maxRetries = 3;
+    int attempts = 0;
+    while (attempts++ < maxRetries) {
     grpc::ClientContext context;
+    context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(1));
 
     grpc::Status status = m_stub->AppendEntries(&context, request, response);
     if (!status.ok())
     {
-        spdlog::error("AppendEntries RPC call failed");
+        spdlog::error("AppendEntries RPC call failed. Error code={}, message={}, attempt={}/{}",
+                      static_cast<int>(status.error_code()),
+                      status.error_message(),
+                      attempts,
+                      maxRetries);
+        if (attempts < maxRetries) {
+            std::this_thread::sleep_for(std::chrono::milliseconds(100 * attempts));
+            continue;
+        }
         return false;
     }
+    break;
+    }
 
     return true;
 }

293-300: Implement persistent state initialization.

The TODO comment indicates missing persistence implementation, which is crucial for maintaining consistency across node restarts.

Would you like me to help implement the persistent state initialization using a simple file-based storage system?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e0a354b and 2fa9fca.

📒 Files selected for processing (1)
  • examples/raft/raft.cpp (1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/raft.cpp

[performance] 22-22: Function parameter 'ip' should be passed by const reference.

(passedByValue)

🔇 Additional comments (5)
examples/raft/raft.cpp (5)

1-21: LGTM! Headers are well-organized and complete.

The includes are properly categorized and contain all necessary dependencies for gRPC, abseil synchronization primitives, and logging.


401-405: Fix incorrect majority calculation.

The majority calculation is incorrect as noted in the previous review. Using the existing fix:

 auto ConsensusModule::hasMajority(const uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

429-488: 🛠️ Refactor suggestion

Enhance heartbeat thread reliability.

The heartbeat mechanism needs improvements:

  1. Add retry logic for failed heartbeats
  2. Implement backoff strategy
  3. Add thread cleanup on errors
 void ConsensusModule::sendHeartbeat(NodeClient &client)
 {
-    constexpr const auto heartbeatInterval{std::chrono::milliseconds(100)};
+    constexpr const auto baseHeartbeatInterval{std::chrono::milliseconds(100)};
+    constexpr const int maxRetries = 3;
+    constexpr const auto maxBackoff{std::chrono::milliseconds(1000)};
 
     m_heartbeatThreads.emplace_back(
-        [this, &client, heartbeatInterval](std::stop_token token)
+        [this, &client](std::stop_token token)
         {
+            auto heartbeatInterval = baseHeartbeatInterval;
+            int consecutiveFailures = 0;

Likely invalid or redundant comment.


22-29: 🛠️ Refactor suggestion

Enhance constructor robustness and efficiency.

The constructor should validate inputs and handle connection failures:

-NodeClient::NodeClient(const ID id, const IP ip)
+NodeClient::NodeClient(const ID id, const IP& ip)
     : m_id{id},
       m_ip{ip},
       m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())),
       m_stub(RaftService::NewStub(m_channel))
 {
     assert(m_id > 0);
+    assert(!m_ip.empty());
+    
+    // Verify channel is valid
+    if (!m_channel || !m_stub) {
+        throw std::runtime_error("Failed to establish gRPC channel to " + m_ip);
+    }
 }

Likely invalid or redundant comment.

🧰 Tools
🪛 cppcheck (2.10-2)

[performance] 22-22: Function parameter 'ip' should be passed by const reference.

(passedByValue)


66-96: 🛠️ Refactor suggestion

Enhance constructor's thread safety and error handling.

The constructor needs better initialization and validation:

  1. Initialize atomic variables explicitly
  2. Add error handling for server startup
  3. Validate replica IDs
 ConsensusModule::ConsensusModule(const ID id, std::vector<IP> replicas)
     : m_id{id},
       m_currentTerm{0},
       m_votedFor{0},
-      m_state{NodeState::FOLLOWER}
+      m_state{NodeState::FOLLOWER},
+      m_stopElectionTimer{false},
+      m_electionInProgress{false},
+      m_leaderHeartbeatReceived{false},
+      m_voteCount{0}
 {
     assert(m_id > 0);
     assert(replicas.size() > 0);
     assert(m_id <= replicas.size());
 
     m_ip = replicas[m_id - 1];
+    if (m_ip.empty()) {
+        throw std::runtime_error("Invalid IP address for node " + std::to_string(m_id));
+    }
 
     grpc::ServerBuilder builder;
     builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials());
     builder.RegisterService(this);
 
     m_server = builder.BuildAndStart();
+    if (!m_server) {
+        throw std::runtime_error("Failed to start gRPC server on " + m_ip);
+    }

Likely invalid or redundant comment.

Comment on lines 172 to 179
/*if (pRequest->lastlogterm() < getLastLogTerm() ||*/
/* (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))*/
/*{*/
/* pResponse->set_votegranted(0);*/
/* return grpc::Status::OK;*/
/*}*/
/**/
return grpc::Status::OK;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Uncomment and fix the log comparison logic in RequestVote.

The commented-out log comparison logic is essential for the Raft protocol's safety properties. It should be uncommented and fixed.

-    /*if (pRequest->lastlogterm() < getLastLogTerm() ||*/
-    /*    (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))*/
-    /*{*/
-    /*    pResponse->set_votegranted(0);*/
-    /*    return grpc::Status::OK;*/
-    /*}*/
-    /**/
+    // Rule: Candidate's log must be at least as up-to-date as receiver's log
+    if (pRequest->lastlogterm() < getLastLogTerm() ||
+        (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))
+    {
+        pResponse->set_term(m_currentTerm);
+        pResponse->set_votegranted(0);
+        pResponse->set_responderid(m_id);
+        return grpc::Status::OK;
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/*if (pRequest->lastlogterm() < getLastLogTerm() ||*/
/* (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))*/
/*{*/
/* pResponse->set_votegranted(0);*/
/* return grpc::Status::OK;*/
/*}*/
/**/
return grpc::Status::OK;
// Rule: Candidate's log must be at least as up-to-date as receiver's log
if (pRequest->lastlogterm() < getLastLogTerm() ||
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))
{
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(0);
pResponse->set_responderid(m_id);
return grpc::Status::OK;
}
return grpc::Status::OK;

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
examples/raft/raft.cpp (1)

422-426: ⚠️ Potential issue

Fix incorrect majority calculation.
The current logic can lead to split votes. Please use total nodes (replicas.size() + 1) instead of just replicas when determining majority.

-constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+const size_t totalNodes = m_replicas.size() + 1; // includes self
+return votes > totalNodes / 2;
🧹 Nitpick comments (7)
examples/raft/raft.cpp (5)

25-43: Increase inline IP validation or logging information.
Currently, the constructor validates non-empty IP and logs channel creation issues. Consider adding more context or checks (e.g., IP format) to proactively catch configuration errors.


45-57: Add trace-level logs for debug support.
The appendEntries RPC call is crucial to Raft. Adding trace logs (e.g., printing the request’s term) can help debug replication issues.


59-73: Implement consistent RPC deadlines.
You set a deadline for RequestVote, whereas appendEntries has no per-RPC deadline. Consider aligning both to ensure consistent timeout behavior.


80-114: Consider checking the validity of term or replica size mismatch.
In the constructor, you assume m_id <= replicas.size(). If this is violated, you throw in the assertion; consider converting it into a runtime error or user-friendly message for production builds.


288-312: Ensure graceful handling of any outstanding RPC calls on stop.
Consider waiting for in-flight RPCs to complete or adding a short deadline prior to forcibly shutting down the gRPC server.

examples/raft/raft.h (2)

20-20: Consider adding randomness seeding or injection for reproducibility.
generateRandomTimeout() depends on likely internal PRNG. For test determinism, you can seed or allow a configurable random source.


72-75: Document the lock invariants.
It is helpful to clarify in comments that becomeLeader and sendHeartbeat require the m_stateMutex to be locked. This ensures future maintainers understand the concurrency design.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fa9fca and 0832944.

📒 Files selected for processing (2)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
🔇 Additional comments (2)
examples/raft/raft.cpp (1)

154-196: Ensure you handle a second vote request from the same candidate within the same term.
When m_votedFor != 0 and is not the candidate’s ID, you immediately continue, but you might want to short-circuit with a negative response.

examples/raft/raft.h (1)

40-40: Revisit the multiple inheritance approach.
Inheriting from both RaftService::Service and std::enable_shared_from_this is valid but can complicate object lifetime. Confirm that shared pointers are consistently managed to avoid double free or dangling references.

Comment on lines 116 to 152
auto ConsensusModule::AppendEntries(grpc::ServerContext *pContext,
const AppendEntriesRequest *pRequest,
AppendEntriesResponse *pResponse) -> grpc::Status
{
(void)pContext;
(void)pRequest;
(void)pResponse;

spdlog::info("Recevied AppendEntries RPC from leader={} during term={}", pRequest->senderid(), pRequest->term());

absl::MutexLock locker(&m_stateMutex);
absl::MutexLock timerLocker(&m_timerMutex);

if (pRequest->term() < m_currentTerm)
{
pResponse->set_term(m_currentTerm);
pResponse->set_success(false);
pResponse->set_responderid(m_id);
return grpc::Status::OK;
}

if (pRequest->term() > m_currentTerm)
{
becomeFollower(pRequest->term());
}

pResponse->set_term(m_currentTerm);
pResponse->set_success(true);
pResponse->set_responderid(m_id);

spdlog::info("Node={} is resetting election timeout at term={}", m_id, m_currentTerm);
resetElectionTimer();

m_leaderHeartbeatReceived.store(true);

return grpc::Status::OK;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Validate the log index in AppendEntries.
The method always sets success=true without verifying log continuity (as per Raft spec). Not rejecting an inconsistent log entry could lead to divergent logs.

Comment on lines 450 to 495
void ConsensusModule::sendHeartbeat(NodeClient &client)
{
constexpr const auto heartbeatInterval{std::chrono::milliseconds(10)};
constexpr const int maxRetries{3};

m_heartbeatThreads.emplace_back(
[this, maxRetries, &client, heartbeatInterval](std::stop_token token)
{
spdlog::info("Node={} is starting a heartbeat thread for client={}", m_id, client.getId());

int consecutiveFailures = 0;
while (!token.stop_requested())
{
AppendEntriesRequest request;
{
absl::ReaderMutexLock locker(&m_stateMutex);
if (m_state != NodeState::LEADER)
{
spdlog::info("Node={} is no longer a leader. Stopping the heartbeat thread");
break;
}

request.set_term(m_currentTerm);
request.set_prevlogterm(getLastLogTerm());
request.set_prevlogindex(getLastLogIndex());
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_id);
}

{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;

consecutiveFailures = 0;
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
return;
}

continue;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix the consecutive failures logic.
Within the heartbeat loop, consecutiveFailures++ is immediately reset to zero. This prevents retries from accumulating.

 if (!client.appendEntries(request, &response)) {
     consecutiveFailures++;
-    consecutiveFailures = 0;  // This negates the increment
     spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                   consecutiveFailures,
                   maxRetries);
     ...
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void ConsensusModule::sendHeartbeat(NodeClient &client)
{
constexpr const auto heartbeatInterval{std::chrono::milliseconds(10)};
constexpr const int maxRetries{3};
m_heartbeatThreads.emplace_back(
[this, maxRetries, &client, heartbeatInterval](std::stop_token token)
{
spdlog::info("Node={} is starting a heartbeat thread for client={}", m_id, client.getId());
int consecutiveFailures = 0;
while (!token.stop_requested())
{
AppendEntriesRequest request;
{
absl::ReaderMutexLock locker(&m_stateMutex);
if (m_state != NodeState::LEADER)
{
spdlog::info("Node={} is no longer a leader. Stopping the heartbeat thread");
break;
}
request.set_term(m_currentTerm);
request.set_prevlogterm(getLastLogTerm());
request.set_prevlogindex(getLastLogIndex());
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_id);
}
{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;
consecutiveFailures = 0;
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
return;
}
continue;
}
void ConsensusModule::sendHeartbeat(NodeClient &client)
{
constexpr const auto heartbeatInterval{std::chrono::milliseconds(10)};
constexpr const int maxRetries{3};
m_heartbeatThreads.emplace_back(
[this, maxRetries, &client, heartbeatInterval](std::stop_token token)
{
spdlog::info("Node={} is starting a heartbeat thread for client={}", m_id, client.getId());
int consecutiveFailures = 0;
while (!token.stop_requested())
{
AppendEntriesRequest request;
{
absl::ReaderMutexLock locker(&m_stateMutex);
if (m_state != NodeState::LEADER)
{
spdlog::info("Node={} is no longer a leader. Stopping the heartbeat thread");
break;
}
request.set_term(m_currentTerm);
request.set_prevlogterm(getLastLogTerm());
request.set_prevlogindex(getLastLogIndex());
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_id);
}
{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
return;
}
continue;
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
.vscode/launch.json (1)

7-33: 🛠️ Refactor suggestion

Enhance RaftMain debug configuration for Raft development

The configuration needs improvements for effectively debugging Raft nodes:

  1. Add preLaunchTask to ensure RaftMain is compiled before debugging
  2. Add command line arguments for Raft configuration (node ID, peer list)
  3. Make the debug port configurable using VS Code input variables

Apply this diff to enhance the configuration:

         {
             "name": "rr - RaftMain",
             "type": "cppdbg",
             "request": "launch",
             "program": "${workspaceFolder}/build/RaftMain",
-            "miDebuggerServerAddress": "localhost:50505",
+            "miDebuggerServerAddress": "localhost:${input:debugPort}",
+            "args": [
+                "--node-id", "${input:nodeId}",
+                "--config", "${workspaceFolder}/assets/raft_config.json"
+            ],
             "stopAtEntry": false,
             "cwd": "${workspaceFolder}",
             "environment": [],
             "externalConsole": true,
+            "preLaunchTask": "build",
             "linux": {

Add these input configurations at the root level:

    "inputs": [
        {
            "id": "debugPort",
            "type": "promptString",
            "description": "Debug port for remote debugging",
            "default": "50505"
        },
        {
            "id": "nodeId",
            "type": "promptString",
            "description": "Raft node ID",
            "default": "1"
        }
    ]
🧹 Nitpick comments (3)
lib/proto/Raft.proto (1)

8-17: Unify naming conventions in enums and fields
Consider using a consistent naming style (e.g., lastLogTerm) in the request messages or switching to snake_case to match common protobuf conventions.

examples/raft/main.cpp (1)

69-69: Consider implementing signal-based graceful shutdown
Currently, the Raft node stops only after the main logic completes. Adding handling of SIGTERM/SIGINT to initiate a controlled shutdown can prevent data inconsistencies or partial writes.

.vscode/launch.json (1)

34-64: Align Debug - RaftMain configuration with other debug configurations

The configuration should be consistent with other debug configurations in the file:

  1. Add command line arguments for Raft configuration
  2. Add launchCompleteCommand for consistent behavior
  3. Add missing field descriptions

Apply this diff:

         {
             "name": "Debug - RaftMain",
             "type": "cppdbg",
             "request": "attach",
             "program": "${workspaceFolder}/build/RaftMain",
             "processId": "${command:pickProcess}",
             "MIMode": "gdb",
+            "args": [ // Arguments for Raft configuration
+                "--node-id", "1",
+                "--config", "${workspaceFolder}/assets/raft_config.json"
+            ],
             "setupCommands": [
                 {
                     "description": "Enable pretty-printing for gdb",
                     "text": "-enable-pretty-printing",
                     "ignoreFailures": true
                 }
             ],
             "preLaunchTask": "build",
             "miDebuggerPath": "/usr/bin/gdb",
             "logging": {
                 "trace": true,
                 "traceResponse": true,
                 "engineLogging": false
             },
             "targetArchitecture": "x86_64",
             "pipeTransport": {
                 "pipeCwd": "",
                 "pipeProgram": "/bin/bash",
                 "pipeArgs": [
                     "-c"
                 ],
                 "debuggerPath": "/usr/bin/gdb"
             },
+            "launchCompleteCommand": "exec-run" // Consistent with other configurations
         },
🧰 Tools
🪛 Biome (1.9.4)

[error] 64-64: Expected a property but instead found '}'.

Expected a property here.

(parse)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0832944 and 2b7359e.

📒 Files selected for processing (6)
  • .vscode/launch.json (2 hunks)
  • examples/raft/CMakeLists.txt (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • lib/proto/Raft.proto (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
.vscode/launch.json

[error] 64-64: Expected a property but instead found '}'.

Expected a property here.

(parse)

🔇 Additional comments (9)
examples/raft/raft.cpp (5)

28-28: Use atomic or mutex to protect gFirstElection
gFirstElection is a global integer that is written from multiple threads (e.g., in the election timer and upon heartbeat reception). For thread safety, consider making this variable atomic or protecting it with a mutex.


33-34: Security concern: Using insecure channel credentials
This code uses grpc::InsecureChannelCredentials(), which is typically insecure for production. Consider TLS or an equivalent secure channel.


509-513: Fix incorrect majority calculation
This logic uses votes > m_replicas.size() / 2.0 and lacks the node’s own vote in the total. Consider using (m_replicas.size() + 1) to account for the node itself.


569-581: Fix the consecutive failures logic
consecutiveFailures++ is immediately reset to zero on line 571, preventing retries from incrementing. This makes the retry loop unusable as intended.


133-134: Unused or uninitialized server thread
While m_serverThread is declared and stopped in stop(), it never appears to be started. This might cause confusion or unexpected deadlocks. Verify if it’s needed and ensure proper initialization if so.

Also applies to: 421-423

examples/raft/CMakeLists.txt (1)

4-6: Add test configuration for Raft implementation
Comprehensive testing of Raft is crucial. Adding a test binary and registering tests ensures reliability of leader election, log replication, and fault tolerance.

examples/raft/main.cpp (1)

25-25: Environment variable usage may have portability issues
Using setenv("ASAN_OPTIONS", ...) is helpful for debugging, but consider limiting this to debug builds to avoid altering production behavior.

examples/raft/raft.h (2)

74-75: Unused method initializeVolatileState
Although declared, initializeVolatileState() has no definition or usage in the codebase. If it's part of your design, ensure it is implemented. Otherwise, remove it to avoid confusion.


82-83: Confirm concurrency invariants for sendAppendEntriesRPC
sendAppendEntriesRPC is not marked with the same lock requirements as sendHeartbeat. Verify that you consistently acquire the necessary locks in both methods so that log mutations are always thread-safe.

],
"debuggerPath": "/usr/bin/gdb"
},
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix JSON syntax error

Remove the trailing comma after the pipeTransport object.

Apply this diff:

                 ],
                 "debuggerPath": "/usr/bin/gdb"
-            },
+            }
         },
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
},
],
"debuggerPath": "/usr/bin/gdb"
}
},
🧰 Tools
🪛 Biome (1.9.4)

[error] 64-64: Expected a property but instead found '}'.

Expected a property here.

(parse)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (3)
examples/raft/raft.cpp (3)

141-224: ⚠️ Potential issue

Fix log replication safety issues.

The AppendEntries implementation has several critical issues:

  1. Missing validation of request fields
  2. Potential off-by-one errors in log indexing
  3. Inconsistent state updates
+    // Validate request fields
+    if (!pRequest) {
+        return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "Null request");
+    }

     // 3. Append new entries and remove conflicting ones
     auto newEntryStart = pRequest->prevlogindex() + 1;
+    // Check for integer overflow
+    if (newEntryStart < pRequest->prevlogindex()) {
+        return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "Index overflow");
+    }

-    m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end());
+    // Validate and append entries
+    for (const auto& entry : pRequest->entries()) {
+        if (entry.term() < m_currentTerm) {
+            return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Entry from previous term");
+        }
+        m_log.push_back(entry);
+    }

481-584: ⚠️ Potential issue

Fix election safety issues.

The election implementation has critical safety issues:

  1. Race condition in vote counting
  2. Incorrect majority calculation (as noted in past reviews)
  3. Missing timeout handling for concurrent elections
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;

     void startElection() {
+        const auto electionTimeout = std::chrono::seconds(generateRandomTimeout());
+        const auto deadline = std::chrono::steady_clock::now() + electionTimeout;
+
         RequestVoteRequest request;
         {
             absl::WriterMutexLock locker(&m_stateMutex);
+            if (std::chrono::steady_clock::now() >= deadline) {
+                spdlog::warn("Election timed out before starting");
+                return;
+            }

586-660: ⚠️ Potential issue

Fix heartbeat reliability issues.

The heartbeat implementation has several critical issues:

  1. Incorrect retry logic (consecutiveFailures reset too early)
  2. Potential thread leaks in error paths
  3. Missing timeout handling
     if (!client.appendEntries(request, &response)) {
         consecutiveFailures++;
-        consecutiveFailures = 0;  // This negates the increment
         spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                       consecutiveFailures,
                       maxRetries);
         if (consecutiveFailures >= maxRetries) {
+            absl::WriterMutexLock locker(&m_stateMutex);
+            becomeFollower(m_currentTerm);  // Step down if can't reach majority
             return;
         }
         continue;
     }
🧹 Nitpick comments (2)
examples/raft/raft.h (1)

1-24: Add include guards and improve documentation.

While #pragma once works for most modern compilers, consider adding traditional include guards for maximum compatibility. Also add documentation for type aliases and threading model.

 #pragma once
+#ifndef RAFT_H_
+#define RAFT_H_

+// Type aliases for improved readability and maintainability
 using IP = std::string;         // Represents an IP address
 using ID = uint32_t;           // Unique identifier for nodes
 using Clock = std::chrono::high_resolution_clock;
 using TimePoint = std::chrono::high_resolution_clock::time_point;

 // Valid IDs start from 1
 constexpr const ID invalidId = 0;

+#endif  // RAFT_H_
examples/raft/raft.cpp (1)

1-796: Improve logging for better observability.

The logging implementation needs improvements:

  1. Add structured logging with consistent fields
  2. Use appropriate log levels
  3. Add logging for all error paths
+    // Add logging configuration
+    spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%n] [%^%l%$] [%s:%#] %v");

-    spdlog::debug("Node={} starts election. New term={}", m_id, m_currentTerm);
+    spdlog::debug("Starting election node_id={} current_term={} state={}", 
+                  m_id, m_currentTerm, static_cast<int>(m_state));

-    spdlog::error("AppendEntries RPC call failed");
+    spdlog::error("RPC failed operation={} node_id={} peer_id={} error_code={} error_message={}", 
+                  "AppendEntries", m_id, client.getId(), 
+                  static_cast<int>(status.error_code()), status.error_message());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b7359e and f53b501.

📒 Files selected for processing (2)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)

examples/raft/raft.h Outdated Show resolved Hide resolved
examples/raft/raft.h Outdated Show resolved Hide resolved
examples/raft/raft.h Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (7)
examples/raft/raft.cpp (6)

56-67: ⚠️ Potential issue

Correct the error messages in exceptions.

In the NodeClient constructor, the error messages incorrectly use m_id twice when formatting the error string. The second m_id should be m_ip to display the correct IP address.

Apply this diff to fix the error messages:

-throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_id));
+throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip));

Similarly, update the other error messages:

-throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_id));
+throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip));

-throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_id));
+throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip));

137-138: ⚠️ Potential issue

Avoid unsafe dynamic_cast and check for nullptr.

In the ConsensusModule constructor, you are using dynamic_cast without checking for a nullptr result, which can lead to undefined behavior if the cast fails.

Consider modifying the code to check if the cast is successful before using the result:

-auto raftService = dynamic_cast<RaftService::Service*>(this);
-auto kvService = dynamic_cast<TinyKVPPService::Service*>(this);
+auto* raftService = dynamic_cast<RaftService::Service*>(this);
+auto* kvService = dynamic_cast<TinyKVPPService::Service*>(this);
+if (!raftService || !kvService) {
+    throw std::runtime_error("Invalid service cast in ConsensusModule constructor");
+}

 builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials());
-builder.RegisterService(dynamic_cast<RaftService::Service *>(this));
-builder.RegisterService(dynamic_cast<TinyKVPPService::Service *>(this));
+builder.RegisterService(raftService);
+builder.RegisterService(kvService);

647-649: ⚠️ Potential issue

Fix the logic error with consecutiveFailures.

Inside the heartbeat loop in sendHeartbeat, consecutiveFailures is incremented and then immediately reset to zero. This negates the increment and prevents retry logic from working correctly.

Apply this diff to correct the logic:

 if (!client.appendEntries(request, &response))
 {
     consecutiveFailures++;
-    consecutiveFailures = 0;
     spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                   consecutiveFailures,
                   maxRetries);
     if (consecutiveFailures >= maxRetries)
     {
         return;
     }
     continue;
 }

471-491: ⚠️ Potential issue

Ensure proper shutdown and cleanup in stop() method.

In the ConsensusModule::stop() method, there are potential issues with thread management and server shutdown:

  • The m_serverThread is being requested to stop and joined, but it's not clear where this thread was started.
  • The m_raftServer->Shutdown() should be called before stopping the election and heartbeat threads to prevent new RPCs from being accepted during shutdown.

Consider reordering the shutdown sequence and ensuring all threads are properly managed:

 void ConsensusModule::stop()
 {
     absl::WriterMutexLock locker{&m_stateMutex};

+    if (m_raftServer)
+    {
+        m_raftServer->Shutdown();
+    }

     m_stopElection = true;

     m_electionThread.request_stop();
     m_electionThread.join();

     for (auto &heartbeatThread : m_heartbeatThreads)
     {
         heartbeatThread.request_stop();
         heartbeatThread.join();
     }
     m_heartbeatThreads.clear();

-    if (m_raftServer)
-    {
-        m_raftServer->Shutdown();
-    }

-    m_serverThread.request_stop();
-    m_serverThread.join();
 }

128-131: ⚠️ Potential issue

Initialize m_matchIndex and m_nextIndex for all replicas.

The m_matchIndex and m_nextIndex maps are not initialized in the constructor, which can lead to undefined behavior when accessing these values later.

Add initialization for these maps in the constructor:

 for (auto [id, ip] : std::ranges::views::enumerate(replicas))
 {
     if (id + 1 == m_id)
     {
         continue;
     }

     m_replicas.emplace(id + 1, NodeClient(id + 1, ip));
+    // Initialize matchIndex and nextIndex for each replica
+    m_matchIndex[id + 1] = 0;
+    m_nextIndex[id + 1] = getLastLogIndex() + 1;
 }

169-240: ⚠️ Potential issue

Ensure compliance with Raft protocol in AppendEntries.

The AppendEntries method seems to have some incomplete implementation aspects regarding the Raft protocol, such as:

  • Not updating m_currentTerm correctly.
  • Missing reset of m_votedFor.
  • Incomplete log replication logic.

Review the method to ensure it fully complies with the Raft specification, especially Sections 5.3 and 5.4.

examples/raft/raft.h (1)

117-121: ⚠️ Potential issue

Add thread safety annotations for member variables.

Member variables that are accessed concurrently should have appropriate thread safety annotations to prevent data races and ensure correctness.

Add ABSL_GUARDED_BY annotations:

 uint32_t m_currentTerm      ABSL_GUARDED_BY(m_stateMutex);
 uint32_t m_votedFor         ABSL_GUARDED_BY(m_stateMutex);
 NodeState m_state           ABSL_GUARDED_BY(m_stateMutex);
 std::vector<LogEntry> m_log ABSL_GUARDED_BY(m_stateMutex);
-/*db::wal::wal_t m_wal        ABSL_GUARDED_BY(m_stateMutex);*/
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f53b501 and 497b88d.

📒 Files selected for processing (2)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (4)
examples/raft/raft.cpp (1)

379-389: Handle failure of persistent state initialization in init().

In the init() method, if initializePersistentState() fails, the node currently logs a warning but still returns false. There might be a need for more robust error handling or recovery strategies.

Please confirm whether it's acceptable for the node to proceed if persistent state initialization fails. If not, consider terminating the node or implementing a retry mechanism.

examples/raft/raft.h (3)

30-37: Add virtual destructor to NodeClient.

The NodeClient class contains virtual methods due to the use of gRPC stubs. It is recommended to have a virtual destructor to ensure proper cleanup of derived classes.

Add a virtual destructor:

 class NodeClient
 {
   public:
     NodeClient(ID nodeId, IP nodeIp);
+    virtual ~NodeClient() = default;

     auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;
     auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;

48-75: Improve documentation and thread safety annotations in ConsensusModule.

The ConsensusModule class lacks sufficient documentation and thread safety annotations, which are important for maintainability and correctness.

Consider adding method documentation and ensuring that thread safety annotations are consistently applied:

 class ConsensusModule : public RaftService::Service,
                         public TinyKVPPService::Service,
                         std::enable_shared_from_this<ConsensusModule>
 {
   public:
     // Constructs a ConsensusModule with the given node ID and list of replica IPs.
     ConsensusModule(ID nodeId, std::vector<IP> replicas);

     // Raft RPCs
     auto AppendEntries(grpc::ServerContext        *pContext,
                        const AppendEntriesRequest *pRequest,
                        AppendEntriesResponse      *pResponse) -> grpc::Status override;

     // Other methods...
+    // Initializes the persistent state. Must be called before start().
     auto init() -> bool;

+    // Starts the consensus module's main loop and begins participating in the cluster.
     void start();

+    // Stops the consensus module and cleans up resources.
     void stop();

134-138: Use std::atomic_flag for better synchronization.

For atomic boolean flags, std::atomic_flag provides better guarantees and semantics compared to std::atomic<bool>.

Modify the declarations:

 std::atomic<bool>     m_leaderHeartbeatReceived{false};
 std::atomic<bool>     m_stopElection{false};
+std::atomic_flag      m_leaderHeartbeatReceived = ATOMIC_FLAG_INIT;
+std::atomic_flag      m_stopElection = ATOMIC_FLAG_INIT;

examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
@lnikon lnikon self-assigned this Jan 9, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (3)
examples/raft/raft.cpp (3)

666-676: ⚠️ Potential issue

Fix the consecutiveFailures logic in sendHeartbeat.

Currently, consecutiveFailures is reset to zero immediately after being incremented, which prevents the failure count from accumulating correctly. This negates the retry mechanism and may cause premature termination of the heartbeat thread.

Apply this diff to correct the logic:

        consecutiveFailures++;
        spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                      consecutiveFailures,
                      maxRetries);
        if (consecutiveFailures >= maxRetries)
        {
            return;
        }
-        consecutiveFailures = 0;
        continue;

710-784: 🛠️ Refactor suggestion

Avoid using detached threads in sendAppendEntriesRPC.

Using detached threads can lead to resource management issues and potential undefined behavior. It's safer to manage thread lifetimes explicitly.

Apply this diff to manage threads properly:

-        std::thread(
+        m_replicationThreads.emplace_back(
             [this](NodeClient &client, std::vector<LogEntry> logEntries)
             {
                 // ... existing code ...
             },
             std::ref(client),
             logEntries)
-            .detach();
+        );

Ensure that m_replicationThreads is stored (e.g., in a std::vector<std::jthread>) and properly joined or stopped during shutdown to prevent resource leaks.


590-603: ⚠️ Potential issue

Ensure thread safety in becomeFollower.

The method becomeFollower modifies shared state but may not have the necessary mutex protection in all calling contexts.

Ensure that becomeFollower is always called with m_stateMutex held or add mutex locking within the method:

+    absl::MutexLock locker(&m_stateMutex);
    m_currentTerm = newTerm;
    m_state = NodeState::FOLLOWER;
    updatePersistentState(std::nullopt, 0);

    // ... existing code ...

Alternatively, if the method is intended to be called with the mutex already locked, add a thread annotation to enforce this:

void ConsensusModule::becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
🧹 Nitpick comments (3)
conan/profiles/release-gcc (1)

10-11: Review profile consistency across compilers

The GCC profile retains the compiler.libcxx setting while it's removed from Clang profiles. Consider:

  1. Standardizing the approach across all profiles
  2. Documenting the reason for different handling if intended
examples/raft/raft.cpp (1)

642-646: Prevent tight loop on failed heartbeats.

When the heartbeat fails, the loop immediately continues without a delay, potentially causing high CPU usage due to rapid retries.

Consider adding a short sleep before retrying to prevent a tight loop:

        continue;
+       // Add a short delay before the next attempt
+       std::this_thread::sleep_for(std::chrono::milliseconds(50));
    }
examples/raft/raft.h (1)

41-41: Add noexcept to destructor for NodeClient.

Consistently marking the destructor as noexcept is good practice and conveys exception safety guarantees.

Ensure the destructor is declared as:

virtual ~NodeClient() noexcept = default;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 497b88d and ee68090.

📒 Files selected for processing (9)
  • CMakeLists.txt (2 hunks)
  • conan/profiles/debug-clang (1 hunks)
  • conan/profiles/debug-clang-tsan (0 hunks)
  • conan/profiles/debug-gcc (1 hunks)
  • conan/profiles/release-clang (1 hunks)
  • conan/profiles/release-gcc (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
💤 Files with no reviewable changes (1)
  • conan/profiles/debug-clang-tsan
✅ Files skipped from review due to trivial changes (1)
  • conan/profiles/debug-gcc
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/raft/main.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (6)
conan/profiles/debug-clang (1)

10-10: Verify Clang 19 availability in the build environment

The upgrade to Clang 19 is significant. Please ensure:

  1. Clang 19 is available in the build environment
  2. All team members have access to this version
  3. CI/CD pipelines are updated accordingly

Run this script to check Clang availability:

✅ Verification successful

Clang 19 upgrade is well-supported by the project infrastructure

The compiler version upgrade is consistent with the project's requirements:

  • Both debug and release Clang profiles are aligned to version 19
  • Project requires C++23 features, which Clang 19 fully supports
  • Build environment is controlled via Docker images
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if Clang 19 is available and its features
clang_path=$(which clang)
if [ -x "$clang_path" ]; then
    $clang_path --version
    # Check C++23 support
    echo '#include <concepts>' | $clang_path -x c++ -std=c++23 -
fi

Length of output: 140


Script:

#!/bin/bash
# Search for compiler configurations and CI files
echo "=== Compiler Profiles ==="
fd ".*" conan/profiles --type f --exec cat {}

echo -e "\n=== CI Configurations ==="
fd "^\.gitlab-ci\.yml$|^\.github/workflows/.*\.yml$" --type f --exec cat {}

echo -e "\n=== Build System Files ==="
fd "CMakeLists.txt|conanfile\.txt|conanfile\.py" --type f --exec cat {}

Length of output: 14546

conan/profiles/release-clang (1)

10-10: LGTM! Changes are consistent with debug profile

The changes maintain consistency with the debug-clang profile, which is good practice.

conan/profiles/release-gcc (1)

9-9: Consider implications of removing GNU extensions

Changing from gnu23 to 23 removes GNU extensions. Please verify that:

  1. The Raft implementation doesn't rely on any GNU-specific features
  2. All existing code compiles without GNU extensions

Run this script to check for GNU extension usage:

✅ Verification successful

GNU extensions removal is safe

The codebase doesn't rely on GNU-specific features. The only GNU attribute usage is in thread safety annotations that are already properly guarded for Clang-only usage, and in isolated third-party GSL utility files. The core Raft implementation will compile correctly with standard C++23.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for common GNU extension usage
rg -l "(__GNUC__|__extension__|__attribute__|__asm__|typeof|__typeof__|__builtin)" 

# Check if any files use GNU-specific headers
rg -l "#include.*<(gnu/|bits/)"

Length of output: 328


Script:

#!/bin/bash
# Check thread_safety.h content
echo "=== thread_safety.h content ==="
cat examples/raft/thread_safety.h

# Look for GNU extensions specifically in Raft implementation
echo -e "\n=== GNU extensions in Raft code ==="
fd -e cpp -e h -e hpp . examples/raft --exec rg -l "(__GNUC__|__extension__|__attribute__|__asm__|typeof|__typeof__|__builtin)" {}

Length of output: 9927

examples/raft/raft.cpp (2)

142-147: Use static_cast instead of dynamic_cast for base class casting.

Since ConsensusModule publicly inherits from RaftService::Service, static_cast is sufficient and more efficient than dynamic_cast. This change also eliminates the need for runtime nullptr checks.

Apply this diff to replace dynamic_cast with static_cast:

-auto *raftService = dynamic_cast<RaftService::Service *>(this);
+auto *raftService = static_cast<RaftService::Service *>(this);
 if (raftService == nullptr)
 {
     throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
 }

Similarly, update the casting for tkvppService:

-auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
+auto *tkvppService = static_cast<TinyKVPPService::Service *>(this);

448-458: Verify election timer reset logic.

The logic within AwaitWithTimeout may not properly reset the election timer, potentially causing unintended election timeouts.

Please verify that the condition in AwaitWithTimeout correctly captures both the receipt of a heartbeat and the passage of the timeout duration:

bool heartbeatReceived = m_timerMutex.AwaitWithTimeout(
    absl::Condition(
        +[this, &timeToWaitDeadlineMs]() { 
            return m_leaderHeartbeatReceived.load() || currentTimeMs() >= timeToWaitDeadlineMs; 
        }
    ),
    absl::Milliseconds(timeToWaitMs)
);

Ensure that currentTimeMs() is appropriately calculated within the lambda to prevent stale time comparisons.

CMakeLists.txt (1)

16-22: Conditionally include -Wthread-safety for Clang compiler.

The -Wthread-safety flag is specific to Clang and can cause errors with other compilers like GCC.

The added conditional correctly handles compiler-specific flags, ensuring compatibility across different compilers.

examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.h Outdated Show resolved Hide resolved
Comment on lines 151 to 152
std::atomic<bool> m_leaderHeartbeatReceived{false};
std::atomic<bool> m_stopElection{false};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Atomic variables need explicit memory ordering specifications

The verification reveals that while atomic methods are used, memory ordering is not specified. This could lead to synchronization issues:

  • m_leaderHeartbeatReceived: All load/store operations should specify memory ordering
  • m_stopElection: Direct boolean access should be replaced with load/store operations with proper memory ordering

Recommended changes:

  • For m_leaderHeartbeatReceived.store(), use memory_order_release
  • For m_leaderHeartbeatReceived.load(), use memory_order_acquire
  • Replace direct access to m_stopElection with proper atomic operations
  • For m_stopElection, use memory_order_seq_cst as it's used for thread termination
🔗 Analysis chain

Ensure atomic variables are used safely with appropriate memory order.

The atomic variables m_leaderHeartbeatReceived and m_stopElection may require explicit memory order semantics to prevent data races.

Please verify that these atomic variables are accessed with consistent memory order, or consider protecting them with mutexes if they represent complex state changes.

For instance, use:

std::atomic<bool> m_leaderHeartbeatReceived{false};

And when accessing:

if (m_leaderHeartbeatReceived.load(std::memory_order_acquire)) {
    // ...
}
🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for usages of m_leaderHeartbeatReceived
ast-grep --pattern 'm_leaderHeartbeatReceived'

# Search for usages of m_stopElection
ast-grep --pattern 'm_stopElection'

# Backup search using ripgrep for more context
rg "m_leaderHeartbeatReceived|m_stopElection" -A 2 -B 2

Length of output: 3647

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (5)
examples/raft/raft.cpp (2)

605-609: ⚠️ Potential issue

Correct majority calculation in hasMajority function

The current majority calculation may lead to incorrect results, especially in clusters with an even number of nodes. The calculation should consider the total number of nodes, including the current node.

Apply this diff to fix the calculation:

 auto ConsensusModule::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    const size_t totalNodes = m_replicas.size() + 1; // Including self
+    return votes > totalNodes / 2;
 }

666-678: ⚠️ Potential issue

Fix consecutiveFailures reset logic in heartbeat

The consecutiveFailures counter is reset to zero immediately after incrementing, which prevents the failure count from accumulating and may cause premature termination of the heartbeat loop.

Apply this diff to correct the failure counter logic:

 if (!client.appendEntries(request, &response))
 {
     consecutiveFailures++;
     spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                   consecutiveFailures,
                   maxRetries);
     if (consecutiveFailures >= maxRetries)
     {
         return;
     }
-    consecutiveFailures = 0;
     continue;
 }
+else
+{
+    consecutiveFailures = 0; // Reset on successful response
+}
examples/raft/raft.h (3)

67-67: ⚠️ Potential issue

Inherit enable_shared_from_this publicly

ConsensusModule should inherit from std::enable_shared_from_this publicly to use shared_from_this() safely. Private inheritance may lead to undefined behavior when calling shared_from_this().

Apply this diff to change the inheritance:

 class ConsensusModule : public RaftService::Service,
                         public TinyKVPPService::Service,
-                        std::enable_shared_from_this<ConsensusModule>
+                        public std::enable_shared_from_this<ConsensusModule>
{
    // Class definition
};

151-152: 🛠️ Refactor suggestion

Specify memory order for atomic variables

The std::atomic<bool> variables m_leaderHeartbeatReceived and m_stopElection are accessed without specifying memory order, which may lead to synchronization issues.

Apply this diff to specify memory ordering:

 std::atomic<bool> m_leaderHeartbeatReceived{false};
 std::atomic<bool> m_stopElection{false};

 // When storing:
-m_leaderHeartbeatReceived.store(true);
+m_leaderHeartbeatReceived.store(true, std::memory_order_release);

 // When loading:
-if (m_leaderHeartbeatReceived.load())
+if (m_leaderHeartbeatReceived.load(std::memory_order_acquire))

Repeat similar changes for m_stopElection. Alternatively, protect accesses with mutexes if appropriate.


145-148: ⚠️ Potential issue

Initialize m_matchIndex and m_nextIndex properly

Ensure that m_matchIndex and m_nextIndex are initialized for all replica nodes in the constructor to prevent uninitialized access.

Confirm that in the constructor, after initializing m_replicas, you properly initialize m_matchIndex and m_nextIndex. If this is already done, ensure that the logic correctly covers all replica nodes.

🧹 Nitpick comments (1)
examples/raft/raft.cpp (1)

925-943: Handle malformed log entries during restore

In restorePersistentState(), if a log line is malformed, the code may crash or behave unexpectedly. Add checks to ensure that each line is correctly parsed.

Apply this diff to add error handling:

 while (std::getline(ifs, logLine))
 {
     std::stringstream sst(logLine);

     std::string   key;
     std::string   value;
     std::uint32_t term = 0;

     sst >> key >> value >> term;

+    if (sst.fail())
+    {
+        spdlog::error("Malformed log entry in {}: {}", path.c_str(), logLine);
+        continue; // Or handle the error as needed
+    }

     LogEntry logEntry;
     logEntry.set_key(key);
     logEntry.set_value(value);
     logEntry.set_term(term);
     m_log.emplace_back(logEntry);

     spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term);
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee68090 and f68e8d1.

📒 Files selected for processing (2)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (4)
examples/raft/raft.cpp (2)

44-46: Pass nodeIp by const reference in NodeClient constructor

Passing nodeIp by const reference avoids unnecessary copying of the std::string, improving performance.

Apply this diff to modify the constructor:

-NodeClient::NodeClient(ID nodeId, IP nodeIp)
+NodeClient::NodeClient(ID nodeId, const IP& nodeIp)
     : m_id{nodeId},
-      m_ip{std::move(nodeIp)},
+      m_ip{nodeIp},
       m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())),
       m_stub(RaftService::NewStub(m_channel)),
       m_kvStub(TinyKVPPService::NewStub(m_channel))
{
    assert(m_id > 0);
    assert(!m_ip.empty());
    // ...
}

Remember to update the declaration in raft.h as well.


658-659: Correct previous log term and index in heartbeat

In the sendHeartbeat method, request.set_prevlogterm(getLastLogTerm()); and request.set_prevlogindex(getLastLogIndex()); may not correctly represent the leader's state when there are no new entries. Ensure that the correct previous log term and index are sent.

Please verify that getLastLogTerm() and getLastLogIndex() return the appropriate values in the context of heartbeats. If they refer to the latest log entries, consider adjusting them to represent the previous ones when necessary.

Run the following script to check the implementations of getLastLogTerm() and getLastLogIndex():

Ensure that these functions return zero when the log is empty and the correct term/index otherwise.

examples/raft/raft.h (2)

32-48: Add virtual destructor and documentation to NodeClient

To ensure proper resource cleanup and facilitate inheritance, add a virtual destructor to NodeClient. Additionally, provide method documentation for clarity.

Apply this diff:

 class NodeClient
 {
   public:
+    /**
+     * Virtual destructor for safe polymorphic destruction.
+     */
+    virtual ~NodeClient() noexcept = default;

     // Existing constructor and methods...

     /**
      * Sends AppendEntries RPC to the target node.
      */
     auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;

     // Other methods...
 };

65-93: Improve documentation and inheritance in ConsensusModule

Add method documentation and ensure consistent use of override specifiers. Publicly inherit from enable_shared_from_this to avoid issues with shared_from_this().

Apply this diff:

 class ConsensusModule : public RaftService::Service,
                         public TinyKVPPService::Service,
-                        std::enable_shared_from_this<ConsensusModule>
+                        public std::enable_shared_from_this<ConsensusModule>
 {
   public:
     /**
      * Constructs a consensus module for a node in the Raft cluster.
      * @param nodeId Unique identifier for this node
      * @param replicas Vector of IP addresses for all nodes in cluster
      * @throws std::runtime_error if server creation fails
      */
     ConsensusModule(ID nodeId, std::vector<IP> replicas);

+    virtual ~ConsensusModule() noexcept override = default;

     /**
      * Handles AppendEntries RPC calls from other nodes.
      */
     auto AppendEntries(grpc::ServerContext *pContext,
                        const AppendEntriesRequest *pRequest,
                        AppendEntriesResponse *pResponse) -> grpc::Status override;

     // Other methods with documentation...
 };

examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
examples/raft/raft.cpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
examples/raft/raft.cpp (3)

616-620: ⚠️ Potential issue

Fix incorrect majority calculation in hasMajority.

The majority calculation is incorrect and could lead to split votes or incorrect behavior. The function should consider the total number of nodes (replicas plus self) when determining if a majority has been achieved.

Apply this diff to correct the calculation:

 auto ConsensusModule::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes >= (totalNodes / 2) + 1;
 }

This ensures that a majority requires more than half of the total nodes, which aligns with the Raft consensus requirements.


678-687: ⚠️ Potential issue

Correct the consecutiveFailures logic in heartbeat.

The consecutiveFailures counter is reset to zero immediately after being incremented, which prevents retries from accumulating correctly. This negates the purpose of tracking consecutive failures and may cause the heartbeat thread to continue indefinitely without proper handling of failures.

Apply this diff to fix the logic:

 if (!client.appendEntries(request, &response))
 {
     consecutiveFailures++;

     spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                   consecutiveFailures,
                   maxRetries);
     if (consecutiveFailures >= maxRetries)
     {
         return;
     }
-    consecutiveFailures = 0;
     continue;
 }

+// Reset consecutiveFailures to zero only when AppendEntries RPC call succeeds
+} else {
+    consecutiveFailures = 0;
+}

This change ensures that consecutiveFailures is only reset upon a successful RPC call, allowing the retry mechanism to function correctly.


721-797: 🛠️ Refactor suggestion

Avoid detaching threads in sendAppendEntriesRPC.

Detaching threads can lead to undefined behavior and resource leaks. Use std::jthread or store the thread handles to manage thread lifetimes properly instead of detaching threads. Ensure proper thread management and cleanup.

Apply this diff to properly manage the threads:

 void ConsensusModule::sendAppendEntriesRPC(NodeClient &client, std::vector<LogEntry> logEntries)
 {
-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](NodeClient &client, std::vector<LogEntry> logEntries)
         {
             // ... existing code ...
         },
-        std::ref(client),
-        logEntries)
-        .detach();
+        std::ref(client),
+        std::move(logEntries)
+    );
 }

+// Declare m_replicationThreads in your class to store these threads
+std::vector<std::jthread> m_replicationThreads;

Ensure you join these threads appropriately in the stop() method to prevent resource leaks:

 void ConsensusModule::stop()
 {
     // ... existing code ...

+    for (auto &replicationThread : m_replicationThreads)
+    {
+        replicationThread.request_stop();
+        replicationThread.join();
+    }
+    m_replicationThreads.clear();
 }
🧹 Nitpick comments (3)
examples/raft/raft.cpp (3)

72-75: Use consistent and fixed RPC timeout values instead of random timeouts.

Using random timeouts for RPC calls in appendEntries, requestVote, and put methods can lead to unpredictable behavior and complicate debugging. It's generally better to use a consistent and reasonable timeout value for RPC calls to ensure reliable communication and easier troubleshooting.

Apply this diff to use a fixed timeout:

 // In NodeClient::appendEntries (lines 72-75)
-    const auto RPC_TIMEOUT = std::chrono::seconds(generateRandomTimeout());
+    constexpr auto RPC_TIMEOUT = std::chrono::seconds(5); // Use a fixed 5-second timeout

 // In NodeClient::requestVote (lines 88-91)
-    const auto RPC_TIMEOUT = std::chrono::seconds(generateRandomTimeout());
+    constexpr auto RPC_TIMEOUT = std::chrono::seconds(5); // Use a fixed 5-second timeout

 // In NodeClient::put (lines 106-108)
-    context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generateRandomTimeout()));
+    context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(5)); // Use a fixed 5-second timeout

Also applies to: 88-91, 106-108


143-155: Use static_cast instead of dynamic_cast for service registration.

Since ConsensusModule inherits from RaftService::Service and TinyKVPPService::Service, you can safely use static_cast instead of dynamic_cast. This avoids the overhead of run-time type checking and does not require RTTI support, improving performance and compatibility.

Apply this diff:

 // Replace dynamic_cast with static_cast for RaftService
-    auto *raftService = dynamic_cast<RaftService::Service *>(this);
+    auto *raftService = static_cast<RaftService::Service *>(this);

 // Replace dynamic_cast with static_cast for TinyKVPPService
-    auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
+    auto *tkvppService = static_cast<TinyKVPPService::Service *>(this);

508-512: Remove commented and unused code related to m_serverThread.

The code involving m_serverThread is commented out, and there is no evidence of m_serverThread being initialized or used elsewhere. Keeping unused or commented code can lead to confusion and maintainability issues. Consider removing this code if it's not needed.

Apply this diff:

-    /*if (m_serverThread.joinable())*/
     {
         m_serverThread.request_stop();
         m_serverThread.join();
     }

If m_serverThread is intended to be used for managing the server thread, ensure it is properly initialized and managed throughout the code.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f68e8d1 and 3372412.

📒 Files selected for processing (2)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/raft/main.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (6)
examples/raft/raft.cpp (6)

73-85: Use consistent and reasonable RPC timeouts.

The RPC timeout is being set using generateRandomTimeout(), which may lead to inconsistent behavior. It's preferable to have a consistent and reasonable timeout value for RPC calls.

Apply this diff to set a fixed RPC timeout:

-    const auto RPC_TIMEOUT = std::chrono::seconds(generateRandomTimeout());
+    constexpr auto RPC_TIMEOUT = std::chrono::seconds(5);

Also, consider defining RPC_TIMEOUT as a class constant if it's used in multiple methods.


87-103: Improve error handling in requestVote method.

The method logs an error when status.ok() is false but does not provide sufficient context. Also, consider handling specific gRPC error codes.

Apply this diff to enhance error logging:

 if (!status.ok())
 {
-    spdlog::error("RequestVote RPC call failed. Error code={} and message={}",
-                  static_cast<int>(status.error_code()),
-                  status.error_message());
+    spdlog::error("RequestVote RPC to node={} failed. Error code={} ({}) - {}",
+                  m_id,
+                  static_cast<int>(status.error_code()),
+                  status.error_code(),
+                  status.error_message());
     return false;
 }

186-254: Correct typo in log message and handle missing leader heartbeat.

There is a typo in the log message at line 186 (Recevied should be Received), and the code should ensure that missing heartbeats are handled appropriately.

Apply this diff to fix the typo:

-spdlog::debug("Recevied AppendEntries RPC from leader={} during term={}", pRequest->senderid(), pRequest->term());
+spdlog::debug("Received AppendEntries RPC from leader={} during term={}", pRequest->senderid(), pRequest->term());

Also, consider adding logic to handle cases where heartbeats are not received in time, potentially triggering a new election.


614-616: Clarify log message in becomeFollower method.

The log message at line 614 could be more informative by including the node ID.

Apply this diff to enhance the log message:

-spdlog::debug("Server reverted to follower state in term={}", m_currentTerm);
+spdlog::debug("Node={} reverted to follower state in term={}", m_id, m_currentTerm);

511-513: Remove unreachable code or fix the condition.

The commented-out check for m_serverThread.joinable() and the subsequent unconditional block may lead to confusion.

Either uncomment the check or remove the unused code.

-/*if (m_serverThread.joinable())*/
+if (m_serverThread.joinable())
 {
     m_serverThread.request_stop();
     m_serverThread.join();
 }

748-752: Include error status in log messages for better debugging.

In sendAppendEntriesRPC, when logging an error, including the status message can aid in troubleshooting.

Modify the log statement to include status.error_message().

 spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}. Error: {}",
               m_id,
               client.getId(),
               m_currentTerm,
+              status.error_message());
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3372412 and 1b8b93b.

📒 Files selected for processing (1)
  • examples/raft/raft.cpp (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-clang
  • GitHub Check: build-gcc
🔇 Additional comments (5)
examples/raft/raft.cpp (5)

617-621: ⚠️ Potential issue

Correct majority calculation in hasMajority method.

The majority calculation may not be accurate, leading to potential split votes.

Apply this diff to fix the majority calculation:

 auto ConsensusModule::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

Likely invalid or redundant comment.


825-841: ⚠️ Potential issue

Fix potential infinite wait in waitForMajorityReplication.

The condition in waitForMajorityReplication might never become true, causing an infinite wait.

Apply this diff to ensure timely exit:

 constexpr const auto replicationTimeout{5};

 auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
 {
     uint32_t count = 1;
     for (const auto &[peer, matchIdx] : m_matchIndex)
     {
-        count += matchIdx >= logIndex;
+        if (matchIdx >= logIndex)
+        {
+            ++count;
+        }
     }
     return count >= (m_replicas.size() + 1) / 2 + 1;
 };

 spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_id, logIndex);
 return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));

Likely invalid or redundant comment.


45-69: ⚠️ Potential issue

Potential memory leak and unnecessary dynamic allocation in NodeClient constructor.

The NodeClient constructor initializes gRPC channel and stubs without checking for potential issues with dynamic allocation. Additionally, if the channel or stub creation fails, it throws exceptions, but the use of new is not evident, possibly leading to memory mismanagement.

Apply this diff to ensure proper initialization and avoid potential memory leaks:

 NodeClient::NodeClient(ID nodeId, IP nodeIp)
     : m_id{nodeId},
       m_ip{std::move(nodeIp)},
-      m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())),
-      m_stub(RaftService::NewStub(m_channel)),
-      m_kvStub(TinyKVPPService::NewStub(m_channel))
+      m_channel{grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())},
+      m_stub{RaftService::NewStub(m_channel)},
+      m_kvStub{TinyKVPPService::NewStub(m_channel)}
 {
     assert(m_id > 0);
     assert(!m_ip.empty());

     if (!m_channel)
     {
         throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip));
     }

     if (!m_stub)
     {
         throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip));
     }

     if (!m_kvStub)
     {
         throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip));
     }
 }

Likely invalid or redundant comment.


186-468: ⚠️ Potential issue

Avoid possible deadlock due to lock order inversion.

In AppendEntries and election functions, mutexes m_stateMutex and m_timerMutex are acquired. Ensure consistent lock ordering to prevent deadlocks.

Standardize the order in which mutexes are locked throughout the code, acquiring m_stateMutex before m_timerMutex or vice versa, but consistently.

Run the following script to identify inconsistent mutex lock orders:


107-119: Forward Put requests to the leader when not the leader.

In the put method, when the node is not the leader, it attempts to forward the request to the leader. Ensure that the forwarding logic handles errors and edge cases appropriately.

Run the following script to verify that all non-leader nodes correctly forward Put requests to the current leader:

✅ Verification successful

Put request forwarding implementation is correct and complete

The implementation properly handles forwarding Put requests to the leader:

  • Checks if the node is not the leader
  • Verifies if the leader is known (via votedFor)
  • Forwards requests using the replica's Put RPC
  • Handles errors appropriately with detailed logging
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that non-leader nodes forward Put requests to the leader.

# Test: Search for `Put` method implementations and check forwarding logic.
rg "auto ConsensusModule::Put" -A 50 | rg "if \(m_state != NodeState::LEADER\)" -A 20

Length of output: 1227


Script:

#!/bin/bash
# Get the complete Put method implementation
ast-grep --pattern 'auto ConsensusModule::Put($$$) {
  $$$
}'

# Also search for any forwarding-related methods or RPC calls
rg "Forward.*Put|forward.*put" -B 2 -A 5

Length of output: 521

Comment on lines 722 to 798
void ConsensusModule::sendAppendEntriesRPC(NodeClient &client, std::vector<LogEntry> logEntries)
{
std::thread(
[this](NodeClient &client, std::vector<LogEntry> logEntries)
{
AppendEntriesRequest request;
{
absl::MutexLock locker{&m_stateMutex};

request.set_term(m_currentTerm);
request.set_prevlogterm(logEntries.front().term());
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1));
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_id);

for (const auto& logEntry : logEntries)
{
request.add_entries()->CopyFrom(logEntry);
}
}

AppendEntriesResponse response;
auto status = client.appendEntries(request, &response);
if (!status)
{
absl::MutexLock locker{&m_stateMutex};
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}",
m_id,
client.getId(),
m_currentTerm);
return;
}

{
absl::MutexLock locker{&m_stateMutex};
if (response.term() > m_currentTerm)
{
becomeFollower(response.term());
return;
}
}

if (response.success())
{
absl::MutexLock locker{&m_stateMutex};
m_matchIndex[client.getId()] = response.match_index();
m_nextIndex[client.getId()] = response.match_index() + 1;

uint32_t majorityIndex = findMajorityIndexMatch();
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm)
{
updatePersistentState(majorityIndex, std::nullopt);

// Apply successfull replication to the state machine e.g. in-memory hash-table
while (m_lastApplied < m_commitIndex)
{
++m_lastApplied;
m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value();
}

return;
}
}

if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.getId()] = std::max(1U, m_nextIndex[client.getId()] - 1);
}
sendAppendEntriesRPC(client, {});
}
},
std::ref(client),
logEntries)
.detach();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid detaching threads in sendAppendEntriesRPC.

Detaching threads may lead to resource leaks and undefined behavior. It's better to manage thread lifetimes properly.

Apply this diff to store and manage threads:

 void ConsensusModule::sendAppendEntriesRPC(NodeClient &client, std::vector<LogEntry> logEntries)
 {
-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](NodeClient &client, std::vector<LogEntry> logEntries)
         {
             // Existing code...
         },
         std::ref(client),
         logEntries
-    ).detach();
+    );
 }

Also, declare m_replicationThreads in your class to store these threads:

std::vector<std::jthread> m_replicationThreads;

Ensure you join these threads appropriately in the stop() method to prevent resource leaks.

Comment on lines 857 to 905

auto ConsensusModule::updatePersistentState(std::optional<std::uint32_t> commitIndex,
std::optional<std::uint32_t> votedFor) -> bool
{
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
return flushPersistentState();
}

auto ConsensusModule::flushPersistentState() -> bool
{
// Flush commitIndex and votedFor
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id));
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
if (!fsa.is_open())
{
spdlog::error("Node={} is unable to open {} to flush commitIndex={} and votedFor={}",
m_id,
path.c_str(),
m_commitIndex,
m_votedFor);
return false;
}
fsa << m_commitIndex << " " << m_votedFor << "\n";
fsa.flush();
// TODO(lnikon): ::fsync(fsa->handle());
}

// Flush the log
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id));
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
if (!fsa.is_open())
{
spdlog::error("Node={} is unable to open {} to flush the log", m_id, path.c_str());
return false;
}

for (const auto &entry : m_log)
{
fsa << entry.key() << " " << entry.value() << " " << entry.term() << "\n";
}
fsa.flush();
// TODO(lnikon): ::fsync(fsa->handle());
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Ensure persistence methods handle errors properly.

The methods updatePersistentState, flushPersistentState, and restorePersistentState should handle I/O errors more robustly.

Consider adding error checks and fallback mechanisms. Also, ensure that file operations are exception-safe.

Additionally, the TODO comments about fsync should be addressed to ensure data is flushed to disk.

Would you like me to implement the error handling and finalize the persistence layer?

Comment on lines 355 to 358
{
absl::MutexLock locker{&m_stateMutex};
m_log.push_back(logEntry);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure thread safety when modifying shared data structures.

In the Put method, the m_log is being modified without proper synchronization, which may lead to data races.

Apply this diff to enforce thread safety:

 {
-    absl::MutexLock locker{&m_stateMutex};
+    absl::WriterMutexLock locker{&m_stateMutex};
     m_log.push_back(logEntry);
 }

Use absl::WriterMutexLock instead of absl::MutexLock for write operations.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
absl::MutexLock locker{&m_stateMutex};
m_log.push_back(logEntry);
}
{
absl::WriterMutexLock locker{&m_stateMutex};
m_log.push_back(logEntry);
}

Comment on lines 800 to 808
auto ConsensusModule::getLogTerm(uint32_t index) const -> uint32_t
{
if (index == 0 || index > m_log.size())
{
return 0;
}

return m_log[index - 1].term();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle out-of-range access in getLogTerm method.

The method getLogTerm might return an incorrect term if the index is out of bounds.

Apply this diff to fix the index check:

 if (index == 0 || index > m_log.size())
 {
     return 0;
 }

-return m_log[index - 1].term();
+return m_log.at(index - 1).term();

Using m_log.at(index - 1) will throw an exception if the index is out of range, making the error more apparent.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto ConsensusModule::getLogTerm(uint32_t index) const -> uint32_t
{
if (index == 0 || index > m_log.size())
{
return 0;
}
return m_log[index - 1].term();
}
auto ConsensusModule::getLogTerm(uint32_t index) const -> uint32_t
{
if (index == 0 || index > m_log.size())
{
return 0;
}
return m_log.at(index - 1).term();
}

Comment on lines 907 to 959
auto ConsensusModule::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_id);
return true;
}

std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore commitIndex and votedFor", m_id, path.c_str());
return false;
}

ifs >> m_commitIndex >> m_votedFor;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor);
}

{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_id, path.c_str());
return false;
}

std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);

std::string key;
std::string value;
std::uint32_t term = 0;

sst >> key >> value >> term;

LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);

spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term);
}
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle file parsing errors in restorePersistentState.

While restoring persistent state, the code assumes the file format is correct. If the file is corrupted, it may cause undefined behavior.

Apply this diff to add error handling:

 ifs >> m_commitIndex >> m_votedFor;
+if (ifs.fail())
+{
+    spdlog::error("Node={} encountered error while parsing {}", m_id, path.c_str());
+    return false;
+}
 spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor);

Similarly, validate the parsing of log entries.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto ConsensusModule::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_id);
return true;
}
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore commitIndex and votedFor", m_id, path.c_str());
return false;
}
ifs >> m_commitIndex >> m_votedFor;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor);
}
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_id, path.c_str());
return false;
}
std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);
std::string key;
std::string value;
std::uint32_t term = 0;
sst >> key >> value >> term;
LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term);
}
}
return true;
}
auto ConsensusModule::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_id);
return true;
}
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore commitIndex and votedFor", m_id, path.c_str());
return false;
}
ifs >> m_commitIndex >> m_votedFor;
if (ifs.fail())
{
spdlog::error("Node={} encountered error while parsing {}", m_id, path.c_str());
return false;
}
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor);
}
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_id, path.c_str());
return false;
}
std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);
std::string key;
std::string value;
std::uint32_t term = 0;
sst >> key >> value >> term;
LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term);
}
}
return true;
}

Comment on lines 256 to 301
auto ConsensusModule::RequestVote(grpc::ServerContext *pContext,
const RequestVoteRequest *pRequest,
RequestVoteResponse *pResponse) -> grpc::Status
{
(void)pContext;

absl::WriterMutexLock locker(&m_stateMutex);

spdlog::debug("Received RequestVote RPC from candidate={} during term={} peerTerm={}",
pRequest->candidateid(),
m_currentTerm,
pRequest->term());

pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(0);
pResponse->set_responderid(m_id);

// Become follower if candidates has higher term
if (pRequest->term() > m_currentTerm)
{
becomeFollower(pRequest->term());
}

// Don't grant vote to the candidate if the nodes term is higher
if (pRequest->term() < m_currentTerm)
{
return grpc::Status::OK;
}

// Grant vote to the candidate if the node hasn't voted yet and
// candidates log is at least as up-to-date as receiver's log
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid())
{
if (pRequest->lastlogterm() > getLastLogTerm() ||
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex()))
{

updatePersistentState(std::nullopt, pRequest->candidateid());
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(1);
m_leaderHeartbeatReceived.store(true);
}
}

return grpc::Status::OK;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Ensure thread safety and proper term updates in RequestVote method.

The RequestVote handler should ensure that term updates and state changes are thread-safe. Also, reset election timers upon granting a vote.

Apply this diff to improve the RequestVote implementation:

 // Grant vote to the candidate if the node hasn't voted yet and
 // candidate's log is at least as up-to-date as receiver's log
 if (m_votedFor == 0 || m_votedFor == pRequest->candidateid())
 {
     if (pRequest->lastlogterm() > getLastLogTerm() ||
         (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex()))
     {
+        m_votedFor = pRequest->candidateid();
+        m_leaderHeartbeatReceived.store(true);
+        // Reset election timer
+        // Code to reset election timer goes here
+
         updatePersistentState(std::nullopt, pRequest->candidateid());
         pResponse->set_term(m_currentTerm);
         pResponse->set_votegranted(1);
-        m_leaderHeartbeatReceived.store(true);
     }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto ConsensusModule::RequestVote(grpc::ServerContext *pContext,
const RequestVoteRequest *pRequest,
RequestVoteResponse *pResponse) -> grpc::Status
{
(void)pContext;
absl::WriterMutexLock locker(&m_stateMutex);
spdlog::debug("Received RequestVote RPC from candidate={} during term={} peerTerm={}",
pRequest->candidateid(),
m_currentTerm,
pRequest->term());
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(0);
pResponse->set_responderid(m_id);
// Become follower if candidates has higher term
if (pRequest->term() > m_currentTerm)
{
becomeFollower(pRequest->term());
}
// Don't grant vote to the candidate if the nodes term is higher
if (pRequest->term() < m_currentTerm)
{
return grpc::Status::OK;
}
// Grant vote to the candidate if the node hasn't voted yet and
// candidates log is at least as up-to-date as receiver's log
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid())
{
if (pRequest->lastlogterm() > getLastLogTerm() ||
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex()))
{
updatePersistentState(std::nullopt, pRequest->candidateid());
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(1);
m_leaderHeartbeatReceived.store(true);
}
}
return grpc::Status::OK;
}
auto ConsensusModule::RequestVote(grpc::ServerContext *pContext,
const RequestVoteRequest *pRequest,
RequestVoteResponse *pResponse) -> grpc::Status
{
(void)pContext;
absl::WriterMutexLock locker(&m_stateMutex);
spdlog::debug("Received RequestVote RPC from candidate={} during term={} peerTerm={}",
pRequest->candidateid(),
m_currentTerm,
pRequest->term());
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(0);
pResponse->set_responderid(m_id);
// Become follower if candidates has higher term
if (pRequest->term() > m_currentTerm)
{
becomeFollower(pRequest->term());
}
// Don't grant vote to the candidate if the nodes term is higher
if (pRequest->term() < m_currentTerm)
{
return grpc::Status::OK;
}
// Grant vote to the candidate if the node hasn't voted yet and
// candidates log is at least as up-to-date as receiver's log
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid())
{
if (pRequest->lastlogterm() > getLastLogTerm() ||
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex()))
{
m_votedFor = pRequest->candidateid();
m_leaderHeartbeatReceived.store(true);
// Reset election timer
// Code to reset election timer goes here
updatePersistentState(std::nullopt, pRequest->candidateid());
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(1);
}
}
return grpc::Status::OK;
}

Comment on lines 126 to 176
ConsensusModule::ConsensusModule(ID nodeId, std::vector<IP> replicas)
: m_id{nodeId},
m_currentTerm{0},
m_votedFor{0},
m_state{NodeState::FOLLOWER},
m_commitIndex{0},
m_lastApplied{0},
m_voteCount{0}
{
assert(m_id > 0);
assert(replicas.size() > 0);
assert(m_id <= replicas.size());

m_ip = replicas[m_id - 1];

grpc::ServerBuilder builder;
builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials());

auto *raftService = dynamic_cast<RaftService::Service *>(this);
if (raftService == nullptr)
{
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
}
builder.RegisterService(raftService);

auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
if (tkvppService == nullptr)
{
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
}
builder.RegisterService(tkvppService);

m_raftServer = builder.BuildAndStart();
if (!m_raftServer)
{
throw std::runtime_error(fmt::format("Failed to create a gRPC server for node={} ip={}", m_id, m_ip));
}

std::size_t id{1};
for (const auto &ip : replicas)
{
if (id != m_id)
{
m_replicas[id] = NodeClient(id, ip);
m_matchIndex[id] = 0;
m_nextIndex[id] = 1;
}

++id;
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Initialize all member variables in ConsensusModule constructor and check dynamic casts.

The constructor doesn't initialize some member variables, which could lead to undefined behavior. Also, dynamic casts should be validated.

Apply this diff to initialize all member variables and validate dynamic casts:

 ConsensusModule::ConsensusModule(ID nodeId, std::vector<IP> replicas)
     : m_id{nodeId},
       m_currentTerm{0},
       m_votedFor{0},
       m_state{NodeState::FOLLOWER},
       m_commitIndex{0},
       m_lastApplied{0},
       m_voteCount{0},
+      m_stopElection{false},
+      m_leaderHeartbeatReceived{false},
+      m_raftServer{nullptr}
 {
     assert(m_id > 0);
     assert(replicas.size() > 0);
     assert(m_id <= replicas.size());

     m_ip = replicas[m_id - 1];

     grpc::ServerBuilder builder;
     builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials());

-    auto *raftService = dynamic_cast<RaftService::Service *>(this);
-    if (raftService == nullptr)
+    // Register RaftService
+    if (!this->RegisterService(&builder))
     {
         throw std::runtime_error(fmt::format("Failed to register RaftService for node={} ip={}", m_id, m_ip));
     }

     builder.RegisterService(static_cast<RaftService::Service *>(this));

     m_raftServer = builder.BuildAndStart();
     if (!m_raftServer)
     {
         throw std::runtime_error(fmt::format("Failed to create a gRPC server for node={} ip={}", m_id, m_ip));
     }

     std::size_t id{1};
     for (const auto &ip : replicas)
     {
         if (id != m_id)
         {
             m_replicas[id] = NodeClient(id, ip);
             m_matchIndex[id] = 0;
             m_nextIndex[id] = 1;
         }

         ++id;
     }
 }

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (4)
examples/raft/raft.cpp (4)

251-252: ⚠️ Potential issue

Fix incorrect format string in spdlog::error

The spdlog::error call has one format placeholder but is provided with two arguments. This mismatch can lead to incorrect logging output or runtime errors. Ensure that the number of placeholders matches the number of arguments.

Apply this diff to correct the format string:

     if (!updatePersistentState(std::min(pRequest->leadercommit(), (uint32_t)m_log.size()), std::nullopt))
     {
-        spdlog::error("Node={} is unable to persist commitIndex", m_id, m_commitIndex);
+        spdlog::error("Node={} is unable to persist commitIndex={}", m_id, m_commitIndex);
         pResponse->set_term(m_currentTerm);
         pResponse->set_success(false);

Repeat this correction for other spdlog statements where the number of placeholders does not match the number of arguments.


710-718: ⚠️ Potential issue

Correct the reset of consecutiveFailures in heartbeat logic

In the heartbeat thread, consecutiveFailures is incremented when an AppendEntries RPC fails but is immediately reset to zero within the failure condition. This prevents the retry logic from working correctly, as consecutiveFailures never accumulates.

Apply this diff to adjust the position of consecutiveFailures reset:

                             spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                                           consecutiveFailures,
                                           maxRetries);
                             if (consecutiveFailures >= maxRetries)
                             {
                                 return;
                             }
-                            consecutiveFailures = 0;

                             continue;
                         }

+                        consecutiveFailures = 0;

754-829: 🛠️ Refactor suggestion

Avoid detaching threads in sendAppendEntriesRPC

Detaching threads may lead to resource leaks and undefined behavior. It's important to manage thread lifetimes properly by storing thread handles and ensuring they are joined upon completion.

Apply this diff to manage the threads properly:

 void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
 {
-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](node_client_t &client, std::vector<LogEntry> logEntries)
         {
             // Existing code...
         },
         std::ref(client),
         logEntries
-    ).detach();
+    );
 }

Declare m_replicationThreads in your class to store these threads:

std::vector<std::jthread> m_replicationThreads;

Ensure you join these threads appropriately in the stop() method to prevent resource leaks.


831-840: ⚠️ Potential issue

Handle out-of-range access in getLogTerm method

The getLogTerm method may return an incorrect term if the index parameter is out of bounds, which can lead to undefined behavior when accessing m_log[index - 1].

Apply this diff to fix the index check:

 auto consensus_module_t::getLogTerm(uint32_t index) const -> uint32_t
 {
-    if (index == 0 || index > m_log.size())
+    if (index == 0 || index > m_log.size())
     {
         return 0;
     }

-    return m_log[index - 1].term();
+    return index <= m_log.size() ? m_log[index - 1].term() : 0;
 }

Alternatively, use m_log.at(index - 1) to throw an exception if the index is out of range, making the error more apparent.

🧹 Nitpick comments (5)
examples/raft/raft.h (3)

33-65: Enhance method documentation and specifications.

The class methods need thread safety documentation and noexcept specifications where applicable:

     [[nodiscard]] auto id() const -> id_t;
+    [[nodiscard]] auto id() const noexcept -> id_t;

     /** Thread-safe: Can be called concurrently by multiple threads */
     auto appendEntries(const AppendEntriesRequest &request, 
                       AppendEntriesResponse *response) -> bool;

     /** Thread-safe: Can be called concurrently by multiple threads */
     auto requestVote(const RequestVoteRequest &request,
                     RequestVoteResponse *response) -> bool;

     /** Thread-safe: Can be called concurrently by multiple threads */
     auto put(const PutRequest &request, PutResponse *pResponse) -> bool;

88-92: Document public interface methods.

Key public methods lack documentation explaining their purpose, thread safety guarantees, and error handling behavior.

+    /**
+     * Initializes the consensus module's persistent state.
+     * Must be called before start().
+     * @return true if initialization succeeds, false otherwise
+     * Thread-safe: Can be called only once
+     */
     [[nodiscard]] auto init() -> bool;

+    /**
+     * Starts the consensus module, election timer, and RPC servers.
+     * Thread-safe: Can be called only once
+     */
     void start();

+    /**
+     * Stops all threads and servers gracefully.
+     * Thread-safe: Can be called multiple times
+     */
     void stop();

165-165: Consider using persistent storage for key-value pairs.

The in-memory hashtable m_kv will lose all data if the node crashes. Consider using the WAL (Write-Ahead Log) that's already included or a persistent key-value store for durability.

lib/db/db.cpp (2)

67-73: Remove commented-out code or implement consensus replication

The code from lines 67 to 73 contains commented-out code related to consensus replication using ConsensusModule. Leaving such code commented out may lead to confusion and reduce code clarity.

Apply this diff to remove the commented-out code:

     // TODO(lnikon): Indicate on insertion failure
     void db_t::put(const structures::lsmtree::key_t &key, const structures::lsmtree::value_t &value)
     {
-        // ConsensusModule m_cm;
-        // if (!m_cm.replicate(key, value))
-        // {
-        //     spdlog::error("ConensusModule is unable to replicate");
-        //     return;
-        // }

Alternatively, if consensus replication is planned, consider implementing this feature or tracking it with a TODO comment with a clear description.


70-70: Fix typo in error message

There's a typo in the commented-out error message: "ConensusModule" should be "ConsensusModule".

Apply this diff to correct the typo:

     //     spdlog::error("ConensusModule is unable to replicate");
+    //     spdlog::error("ConsensusModule is unable to replicate");
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1b8b93b and ff8edca.

📒 Files selected for processing (4)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • lib/db/db.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/raft/main.cpp
🔇 Additional comments (1)
examples/raft/raft.h (1)

21-28: LGTM! Well-defined type aliases and constants.

The type aliases improve code readability and the invalid ID constant is well-documented.

Comment on lines 3 to 17
#include <wal/wal.h>

#include "Raft.grpc.pb.h"
#include "Raft.pb.h"

#include "TinyKVPP.pb.h"
#include "TinyKVPP.grpc.pb.h"

#include <absl/base/thread_annotations.h>

#include <chrono> // for 'std::chrono::high_resolution_clock'
#include <cstdint> // for 'uint32_t'
#include <string> // for 'std::string'
#include <thread> // for 'std::jthread'

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add missing essential headers.

The following headers are used but not included:

  • <memory> for std::shared_ptr, std::unique_ptr
  • <optional> for std::optional
  • <unordered_map> for std::unordered_map
  • <vector> for std::vector
 #include <chrono>  // for 'std::chrono::high_resolution_clock'
 #include <cstdint> // for 'uint32_t'
+#include <memory>  // for 'std::shared_ptr', 'std::unique_ptr'
+#include <optional>  // for 'std::optional'
 #include <string>  // for 'std::string'
 #include <thread>  // for 'std::jthread'
+#include <unordered_map>  // for 'std::unordered_map'
+#include <vector>  // for 'std::vector'
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#include <wal/wal.h>
#include "Raft.grpc.pb.h"
#include "Raft.pb.h"
#include "TinyKVPP.pb.h"
#include "TinyKVPP.grpc.pb.h"
#include <absl/base/thread_annotations.h>
#include <chrono> // for 'std::chrono::high_resolution_clock'
#include <cstdint> // for 'uint32_t'
#include <string> // for 'std::string'
#include <thread> // for 'std::jthread'
#include <wal/wal.h>
#include "Raft.grpc.pb.h"
#include "Raft.pb.h"
#include "TinyKVPP.pb.h"
#include "TinyKVPP.grpc.pb.h"
#include <absl/base/thread_annotations.h>
#include <chrono> // for 'std::chrono::high_resolution_clock'
#include <cstdint> // for 'uint32_t'
#include <memory> // for 'std::shared_ptr', 'std::unique_ptr'
#include <optional> // for 'std::optional'
#include <string> // for 'std::string'
#include <thread> // for 'std::jthread'
#include <unordered_map> // for 'std::unordered_map'
#include <vector> // for 'std::vector'

Comment on lines 68 to 81
if (!m_channel)
{
throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip));
}

if (!m_stub)
{
throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip));
}

if (!m_kvStub)
{
throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Correct ineffective null checks on gRPC objects

The grpc::CreateChannel and NewStub methods do not return null pointers, even if the connection cannot be established. Therefore, the null checks on m_channel, m_stub, and m_kvStub are ineffective and should be removed or replaced with proper connection checks. Instead of checking for null, consider verifying the channel's connectivity state or handling errors during RPC calls.

Apply this diff to remove ineffective null checks:

     assert(m_id > 0);
     assert(!m_ip.empty());

-    if (!m_channel)
-    {
-        throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip));
-    }

-    if (!m_stub)
-    {
-        throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip));
-    }

-    if (!m_kvStub)
-    {
-        throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip));
-    }

If necessary, implement connection verification using the channel's state:

// Example: Wait for the channel to become ready
if (!m_channel->WaitForConnected(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME), gpr_time_from_seconds(5, GPR_TIMESPAN)))) {
    throw std::runtime_error(fmt::format("gRPC channel to node={} ip={} failed to connect", m_id, m_ip));
}

Comment on lines 890 to 895
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex,
std::optional<std::uint32_t> votedFor) -> bool
{
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
return flushPersistentState();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle concurrent updates in updatePersistentState

The method updatePersistentState assigns new values to m_commitIndex and m_votedFor without synchronization. This may lead to data races if accessed concurrently from multiple threads.

Apply this diff to protect the assignments with a mutex lock:

 auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex,
                                                std::optional<std::uint32_t> votedFor) -> bool
 {
+    absl::MutexLock locker{&m_stateMutex};
     m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
     m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
     return flushPersistentState();
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex,
std::optional<std::uint32_t> votedFor) -> bool
{
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
return flushPersistentState();
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex,
std::optional<std::uint32_t> votedFor) -> bool
{
absl::MutexLock locker{&m_stateMutex};
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
return flushPersistentState();

Comment on lines +816 to +824
}

if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
sendAppendEntriesRPC(client, {});
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix infinite recursion in failed AppendEntries handling

In the sendAppendEntriesRPC method, when an AppendEntries RPC fails (success is false), the code reduces m_nextIndex[client.id()] and calls sendAppendEntriesRPC recursively without a termination condition. This can lead to infinite recursion and stack overflow.

Apply this diff to modify the logic and prevent infinite recursion:

             if (!response.success())
             {
                 {
                     absl::MutexLock locker{&m_stateMutex};
                     m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
                 }
-                sendAppendEntriesRPC(client, {});
+                // Retry sending AppendEntries with updated nextIndex
+                // Consider implementing a loop with a termination condition instead of recursion
             }

Consider redesigning the method to use iteration instead of recursion, ensuring there's a clear exit condition when retries fail.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
}
if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
sendAppendEntriesRPC(client, {});
}
if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
// Retry sending AppendEntries with updated nextIndex
// Consider implementing a loop with a termination condition instead of recursion

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

♻️ Duplicate comments (3)
examples/raft/main.cpp (2)

12-24: ⚠️ Potential issue

Add signal handlers for graceful shutdown.

The program should handle SIGTERM/SIGINT signals for graceful shutdown.

Reference: This was mentioned in a previous review. The program currently lacks proper shutdown handling, which could lead to resource leaks and inconsistent state.

void signal_handler(int signal) {
    spdlog::info("Received signal {}. Initiating graceful shutdown...", signal);
    // Trigger graceful shutdown
}

int main(int argc, char *argv[]) {
    std::signal(SIGTERM, signal_handler);
    std::signal(SIGINT, signal_handler);
    // ... rest of the code
}

46-48: ⚠️ Potential issue

Security concern: Using insecure channel credentials.

The use of grpc::InsecureChannelCredentials() is not suitable for production environments.

Consider using TLS certificates for secure communication:

auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions());
m_channel = grpc::CreateChannel(replicaIp, creds);
examples/raft/CMakeLists.txt (1)

13-15: 🛠️ Refactor suggestion

Complete test configuration setup

The current setup only includes Catch2 test registration. If you decide to use Google Test, you'll need different registration.

Based on your chosen framework, use one of these configurations:

# For Catch2 (current setup)
include(Catch)
catch_discover_tests(RaftTest)

# OR for Google Test
include(GoogleTest)
gtest_discover_tests(RaftTest)

# Add test properties for both cases
set_tests_properties(${ALL_TESTS} PROPERTIES
    ENVIRONMENT "RAFT_TEST_DIR=${CMAKE_CURRENT_BINARY_DIR}/test_data"
)
🧹 Nitpick comments (9)
examples/raft/main.cpp (2)

1-11: Organize includes and add missing headers.

Consider organizing includes into logical groups and adding missing standard headers:

 // gRPC includes
 #include <grpcpp/create_channel.h>
 #include <grpcpp/security/credentials.h>
 #include <grpcpp/server.h>
 
 // Project includes
 #include "raft.h"
 
 // Third-party includes
 #include <cxxopts.hpp>
 #include <spdlog/spdlog.h>
 
+// Standard includes
+#include <memory>
+#include <vector>
+#include <csignal>

29-29: Fix typo in error message.

Correct the spelling of "positve" to "positive".

-        spdlog::error("ID of the node should be positve integer");
+        spdlog::error("ID of the node should be positive integer");
examples/raft/raft.h (2)

59-59: Pass node_config_t by const reference to avoid unnecessary copying

Consider passing config as a const node_config_t& to avoid unnecessary copying in the node_client_t constructor.

Apply this diff:

-        node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub);
+        node_client_t(const node_config_t& config, std::unique_ptr<RaftService::StubInterface> pRaftStub);

91-91: Pass parameters by const reference to avoid unnecessary copying

Consider passing node_config_t and replicas by const reference in the consensus_module_t constructor to avoid unnecessary copies.

Apply this diff:

-        consensus_module_t(node_config_t nodeConfig, std::vector<node_client_t> replicas);
+        consensus_module_t(const node_config_t& nodeConfig, const std::vector<node_client_t>& replicas);
conanfile.txt (1)

14-14: Consider consolidating dependencies to a single testing framework

Since the project already uses Catch2, adding gtest as a dependency may be unnecessary. Consolidating to one testing framework can reduce complexity and dependency management overhead.

examples/raft/raft.cpp (2)

29-39: Consider using constexpr for timeout bounds and more descriptive names.

The timeout generation function could be more maintainable with:

  • More descriptive names indicating these are election timeouts
  • Documentation explaining the purpose of these specific timeout values
-    const int minTimeout{150};
-    const int maxTimeout{300};
+    // Election timeout bounds (in milliseconds) as per Raft paper
+    constexpr int MIN_ELECTION_TIMEOUT_MS{150};
+    constexpr int MAX_ELECTION_TIMEOUT_MS{300};

     std::random_device              randomDevice;
     std::mt19937                    gen(randomDevice());
-    std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
+    std::uniform_int_distribution<> dist(MIN_ELECTION_TIMEOUT_MS, MAX_ELECTION_TIMEOUT_MS);

121-136: Standardize RPC timeout handling across methods.

The put method's timeout handling differs from appendEntries and requestVote. Consider extracting the timeout logic to maintain consistency.

+    const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
+
     grpc::ClientContext context;
-    context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generate_random_timeout()));
+    context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
examples/raft/CMakeLists.txt (2)

1-2: Consider aligning project name with repository name

The project name "zkv" differs from the repository name "tinykvpp". This inconsistency might cause confusion. Consider using a consistent naming scheme across the project.

-project(zkv)
+project(tinykvpp)

4-6: Add debug configuration and consider C++ standard compatibility

The C++23 standard is very recent and might not be widely supported. Consider:

  1. Adding debug configuration options
  2. Using C++20 for better compiler compatibility unless C++23 features are specifically needed

Add the following configuration:

# Add debug configuration
target_compile_definitions(RaftMain PRIVATE
    $<$<CONFIG:Debug>:RAFT_DEBUG>
    $<$<CONFIG:Debug>:RAFT_LOG_LEVEL=debug>
)

# Optional: Use C++20 if C++23 features aren't needed
# set_target_properties(RaftMain PROPERTIES CXX_STANDARD 20)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff8edca and 7ca37a9.

📒 Files selected for processing (7)
  • CMakeLists.txt (3 hunks)
  • conanfile.txt (1 hunks)
  • examples/raft/CMakeLists.txt (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • examples/raft/raft_test.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (6)
examples/raft/raft.h (3)

85-105: Enhance class design with virtual destructor and method documentation

The consensus_module_t class would benefit from a virtual destructor and detailed method documentation to improve maintainability and safety.


172-173: Atomic variables need explicit memory ordering specifications

The atomic variables m_leaderHeartbeatReceived and m_stopElection may require explicit memory order semantics to prevent synchronization issues.


85-86: ⚠️ Potential issue

Ensure public inheritance of std::enable_shared_from_this

Currently, consensus_module_t inherits from std::enable_shared_from_this<consensus_module_t> without specifying the access specifier, which defaults to private inheritance in C++. This can lead to undefined behavior when calling shared_from_this(). Please make the inheritance public.

Apply this diff to fix the issue:

-                               std::enable_shared_from_this<consensus_module_t>
+                               public std::enable_shared_from_this<consensus_module_t>

Likely invalid or redundant comment.

examples/raft/raft.cpp (3)

665-669: ⚠️ Potential issue

Fix incorrect majority calculation.

The current majority calculation is incorrect and could lead to split votes. The total number of nodes should include the current node.

 auto consensus_module_t::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

Likely invalid or redundant comment.


910-916: ⚠️ Potential issue

Fix race condition in updatePersistentState.

The method modifies shared state without proper synchronization.

 auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex,
                                              std::optional<std::uint32_t> votedFor) -> bool
 {
+    absl::MutexLock locker{&m_stateMutex};
     m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
     m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
     return flushPersistentState();
 }

Likely invalid or redundant comment.


1002-1003: ⚠️ Potential issue

Add error handling for file parsing.

The state restoration code doesn't validate the parsed values.

         ifs >> m_commitIndex >> m_votedFor;
+        if (ifs.fail()) {
+            spdlog::error("Node={} failed to parse state file", m_config.m_id);
+            return false;
+        }
         spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);

Likely invalid or redundant comment.

Comment on lines +33 to +38
auto nodeIps = parsedOptions["nodes"].as<std::vector<raft::ip_t>>();
if (nodeIps.empty())
{
spdlog::error("List of node IPs can't be empty");
return EXIT_FAILURE;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance IP address validation.

The current validation only checks if the list is not empty. Consider adding validation for:

  • IP address format
  • Port number range
  • Duplicate addresses

Example validation function:

bool validateIpAddress(const std::string& ip) {
    std::regex ipRegex(
        "^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}(:[0-9]{1,5})?$"
    );
    if (!std::regex_match(ip, ipRegex)) {
        return false;
    }
    // Additional validation for port range and IP octets
    return true;
}

Comment on lines 40 to 53
raft::id_t replicaId{1};
std::vector<raft::node_client_t> replicas;
for (const auto &replicaIp : nodeIps)
{
if (replicaId != nodeId)
{
std::unique_ptr<RaftService::Stub> stub{
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))};

replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub));
}

++replicaId;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add error handling for replica setup.

The replica setup should include error handling for channel creation and connection failures.

for (const auto &replicaIp : nodeIps) {
    if (replicaId != nodeId) {
        try {
            auto channel = grpc::CreateChannel(replicaIp, creds);
            if (!channel->WaitForConnected(
                std::chrono::system_clock::now() + 
                std::chrono::seconds(5))) {
                spdlog::error("Failed to connect to replica at {}", replicaIp);
                return EXIT_FAILURE;
            }
            std::unique_ptr<RaftService::Stub> stub{
                RaftService::NewStub(channel)};
            replicas.emplace_back(
                raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, 
                std::move(stub));
        } catch (const std::exception& e) {
            spdlog::error("Failed to setup replica: {}", e.what());
            return EXIT_FAILURE;
        }
    }
    ++replicaId;
}

Comment on lines 62 to 64
consensusModule.start();

/*cm.stop();*/
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Implement proper consensus module lifecycle management.

The commented-out stop() call and lack of proper cleanup could lead to resource leaks.

Consider implementing RAII pattern or proper cleanup:

int main(int argc, char *argv[]) {
    // ... existing code ...
    
    std::atomic<bool> running{true};
    auto cleanup = [&running, &consensusModule]() {
        running = false;
        consensusModule.stop();
        spdlog::info("Consensus module stopped gracefully");
    };
    
    // Set up signal handler to trigger cleanup
    std::signal(SIGTERM, [](int) { cleanup(); });
    std::signal(SIGINT, [](int) { cleanup(); });
    
    consensusModule.start();
    
    // Wait for shutdown signal
    while (running) {
        std::this_thread::sleep_for(std::chrono::milliseconds(100));
    }
    
    return EXIT_SUCCESS;
}

Comment on lines 1 to 14
#include "gmock/gmock.h"
#include <catch2/catch_test_macros.hpp>

#include <grpc/grpc.h>
#include <grpcpp/channel.h>
#include <grpcpp/client_context.h>
#include <grpcpp/create_channel.h>
#include <grpcpp/security/credentials.h>
#include <grpcpp/security/server_credentials.h>
#include <grpcpp/server_builder.h>
#include <grpcpp/support/status.h>

#include <gtest/gtest.h>
#include <memory>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid mixing Catch2 and Google Test in the same test file

Including both Catch2 and Google Test (gtest/gmock) in the same test file can lead to conflicts and is generally not recommended. Choose one testing framework to maintain consistency in your tests.

Comment on lines 8 to 11
add_executable(RaftTest "raft_test.cpp" "raft.cpp")
set_target_properties(RaftTest PROPERTIES CXX_STANDARD 23)
target_link_libraries(RaftTest Catch2::Catch2WithMain gtest::gtest DB
RaftProtoObjects TKVProtoObjects)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Standardize on a single testing framework

The test executable links against both Catch2 and Google Test frameworks. This is unusual and could lead to:

  • Increased maintenance overhead
  • Confusion in test organization
  • Larger binary size

Choose one framework and update the configuration:

# If choosing Catch2
target_link_libraries(RaftTest PRIVATE
    Catch2::Catch2WithMain
    DB
    RaftProtoObjects
    TKVProtoObjects
)

# OR if choosing Google Test
target_link_libraries(RaftTest PRIVATE
    gtest::gtest
    gtest::gtest_main
    DB
    RaftProtoObjects
    TKVProtoObjects
)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

♻️ Duplicate comments (5)
examples/raft/main.cpp (2)

61-63: ⚠️ Potential issue

Security concern: Using insecure channel credentials.

The use of grpc::InsecureChannelCredentials() is not suitable for production environments.

Consider using TLS certificates for secure communication:

-                RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))};
+                auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions());
+                RaftService::NewStub(grpc::CreateChannel(replicaIp, creds))};

56-69: ⚠️ Potential issue

Add error handling for replica setup.

The replica setup should include error handling for channel creation and connection failures.

Consider wrapping the replica setup in a try-catch block and adding connection timeout checks.

examples/raft/raft.cpp (3)

689-693: ⚠️ Potential issue

Fix incorrect majority calculation.

The majority calculation is incorrect and could lead to split votes.

 auto consensus_module_t::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

794-874: ⚠️ Potential issue

Fix thread safety issues in sendAppendEntriesRPC.

Using detached threads is unsafe and could lead to resource leaks and undefined behavior.

-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](node_client_t &client, std::vector<LogEntry> logEntries)
         {
             // ... existing code ...
-            if (!response.success())
-            {
-                {
-                    absl::MutexLock locker{&m_stateMutex};
-                    m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
-                }
-                sendAppendEntriesRPC(client, {});  // Recursive call!
-            }
+            if (!response.success())
+            {
+                absl::MutexLock locker{&m_stateMutex};
+                m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
+                // Use a loop instead of recursion
+                continue;
+            }
         },
         std::ref(client),
-        logEntries)
-        .detach();
+        logEntries);

942-1006: ⚠️ Potential issue

Improve persistence reliability and error handling.

The persistence implementation lacks atomic write guarantees and proper fsync calls.

 auto consensus_module_t::flushPersistentState() -> bool
 {
+    const auto tmpPath = path.string() + ".tmp";
+    
     // Flush commitIndex and votedFor
     {
-        auto          path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
-        std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
+        std::ofstream fsa(tmpPath, std::fstream::out | std::fstream::trunc);
         if (!fsa.is_open())
         {
             return false;
         }

         fsa << m_commitIndex << " " << m_votedFor << "\n";
         fsa.flush();
-        // TODO(lnikon): ::fsync(fsa->handle());
+        if (::fsync(fileno(fsa.rdbuf()->fd())) != 0) {
+            return false;
+        }
+        fsa.close();
+        
+        if (std::rename(tmpPath.c_str(), path.c_str()) != 0) {
+            return false;
+        }
     }
🧹 Nitpick comments (8)
examples/raft/main.cpp (4)

1-14: Add header guards or pragma once directive.

Even though this is a source file, it's good practice to add header guards to prevent potential issues if the file is accidentally included elsewhere.

+#pragma once

 #include <chrono>
 #include <csignal>

42-47: Enhance node ID validation.

The current validation only checks if the ID is non-zero. Consider adding:

  • Upper bound check to prevent overflow
  • Validation against the size of the node list
 auto nodeId = parsedOptions["id"].as<id_t>();
-if (nodeId == 0)
+if (nodeId == 0 || nodeId > nodeIps.size())
 {
-    spdlog::error("ID of the node should be positve integer");
+    spdlog::error("Node ID should be a positive integer not exceeding the number of nodes ({})", nodeIps.size());
     return EXIT_FAILURE;
 }

70-76: Enhance error reporting for consensus module initialization.

Consider adding more detailed error reporting to help diagnose initialization failures.

-    if (!consensusModule.init())
+    try 
+    {
+        if (!consensusModule.init())
+        {
+            spdlog::error("Failed to initialize the consensus module: initialization returned false");
+            return EXIT_FAILURE;
+        }
+    }
+    catch (const std::exception& e)
+    {
+        spdlog::error("Failed to initialize the consensus module: {}", e.what());
+        return EXIT_FAILURE;
+    }

77-77: Make log level configurable.

Consider adding a command-line option to configure the log level instead of hardcoding it.

+    options.add_options()
+        ("log-level", "Log level (trace, debug, info, warn, error, critical)", 
+         cxxopts::value<std::string>()->default_value("info"));

+    auto logLevel = parsedOptions["log-level"].as<std::string>();
+    if (logLevel == "trace") spdlog::set_level(spdlog::level::trace);
+    else if (logLevel == "debug") spdlog::set_level(spdlog::level::debug);
+    // ... handle other levels
-    spdlog::set_level(spdlog::level::debug);
examples/raft/raft_test.cpp (1)

93-95: Implement leader election test cases

The leader election test case is empty but crucial for validating Raft consensus. Key scenarios to test:

  • Normal leader election
  • Split vote scenarios
  • Leader failure and re-election

Would you like me to help implement these test cases or create a GitHub issue to track this task?

examples/raft/raft.h (2)

32-38: Consider using steady_clock for better portability.

high_resolution_clock might alias to system_clock on some platforms, which can be adjusted and jump backwards/forwards. For election timeouts and heartbeats, steady_clock would be more appropriate as it's monotonic and provides consistent intervals.

-using clock_t = std::chrono::high_resolution_clock;
-using timepoint_t = std::chrono::high_resolution_clock::time_point;
+using clock_t = std::chrono::steady_clock;
+using timepoint_t = std::chrono::steady_clock::time_point;

40-45: Add validation in node_config_t constructor.

The struct could benefit from a constructor that validates the ID and IP address format.

 struct node_config_t
 {
+    node_config_t(id_t id, ip_t ip) 
+        : m_id(id), m_ip(std::move(ip)) {
+        if (m_id == gInvalidId) {
+            throw std::invalid_argument("Invalid node ID");
+        }
+        if (m_ip.empty()) {
+            throw std::invalid_argument("Empty IP address");
+        }
+    }
+
     id_t m_id{gInvalidId};
     ip_t m_ip;
 };
examples/raft/raft.cpp (1)

166-178: Improve service registration error handling.

The dynamic_cast error handling could be more informative and the casts could be combined.

-    auto *raftService = dynamic_cast<RaftService::Service *>(this);
-    if (raftService == nullptr)
-    {
-        throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
-    }
-    builder.RegisterService(raftService);
-
-    auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
-    if (tkvppService == nullptr)
-    {
-        throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
-    }
-    builder.RegisterService(tkvppService);
+    if (auto* raftService = dynamic_cast<RaftService::Service*>(this)) {
+        builder.RegisterService(raftService);
+    } else {
+        throw std::runtime_error(
+            fmt::format("Node {} failed to register RaftService", m_config.m_id));
+    }
+
+    if (auto* tkvppService = dynamic_cast<TinyKVPPService::Service*>(this)) {
+        builder.RegisterService(tkvppService);
+    } else {
+        throw std::runtime_error(
+            fmt::format("Node {} failed to register TinyKVPPService", m_config.m_id));
+    }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7ca37a9 and 54f02c9.

📒 Files selected for processing (5)
  • .vscode/launch.json (2 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • examples/raft/raft_test.cpp (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
.vscode/launch.json

[error] 74-74: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 105-105: Expected a property but instead found '}'.

Expected a property here.

(parse)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (8)
examples/raft/main.cpp (1)

17-24: LGTM! Thread-safe signal handling implementation.

The signal handler correctly uses atomic operations for thread-safe graceful shutdown handling.

examples/raft/raft_test.cpp (3)

1-2: Avoid mixing test frameworks

Including both Catch2 (catch_test_macros.hpp) and Google Test/Mock (gmock.h, gtest/gtest.h) in the same test file can lead to conflicts and maintenance issues.

Choose one framework consistently:

  • If using Google Test: Remove Catch2 includes and convert test cases to TEST/TEST_F
  • If using Catch2: Remove Google Test includes and use Catch2's built-in mocking capabilities

Also applies to: 14-14


89-91: Verify consensus module state after initialization

The test starts the consensus module but doesn't verify its state or behavior.

Add assertions to verify:

  • The module transitions to the correct initial state
  • The heartbeat mechanism is active
  • The term number is correctly initialized

25-42: 🛠️ Refactor suggestion

Maintain consistent assertion style

The test mixes Google Test assertions (EXPECT_*) within a Catch2 test case. This makes the test harder to maintain and understand.

Convert to consistent Catch2 style:

 TEST_CASE("NodeClient AppendEntries", "[NodeClient]")
 {
     raft::node_config_t nodeConfig{.m_id = 1, .m_ip = "0.0.0.0:9090"};
     auto mockStub = std::make_unique<MockRaftServiceStub>();
     AppendEntriesResponse response;
-    EXPECT_CALL(*mockStub, AppendEntries)
+    REQUIRE_CALL(*mockStub, AppendEntries)
         .Times(testing::AtLeast(2))
         .WillOnce(testing::DoAll(testing::SetArgPointee<2>(response), testing::Return(grpc::Status::OK)))
         .WillOnce(testing::DoAll(testing::SetArgPointee<2>(response), testing::Return(grpc::Status::CANCELLED)));
     
     raft::node_client_t nodeClient{nodeConfig, std::move(mockStub)};
     
     AppendEntriesRequest request;
-    EXPECT_EQ(nodeClient.appendEntries(request, &response), true);
-    EXPECT_EQ(nodeClient.appendEntries(request, &response), false);
+    REQUIRE(nodeClient.appendEntries(request, &response) == true);
+    REQUIRE(nodeClient.appendEntries(request, &response) == false);
 }

Likely invalid or redundant comment.

.vscode/launch.json (1)

7-33: Enhance RaftMain debug configurations

The debug configurations could be more flexible:

  1. Add configurable debug port
  2. Make node ID and peer list configurable via inputs
  3. Add build task dependency

As suggested in the previous review, consider using input variables for configuration:

{
    "inputs": [
        {
            "id": "debugPort",
            "type": "promptString",
            "description": "Debug port for remote debugging",
            "default": "50505"
        },
        {
            "id": "nodeId",
            "type": "promptString",
            "description": "Raft node ID",
            "default": "1"
        }
    ]
}

Also applies to: 34-74

examples/raft/raft.h (3)

1-30: LGTM! Well-organized headers with clear grouping.

Headers are properly organized and include all necessary dependencies with helpful comments indicating their purpose.


83-86: Fix inheritance from enable_shared_from_this.

The inheritance should be public to ensure proper functionality of shared_from_this().

 class consensus_module_t : public RaftService::Service,
                          public TinyKVPPService::Service,
-                         public std::enable_shared_from_this<consensus_module_t>
+                         public std::enable_shared_from_this<consensus_module_t>

171-173: Specify memory ordering for atomic operations.

The atomic variables need explicit memory ordering specifications to ensure proper synchronization.

-    std::atomic<bool>     m_leaderHeartbeatReceived{false};
+    std::atomic<bool>     m_leaderHeartbeatReceived ABSL_GUARDED_BY(m_timerMutex){false};

examples/raft/main.cpp Outdated Show resolved Hide resolved
Comment on lines 73 to 74
aeResponse.set_responderid(1);
aeResponse.set_responderid(2);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix overwritten responder ID

The responder ID is being overwritten immediately after being set:

-    aeResponse.set_responderid(1);
-    aeResponse.set_responderid(2);
+    aeResponse.set_responderid(2);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
aeResponse.set_responderid(1);
aeResponse.set_responderid(2);
aeResponse.set_responderid(2);

],
"debuggerPath": "/usr/bin/gdb"
},
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix JSON syntax errors

Remove trailing commas after the pipeTransport objects:

             "pipeTransport": {
                 // ...
                 "debuggerPath": "/usr/bin/gdb"
-            },
+            }
         },

Also applies to: 105-105

🧰 Tools
🪛 Biome (1.9.4)

[error] 74-74: Expected a property but instead found '}'.

Expected a property here.

(parse)

Comment on lines 118 to 124
void startElection();
void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void sendHeartbeat(node_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries);
[[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add missing thread safety annotations.

Several private methods lack thread safety annotations despite accessing shared state.

-    void               startElection();
+    void               startElection() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
-    void               sendHeartbeat(node_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
+    void               sendHeartbeat(node_client_t &client) 
+        ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex, m_timerMutex);
-    void               sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries);
+    void               sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
+        ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
-    [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool;
+    [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool
+        ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
void startElection();
void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void sendHeartbeat(node_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries);
[[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool;
void startElection() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
void sendHeartbeat(node_client_t &client)
ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex, m_timerMutex);
void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
[[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool
ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex);

Comment on lines 46 to 81
/**
* Client for communicating with other nodes in the Raft cluster.
* Handles RPC operations for consensus and key-value operations.
*/
class node_client_t
{
public:
/**
* Constructs a client for communicating with a specific node.
* @param nodeId Unique identifier for the target node
* @param nodeIp IP address of the target node
* @throws std::runtime_error if connection cannot be established
*/
node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub);
virtual ~node_client_t() noexcept = default;

node_client_t(const node_client_t &) = delete;
auto operator=(const node_client_t &) -> node_client_t & = delete;

node_client_t(node_client_t &&) = default;
auto operator=(node_client_t &&) -> node_client_t & = default;

auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;
auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;

auto put(const PutRequest &request, PutResponse *pResponse) -> bool;

[[nodiscard]] auto id() const -> id_t;
[[nodiscard]] auto ip() const -> ip_t;

private:
node_config_t m_config{};

std::unique_ptr<RaftService::StubInterface> m_stub{nullptr};
std::unique_ptr<TinyKVPPService::Stub> m_kvStub{nullptr};
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Document thread safety guarantees for RPC methods.

The class handles RPC operations but lacks thread safety documentation. gRPC stubs are thread-safe, but this should be explicitly documented.

 /**
  * Client for communicating with other nodes in the Raft cluster.
  * Handles RPC operations for consensus and key-value operations.
+ * Thread safety: All public methods are thread-safe as they use thread-safe gRPC stubs.
  */
 class node_client_t
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
/**
* Client for communicating with other nodes in the Raft cluster.
* Handles RPC operations for consensus and key-value operations.
*/
class node_client_t
{
public:
/**
* Constructs a client for communicating with a specific node.
* @param nodeId Unique identifier for the target node
* @param nodeIp IP address of the target node
* @throws std::runtime_error if connection cannot be established
*/
node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub);
virtual ~node_client_t() noexcept = default;
node_client_t(const node_client_t &) = delete;
auto operator=(const node_client_t &) -> node_client_t & = delete;
node_client_t(node_client_t &&) = default;
auto operator=(node_client_t &&) -> node_client_t & = default;
auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;
auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;
auto put(const PutRequest &request, PutResponse *pResponse) -> bool;
[[nodiscard]] auto id() const -> id_t;
[[nodiscard]] auto ip() const -> ip_t;
private:
node_config_t m_config{};
std::unique_ptr<RaftService::StubInterface> m_stub{nullptr};
std::unique_ptr<TinyKVPPService::Stub> m_kvStub{nullptr};
};
/**
* Client for communicating with other nodes in the Raft cluster.
* Handles RPC operations for consensus and key-value operations.
* Thread safety: All public methods are thread-safe as they use thread-safe gRPC stubs.
*/
class node_client_t
{
public:
/**
* Constructs a client for communicating with a specific node.
* @param nodeId Unique identifier for the target node
* @param nodeIp IP address of the target node
* @throws std::runtime_error if connection cannot be established
*/
node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub);
virtual ~node_client_t() noexcept = default;
node_client_t(const node_client_t &) = delete;
auto operator=(const node_client_t &) -> node_client_t & = delete;
node_client_t(node_client_t &&) = default;
auto operator=(node_client_t &&) -> node_client_t & = default;
auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;
auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;
auto put(const PutRequest &request, PutResponse *pResponse) -> bool;
[[nodiscard]] auto id() const -> id_t;
[[nodiscard]] auto ip() const -> ip_t;
private:
node_config_t m_config{};
std::unique_ptr<RaftService::StubInterface> m_stub{nullptr};
std::unique_ptr<TinyKVPPService::Stub> m_kvStub{nullptr};
};

Comment on lines 82 to 100
auto node_client_t::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool
{
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());

grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
grpc::Status status = m_stub->AppendEntries(&context, request, response);
if (!status.ok())
{
spdlog::error("AppendEntries RPC to peer id={} ip={} failed. Error code={} message={}",
m_config.m_id,
m_config.m_ip,
static_cast<int>(status.error_code()),
status.error_message());
return false;
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor RPC methods to use consistent timeout and error handling.

The RPC methods (appendEntries, requestVote, put) have duplicated error handling code and inconsistent timeout handling.

Consider extracting common RPC handling logic:

template<typename Request, typename Response>
bool executeRPC(const char* operation, const Request& req, Response* resp,
               std::function<grpc::Status(grpc::ClientContext*, const Request&, Response*)> rpc) {
    const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
    
    grpc::ClientContext context;
    context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
    
    grpc::Status status = rpc(&context, req, resp);
    if (!status.ok()) {
        spdlog::error("{} RPC to peer id={} ip={} failed. Error code={} message={}",
                      operation,
                      m_config.m_id,
                      m_config.m_ip,
                      static_cast<int>(status.error_code()),
                      status.error_message());
        return false;
    }
    return true;
}

Then use it in RPC methods:

auto node_client_t::appendEntries(const AppendEntriesRequest& request,
                                 AppendEntriesResponse* response) -> bool {
    return executeRPC("AppendEntries", request, response,
                     [this](auto ctx, auto& req, auto* resp) {
                         return m_stub->AppendEntries(ctx, req, resp);
                     });
}

Also applies to: 102-121, 123-138

Comment on lines +32 to +41
{
const int minTimeout{150};
const int maxTimeout{300};

std::random_device randomDevice;
std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(minTimeout, maxTimeout);

return dist(gen);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve thread safety and configurability of timeout generation.

The random number generator is not thread-safe and timeout values are hardcoded.

Apply this diff to fix these issues:

+constexpr const int MIN_TIMEOUT_MS = 150;
+constexpr const int MAX_TIMEOUT_MS = 300;

 auto generate_random_timeout() -> int
 {
-    const int minTimeout{150};
-    const int maxTimeout{300};
 
-    std::random_device              randomDevice;
-    std::mt19937                    gen(randomDevice());
+    thread_local std::random_device randomDevice;
+    thread_local std::mt19937 gen(randomDevice());
     std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
 
     return dist(gen);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
const int minTimeout{150};
const int maxTimeout{300};
std::random_device randomDevice;
std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
return dist(gen);
}
{
constexpr const int MIN_TIMEOUT_MS = 150;
constexpr const int MAX_TIMEOUT_MS = 300;
thread_local std::random_device randomDevice;
thread_local std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS);
return dist(gen);
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (4)
examples/raft/raft.cpp (2)

803-883: 🛠️ Refactor suggestion

Avoid detaching threads in sendAppendEntriesRPC.

Detaching threads can lead to resource leaks and undefined behavior, as there is no way to ensure they have completed before the program exits. It's better to manage thread lifetimes properly by storing the threads and joining them appropriately.

Apply this diff to manage the threads:

 void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
 {
-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](node_client_t &client, std::vector<LogEntry> logEntries)
         {
             // Existing code...
         },
         std::ref(client),
         std::move(logEntries)
-    ).detach();
+    );
 }

Declare m_replicationThreads in your class to store these threads:

std::vector<std::jthread> m_replicationThreads;

Ensure you join these threads appropriately in the stop() method to prevent resource leaks.


871-878: ⚠️ Potential issue

Fix infinite recursion in sendAppendEntriesRPC.

In the sendAppendEntriesRPC method, when an AppendEntries RPC fails (success is false), the code reduces m_nextIndex[client.id()] and calls sendAppendEntriesRPC recursively without a termination condition. This can lead to infinite recursion and stack overflow.

Consider modifying the logic to use a loop instead of recursion, ensuring there's a clear exit condition when retries fail. For example:

 if (!response.success())
 {
     {
         absl::MutexLock locker{&m_stateMutex};
         m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
     }
-    sendAppendEntriesRPC(client, {});
+    // Retry sending AppendEntries with updated nextIndex
+    // Implement a loop with a termination condition instead of recursion
 }
examples/raft/raft.h (2)

171-171: ⚠️ Potential issue

Specify memory ordering for atomic variables.

The atomic variable m_leaderHeartbeatReceived is used without specifying explicit memory ordering. This may lead to subtle synchronization issues. Consider specifying memory ordering when calling load() and store() methods.

For example:

// When setting the flag
m_leaderHeartbeatReceived.store(true, std::memory_order_release);

// When reading the flag
if (m_leaderHeartbeatReceived.load(std::memory_order_acquire)) {
    // ...
}

Alternatively, if you are protecting access with mutexes, consider whether you need the variable to be atomic at all.


85-85: ⚠️ Potential issue

Inherit enable_shared_from_this publicly.

The class consensus_module_t should inherit from std::enable_shared_from_this publicly to use shared_from_this() correctly. Failing to do so can lead to undefined behavior when shared_from_this() is called.

Ensure the inheritance is public:

-class consensus_module_t : public RaftService::Service,
-                           public TinyKVPPService::Service,
-                           std::enable_shared_from_this<consensus_module_t>
+class consensus_module_t : public RaftService::Service,
+                           public TinyKVPPService::Service,
+                           public std::enable_shared_from_this<consensus_module_t>
🧹 Nitpick comments (3)
examples/raft/main.cpp (3)

1-14: Improve header organization and encapsulation.

Consider:

  1. Grouping related headers (system, gRPC, project, third-party)
  2. Encapsulating the global condition variable within a SignalHandler class
 // System headers
 #include <csignal>
+#include <condition_variable>
+#include <mutex>
 
 // gRPC headers
 #include <grpcpp/create_channel.h>
 #include <grpcpp/security/credentials.h>
 #include <grpcpp/server.h>
 
 // Project headers
 #include "raft.h"
 
 // Third-party headers
 #include <cxxopts.hpp>
 #include <spdlog/spdlog.h>
 
-std::condition_variable gCv;
+class SignalHandler {
+public:
+    static SignalHandler& instance() {
+        static SignalHandler handler;
+        return handler;
+    }
+    void wait() {
+        std::unique_lock<std::mutex> lock(m_mutex);
+        m_cv.wait(lock);
+    }
+    void notify() {
+        m_cv.notify_all();
+    }
+private:
+    std::condition_variable m_cv;
+    std::mutex m_mutex;
+};

32-38: Enhance error handling in command-line parsing.

Add try-catch block to handle parsing errors and improve error messages.

-    auto parsedOptions = options.parse(argc, argv);
+    cxxopts::ParseResult parsedOptions;
+    try {
+        parsedOptions = options.parse(argc, argv);
+    } catch (const cxxopts::OptionParseException& e) {
+        spdlog::error("Failed to parse command line options: {}", e.what());
+        spdlog::info("{}", options.help());
+        return EXIT_FAILURE;
+    }

75-76: Make log level configurable via command line.

Allow users to specify the desired log level instead of hardcoding it to debug.

+    options.add_options()
+        // ... existing options ...
+        ("log-level", "Log level (trace, debug, info, warn, error, critical)", 
+         cxxopts::value<std::string>()->default_value("info"));
+
+    auto logLevel = parsedOptions["log-level"].as<std::string>();
+    if (logLevel == "trace") spdlog::set_level(spdlog::level::trace);
+    else if (logLevel == "debug") spdlog::set_level(spdlog::level::debug);
+    else if (logLevel == "info") spdlog::set_level(spdlog::level::info);
+    else if (logLevel == "warn") spdlog::set_level(spdlog::level::warn);
+    else if (logLevel == "error") spdlog::set_level(spdlog::level::err);
+    else if (logLevel == "critical") spdlog::set_level(spdlog::level::critical);
+    else {
+        spdlog::warn("Unknown log level: {}. Defaulting to info", logLevel);
+        spdlog::set_level(spdlog::level::info);
+    }
-    spdlog::set_level(spdlog::level::debug);
     consensusModule.start();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 54f02c9 and 9d5cd21.

📒 Files selected for processing (5)
  • CMakeLists.txt (3 hunks)
  • conan/profiles/debug-clang-tsan (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • conan/profiles/debug-clang-tsan
  • CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-clang
  • GitHub Check: build-gcc
🔇 Additional comments (1)
examples/raft/raft.cpp (1)

31-41: Optimize random number generation in generate_random_timeout().

Currently, the function generate_random_timeout() re-initializes the random number generator every time it is called, which can lead to unnecessary overhead. Consider using a thread-local random number generator to improve performance and ensure thread safety.

Apply this diff:

+constexpr const int MIN_TIMEOUT_MS = 150;
+constexpr const int MAX_TIMEOUT_MS = 300;

 auto generate_random_timeout() -> int
 {
-    const int minTimeout{150};
-    const int maxTimeout{300};

-    std::random_device randomDevice;
-    std::mt19937 gen(randomDevice());
+    thread_local std::random_device randomDevice;
+    thread_local std::mt19937 gen(randomDevice());
     std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS);

     return dist(gen);
 }

Comment on lines +68 to +73
raft::consensus_module_t consensusModule({.m_id = nodeId, .m_ip = nodeIps[nodeId - 1]}, std::move(replicas));
if (!consensusModule.init())
{
spdlog::error("Failed to initialize the state machine");
return EXIT_FAILURE;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance consensus module initialization error handling.

Add timeout and detailed error reporting for initialization failures.

-    if (!consensusModule.init())
+    auto initStart = std::chrono::steady_clock::now();
+    while (!consensusModule.init())
     {
-        spdlog::error("Failed to initialize the state machine");
+        auto elapsed = std::chrono::steady_clock::now() - initStart;
+        if (elapsed > std::chrono::seconds(10)) {
+            spdlog::error("Failed to initialize consensus module after {} seconds", 
+                std::chrono::duration_cast<std::chrono::seconds>(elapsed).count());
+            spdlog::error("Last error: {}", consensusModule.getLastError());
+            return EXIT_FAILURE;
+        }
+        std::this_thread::sleep_for(std::chrono::milliseconds(100));
+    }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 54 to 66
std::vector<raft::node_client_t> replicas;
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps)
{
if (replicaId != nodeId)
{
std::unique_ptr<RaftService::Stub> stub{
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))};

replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub));
}

++replicaId;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential index out of bounds error in replica setup.

The replicaId starts from 1 and increments for each IP, but there's no validation that the number of IPs matches the expected node count.

+    if (nodeIps.size() < nodeId) {
+        spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size());
+        return EXIT_FAILURE;
+    }
+
     std::vector<raft::node_client_t> replicas;
     for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<raft::node_client_t> replicas;
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps)
{
if (replicaId != nodeId)
{
std::unique_ptr<RaftService::Stub> stub{
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))};
replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub));
}
++replicaId;
}
if (nodeIps.size() < nodeId) {
spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size());
return EXIT_FAILURE;
}
std::vector<raft::node_client_t> replicas;
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps)
{
if (replicaId != nodeId)
{
std::unique_ptr<RaftService::Stub> stub{
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))};
replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub));
}
++replicaId;
}

Comment on lines 1018 to 1071
auto consensus_module_t::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_config.m_id);
return true;
}

std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error(
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str());
return false;
}

ifs >> m_commitIndex >> m_votedFor;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
}

{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str());
return false;
}

std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);

std::string key;
std::string value;
std::uint32_t term = 0;

sst >> key >> value >> term;

LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);

spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term);
}
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle file parsing errors in restorePersistentState().

While restoring persistent state, the code assumes that the file format is correct. If the file is corrupted, it may cause undefined behavior or crashes. Add error handling to validate the parsed values and handle file corruption gracefully.

Apply this diff to add error handling:

 ifs >> m_commitIndex >> m_votedFor;
+if (ifs.fail())
+{
+    spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
+    return false;
+}

+if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
+    m_votedFor > std::numeric_limits<uint32_t>::max())
+{
+    spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
+    return false;
+}
 spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto consensus_module_t::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_config.m_id);
return true;
}
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error(
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str());
return false;
}
ifs >> m_commitIndex >> m_votedFor;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
}
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str());
return false;
}
std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);
std::string key;
std::string value;
std::uint32_t term = 0;
sst >> key >> value >> term;
LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term);
}
}
return true;
}
auto consensus_module_t::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_config.m_id);
return true;
}
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error(
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str());
return false;
}
ifs >> m_commitIndex >> m_votedFor;
if (ifs.fail())
{
spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
return false;
}
if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
m_votedFor > std::numeric_limits<uint32_t>::max())
{
spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
return false;
}
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
}
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str());
return false;
}
std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);
std::string key;
std::string value;
std::uint32_t term = 0;
sst >> key >> value >> term;
LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term);
}
}
return true;
}

std::jthread m_serverThread;

// Temporary in-memory hashtable to store KVs
std::unordered_map<std::string, std::string> m_kv;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure thread safety when accessing m_kv.

The in-memory key-value store m_kv is accessed from multiple threads (e.g., via Put and Get RPCs) but lacks thread protection. This can lead to data races and undefined behavior.

Add a mutex to protect m_kv:

mutable absl::Mutex m_kvMutex;

std::unordered_map<std::string, std::string> m_kv ABSL_GUARDED_BY(m_kvMutex);

And guard access to m_kv with the mutex:

{
    absl::MutexLock locker(&m_kvMutex);
    m_kv[logEntry.key()] = logEntry.value();
}

Ensure all accesses and modifications to m_kv are properly synchronized.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (2)
examples/raft/raft.h (1)

87-90: Document error handling for RPC methods.

The RPC methods should document their error handling behavior:

  • When do they return false?
  • What exceptions can they throw?
+    /**
+     * Sends append entries RPC to the node.
+     * @return false if RPC fails or connection is lost
+     * @throws std::runtime_error if stub is null
+     */
     auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;

+    /**
+     * Sends request vote RPC to the node.
+     * @return false if RPC fails or connection is lost
+     * @throws std::runtime_error if stub is null
+     */
     auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;

+    /**
+     * Sends put operation RPC to the node.
+     * @return false if RPC fails or connection is lost
+     * @throws std::runtime_error if stub is null
+     */
     auto put(const PutRequest &request, PutResponse *pResponse) -> bool;
examples/raft/raft.cpp (1)

432-516: Improve election timer reliability and clarity.

The election timer implementation has several areas for improvement:

  1. Complex timer logic that's hard to understand
  2. Potential race conditions in state checks
  3. Magic numbers in timeout calculations

Consider extracting the timer logic into a separate class:

class ElectionTimer {
    absl::Mutex m_mutex;
    std::atomic<bool> m_heartbeatReceived{false};
    
public:
    bool waitForHeartbeat(std::stop_token& token) {
        const auto deadline = std::chrono::steady_clock::now() + 
            std::chrono::milliseconds(generate_random_timeout());
            
        absl::MutexLock lock(&m_mutex);
        return m_mutex.AwaitWithDeadline(
            absl::Condition(this, &ElectionTimer::shouldWakeUp),
            absl::FromChrono(deadline));
    }
    
private:
    bool shouldWakeUp() const {
        return m_heartbeatReceived.load() || token.stop_requested();
    }
};
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9d5cd21 and d63a8fe.

📒 Files selected for processing (3)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • examples/raft/raft_test.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/raft/raft_test.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (8)
examples/raft/raft.h (6)

3-27: Add missing essential headers.

The header includes are well-organized, but some essential headers are still missing:

  • <mutex> for std::mutex and synchronization primitives
  • <stdexcept> for std::runtime_error

32-38: Well-structured type aliases and constants!

Good use of type aliases for improved maintainability and self-documenting constant definition.


137-142: Add missing thread safety annotations for private methods.

Several private methods lack proper thread safety annotations despite accessing shared state.


191-195: Fix atomic variable usage.

The atomic variables need explicit memory ordering specifications:

  1. m_leaderHeartbeatReceived should use proper memory ordering
  2. m_voteCount should specify memory ordering for increment/decrement operations

206-206: Add thread safety for key-value store access.

The in-memory key-value store m_kv is accessed from multiple threads but lacks synchronization.


102-104: Fix inheritance of enable_shared_from_this.

enable_shared_from_this should be inherited publicly.

examples/raft/raft.cpp (2)

691-695: ⚠️ Potential issue

Fix incorrect majority calculation.

The current implementation could lead to split votes and doesn't account for the total number of nodes correctly.

 auto consensus_module_t::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

Likely invalid or redundant comment.


376-379: ⚠️ Potential issue

Fix thread safety in log modification.

The log modification is not properly synchronized, which could lead to data races.

 {
-    absl::MutexLock locker{&m_stateMutex};
+    absl::WriterMutexLock locker{&m_stateMutex};
     m_log.push_back(logEntry);
 }

Use absl::WriterMutexLock instead of absl::MutexLock to properly protect write operations on shared data.

Likely invalid or redundant comment.

Comment on lines 40 to 57
struct tkvpp_absl_try_unlock
{
tkvpp_absl_try_unlock(absl::Mutex *mu)
: m_mu{mu}
{
}

~tkvpp_absl_try_unlock()
{
if (m_mu)
{
m_mu->Unlock();
}
}

private:
absl::Mutex *m_mu{nullptr};
};
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve RAII mutex guard implementation.

The tkvpp_absl_try_unlock class needs improvements:

  1. Add null check in constructor
  2. Prevent potential double unlocks by deleting move operations
 struct tkvpp_absl_try_unlock
 {
     tkvpp_absl_try_unlock(absl::Mutex *mu)
-        : m_mu{mu}
+        : m_mu{mu ? mu : throw std::invalid_argument("Mutex pointer cannot be null")}
     {
     }

+    tkvpp_absl_try_unlock(const tkvpp_absl_try_unlock&) = delete;
+    tkvpp_absl_try_unlock& operator=(const tkvpp_absl_try_unlock&) = delete;
+    tkvpp_absl_try_unlock(tkvpp_absl_try_unlock&&) = delete;
+    tkvpp_absl_try_unlock& operator=(tkvpp_absl_try_unlock&&) = delete;

     ~tkvpp_absl_try_unlock()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
struct tkvpp_absl_try_unlock
{
tkvpp_absl_try_unlock(absl::Mutex *mu)
: m_mu{mu}
{
}
~tkvpp_absl_try_unlock()
{
if (m_mu)
{
m_mu->Unlock();
}
}
private:
absl::Mutex *m_mu{nullptr};
};
struct tkvpp_absl_try_unlock
{
tkvpp_absl_try_unlock(absl::Mutex *mu)
: m_mu{mu ? mu : throw std::invalid_argument("Mutex pointer cannot be null")}
{
}
tkvpp_absl_try_unlock(const tkvpp_absl_try_unlock&) = delete;
tkvpp_absl_try_unlock& operator=(const tkvpp_absl_try_unlock&) = delete;
tkvpp_absl_try_unlock(tkvpp_absl_try_unlock&&) = delete;
tkvpp_absl_try_unlock& operator=(tkvpp_absl_try_unlock&&) = delete;
~tkvpp_absl_try_unlock()
{
if (m_mu)
{
m_mu->Unlock();
}
}
private:
absl::Mutex *m_mu{nullptr};
};

Comment on lines 56 to 112
auto node_client_t::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool
{
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());

grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
grpc::Status status = m_stub->AppendEntries(&context, request, response);
if (!status.ok())
{
spdlog::error("AppendEntries RPC to peer id={} ip={} failed. Error code={} message={}",
m_config.m_id,
m_config.m_ip,
static_cast<int>(status.error_code()),
status.error_message());
return false;
}

return true;
}

auto node_client_t::requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool
{
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());

grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);

grpc::Status status = m_stub->RequestVote(&context, request, response);
if (!status.ok())
{
spdlog::error("RequestVote RPC to peer id={} ip={} failed. Error code={} message={}",
m_config.m_id,
m_config.m_ip,
static_cast<int>(status.error_code()),
status.error_message());
return false;
}

return true;
}

auto node_client_t::put(const PutRequest &request, PutResponse *pResponse) -> bool
{
grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generate_random_timeout()));

grpc::Status status = m_kvStub->Put(&context, request, pResponse);
if (!status.ok())
{
spdlog::error("Put RPC call failed. Error code={} and message={}",
static_cast<int>(status.error_code()),
status.error_message());
return false;
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor duplicated RPC error handling code.

The RPC methods (appendEntries, requestVote, put) have duplicated error handling logic and inconsistent timeout handling.

Extract common RPC handling logic into a template method:

template<typename Request, typename Response>
bool executeRPC(const char* operation, const Request& req, Response* resp,
               std::function<grpc::Status(grpc::ClientContext*, const Request&, Response*)> rpc) {
    const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
    
    grpc::ClientContext context;
    context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
    
    grpc::Status status = rpc(&context, req, resp);
    if (!status.ok()) {
        spdlog::error("{} RPC to peer id={} ip={} failed. Error code={} message={}",
                      operation,
                      m_config.m_id,
                      m_config.m_ip,
                      static_cast<int>(status.error_code()),
                      status.error_message());
        return false;
    }
    return true;
}

Then use it in RPC methods:

auto node_client_t::appendEntries(const AppendEntriesRequest& request,
                                 AppendEntriesResponse* response) -> bool {
    return executeRPC("AppendEntries", request, response,
                     [this](auto ctx, auto& req, auto* resp) {
                         return m_stub->AppendEntries(ctx, req, resp);
                     });
}

Comment on lines 803 to 883
void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
{
std::thread(
[this](node_client_t &client, std::vector<LogEntry> logEntries)
{
AppendEntriesRequest request;
{
absl::MutexLock locker{&m_stateMutex};

request.set_term(m_currentTerm);
request.set_prevlogterm(logEntries.front().term());
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1));
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_config.m_id);

for (const auto &logEntry : logEntries)
{
request.add_entries()->CopyFrom(logEntry);
}
}

AppendEntriesResponse response;
auto status = client.appendEntries(request, &response);
if (!status)
{
absl::MutexLock locker{&m_stateMutex};
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}",
m_config.m_id,
client.id(),
m_currentTerm);
return;
}

{
absl::MutexLock locker{&m_stateMutex};
if (response.term() > m_currentTerm)
{
becomeFollower(response.term());
return;
}
}

if (response.success())
{
absl::MutexLock locker{&m_stateMutex};
m_matchIndex[client.id()] = response.match_index();
m_nextIndex[client.id()] = response.match_index() + 1;

uint32_t majorityIndex = findMajorityIndexMatch();
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm)
{
if (!updatePersistentState(majorityIndex, std::nullopt))
{
spdlog::error("Node={} is unable to persist commitIndex={}", m_config.m_id, m_commitIndex);
return;
}

// Apply successfull replication to the state machine e.g. in-memory hash-table
while (m_lastApplied < m_commitIndex)
{
++m_lastApplied;
m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value();
}

return;
}
}

if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
sendAppendEntriesRPC(client, {});
}
},
std::ref(client),
logEntries)
.detach();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix critical thread safety and lifecycle issues in sendAppendEntriesRPC.

The method has several critical issues:

  1. Uses std::thread::detach() which can lead to resource leaks
  2. Potential infinite recursion in failure handling
  3. Unsafe access to shared state
-void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
+void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, const std::vector<LogEntry>& logEntries)
 {
-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](node_client_t &client, std::vector<LogEntry> logEntries)
         {
             // ... existing code ...

             if (!response.success())
             {
                 {
                     absl::MutexLock locker{&m_stateMutex};
                     m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
                 }
-                sendAppendEntriesRPC(client, {});  // Recursive call!
+                // Use a retry loop instead of recursion
+                continue;
             }
         },
         std::ref(client),
-        logEntries)
-        .detach();
+        logEntries);
 }

Also add this member to the class to manage thread lifecycles:

std::vector<std::jthread> m_replicationThreads;

m_voteCount{0}
{
assert(m_config.m_id > 0);
// assert(replicas.size() > 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Uncomment and fix the replicas size assertion.

The assertion for checking replicas size is commented out, which could lead to undefined behavior if an empty replicas vector is provided.

-    // assert(replicas.size() > 0);
+    assert(!replicas.empty());
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// assert(replicas.size() > 0);
assert(!replicas.empty());

Comment on lines 140 to 152
auto *raftService = dynamic_cast<RaftService::Service *>(this);
if (raftService == nullptr)
{
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
}
builder.RegisterService(raftService);

auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
if (tkvppService == nullptr)
{
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
}
builder.RegisterService(tkvppService);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve service registration error handling.

The dynamic casts and service registration could be more robust with better error messages and validation.

-    auto *raftService = dynamic_cast<RaftService::Service *>(this);
-    if (raftService == nullptr)
+    if (auto *raftService = dynamic_cast<RaftService::Service *>(this))
     {
-        throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
+        builder.RegisterService(raftService);
+    }
+    else
+    {
+        throw std::runtime_error(
+            fmt::format("Node {} failed to register RaftService: invalid cast", m_config.m_id));
     }
-    builder.RegisterService(raftService);

-    auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
-    if (tkvppService == nullptr)
+    if (auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this))
     {
-        throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
+        builder.RegisterService(tkvppService);
+    }
+    else
+    {
+        throw std::runtime_error(
+            fmt::format("Node {} failed to register TinyKVPPService: invalid cast", m_config.m_id));
     }
-    builder.RegisterService(tkvppService);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto *raftService = dynamic_cast<RaftService::Service *>(this);
if (raftService == nullptr)
{
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
}
builder.RegisterService(raftService);
auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
if (tkvppService == nullptr)
{
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
}
builder.RegisterService(tkvppService);
if (auto *raftService = dynamic_cast<RaftService::Service *>(this))
{
builder.RegisterService(raftService);
}
else
{
throw std::runtime_error(
fmt::format("Node {} failed to register RaftService: invalid cast", m_config.m_id));
}
if (auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this))
{
builder.RegisterService(tkvppService);
}
else
{
throw std::runtime_error(
fmt::format("Node {} failed to register TinyKVPPService: invalid cast", m_config.m_id));
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (5)
examples/raft/main.cpp (2)

68-68: ⚠️ Potential issue

Potential Out-of-Bounds Access When Accessing nodeIps[nodeId - 1]

The code accesses nodeIps[nodeId - 1] without validating that nodeId is within the bounds of nodeIps. If nodeId is greater than nodeIps.size(), this will result in undefined behavior due to out-of-bounds access.

This issue was previously raised but remains unaddressed.

Suggested fix:

Add a check to ensure that nodeId is within the valid range before accessing nodeIps[nodeId - 1]:

+    if (nodeId > nodeIps.size())
+    {
+        spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size());
+        return EXIT_FAILURE;
+    }

     raft::consensus_module_t consensusModule({.m_id = nodeId, .m_ip = nodeIps[nodeId - 1]}, std::move(replicas));

47-52: 🛠️ Refactor suggestion

Enhance IP Address Validation

Currently, the code only checks if the list of node IPs is empty. Consider adding validation for:

  • Correct IP address format
  • Port number range
  • Duplicate addresses

This suggestion was previously made but not yet implemented.

Suggested improvement:

Implement a validation function for IP addresses and use it to validate each IP in the list:

bool validateIpAddress(const std::string& ip)
{
    std::regex ipRegex(
        R"(^(?:\d{1,3}\.){3}\d{1,3}:\d{1,5}$)"
    );
    if (!std::regex_match(ip, ipRegex))
    {
        return false;
    }
    // Additional validation for IP octet ranges and port numbers can be added here
    return true;
}

Then, use it in your code:

     auto nodeIps = parsedOptions["nodes"].as<std::vector<raft::ip_t>>();
     if (nodeIps.empty())
     {
         spdlog::error("List of node IPs can't be empty");
         return EXIT_FAILURE;
     }
+    for (const auto& ip : nodeIps)
+    {
+        if (!validateIpAddress(ip))
+        {
+            spdlog::error("Invalid IP address format: {}", ip);
+            return EXIT_FAILURE;
+        }
+    }
examples/raft/raft_test.cpp (2)

1-2: 🛠️ Refactor suggestion

Avoid Mixing Catch2 and Google Test Frameworks

Including both Catch2 and Google Test in the same test file can lead to conflicts and maintenance issues. It is recommended to choose one testing framework for consistency.

This issue was raised previously but remains unresolved.

Suggested action:

Standardize on a single testing framework for this test file. If possible, choose either Catch2 or Google Test and refactor the tests accordingly.

Also applies to: 14-14


72-82: ⚠️ Potential issue

Overwriting aeResponse May Cause Inconsistent Test Behavior

In the test case, aeResponse is modified between mock expectations, which could lead to unexpected behavior due to shared state.

This issue was previously noted but still exists.

Suggested fix:

Use separate AppendEntriesResponse instances for each expectation to avoid unintended side effects:

     AppendEntriesResponse aeResponse2;
     aeResponse2.set_responderid(2);
     aeResponse2.set_success(true);
     EXPECT_CALL(*mockStub2, AppendEntries)
         .Times(testing::AtLeast(1))
-        .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse), testing::Return(grpc::Status::OK)));
+        .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse2), testing::Return(grpc::Status::OK)));

     AppendEntriesResponse aeResponse3;
     aeResponse3.set_responderid(3);
     aeResponse3.set_success(true);
     EXPECT_CALL(*mockStub3, AppendEntries)
         .Times(testing::AtLeast(1))
-        .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse), testing::Return(grpc::Status::OK)));
+        .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse3), testing::Return(grpc::Status::OK)));

This ensures that each mock expectation uses its own response object.

examples/raft/raft.h (1)

218-218: ⚠️ Potential issue

Ensure Thread Safety When Accessing m_kv

The in-memory key-value store m_kv is accessed from multiple threads (e.g., via Put and Get RPC handlers) without synchronization, which can lead to data races and undefined behavior.

This issue was previously highlighted but remains unaddressed.

Suggested fix:

Add a mutex to protect m_kv and use it to guard all accesses and modifications:

+    mutable absl::Mutex m_kvMutex;

     // Temporary in-memory hashtable to store KVs
-    std::unordered_map<std::string, std::string> m_kv;
+    std::unordered_map<std::string, std::string> m_kv ABSL_GUARDED_BY(m_kvMutex);

Then, in your methods, lock the mutex when accessing m_kv:

grpc::Status consensus_module_t::Put(grpc::ServerContext *pContext,
                                     const PutRequest *pRequest,
                                     PutResponse *pResponse)
{
    absl::MutexLock lock(&m_kvMutex);
    m_kv[pRequest->key()] = pRequest->value();
    // ...
}

Ensure that all accesses to m_kv in Put and Get methods are properly synchronized.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d63a8fe and 0bab68f.

📒 Files selected for processing (4)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • examples/raft/raft_test.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/raft/raft.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-clang
  • GitHub Check: build-gcc
🔇 Additional comments (1)
examples/raft/raft.h (1)

148-153: Add Thread Safety Annotations to Private Methods

Several private methods that access shared state lack thread safety annotations. Adding these annotations helps with static analysis and ensures proper synchronization.

This suggestion was made earlier but has not been implemented.

Suggested improvement:

Add appropriate thread safety annotations to the methods:

     void               startElection();
+    void               startElection() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);

     void               becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
     void               becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
     void               sendHeartbeat(raft_node_grpc_client_t &client);
+    void               sendHeartbeat(raft_node_grpc_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);

     void               sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries);
+    void               sendAppendEntriesRPC(raft_node_grpc_client_t &client,
                                             std::vector<LogEntry> logEntries)
+        ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);

     [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool;
+    [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool
+        ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex);

This helps to document and enforce the locking requirements for these methods.

Comment on lines +15 to +21
static void signalHandler(int sig)
{
if (sig == SIGTERM || sig == SIGINT)
{
gCv.notify_all();
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Critical Issue: Unsafe use of std::condition_variable in signal handler

Calling std::condition_variable::notify_all() within a signal handler is unsafe and can lead to undefined behavior because it is not an async-signal-safe function. According to the C++ standard, only a limited set of functions are safe to call from within a signal handler.

Suggested fix:

Use an std::atomic<bool> flag to signal the main thread to exit, instead of using a condition variable. Modify the code as follows:

-std::condition_variable gCv;

-static void signalHandler(int sig)
+std::atomic<bool> gShutdown{false};

+void signalHandler(int sig)
 {
     if (sig == SIGTERM || sig == SIGINT)
     {
-        gCv.notify_all();
+        gShutdown.store(true);
     }
 }

 ...

-    std::mutex                   mtx;
-    std::unique_lock<std::mutex> lock(mtx);
-    gCv.wait(lock);
+    while (!gShutdown.load())
+    {
+        std::this_thread::sleep_for(std::chrono::milliseconds(100));
+    }

This replaces the condition variable with an atomic flag that the main thread can periodically check.

Committable suggestion skipped: line range outside the PR's diff.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0bab68f and ece39ad.

📒 Files selected for processing (4)
  • examples/raft/CMakeLists.txt (1 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft_test.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • examples/raft/CMakeLists.txt
  • examples/raft/main.cpp
  • examples/raft/raft_test.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-gcc
  • GitHub Check: build-clang
🔇 Additional comments (8)
examples/raft/raft.cpp (8)

95-114: Extract common RPC handling logic.

The RPC error handling code is duplicated across different methods and timeout handling is inconsistent.

Also applies to: 116-135


166-178: 🛠️ Refactor suggestion

Improve service registration error handling.

The dynamic casts and service registration could be more robust with better error messages and validation.

-    auto *raftService = dynamic_cast<RaftService::Service *>(this);
-    if (raftService == nullptr)
+    if (auto *raftService = dynamic_cast<RaftService::Service *>(this))
     {
-        throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
+        builder.RegisterService(raftService);
+    }
+    else
+    {
+        throw std::runtime_error(
+            fmt::format("Node {} failed to register RaftService: invalid cast", m_config.m_id));
     }
-    builder.RegisterService(raftService);

-    auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
-    if (tkvppService == nullptr)
+    if (auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this))
     {
-        throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
+        builder.RegisterService(tkvppService);
+    }
+    else
+    {
+        throw std::runtime_error(
+            fmt::format("Node {} failed to register TinyKVPPService: invalid cast", m_config.m_id));
     }
-    builder.RegisterService(tkvppService);

Likely invalid or redundant comment.


1064-1067: ⚠️ Potential issue

Handle file parsing errors in restorePersistentState.

The method assumes that the file format is correct. If the file is corrupted, it may cause undefined behavior.

Apply this diff to add error handling:

 ifs >> m_commitIndex >> m_votedFor;
+if (ifs.fail())
+{
+    spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
+    return false;
+}
+
+if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
+    m_votedFor > std::numeric_limits<uint32_t>::max())
+{
+    spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
+    return false;
+}
-m_votedFor = 0;  // Remove this line as it overwrites the loaded value
 spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);

Likely invalid or redundant comment.


407-410: ⚠️ Potential issue

Ensure thread safety when modifying shared data structures.

In the Put method, the m_log is being modified without proper synchronization, which may lead to data races.

Apply this diff to enforce thread safety:

 {
-    absl::MutexLock locker{&m_stateMutex};
+    absl::WriterMutexLock locker{&m_stateMutex};
     m_log.push_back(logEntry);
 }

Use absl::WriterMutexLock instead of absl::MutexLock for write operations.

Likely invalid or redundant comment.


719-723: ⚠️ Potential issue

Fix incorrect majority calculation.

The majority calculation is incorrect and could lead to split votes.

 auto consensus_module_t::hasMajority(uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

Likely invalid or redundant comment.


736-739: ⚠️ Potential issue

Reset vote count when becoming leader.

The vote count should be reset when becoming leader to avoid interference with future elections.

     m_state = NodeState::LEADER;
+    m_voteCount = 0;

     spdlog::info("Node={} become a leader at term={}", m_config.m_id, m_currentTerm);

Likely invalid or redundant comment.


913-921: ⚠️ Potential issue

Add bounds checking to getLogTerm.

The array access is not properly bounds-checked and could lead to undefined behavior.

 auto consensus_module_t::getLogTerm(uint32_t index) const -> uint32_t
 {
     if (index == 0 || index > m_log.size())
     {
         return 0;
     }

-    return m_log[index - 1].term();
+    try {
+        return m_log.at(index - 1).term();
+    } catch (const std::out_of_range&) {
+        spdlog::error("Node={} attempted to access invalid log index {}", m_config.m_id, index);
+        return 0;
+    }
 }

Likely invalid or redundant comment.


248-261: ⚠️ Potential issue

Fix potential buffer overflow in log entry handling.

The log entry insertion doesn't properly validate array bounds and could cause buffer overflow. Also, the insertion method is inefficient.

-    auto newEntryStart = pRequest->prevlogindex() + 1;
-    for (auto idx{0}; idx < pRequest->entries().size(); ++idx)
+    const auto newEntryStart = pRequest->prevlogindex() + 1;
+    if (newEntryStart > m_log.size() + 1) {
+        pResponse->set_term(m_currentTerm);
+        pResponse->set_success(false);
+        pResponse->set_responderid(m_config.m_id);
+        return grpc::Status::OK;
+    }
+
+    for (size_t idx = 0; idx < pRequest->entries().size(); ++idx)
     {
-        const auto newIdx{newEntryStart + idx - 1};
+        const auto newIdx = newEntryStart + idx - 1;
         if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term())
         {
             m_log.resize(newIdx);
             break;
         }
     }
-    m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end());
+    
+    // Use reserve to prevent multiple reallocations
+    m_log.reserve(m_log.size() + pRequest->entries().size());
+    for (const auto& entry : pRequest->entries()) {
+        m_log.push_back(entry);
+    }

Likely invalid or redundant comment.

Comment on lines 58 to 73
auto tkvpp_node_grpc_client_t::put(const PutRequest &request, PutResponse *pResponse) -> bool
{
grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generate_random_timeout()));

grpc::Status status = m_stub->Put(&context, request, pResponse);
if (!status.ok())
{
spdlog::error("Put RPC call failed. Error code={} and message={}",
static_cast<int>(status.error_code()),
status.error_message());
return false;
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Extract common RPC handling logic.

The RPC error handling code is duplicated across different methods and timeout handling is inconsistent.

Consider extracting common RPC handling logic into a template method:

template<typename Request, typename Response>
bool executeRPC(const char* operation, const Request& req, Response* resp,
               std::function<grpc::Status(grpc::ClientContext*, const Request&, Response*)> rpc) {
    const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
    
    grpc::ClientContext context;
    context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
    
    grpc::Status status = rpc(&context, req, resp);
    if (!status.ok()) {
        spdlog::error("{} RPC to peer id={} ip={} failed. Error code={} message={}",
                      operation,
                      m_config.m_id,
                      m_config.m_ip,
                      static_cast<int>(status.error_code()),
                      status.error_message());
        return false;
    }
    return true;
}

Then use it in RPC methods:

auto tkvpp_node_grpc_client_t::put(const PutRequest& request, PutResponse* response) -> bool {
    return executeRPC("Put", request, response,
                     [this](auto ctx, auto& req, auto* resp) {
                         return m_stub->Put(ctx, req, resp);
                     });
}

Comment on lines 831 to 911
void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
{
std::thread(
[this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
{
AppendEntriesRequest request;
{
absl::MutexLock locker{&m_stateMutex};

request.set_term(m_currentTerm);
request.set_prevlogterm(logEntries.front().term());
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1));
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_config.m_id);

for (const auto &logEntry : logEntries)
{
request.add_entries()->CopyFrom(logEntry);
}
}

AppendEntriesResponse response;
auto status = client.appendEntries(request, &response);
if (!status)
{
absl::MutexLock locker{&m_stateMutex};
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}",
m_config.m_id,
client.id(),
m_currentTerm);
return;
}

{
absl::MutexLock locker{&m_stateMutex};
if (response.term() > m_currentTerm)
{
becomeFollower(response.term());
return;
}
}

if (response.success())
{
absl::MutexLock locker{&m_stateMutex};
m_matchIndex[client.id()] = response.match_index();
m_nextIndex[client.id()] = response.match_index() + 1;

uint32_t majorityIndex = findMajorityIndexMatch();
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm)
{
if (!updatePersistentState(majorityIndex, std::nullopt))
{
spdlog::error("Node={} is unable to persist commitIndex={}", m_config.m_id, m_commitIndex);
return;
}

// Apply successfull replication to the state machine e.g. in-memory hash-table
while (m_lastApplied < m_commitIndex)
{
++m_lastApplied;
m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value();
}

return;
}
}

if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
sendAppendEntriesRPC(client, {});
}
},
std::ref(client),
logEntries)
.detach();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix critical thread safety and lifecycle issues in sendAppendEntriesRPC.

The method has several critical issues:

  1. Uses std::thread::detach() which can lead to resource leaks
  2. Potential infinite recursion in failure handling
  3. Unsafe access to shared state
-void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
+void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, const std::vector<LogEntry>& logEntries)
 {
-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
         {
             // ... existing code ...

             if (!response.success())
             {
                 {
                     absl::MutexLock locker{&m_stateMutex};
                     m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
                 }
-                sendAppendEntriesRPC(client, {});  // Recursive call!
+                // Use a retry loop instead of recursion
+                continue;
             }
         },
         std::ref(client),
-        logEntries)
-        .detach();
+        logEntries);
 }

Also add this member to the class to manage thread lifecycles:

std::vector<std::jthread> m_replicationThreads;

Comment on lines +980 to +1044
auto consensus_module_t::flushPersistentState() -> bool
{
// Flush commitIndex and votedFor
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
if (!fsa.is_open())
{
spdlog::error("Node={} is unable to open {} to flush commitIndex={} and votedFor={}",
m_config.m_id,
path.c_str(),
m_commitIndex,
m_votedFor);
return false;
}

fsa << m_commitIndex << " " << m_votedFor << "\n";
if (fsa.fail())
{
spdlog::error("Node={} is unable to write into {} the commitIndex={} and votedFor={}",
m_config.m_id,
path.c_str(),
m_commitIndex,
m_votedFor);
return false;
}

fsa.flush();
if (fsa.fail())
{
return false;
}
// TODO(lnikon): ::fsync(fsa->handle());
}

// Flush the log
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
if (!fsa.is_open())
{
spdlog::error("Node={} is unable to open {} to flush the log", m_config.m_id, path.c_str());
return false;
}

for (const auto &entry : m_log)
{
fsa << entry.key() << " " << entry.value() << " " << entry.term() << "\n";

if (fsa.fail())
{
spdlog::error("Node={} is unable to write into {} the commitIndex={} and votedFor={}",
m_config.m_id,
path.c_str(),
m_commitIndex,
m_votedFor);
return false;
}
}
fsa.flush();
// TODO(lnikon): ::fsync(fsa->handle());
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Improve persistence reliability and error handling.

The persistence implementation lacks atomic write guarantees and proper error handling. The TODO comments about fsync need to be addressed.

  1. Use atomic file operations to prevent data corruption
  2. Implement proper fsync calls
  3. Add comprehensive error handling
 auto consensus_module_t::flushPersistentState() -> bool
 {
+    const auto tmpPath = path.string() + ".tmp";
     // Flush commitIndex and votedFor
     {
         auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
-        std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
+        std::ofstream fsa(tmpPath, std::fstream::out | std::fstream::trunc);
         if (!fsa.is_open()) {
             return false;
         }

         fsa << m_commitIndex << " " << m_votedFor << "\n";
         fsa.flush();
+        if (::fsync(fileno(fsa.rdbuf()->fd())) != 0) {
+            return false;
+        }
+        fsa.close();
+
+        if (std::rename(tmpPath.c_str(), path.c_str()) != 0) {
+            return false;
+        }
     }
     // Similar changes for log persistence...
     return true;
 }

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines +938 to +954
auto consensus_module_t::waitForMajorityReplication(uint32_t logIndex) -> bool
{
constexpr const auto replicationTimeout{5};

auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
{
uint32_t count = 1;
for (const auto &[peer, matchIdx] : m_matchIndex)
{
count += matchIdx >= logIndex;
}
return count >= (m_replicas.size() + 1) / 2 + 1;
};

spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex);
return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve replication wait mechanism.

The timeout value is hardcoded and the majority calculation could be improved.

-    constexpr const auto replicationTimeout{5};
+    static constexpr const auto REPLICATION_TIMEOUT_SECONDS = 5;
+    static constexpr const auto MIN_MAJORITY = 2;  // Minimum number of nodes for consensus

     auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
     {
         uint32_t count = 1;  // Count self
         for (const auto &[peer, matchIdx] : m_matchIndex)
         {
             count += matchIdx >= logIndex;
         }
-        return count >= (m_replicas.size() + 1) / 2 + 1;
+        const size_t majorityCount = (m_replicas.size() + 1) / 2 + 1;
+        return count >= std::max(majorityCount, MIN_MAJORITY);
     };

     spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex);
-    return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));
+    return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), 
+                                        absl::Seconds(REPLICATION_TIMEOUT_SECONDS));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto consensus_module_t::waitForMajorityReplication(uint32_t logIndex) -> bool
{
constexpr const auto replicationTimeout{5};
auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
{
uint32_t count = 1;
for (const auto &[peer, matchIdx] : m_matchIndex)
{
count += matchIdx >= logIndex;
}
return count >= (m_replicas.size() + 1) / 2 + 1;
};
spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex);
return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));
}
auto consensus_module_t::waitForMajorityReplication(uint32_t logIndex) -> bool
{
static constexpr const auto REPLICATION_TIMEOUT_SECONDS = 5;
static constexpr const auto MIN_MAJORITY = 2; // Minimum number of nodes for consensus
auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
{
uint32_t count = 1; // Count self
for (const auto &[peer, matchIdx] : m_matchIndex)
{
count += matchIdx >= logIndex;
}
const size_t majorityCount = (m_replicas.size() + 1) / 2 + 1;
return count >= std::max(majorityCount, MIN_MAJORITY);
};
spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex);
return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority),
absl::Seconds(REPLICATION_TIMEOUT_SECONDS));
}

Comment on lines +31 to +41
auto generate_random_timeout() -> int
{
const int minTimeout{150};
const int maxTimeout{300};

std::random_device randomDevice;
std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(minTimeout, maxTimeout);

return dist(gen);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix thread safety issues in random number generation.

The random number generator is not thread-safe and could lead to race conditions when called from multiple threads. Also, the timeout values should be configurable constants.

Apply this diff to fix these issues:

+constexpr const int MIN_TIMEOUT_MS = 150;
+constexpr const int MAX_TIMEOUT_MS = 300;

 auto generate_random_timeout() -> int
 {
-    const int minTimeout{150};
-    const int maxTimeout{300};
 
-    std::random_device              randomDevice;
-    std::mt19937                    gen(randomDevice());
+    thread_local std::random_device randomDevice;
+    thread_local std::mt19937 gen(randomDevice());
-    std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
+    std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS);
 
     return dist(gen);
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto generate_random_timeout() -> int
{
const int minTimeout{150};
const int maxTimeout{300};
std::random_device randomDevice;
std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
return dist(gen);
}
auto generate_random_timeout() -> int
{
constexpr const int MIN_TIMEOUT_MS = 150;
constexpr const int MAX_TIMEOUT_MS = 300;
thread_local std::random_device randomDevice;
thread_local std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS);
return dist(gen);
}

Comment on lines +923 to +936
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t
{
std::vector<int> matchIndexes;
matchIndexes.resize(m_replicas.size() + 1);
for (const auto &[peer, matchIdx] : m_matchIndex)
{
matchIndexes.emplace_back(matchIdx);
}
matchIndexes.emplace_back(m_log.back().index());

std::ranges::sort(matchIndexes);

return matchIndexes[matchIndexes.size() / 2];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix majority index calculation.

The vector size calculation is incorrect and there are potential duplicate entries.

 auto consensus_module_t::findMajorityIndexMatch() -> uint32_t
 {
-    std::vector<int> matchIndexes;
-    matchIndexes.resize(m_replicas.size() + 1);
+    std::vector<uint32_t> matchIndexes;
+    matchIndexes.reserve(m_replicas.size() + 1);  // Reserve space for all replicas + self
+
     for (const auto &[peer, matchIdx] : m_matchIndex)
     {
         matchIndexes.emplace_back(matchIdx);
     }
-    matchIndexes.emplace_back(m_log.back().index());
+    // Add self's last log index
+    matchIndexes.emplace_back(getLastLogIndex());

     std::ranges::sort(matchIndexes);

+    // Majority is (N+1)/2 where N is total number of servers
     return matchIndexes[matchIndexes.size() / 2];
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t
{
std::vector<int> matchIndexes;
matchIndexes.resize(m_replicas.size() + 1);
for (const auto &[peer, matchIdx] : m_matchIndex)
{
matchIndexes.emplace_back(matchIdx);
}
matchIndexes.emplace_back(m_log.back().index());
std::ranges::sort(matchIndexes);
return matchIndexes[matchIndexes.size() / 2];
}
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t
{
std::vector<uint32_t> matchIndexes;
matchIndexes.reserve(m_replicas.size() + 1); // Reserve space for all replicas + self
for (const auto &[peer, matchIdx] : m_matchIndex)
{
matchIndexes.emplace_back(matchIdx);
}
// Add self's last log index
matchIndexes.emplace_back(getLastLogIndex());
std::ranges::sort(matchIndexes);
// Majority is (N+1)/2 where N is total number of servers
return matchIndexes[matchIndexes.size() / 2];
}

Comment on lines +782 to +800
{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;

spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
spdlog::error(
"Stopping heartbeat thread due to too much failed AppendEntries RPC attempts");
return;
}
}

consecutiveFailures = 0;

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix consecutive failures handling in heartbeat mechanism.

The consecutive failures counter is reset too early, which breaks the retry mechanism.

                     if (!client.appendEntries(request, &response))
                     {
                         consecutiveFailures++;
-                        consecutiveFailures = 0;  // Remove this line
                         spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                                     consecutiveFailures,
                                     maxRetries);
                         if (consecutiveFailures >= maxRetries)
                         {
                             return;
                         }
                         continue;
                     }
+                    // Only reset on success
+                    consecutiveFailures = 0;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
spdlog::error(
"Stopping heartbeat thread due to too much failed AppendEntries RPC attempts");
return;
}
}
consecutiveFailures = 0;
{
AppendEntriesResponse response;
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
spdlog::error(
"Stopping heartbeat thread due to too much failed AppendEntries RPC attempts");
return;
}
continue;
}
// Only reset on success
consecutiveFailures = 0;

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (4)
examples/raft/main.cpp (3)

14-20: ⚠️ Potential issue

Critical Issue: Unsafe use of std::condition_variable in signal handler

Calling std::condition_variable::notify_all() within a signal handler is unsafe and can lead to undefined behavior because it is not async-signal-safe. According to the C++ standard, only a limited set of functions are safe to call from within a signal handler.

Apply this diff to replace the condition variable with an atomic flag:

-std::condition_variable gCv;

 static void signalHandler(int sig)
 {
     if (sig == SIGTERM || sig == SIGINT)
     {
-        gCv.notify_all();
+        gShutdown.store(true);
     }
 }

+std::atomic<bool> gShutdown{false};

 ...

 int main(int argc, char *argv[])
 {
     std::signal(SIGTERM, signalHandler);
     std::signal(SIGINT, signalHandler);

     ...

-    std::mutex                   mtx;
-    std::unique_lock<std::mutex> lock(mtx);
-    gCv.wait(lock);
+    while (!gShutdown.load())
+    {
+        std::this_thread::sleep_for(std::chrono::milliseconds(100));
+    }

     consensusModule.stop();

     return EXIT_SUCCESS;
 }

46-51: 🛠️ Refactor suggestion

Enhance IP address validation and error handling

Currently, the code only checks if the list of node IPs is not empty. For robustness, consider adding validation for:

  • The nodeId is within the range of provided IPs.
  • IP address format correctness.
  • Duplicate IP addresses.

Apply this diff to enhance validation:

 if (nodeIps.empty())
 {
     spdlog::error("List of node IPs can't be empty");
     return EXIT_FAILURE;
 }
+
+if (nodeId > nodeIps.size())
+{
+    spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size());
+    return EXIT_FAILURE;
+}
+
+std::unordered_set<std::string> ipSet(nodeIps.begin(), nodeIps.end());
+if (ipSet.size() != nodeIps.size())
+{
+    spdlog::error("Duplicate IP addresses detected in the node IP list");
+    return EXIT_FAILURE;
+}
+
+std::regex ipRegex("^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}:[0-9]{1,5}$");
+for (const auto &ip : nodeIps)
+{
+    if (!std::regex_match(ip, ipRegex))
+    {
+        spdlog::error("Invalid IP address format: {}", ip);
+        return EXIT_FAILURE;
+    }
+}

58-59: ⚠️ Potential issue

Security Concern: Using insecure channel credentials

The use of grpc::InsecureChannelCredentials() is not suitable for production environments as it exposes the communication to potential man-in-the-middle attacks.

Consider using secure channel credentials:

-std::unique_ptr<RaftService::Stub> stub{
-    RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))};
+auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions());
+std::unique_ptr<RaftService::Stub> stub{
+    RaftService::NewStub(grpc::CreateChannel(replicaIp, creds))};
.vscode/launch.json (1)

105-105: ⚠️ Potential issue

Fix JSON syntax error.

Remove the trailing comma after the pipeTransport object.

Apply this diff:

                 "debuggerPath": "/usr/bin/gdb"
-            },
+            }
         },
🧰 Tools
🪛 Biome (1.9.4)

[error] 105-105: Expected a property but instead found '}'.

Expected a property here.

(parse)

🧹 Nitpick comments (9)
examples/raft/main.cpp (1)

74-75: Set logging level before module initialization

Consider setting the logging level before initializing the consensus module to capture all debug logs during initialization.

Apply this diff to adjust the order:

 spdlog::set_level(spdlog::level::debug);
 consensusModule.start();
examples/raft/raft.h (3)

14-24: Remove unnecessary includes

The includes for Raft.grpc.pb.h and Raft.pb.h are sufficient. The additional gRPC headers in lines 3~ to 11~ may not be necessary and can be removed to reduce compilation time.

Apply this diff to clean up the includes:

-#include <grpc/grpc.h>
-#include <grpcpp/channel.h>
-#include <grpcpp/client_context.h>
-#include <grpcpp/create_channel.h>
-#include <grpcpp/security/credentials.h>
-#include <grpcpp/security/server_credentials.h>
-#include <grpcpp/server_builder.h>
-#include <grpcpp/support/status.h>

 #include "Raft.grpc.pb.h"
 #include "Raft.pb.h"

79-106: Enhance class design and documentation for raft_node_grpc_client_t

To improve maintainability and readability, consider the following:

  • Add method documentation for public methods.
  • Mark the destructor as noexcept.
  • Ensure thread safety guarantees are documented.

Apply this diff to enhance the class:

 /**
  * Client for communicating with other nodes in the Raft cluster.
  * Handles RPC operations for consensus and key-value operations.
+ * Thread-safe: All public methods are thread-safe as they use thread-safe gRPC stubs.
  */
 class raft_node_grpc_client_t
 {
   public:
     /**
      * Constructs a client for communicating with a specific node.
      * @param config Node configuration containing ID and IP address.
      * @param pRaftStub Unique pointer to the Raft service stub interface.
      */
     raft_node_grpc_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub);
-    virtual ~raft_node_grpc_client_t() noexcept = default;
+    virtual ~raft_node_grpc_client_t() noexcept;

     raft_node_grpc_client_t(const raft_node_grpc_client_t &) = delete;
     auto operator=(const raft_node_grpc_client_t &) -> raft_node_grpc_client_t & = delete;

     raft_node_grpc_client_t(raft_node_grpc_client_t &&) = default;
     auto operator=(raft_node_grpc_client_t &&) -> raft_node_grpc_client_t & = default;

     /**
      * Sends append entries RPC to the node.
      * @param request AppendEntriesRequest message.
      * @param response Pointer to AppendEntriesResponse message.
      * @return True if the RPC was successful; otherwise, false.
      */
     auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool;
     /**
      * Sends request vote RPC to the node.
      * @param request RequestVoteRequest message.
      * @param response Pointer to RequestVoteResponse message.
      * @return True if the RPC was successful; otherwise, false.
      */
     auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;

     [[nodiscard]] auto id() const -> id_t;
     [[nodiscard]] auto ip() const -> ip_t;
 };

176-179: Add missing thread safety annotations

The m_state variable is guarded by m_stateMutex but lacks thread safety annotations.

Apply this diff to add the annotation:

 uint32_t m_commitIndex ABSL_GUARDED_BY(m_stateMutex);
 uint32_t m_lastApplied ABSL_GUARDED_BY(m_stateMutex);
-NodeState m_state      ABSL_GUARDED_BY(m_stateMutex);
+NodeState m_state ABSL_GUARDED_BY(m_stateMutex);
examples/raft/raft_test.cpp (2)

84-85: Simplify range transformations

The use of std::views::transform and std::ranges::to may be unnecessary if a simpler approach suffices.

Consider simplifying the raftClients() method:

 auto raftClients()
 {
-    return m_raftClients | std::views::transform([](auto &&pair) { return std::move(pair.second); }) |
-           std::ranges::to<std::vector<raft::raft_node_grpc_client_t>>();
+    std::vector<raft::raft_node_grpc_client_t> clients;
+    for (auto &[id, client] : m_raftClients)
+    {
+        clients.push_back(std::move(client));
+    }
+    return clients;
 }

121-123: Ensure proper cleanup in main function

After running all tests, it's good practice to clean up any resources or perform any necessary teardown.

Consider adding any required cleanup code after RUN_ALL_TESTS():

int main(int argc, char **argv)
{
    spdlog::set_level(spdlog::level::debug);

    testing::InitGoogleTest(&argc, argv);
    int result = RUN_ALL_TESTS();

    // Perform any necessary cleanup here

    return result;
}
.vscode/launch.json (1)

39-44: Make node configuration configurable.

The hardcoded node ID and peer list make it difficult to debug different nodes.

Apply this diff:

             "args": [
                 "--id",
-                "1",
+                "${input:nodeId}",
                 "--nodes",
-                "0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082"
+                "${input:peerList}"
             ],

Add to the inputs section:

        {
            "id": "peerList",
            "type": "promptString",
            "description": "Comma-separated list of peer addresses",
            "default": "0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082"
        }
examples/raft/raft.cpp (2)

918-934: Improve replication wait mechanism.

The timeout value is hardcoded and the majority calculation could be improved.

Apply this diff:

-    constexpr const auto replicationTimeout{5};
+    static constexpr const auto REPLICATION_TIMEOUT_SECONDS = 5;
+    static constexpr const auto MIN_MAJORITY = 2;  // Minimum number of nodes for consensus

     auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
     {
         uint32_t count = 1;  // Count self
         for (const auto &[peer, matchIdx] : m_matchIndex)
         {
             count += matchIdx >= logIndex;
         }
-        return count >= (m_replicas.size() + 1) / 2 + 1;
+        const size_t majorityCount = (m_replicas.size() + 1) / 2 + 1;
+        return count >= std::max(majorityCount, MIN_MAJORITY);
     };

     spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex);
-    return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));
+    return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), 
+                                        absl::Seconds(REPLICATION_TIMEOUT_SECONDS));

893-901: Use safe array access in getLogTerm.

The array access should use at() to catch out-of-bounds access.

Apply this diff:

     if (index == 0 || index > m_log.size())
     {
         return 0;
     }

-    return m_log[index - 1].term();
+    return m_log.at(index - 1).term();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ece39ad and 0834c89.

📒 Files selected for processing (5)
  • .vscode/launch.json (2 hunks)
  • examples/raft/main.cpp (1 hunks)
  • examples/raft/raft.cpp (1 hunks)
  • examples/raft/raft.h (1 hunks)
  • examples/raft/raft_test.cpp (1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
.vscode/launch.json

[error] 74-74: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 105-105: Expected a property but instead found '}'.

Expected a property here.

(parse)

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build-clang
  • GitHub Check: build-gcc
🔇 Additional comments (10)
examples/raft/main.cpp (1)

68-72: Enhance consensus module initialization error handling

Add timeout and detailed error reporting for initialization failures to improve robustness.

Apply this diff to enhance error handling:

 auto nodeId = parsedOptions["id"].as<id_t>();

 ...

 raft::consensus_module_t consensusModule({.m_id = nodeId, .m_ip = nodeIps[nodeId - 1]}, std::move(replicas));
+auto initStart = std::chrono::steady_clock::now();
+while (!consensusModule.init())
 {
     spdlog::error("Failed to initialize the state machine");
+    auto elapsed = std::chrono::steady_clock::now() - initStart;
+    if (elapsed > std::chrono::seconds(10))
+    {
+        spdlog::error("Failed to initialize consensus module after {} seconds",
+                      std::chrono::duration_cast<std::chrono::seconds>(elapsed).count());
+        return EXIT_FAILURE;
+    }
+    std::this_thread::sleep_for(std::chrono::milliseconds(100));
 }
examples/raft/raft_test.cpp (1)

28-45: Avoid mixing Google Test and Catch2 frameworks

The test file uses Google Test (gtest/gtest.h) and Google Mock (gmock/gmock.h). Ensure that only one testing framework is used to prevent potential conflicts.

.vscode/launch.json (1)

74-74: ⚠️ Potential issue

Fix JSON syntax error.

Remove the trailing comma after the pipeTransport object.

Apply this diff:

                 "debuggerPath": "/usr/bin/gdb"
-            },
+            }
         },

Likely invalid or redundant comment.

🧰 Tools
🪛 Biome (1.9.4)

[error] 74-74: Expected a property but instead found '}'.

Expected a property here.

(parse)

examples/raft/raft.cpp (7)

704-725: ⚠️ Potential issue

Reset vote count when becoming leader.

The vote count should be reset to avoid interference with future elections.

Apply this diff:

     m_state = NodeState::LEADER;
+    m_voteCount = 0;  // Reset vote count

Likely invalid or redundant comment.


903-916: ⚠️ Potential issue

Fix majority index calculation.

The vector size calculation is incorrect and there are potential duplicate entries.

Apply this diff:

 auto consensus_module_t::findMajorityIndexMatch() -> uint32_t
 {
-    std::vector<int> matchIndexes;
-    matchIndexes.resize(m_replicas.size() + 1);
+    std::vector<uint32_t> matchIndexes;
+    matchIndexes.reserve(m_replicas.size() + 1);  // Reserve space for all replicas + self
+
     for (const auto &[peer, matchIdx] : m_matchIndex)
     {
         matchIndexes.emplace_back(matchIdx);
     }
-    matchIndexes.emplace_back(m_log.back().index());
+    // Add self's last log index
+    matchIndexes.emplace_back(getLastLogIndex());

     std::ranges::sort(matchIndexes);

+    // Majority is (N+1)/2 where N is total number of servers
     return matchIndexes[matchIndexes.size() / 2];
 }

Likely invalid or redundant comment.


31-41: ⚠️ Potential issue

Fix thread safety issues in random number generation.

The current implementation creates new random device and generator on each call, which is inefficient and not thread-safe.

Apply this diff:

+constexpr const int MIN_TIMEOUT_MS = 150;
+constexpr const int MAX_TIMEOUT_MS = 300;

 auto generate_random_timeout() -> int
 {
-    const int minTimeout{150};
-    const int maxTimeout{300};
 
-    std::random_device              randomDevice;
-    std::mt19937                    gen(randomDevice());
+    thread_local std::random_device randomDevice;
+    thread_local std::mt19937 gen(randomDevice());
-    std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
+    std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS);
 
     return dist(gen);
 }

Likely invalid or redundant comment.


960-1024: ⚠️ Potential issue

Improve persistence reliability and error handling.

The persistence implementation lacks atomic write guarantees and proper error handling.

Apply this diff:

 auto consensus_module_t::flushPersistentState() -> bool
 {
+    const auto tmpPath = path.string() + ".tmp";
     // Flush commitIndex and votedFor
     {
         auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
-        std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
+        std::ofstream fsa(tmpPath, std::fstream::out | std::fstream::trunc);
         if (!fsa.is_open()) {
             return false;
         }

         fsa << m_commitIndex << " " << m_votedFor << "\n";
         fsa.flush();
+        if (::fsync(fileno(fsa.rdbuf()->fd())) != 0) {
+            return false;
+        }
+        fsa.close();
+
+        if (std::rename(tmpPath.c_str(), path.c_str()) != 0) {
+            return false;
+        }
     }
     // Similar changes for log persistence...
     return true;
 }

Likely invalid or redundant comment.


761-779: ⚠️ Potential issue

Fix consecutive failures handling in heartbeat mechanism.

The consecutive failures counter is reset too early, which breaks the retry mechanism.

Apply this diff:

                     if (!client.appendEntries(request, &response))
                     {
                         consecutiveFailures++;
-                        consecutiveFailures = 0;  // Remove this line
                         spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
                                     consecutiveFailures,
                                     maxRetries);
                         if (consecutiveFailures >= maxRetries)
                         {
                             return;
                         }
                         continue;
                     }
+                    // Only reset on success
+                    consecutiveFailures = 0;

Likely invalid or redundant comment.


877-885: ⚠️ Potential issue

Fix infinite recursion in failed AppendEntries handling.

The recursive call to sendAppendEntriesRPC on failure can lead to infinite recursion and stack overflow.

Apply this diff:

             if (!response.success())
             {
                 {
                     absl::MutexLock locker{&m_stateMutex};
                     m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
                 }
-                sendAppendEntriesRPC(client, {});
+                // Use a retry loop instead of recursion
+                continue;
             }

Likely invalid or redundant comment.


698-702: ⚠️ Potential issue

Fix incorrect majority calculation.

The majority calculation is incorrect and could lead to split votes.

Apply this diff:

 auto consensus_module_t::hasMajority(const uint32_t votes) const -> bool
 {
-    constexpr const double HALF_OF_THE_REPLICAS = 2.0;
-    return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+    // Total nodes = replicas + self
+    const size_t totalNodes = m_replicas.size() + 1;
+    return votes > totalNodes / 2;
 }

Likely invalid or redundant comment.

Comment on lines +171 to +173
uint32_t m_currentTerm ABSL_GUARDED_BY(m_stateMutex);
uint32_t m_votedFor ABSL_GUARDED_BY(m_stateMutex);
std::vector<LogEntry> m_log ABSL_GUARDED_BY(m_stateMutex);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Ensure proper initialization of member variables

Member variables m_currentTerm and m_votedFor are declared but not initialized. Uninitialized variables can lead to undefined behavior.

Apply this diff to initialize the variables:

 uint32_t m_currentTerm      ABSL_GUARDED_BY(m_stateMutex);
 uint32_t m_votedFor         ABSL_GUARDED_BY(m_stateMutex);

Initialize them in the constructor:

consensus_module_t::consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas)
    : m_config(std::move(nodeConfig))
    , m_replicas()
    , m_currentTerm(0)
    , m_votedFor(gInvalidId)
{
    // Constructor implementation
}

Comment on lines +108 to +125
class consensus_module_t : public RaftService::Service
{
public:
// @id is the ID of the current node. Order of RaftServices in @replicas is important!
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas);

// NOLINTBEGIN(modernize-use-trailing-return-type)
grpc::Status AppendEntries(grpc::ServerContext *pContext,
const AppendEntriesRequest *pRequest,
AppendEntriesResponse *pResponse) override ABSL_LOCKS_EXCLUDED(m_stateMutex);

auto RequestVote(grpc::ServerContext *pContext, const RequestVoteRequest *pRequest, RequestVoteResponse *pResponse)
-> grpc::Status override ABSL_LOCKS_EXCLUDED(m_stateMutex);
// NOLINTEND(modernize-use-trailing-return-type)

[[nodiscard]] auto init() -> bool;
void start();
void stop();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Correct inheritance of std::enable_shared_from_this and document thread safety

  • Inherit publicly from std::enable_shared_from_this to ensure proper functionality.
  • Provide documentation on thread safety guarantees of consensus_module_t.

Apply this diff to correct inheritance and add documentation:

 class consensus_module_t : public RaftService::Service
-{
+    , public std::enable_shared_from_this<consensus_module_t>
+{
  public:
     /**
      * Constructs a consensus module for a node in the Raft cluster.
      * @param nodeConfig Configuration of the current node.
      * @param replicas Vector of client stubs for replica nodes.
      */
     consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas);

+    /**
+     * Initializes the consensus module. Must be called before start().
+     * @return True if initialization was successful; otherwise, false.
+     */
     [[nodiscard]] auto init() -> bool;
     /**
      * Starts the consensus module operations.
      */
     void               start();
     /**
      * Stops the consensus module operations.
      */
     void               stop();
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
class consensus_module_t : public RaftService::Service
{
public:
// @id is the ID of the current node. Order of RaftServices in @replicas is important!
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas);
// NOLINTBEGIN(modernize-use-trailing-return-type)
grpc::Status AppendEntries(grpc::ServerContext *pContext,
const AppendEntriesRequest *pRequest,
AppendEntriesResponse *pResponse) override ABSL_LOCKS_EXCLUDED(m_stateMutex);
auto RequestVote(grpc::ServerContext *pContext, const RequestVoteRequest *pRequest, RequestVoteResponse *pResponse)
-> grpc::Status override ABSL_LOCKS_EXCLUDED(m_stateMutex);
// NOLINTEND(modernize-use-trailing-return-type)
[[nodiscard]] auto init() -> bool;
void start();
void stop();
class consensus_module_t : public RaftService::Service
, public std::enable_shared_from_this<consensus_module_t>
{
public:
/**
* Constructs a consensus module for a node in the Raft cluster.
* @param nodeConfig Configuration of the current node.
* @param replicas Vector of client stubs for replica nodes.
*/
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas);
// NOLINTBEGIN(modernize-use-trailing-return-type)
grpc::Status AppendEntries(grpc::ServerContext *pContext,
const AppendEntriesRequest *pRequest,
AppendEntriesResponse *pResponse) override ABSL_LOCKS_EXCLUDED(m_stateMutex);
auto RequestVote(grpc::ServerContext *pContext, const RequestVoteRequest *pRequest, RequestVoteResponse *pResponse)
-> grpc::Status override ABSL_LOCKS_EXCLUDED(m_stateMutex);
// NOLINTEND(modernize-use-trailing-return-type)
/**
* Initializes the consensus module. Must be called before start().
* @return True if initialization was successful; otherwise, false.
*/
[[nodiscard]] auto init() -> bool;
/**
* Starts the consensus module operations.
*/
void start();
/**
* Stops the consensus module operations.
*/
void stop();

Comment on lines +7 to +33
{
"name": "rr - RaftMain",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/build/RaftMain",
"miDebuggerServerAddress": "localhost:50505",
"stopAtEntry": false,
"cwd": "${workspaceFolder}",
"environment": [],
"externalConsole": true,
"linux": {
"MIMode": "gdb",
"setupCommands": [
{
"description": "Setup to resolve symbols",
"text": "set sysroot /",
"ignoreFailures": false
}
]
},
"osx": {
"MIMode": "gdb"
},
"windows": {
"MIMode": "gdb"
}
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Enhance RaftMain debug configuration for multi-node debugging.

The configuration needs improvements for debugging multiple Raft nodes:

  1. Add build task to ensure RaftMain is compiled
  2. Add command line arguments for Raft configuration (node ID, peer list)
  3. Make debug port configurable to avoid conflicts

Apply this diff:

         {
             "name": "rr - RaftMain",
             "type": "cppdbg",
             "request": "launch",
             "program": "${workspaceFolder}/build/RaftMain",
-            "miDebuggerServerAddress": "localhost:50505",
+            "miDebuggerServerAddress": "localhost:${input:debugPort}",
+            "args": [
+                "--node-id", "${input:nodeId}",
+                "--config", "${workspaceFolder}/assets/raft_config.json"
+            ],
             "stopAtEntry": false,
             "cwd": "${workspaceFolder}",
             "environment": [],
             "externalConsole": true,
+            "preLaunchTask": "build",
             "linux": {
                 "MIMode": "gdb",

Add to the end of the file:

    "inputs": [
        {
            "id": "debugPort",
            "type": "promptString",
            "description": "Debug port for remote debugging",
            "default": "50505"
        },
        {
            "id": "nodeId",
            "type": "promptString",
            "description": "Raft node ID",
            "default": "1"
        }
    ]

Comment on lines +810 to +891
void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
{
std::thread(
[this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
{
AppendEntriesRequest request;
{
absl::MutexLock locker{&m_stateMutex};

request.set_term(m_currentTerm);
request.set_prevlogterm(logEntries.front().term());
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1));
request.set_leadercommit(m_commitIndex);
request.set_senderid(m_config.m_id);

for (const auto &logEntry : logEntries)
{
request.add_entries()->CopyFrom(logEntry);
}
}

AppendEntriesResponse response;
auto status = client.appendEntries(request, &response);
if (!status)
{
absl::MutexLock locker{&m_stateMutex};
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}",
m_config.m_id,
client.id(),
m_currentTerm);
return;
}

{
absl::MutexLock locker{&m_stateMutex};
if (response.term() > m_currentTerm)
{
becomeFollower(response.term());
return;
}
}

if (response.success())
{
absl::MutexLock locker{&m_stateMutex};
m_matchIndex[client.id()] = response.match_index();
m_nextIndex[client.id()] = response.match_index() + 1;

uint32_t majorityIndex = findMajorityIndexMatch();
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm)
{
if (!updatePersistentState(majorityIndex, std::nullopt))
{
spdlog::error("Node={} is unable to persist commitIndex={}", m_config.m_id, m_commitIndex);
return;
}

// Apply successfull replication to the state machine e.g. in-memory hash-table
while (m_lastApplied < m_commitIndex)
{
++m_lastApplied;
// TODO(lnikon): Update the state machine
// m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value();
}

return;
}
}

if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
sendAppendEntriesRPC(client, {});
}
},
std::ref(client),
logEntries)
.detach();
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Avoid detaching threads in sendAppendEntriesRPC.

Using detached threads can lead to resource leaks and undefined behavior.

Apply this diff:

-    std::thread(
+    m_replicationThreads.emplace_back(
         [this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
         {
             // ... existing code ...
         },
         std::ref(client),
-        logEntries)
-        .detach();
+        std::move(logEntries));

Add to the class definition:

std::vector<std::jthread> m_replicationThreads;

Comment on lines +234 to +247
// 3. Append new entries and remove conflicting ones
auto newEntryStart = pRequest->prevlogindex() + 1;
for (auto idx{0}; idx < pRequest->entries().size(); ++idx)
{
const auto newIdx{newEntryStart + idx - 1};
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term())
{
m_log.resize(newIdx);
break;
}
}

m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end());

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix potential buffer overflow in log entry handling.

The log entry insertion doesn't properly validate array bounds and could cause buffer overflow. Also, the insertion method is inefficient.

Apply this diff:

-    auto newEntryStart = pRequest->prevlogindex() + 1;
-    for (auto idx{0}; idx < pRequest->entries().size(); ++idx)
+    const auto newEntryStart = pRequest->prevlogindex() + 1;
+    if (newEntryStart > m_log.size() + 1) {
+        pResponse->set_term(m_currentTerm);
+        pResponse->set_success(false);
+        pResponse->set_responderid(m_config.m_id);
+        return grpc::Status::OK;
+    }
+
+    for (size_t idx = 0; idx < pRequest->entries().size(); ++idx)
     {
-        const auto newIdx{newEntryStart + idx - 1};
+        const auto newIdx = newEntryStart + idx - 1;
         if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term())
         {
             m_log.resize(newIdx);
             break;
         }
     }
-    m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end());
+    
+    // Use reserve to prevent multiple reallocations
+    m_log.reserve(m_log.size() + pRequest->entries().size());
+    for (const auto& entry : pRequest->entries()) {
+        m_log.push_back(entry);
+    }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// 3. Append new entries and remove conflicting ones
auto newEntryStart = pRequest->prevlogindex() + 1;
for (auto idx{0}; idx < pRequest->entries().size(); ++idx)
{
const auto newIdx{newEntryStart + idx - 1};
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term())
{
m_log.resize(newIdx);
break;
}
}
m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end());
// 3. Append new entries and remove conflicting ones
const auto newEntryStart = pRequest->prevlogindex() + 1;
if (newEntryStart > m_log.size() + 1) {
pResponse->set_term(m_currentTerm);
pResponse->set_success(false);
pResponse->set_responderid(m_config.m_id);
return grpc::Status::OK;
}
for (size_t idx = 0; idx < pRequest->entries().size(); ++idx)
{
const auto newIdx = newEntryStart + idx - 1;
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term())
{
m_log.resize(newIdx);
break;
}
}
// Use reserve to prevent multiple reallocations
m_log.reserve(m_log.size() + pRequest->entries().size());
for (const auto& entry : pRequest->entries()) {
m_log.push_back(entry);
}

Comment on lines +1026 to +1080
auto consensus_module_t::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_config.m_id);
return true;
}

std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error(
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str());
return false;
}

ifs >> m_commitIndex >> m_votedFor;
m_votedFor = 0;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
}

{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str());
return false;
}

std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);

std::string key;
std::string value;
std::uint32_t term = 0;

sst >> key >> value >> term;

LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);

spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term);
}
}

return true;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle file parsing errors in restorePersistentState.

The method doesn't properly validate parsed values and could lead to undefined behavior with corrupted files.

Apply this diff:

         ifs >> m_commitIndex >> m_votedFor;
+        if (ifs.fail()) {
+            spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
+            return false;
+        }
+
+        if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
+            m_votedFor > std::numeric_limits<uint32_t>::max()) {
+            spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
+            return false;
+        }

         while (std::getline(ifs, logLine)) {
             std::stringstream sst(logLine);
             std::string key, value;
             std::uint32_t term = 0;
             
             sst >> key >> value >> term;
+            if (sst.fail()) {
+                spdlog::error("Node={} failed to parse log entry", m_config.m_id);
+                return false;
+            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto consensus_module_t::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_config.m_id);
return true;
}
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error(
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str());
return false;
}
ifs >> m_commitIndex >> m_votedFor;
m_votedFor = 0;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
}
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str());
return false;
}
std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);
std::string key;
std::string value;
std::uint32_t term = 0;
sst >> key >> value >> term;
LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term);
}
}
return true;
}
auto consensus_module_t::restorePersistentState() -> bool
{
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
if (!std::filesystem::exists(path))
{
spdlog::info("Node={} is running the first time", m_config.m_id);
return true;
}
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error(
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str());
return false;
}
ifs >> m_commitIndex >> m_votedFor;
if (ifs.fail()) {
spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
return false;
}
if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
m_votedFor > std::numeric_limits<uint32_t>::max()) {
spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
return false;
}
m_votedFor = 0;
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
}
{
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id));
std::ifstream ifs(path, std::ifstream::in);
if (!ifs.is_open())
{
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str());
return false;
}
std::string logLine;
while (std::getline(ifs, logLine))
{
std::stringstream sst(logLine);
std::string key;
std::string value;
std::uint32_t term = 0;
sst >> key >> value >> term;
if (sst.fail()) {
spdlog::error("Node={} failed to parse log entry", m_config.m_id);
return false;
}
LogEntry logEntry;
logEntry.set_key(key);
logEntry.set_value(value);
logEntry.set_term(term);
m_log.emplace_back(logEntry);
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term);
}
}
return true;
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant