-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Raft protocol to replicate KV pairs #10
base: master
Are you sure you want to change the base?
Conversation
WalkthroughThis pull request introduces new debugging configurations for the Changes
Sequence DiagramsequenceDiagram
participant Candidate
participant Follower1
participant Follower2
Candidate->>Follower1: RequestVote RPC
Candidate->>Follower2: RequestVote RPC
Follower1-->>Candidate: Vote Granted
Follower2-->>Candidate: Vote Granted
Candidate->>Candidate: Become Leader
loop Heartbeat
Candidate->>Follower1: AppendEntries RPC
Candidate->>Follower2: AppendEntries RPC
end
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (7)
examples/raft/thread_safety.h (2)
52-96
: Commented-out Mutex classCurrently, the annotated
Mutex
class is left commented out. If it’s still under development, consider removing it to reduce code clutter or enabling it with an appropriate implementation.
229-279
: Deprecated lock-style annotationsThe lock-style annotations at the end of this file are marked as deprecated. If the project no longer relies on them, removing the dead code can improve readability. Otherwise, keep them in a dedicated legacy-compatibility file or mark them with a deprecation notice to guide users to the new annotation approach.
examples/raft/main.cpp (1)
58-58
: Consider passing 'ip' by const referenceStatic analysis suggests passing 'ip' by const reference to avoid potential performance overhead and unnecessary copying when constructing
NodeClient
.- NodeClient(const ID id, const IP ip) + NodeClient(const ID id, const IP& ip)🧰 Tools
🪛 cppcheck (2.10-2)
[performance] 58-58: Function parameter 'ip' should be passed by const reference.
(passedByValue)
lib/proto/Raft.proto (1)
45-45
: Inconsistent field naming: “lastLogterm”For consistency with other fields like
term
,lastLogIndex
, and typical naming conventions, consider renaminglastLogterm
tolastLogTerm
.- uint32 lastLogterm = 4; + uint32 lastLogTerm = 4;lib/proto/CMakeLists.txt (3)
40-41
: Remove duplicate variable definitions.These variables are already defined above (lines 11-12) and don't need to be redefined since they haven't changed.
-set(PROTO_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}") -set(PROTO_IMPORT_DIRS "${CMAKE_CURRENT_LIST_DIR}")
34-62
: Consider reducing code duplication in proto generation setup.The Raft proto setup duplicates the same pattern as TinyKVPP. Consider creating a function to reduce duplication and improve maintainability.
Example approach:
function(setup_proto_library TARGET_NAME PROTO_FILE) if(NOT EXISTS "${PROTO_FILE}") message(FATAL_ERROR "Proto file not found: ${PROTO_FILE}") endif() add_library(${TARGET_NAME} OBJECT "${PROTO_FILE}") target_include_directories(${TARGET_NAME} PUBLIC "$<BUILD_INTERFACE:${PROTO_BINARY_DIR}>" ${protobuf_INCLUDE_DIR}) target_link_libraries(${TARGET_NAME} PUBLIC protobuf::libprotobuf gRPC::grpc++) protobuf_generate( TARGET ${TARGET_NAME} IMPORT_DIRS ${PROTO_IMPORT_DIRS} PROTOC_OUT_DIR "${PROTO_BINARY_DIR}" ) protobuf_generate( TARGET ${TARGET_NAME} LANGUAGE grpc PLUGIN protoc-gen-grpc=$<TARGET_FILE:gRPC::grpc_cpp_plugin> PLUGIN_OPTIONS generate_mock_code=true GENERATE_EXTENSIONS .grpc.pb.h .grpc.pb.cc) endfunction() # Usage: setup_proto_library(TKVProtoObjects "${TINYKVPP_PROTO_FILE}") setup_proto_library(RaftProtoObjects "${RAFT_PROTO_FILE}")
1-3
: Add version requirements for protobuf and gRPC.Since this project relies heavily on protobuf and gRPC for the Raft implementation, consider adding minimum version requirements to ensure compatibility.
cmake_minimum_required(VERSION 3.25) project(zkv) + +find_package(Protobuf 3.21 REQUIRED) +find_package(gRPC 1.50 REQUIRED)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
.vscode/launch.json
(2 hunks)CMakeLists.txt
(1 hunks)examples/CMakeLists.txt
(1 hunks)examples/raft/CMakeLists.txt
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/thread_safety.h
(1 hunks)lib/proto/CMakeLists.txt
(2 hunks)lib/proto/Raft.proto
(1 hunks)lib/structures/lsmtree/levels/level.cpp
(1 hunks)run_replicas.sh
(1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/main.cpp
[performance] 58-58: Function parameter 'ip' should be passed by const reference.
(passedByValue)
🔇 Additional comments (6)
examples/raft/thread_safety.h (2)
6-10
: Clang-specific annotations and portability concern
These conditional macros work well for Clang’s thread safety analysis. However, if this header is consumed in a project that compiles with different compilers lacking Clang’s __attribute__
support, the macros become no-ops. Ensure that consumers of this header file are aware of how these macros behave in non-Clang environments.
108-124
: Effective RAII-based locking
The MutexLocker
class employs a clean RAII pattern to ensure mutex acquisition and release. This pattern mitigates risks of forgetting to unlock in exceptional code paths. The approach is sound and safe.
examples/raft/main.cpp (1)
84-85
: Review election timeout usage
Using generateRandomTimeout()
directly as the RPC deadline is somewhat ad-hoc. If threads or timeouts are short, requests might prematurely fail and disrupt the election process. Validate that this approach is within expected design constraints.
examples/CMakeLists.txt (1)
6-6
: New subdirectory addition
Adding the raft
subdirectory properly includes the new Raft module. Ensure that all required libraries and dependencies (like grpc
, spdlog
, and the proto files) are consistently linked in the raft
subdirectory’s CMakeLists.txt
.
run_replicas.sh (1)
3-5
: Verify the deployment strategy for production
The current configuration uses localhost addresses (0.0.0.0), which is suitable for development but may need adjustment for production deployment.
Please clarify:
- Is this script intended for development only?
- For production deployment, will there be:
- Configuration for actual host addresses?
- Container orchestration (e.g., Kubernetes)?
- Load balancer configuration?
.vscode/launch.json (1)
15-16
: Verify external console requirement.
The configuration enables external console which differs from other configurations. Verify if this is required for the Raft implementation (e.g., for viewing distributed node logs separately).
examples/raft/CMakeLists.txt
Outdated
cmake_minimum_required(VERSION 3.25) | ||
project(zkv) | ||
|
||
add_executable(RaftMain "${CMAKE_CURRENT_SOURCE_DIR}/main.cpp") | ||
set_target_properties(RaftMain PROPERTIES CXX_STANDARD 23) | ||
target_link_libraries(RaftMain PRIVATE DB RaftProtoObjects) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add test configuration for Raft implementation
The Raft consensus is a critical component that requires thorough testing. Consider adding test configuration to ensure reliability.
Add the following after the existing code:
# Add test executable
add_executable(RaftTests "${CMAKE_CURRENT_SOURCE_DIR}/tests/raft_tests.cpp")
set_target_properties(RaftTests PROPERTIES CXX_STANDARD 23)
target_link_libraries(RaftTests PRIVATE
DB
RaftProtoObjects
Catch2::Catch2WithMain
)
# Register with CTest
add_test(NAME raft_tests COMMAND RaftTests)
This will help ensure:
- Leader election correctness
- Log replication consistency
- Network partition handling
- State persistence reliability
#!/bin/bash | ||
|
||
./build/RaftMain --id 1 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_1.txt & | ||
./build/RaftMain --id 2 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_2.txt & | ||
./build/RaftMain --id 3 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_3.txt & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enhance the replica deployment script for production readiness
The current script has several limitations that could cause issues in production:
- No cleanup of background processes
- No health checks for node startup
- No error handling for port availability
- Hardcoded configuration
- No graceful shutdown mechanism
Replace the script with this more robust version:
-#!/bin/bash
-
-./build/RaftMain --id 1 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_1.txt &
-./build/RaftMain --id 2 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_2.txt &
-./build/RaftMain --id 3 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_3.txt &
+#!/bin/bash
+set -euo pipefail
+
+# Configuration
+readonly BASE_PORT=8080
+readonly NUM_REPLICAS=3
+readonly BINARY="./build/RaftMain"
+readonly LOG_DIR="logs"
+
+# Create log directory
+mkdir -p "${LOG_DIR}"
+
+# Cleanup function
+cleanup() {
+ echo "Shutting down Raft nodes..."
+ kill $(jobs -p) 2>/dev/null || true
+ wait
+}
+
+# Register cleanup on script exit
+trap cleanup EXIT
+
+# Build node list
+NODES=""
+for ((i=1; i<=NUM_REPLICAS; i++)); do
+ port=$((BASE_PORT + i - 1))
+ if [ ! -z "$NODES" ]; then
+ NODES="${NODES},"
+ fi
+ NODES="${NODES}0.0.0.0:${port}"
+done
+
+# Start nodes
+for ((i=1; i<=NUM_REPLICAS; i++)); do
+ port=$((BASE_PORT + i - 1))
+
+ # Check if port is available
+ if nc -z localhost $port 2>/dev/null; then
+ echo "Error: Port $port is already in use"
+ exit 1
+ fi
+
+ # Start node
+ echo "Starting Raft node $i on port $port..."
+ $BINARY --id $i --nodes $NODES &> "${LOG_DIR}/log_${i}.txt" &
+
+ # Wait briefly to ensure proper startup order
+ sleep 1
+
+ # Check if process is running
+ if ! kill -0 $! 2>/dev/null; then
+ echo "Error: Failed to start node $i"
+ exit 1
+ fi
+done
+
+echo "All Raft nodes started. Logs available in ${LOG_DIR}/"
+echo "Press Ctrl+C to shutdown all nodes"
+
+# Wait for Ctrl+C
+wait
This enhanced version includes:
- Process cleanup on exit
- Port availability checking
- Configurable number of replicas
- Organized logging
- Startup health checks
- Proper error handling
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#!/bin/bash | |
./build/RaftMain --id 1 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_1.txt & | |
./build/RaftMain --id 2 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_2.txt & | |
./build/RaftMain --id 3 --nodes 0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082 &> log_3.txt & | |
#!/bin/bash | |
set -euo pipefail | |
# Configuration | |
readonly BASE_PORT=8080 | |
readonly NUM_REPLICAS=3 | |
readonly BINARY="./build/RaftMain" | |
readonly LOG_DIR="logs" | |
# Create log directory | |
mkdir -p "${LOG_DIR}" | |
# Cleanup function | |
cleanup() { | |
echo "Shutting down Raft nodes..." | |
kill $(jobs -p) 2>/dev/null || true | |
wait | |
} | |
# Register cleanup on script exit | |
trap cleanup EXIT | |
# Build node list | |
NODES="" | |
for ((i=1; i<=NUM_REPLICAS; i++)); do | |
port=$((BASE_PORT + i - 1)) | |
if [ ! -z "$NODES" ]; then | |
NODES="${NODES}," | |
fi | |
NODES="${NODES}0.0.0.0:${port}" | |
done | |
# Start nodes | |
for ((i=1; i<=NUM_REPLICAS; i++)); do | |
port=$((BASE_PORT + i - 1)) | |
# Check if port is available | |
if nc -z localhost $port 2>/dev/null; then | |
echo "Error: Port $port is already in use" | |
exit 1 | |
fi | |
# Start node | |
echo "Starting Raft node $i on port $port..." | |
$BINARY --id $i --nodes $NODES &> "${LOG_DIR}/log_${i}.txt" & | |
# Wait briefly to ensure proper startup order | |
sleep 1 | |
# Check if process is running | |
if ! kill -0 $! 2>/dev/null; then | |
echo "Error: Failed to start node $i" | |
exit 1 | |
fi | |
done | |
echo "All Raft nodes started. Logs available in ${LOG_DIR}/" | |
echo "Press Ctrl+C to shutdown all nodes" | |
# Wait for Ctrl+C | |
wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Nitpick comments (5)
examples/raft/main.cpp (5)
32-33
: Consider making timeout values configurable.The hardcoded timeout values could be moved to configuration to allow tuning based on network conditions and deployment requirements.
- const int minTimeout{150}; - const int maxTimeout{300}; + static constexpr const char* CONFIG_MIN_TIMEOUT = "RAFT_MIN_TIMEOUT"; + static constexpr const char* CONFIG_MAX_TIMEOUT = "RAFT_MAX_TIMEOUT"; + const int minTimeout = std::getenv(CONFIG_MIN_TIMEOUT) ? std::atoi(std::getenv(CONFIG_MIN_TIMEOUT)) : 150; + const int maxTimeout = std::getenv(CONFIG_MAX_TIMEOUT) ? std::atoi(std::getenv(CONFIG_MAX_TIMEOUT)) : 300;
53-60
: Optimize parameter passing in constructor.The
ip
parameter should be passed by const reference for better performance with strings.- NodeClient(const ID id, const IP ip) + NodeClient(const ID id, const IP& ip)🧰 Tools
🪛 cppcheck (2.10-2)
[performance] 53-53: Function parameter 'ip' should be passed by const reference.
(passedByValue)
103-103
: Remove commented-out code.The commented-out completion queue declaration should be removed if it's not needed.
- /*grpc::CompletionQueue m_cq;*/
309-309
: Implement persistent state initialization.The TODO comment indicates missing persistence implementation which is crucial for Raft's durability guarantees.
Would you like me to help implement the persistent state initialization using a simple file-based storage solution?
591-592
: Improve error message formatting.The error message should follow consistent formatting and include the actual value for better debugging.
- spdlog::error("ID of the node should be positve integer"); + spdlog::error("Invalid node ID: {}. Node ID must be a positive integer.", nodeId);
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/raft/main.cpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/main.cpp
[performance] 53-53: Function parameter 'ip' should be passed by const reference.
(passedByValue)
examples/raft/main.cpp
Outdated
void decrementNextIndex(ID id) | ||
{ | ||
(void)id; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement decrementNextIndex method.
The method is currently a no-op but is crucial for Raft's log replication mechanism.
void decrementNextIndex(ID id)
{
- (void)id;
+ if (m_nextIndex[id - 1] > 0) {
+ m_nextIndex[id - 1]--;
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
void decrementNextIndex(ID id) | |
{ | |
(void)id; | |
} | |
void decrementNextIndex(ID id) | |
{ | |
if (m_nextIndex[id - 1] > 0) { | |
m_nextIndex[id - 1]--; | |
} | |
} |
examples/raft/main.cpp
Outdated
while (!m_stopElectionTimer) | ||
{ | ||
if (token.stop_requested()) | ||
{ | ||
spdlog::info("Stopping election timer thread"); | ||
return; | ||
} | ||
|
||
{ | ||
absl::MutexLock locker(&m_electionMutex); | ||
if (m_state == NodeState::LEADER) | ||
{ | ||
continue; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential race condition in election timer loop.
The continue
statement inside the leader check could cause the thread to spin continuously if the node is the leader. Consider using a condition variable to wait efficiently.
if (m_state == NodeState::LEADER)
{
- continue;
+ m_timerCV.WaitWithTimeout(&m_timerMutex, absl::Milliseconds(100));
+ continue;
}
Committable suggestion skipped: line range outside the PR's diff.
examples/raft/main.cpp
Outdated
std::thread( | ||
[this, &client, heartbeatInterval]() | ||
{ | ||
while (m_state == NodeState::LEADER) | ||
{ | ||
AppendEntriesRequest request; | ||
{ | ||
absl::ReaderMutexLock locker(&m_electionMutex); | ||
|
||
request.set_term(m_currentTerm); | ||
request.set_prevlogterm(getLastLogTerm()); | ||
request.set_prevlogindex(getLastLogIndex()); | ||
request.set_leadercommit(m_commitIndex); | ||
request.set_senderid(m_id); | ||
} | ||
|
||
{ | ||
AppendEntriesResponse response; | ||
if (!client.appendEntries(request, &response)) | ||
{ | ||
spdlog::error("AppendEntriesRequest failed during heartbeat"); | ||
return; | ||
} | ||
|
||
auto responseTerm = response.term(); | ||
auto success = response.success(); | ||
|
||
spdlog::info( | ||
"Received AppendEntriesResponse in requester thread peerTerm={} success={} responderId={}", | ||
responseTerm, | ||
success, | ||
response.responderid()); | ||
|
||
{ | ||
absl::WriterMutexLock locker(&m_electionMutex); | ||
|
||
if (responseTerm > m_currentTerm) | ||
{ | ||
becomeFollower(responseTerm); | ||
return; | ||
} | ||
|
||
if (!success) | ||
{ | ||
decrementNextIndex(client.getId()); | ||
} | ||
} | ||
} | ||
|
||
std::this_thread::sleep_for(heartbeatInterval); | ||
} | ||
}) | ||
.detach(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential resource leak in heartbeat thread.
The detached thread might outlive the NodeClient reference it captures, leading to undefined behavior. Consider using a managed thread pool or ensuring proper lifecycle management.
- std::thread(
+ auto heartbeatThread = std::make_shared<std::jthread>(
[this, &client, heartbeatInterval]()
{
// ... existing code ...
- })
- .detach();
+ });
+ m_heartbeatThreads.push_back(heartbeatThread);
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
examples/raft/main.cpp (2)
30-40
: Consider making timeout constants configurable and more descriptive.The hardcoded timeout values could be made configurable to allow tuning based on network conditions and deployment requirements.
- const int minTimeout{150}; - const int maxTimeout{300}; + static constexpr int MIN_ELECTION_TIMEOUT_MS{150}; + static constexpr int MAX_ELECTION_TIMEOUT_MS{300}; + // TODO: Consider making these configurable via command line arguments
53-60
: Pass 'ip' parameter by const reference for better performance.Passing string parameters by value creates unnecessary copies.
- NodeClient(const ID id, const IP ip) + NodeClient(const ID id, const IP& ip)🧰 Tools
🪛 cppcheck (2.10-2)
[performance] 53-53: Function parameter 'ip' should be passed by const reference.
(passedByValue)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/raft/main.cpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/main.cpp
[performance] 53-53: Function parameter 'ip' should be passed by const reference.
(passedByValue)
🔇 Additional comments (3)
examples/raft/main.cpp (3)
491-494
:
Implement decrementNextIndex method.
The method is currently a no-op but is crucial for Raft's log replication mechanism.
void decrementNextIndex(ID id)
{
- (void)id;
+ if (m_nextIndex[id - 1] > 0) {
+ m_nextIndex[id - 1]--;
+ }
}
Likely invalid or redundant comment.
436-489
:
Replace detached thread with managed thread for heartbeat.
Using detached threads can lead to resource leaks and undefined behavior if the referenced objects are destroyed. Consider using a managed thread pool or at least storing the thread handle.
- std::thread(
+ auto heartbeatThread = std::make_shared<std::jthread>(
[this, &client, heartbeatInterval]()
{
// ... existing code ...
- })
- .detach();
+ });
+ m_heartbeatThreads.push_back(heartbeatThread);
Likely invalid or redundant comment.
226-240
:
Fix potential deadlock in election timer loop.
The current implementation might deadlock if the node becomes a leader while holding the election mutex. Consider using a condition variable with a timeout instead of a busy wait.
- if (m_state == NodeState::LEADER)
- {
- continue;
- }
+ if (m_state == NodeState::LEADER)
+ {
+ m_timerCV.WaitWithTimeout(&m_timerMutex, absl::Milliseconds(100));
+ continue;
+ }
Likely invalid or redundant comment.
examples/raft/main.cpp
Outdated
m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())), | ||
m_stub(RaftService::NewStub(m_channel)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security concern: Using insecure channel credentials.
The use of grpc::InsecureChannelCredentials()
is not suitable for production environments as it exposes the communication to potential man-in-the-middle attacks.
Consider using TLS certificates for secure communication:
auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions());
m_channel = grpc::CreateChannel(m_ip, creds);
examples/raft/main.cpp
Outdated
cm.start(); | ||
|
||
cm.stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Implement graceful shutdown handling.
The server should handle SIGTERM/SIGINT signals for graceful shutdown instead of immediate stop.
void signal_handler(int signal) {
spdlog::info("Received signal {}. Initiating graceful shutdown...", signal);
// Trigger graceful shutdown
}
// In main():
std::signal(SIGTERM, signal_handler);
std::signal(SIGINT, signal_handler);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
examples/raft/raft.h (1)
22-38
: Consider passingip
by const reference in theNodeClient
constructor.Static analysis flagged this constructor parameter as a performance concern. Although this might be minor for small strings, passing by const reference avoids unnecessary copies, especially if
IP
can hold large strings or if this constructor is called frequently.- NodeClient(const ID id, const IP ip); + NodeClient(const ID id, const IP& ip);examples/raft/raft.cpp (1)
22-22
: Passip
by const reference to optimize.As noted by static analysis, passing
ip
by const reference can avoid copying strings if they are large or if the constructor is called often.- NodeClient::NodeClient(const ID id, const IP ip) + NodeClient::NodeClient(const ID id, const IP& ip)🧰 Tools
🪛 cppcheck (2.10-2)
[performance] 22-22: Function parameter 'ip' should be passed by const reference.
(passedByValue)
examples/raft/main.cpp (1)
39-42
: Avoid copying IP strings in the for loop.In the loop iterating over
nodeIps
, consider capturing the iterated string by const reference to avoid unnecessary copies. For large or numerous IP strings, this minor refactor can improve performance:- for (auto ip : nodeIps) + for (const auto& ip : nodeIps)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/raft/CMakeLists.txt
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/raft.cpp
[performance] 22-22: Function parameter 'ip' should be passed by const reference.
(passedByValue)
examples/raft/main.cpp
[performance] 25-25: Function parameter 'ip' should be passed by const reference.
(passedByValue)
🔇 Additional comments (4)
examples/raft/raft.h (1)
40-125
: enable_shared_from_this
usage check.
The class ConsensusModule
inherits from std::enable_shared_from_this<ConsensusModule>
but does not appear to call shared_from_this()
. Verify whether this inheritance is intentional. If not, consider removing it to simplify the class.
examples/raft/raft.cpp (2)
193-198
: Election timer loop concurrency risk
This comment reaffirms a previously raised concern about potential continuous spinning or premature exit when the node becomes a leader, preventing election timeout monitoring if leadership changes again.
459-463
: decrementNextIndex
remains unimplemented
This method is critical for the Raft log replication process but is currently a no-op. Consider implementing the logic to decrement m_nextIndex[id - 1]
, as recommended in a prior review.
examples/raft/CMakeLists.txt (1)
4-6
: Add specialized test target for Raft.
A test configuration for the Raft consensus ensures reliability of leader election, log replication, and fault tolerance. See previous suggestion for details on adding a dedicated test executable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (2)
examples/raft/raft.cpp (2)
31-43
: Add deadline to appendEntries RPC call.For consistency with
requestVote
, consider adding a deadline to prevent hanging on network issues:auto NodeClient::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool { grpc::ClientContext context; + context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generateRandomTimeout())); grpc::Status status = m_stub->AppendEntries(&context, request, response);
279-286
: Implement persistent state storage.The TODO comment indicates missing persistence implementation:
Would you like me to help implement the persistent state storage using RocksDB or a similar key-value store? This would ensure durability of critical Raft state across restarts.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/raft/raft.cpp
(1 hunks)lib/structures/lsmtree/levels/level.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- lib/structures/lsmtree/levels/level.cpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/raft.cpp
[performance] 22-22: Function parameter 'ip' should be passed by const reference.
(passedByValue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
examples/raft/raft.cpp (2)
31-43
: Enhance error handling in appendEntries.The error handling could be more detailed and include retry logic for transient failures.
Consider adding:
auto NodeClient::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool { + constexpr int maxRetries = 3; + int attempts = 0; + while (attempts++ < maxRetries) { grpc::ClientContext context; + context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(1)); grpc::Status status = m_stub->AppendEntries(&context, request, response); if (!status.ok()) { - spdlog::error("AppendEntries RPC call failed"); + spdlog::error("AppendEntries RPC call failed. Error code={}, message={}, attempt={}/{}", + static_cast<int>(status.error_code()), + status.error_message(), + attempts, + maxRetries); + if (attempts < maxRetries) { + std::this_thread::sleep_for(std::chrono::milliseconds(100 * attempts)); + continue; + } return false; } + break; + } return true; }
293-300
: Implement persistent state initialization.The TODO comment indicates missing persistence implementation, which is crucial for maintaining consistency across node restarts.
Would you like me to help implement the persistent state initialization using a simple file-based storage system?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/raft/raft.cpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck (2.10-2)
examples/raft/raft.cpp
[performance] 22-22: Function parameter 'ip' should be passed by const reference.
(passedByValue)
🔇 Additional comments (5)
examples/raft/raft.cpp (5)
1-21
: LGTM! Headers are well-organized and complete.
The includes are properly categorized and contain all necessary dependencies for gRPC, abseil synchronization primitives, and logging.
401-405
: Fix incorrect majority calculation.
The majority calculation is incorrect as noted in the previous review. Using the existing fix:
auto ConsensusModule::hasMajority(const uint32_t votes) const -> bool
{
- constexpr const double HALF_OF_THE_REPLICAS = 2.0;
- return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS;
+ // Total nodes = replicas + self
+ const size_t totalNodes = m_replicas.size() + 1;
+ return votes > totalNodes / 2;
}
429-488
: 🛠️ Refactor suggestion
Enhance heartbeat thread reliability.
The heartbeat mechanism needs improvements:
- Add retry logic for failed heartbeats
- Implement backoff strategy
- Add thread cleanup on errors
void ConsensusModule::sendHeartbeat(NodeClient &client)
{
- constexpr const auto heartbeatInterval{std::chrono::milliseconds(100)};
+ constexpr const auto baseHeartbeatInterval{std::chrono::milliseconds(100)};
+ constexpr const int maxRetries = 3;
+ constexpr const auto maxBackoff{std::chrono::milliseconds(1000)};
m_heartbeatThreads.emplace_back(
- [this, &client, heartbeatInterval](std::stop_token token)
+ [this, &client](std::stop_token token)
{
+ auto heartbeatInterval = baseHeartbeatInterval;
+ int consecutiveFailures = 0;
Likely invalid or redundant comment.
22-29
: 🛠️ Refactor suggestion
Enhance constructor robustness and efficiency.
The constructor should validate inputs and handle connection failures:
-NodeClient::NodeClient(const ID id, const IP ip)
+NodeClient::NodeClient(const ID id, const IP& ip)
: m_id{id},
m_ip{ip},
m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())),
m_stub(RaftService::NewStub(m_channel))
{
assert(m_id > 0);
+ assert(!m_ip.empty());
+
+ // Verify channel is valid
+ if (!m_channel || !m_stub) {
+ throw std::runtime_error("Failed to establish gRPC channel to " + m_ip);
+ }
}
Likely invalid or redundant comment.
🧰 Tools
🪛 cppcheck (2.10-2)
[performance] 22-22: Function parameter 'ip' should be passed by const reference.
(passedByValue)
66-96
: 🛠️ Refactor suggestion
Enhance constructor's thread safety and error handling.
The constructor needs better initialization and validation:
- Initialize atomic variables explicitly
- Add error handling for server startup
- Validate replica IDs
ConsensusModule::ConsensusModule(const ID id, std::vector<IP> replicas)
: m_id{id},
m_currentTerm{0},
m_votedFor{0},
- m_state{NodeState::FOLLOWER}
+ m_state{NodeState::FOLLOWER},
+ m_stopElectionTimer{false},
+ m_electionInProgress{false},
+ m_leaderHeartbeatReceived{false},
+ m_voteCount{0}
{
assert(m_id > 0);
assert(replicas.size() > 0);
assert(m_id <= replicas.size());
m_ip = replicas[m_id - 1];
+ if (m_ip.empty()) {
+ throw std::runtime_error("Invalid IP address for node " + std::to_string(m_id));
+ }
grpc::ServerBuilder builder;
builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials());
builder.RegisterService(this);
m_server = builder.BuildAndStart();
+ if (!m_server) {
+ throw std::runtime_error("Failed to start gRPC server on " + m_ip);
+ }
Likely invalid or redundant comment.
examples/raft/raft.cpp
Outdated
/*if (pRequest->lastlogterm() < getLastLogTerm() ||*/ | ||
/* (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))*/ | ||
/*{*/ | ||
/* pResponse->set_votegranted(0);*/ | ||
/* return grpc::Status::OK;*/ | ||
/*}*/ | ||
/**/ | ||
return grpc::Status::OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment and fix the log comparison logic in RequestVote.
The commented-out log comparison logic is essential for the Raft protocol's safety properties. It should be uncommented and fixed.
- /*if (pRequest->lastlogterm() < getLastLogTerm() ||*/
- /* (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))*/
- /*{*/
- /* pResponse->set_votegranted(0);*/
- /* return grpc::Status::OK;*/
- /*}*/
- /**/
+ // Rule: Candidate's log must be at least as up-to-date as receiver's log
+ if (pRequest->lastlogterm() < getLastLogTerm() ||
+ (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))
+ {
+ pResponse->set_term(m_currentTerm);
+ pResponse->set_votegranted(0);
+ pResponse->set_responderid(m_id);
+ return grpc::Status::OK;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/*if (pRequest->lastlogterm() < getLastLogTerm() ||*/ | |
/* (pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex()))*/ | |
/*{*/ | |
/* pResponse->set_votegranted(0);*/ | |
/* return grpc::Status::OK;*/ | |
/*}*/ | |
/**/ | |
return grpc::Status::OK; | |
// Rule: Candidate's log must be at least as up-to-date as receiver's log | |
if (pRequest->lastlogterm() < getLastLogTerm() || | |
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() < getLastLogIndex())) | |
{ | |
pResponse->set_term(m_currentTerm); | |
pResponse->set_votegranted(0); | |
pResponse->set_responderid(m_id); | |
return grpc::Status::OK; | |
} | |
return grpc::Status::OK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
examples/raft/raft.cpp (1)
422-426
:⚠️ Potential issueFix incorrect majority calculation.
The current logic can lead to split votes. Please use total nodes (replicas.size() + 1
) instead of just replicas when determining majority.-constexpr const double HALF_OF_THE_REPLICAS = 2.0; -return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; +const size_t totalNodes = m_replicas.size() + 1; // includes self +return votes > totalNodes / 2;
🧹 Nitpick comments (7)
examples/raft/raft.cpp (5)
25-43
: Increase inline IP validation or logging information.
Currently, the constructor validates non-empty IP and logs channel creation issues. Consider adding more context or checks (e.g., IP format) to proactively catch configuration errors.
45-57
: Add trace-level logs for debug support.
TheappendEntries
RPC call is crucial to Raft. Adding trace logs (e.g., printing the request’s term) can help debug replication issues.
59-73
: Implement consistent RPC deadlines.
You set a deadline forRequestVote
, whereasappendEntries
has no per-RPC deadline. Consider aligning both to ensure consistent timeout behavior.
80-114
: Consider checking the validity of term or replica size mismatch.
In the constructor, you assumem_id <= replicas.size()
. If this is violated, you throw in the assertion; consider converting it into a runtime error or user-friendly message for production builds.
288-312
: Ensure graceful handling of any outstanding RPC calls on stop.
Consider waiting for in-flight RPCs to complete or adding a short deadline prior to forcibly shutting down the gRPC server.examples/raft/raft.h (2)
20-20
: Consider adding randomness seeding or injection for reproducibility.
generateRandomTimeout()
depends on likely internal PRNG. For test determinism, you can seed or allow a configurable random source.
72-75
: Document the lock invariants.
It is helpful to clarify in comments thatbecomeLeader
andsendHeartbeat
require them_stateMutex
to be locked. This ensures future maintainers understand the concurrency design.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)
🔇 Additional comments (2)
examples/raft/raft.cpp (1)
154-196
: Ensure you handle a second vote request from the same candidate within the same term.
When m_votedFor != 0
and is not the candidate’s ID, you immediately continue, but you might want to short-circuit with a negative response.
examples/raft/raft.h (1)
40-40
: Revisit the multiple inheritance approach.
Inheriting from both RaftService::Service
and std::enable_shared_from_this
is valid but can complicate object lifetime. Confirm that shared pointers are consistently managed to avoid double free or dangling references.
examples/raft/raft.cpp
Outdated
auto ConsensusModule::AppendEntries(grpc::ServerContext *pContext, | ||
const AppendEntriesRequest *pRequest, | ||
AppendEntriesResponse *pResponse) -> grpc::Status | ||
{ | ||
(void)pContext; | ||
(void)pRequest; | ||
(void)pResponse; | ||
|
||
spdlog::info("Recevied AppendEntries RPC from leader={} during term={}", pRequest->senderid(), pRequest->term()); | ||
|
||
absl::MutexLock locker(&m_stateMutex); | ||
absl::MutexLock timerLocker(&m_timerMutex); | ||
|
||
if (pRequest->term() < m_currentTerm) | ||
{ | ||
pResponse->set_term(m_currentTerm); | ||
pResponse->set_success(false); | ||
pResponse->set_responderid(m_id); | ||
return grpc::Status::OK; | ||
} | ||
|
||
if (pRequest->term() > m_currentTerm) | ||
{ | ||
becomeFollower(pRequest->term()); | ||
} | ||
|
||
pResponse->set_term(m_currentTerm); | ||
pResponse->set_success(true); | ||
pResponse->set_responderid(m_id); | ||
|
||
spdlog::info("Node={} is resetting election timeout at term={}", m_id, m_currentTerm); | ||
resetElectionTimer(); | ||
|
||
m_leaderHeartbeatReceived.store(true); | ||
|
||
return grpc::Status::OK; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validate the log index in AppendEntries
.
The method always sets success=true
without verifying log continuity (as per Raft spec). Not rejecting an inconsistent log entry could lead to divergent logs.
examples/raft/raft.cpp
Outdated
void ConsensusModule::sendHeartbeat(NodeClient &client) | ||
{ | ||
constexpr const auto heartbeatInterval{std::chrono::milliseconds(10)}; | ||
constexpr const int maxRetries{3}; | ||
|
||
m_heartbeatThreads.emplace_back( | ||
[this, maxRetries, &client, heartbeatInterval](std::stop_token token) | ||
{ | ||
spdlog::info("Node={} is starting a heartbeat thread for client={}", m_id, client.getId()); | ||
|
||
int consecutiveFailures = 0; | ||
while (!token.stop_requested()) | ||
{ | ||
AppendEntriesRequest request; | ||
{ | ||
absl::ReaderMutexLock locker(&m_stateMutex); | ||
if (m_state != NodeState::LEADER) | ||
{ | ||
spdlog::info("Node={} is no longer a leader. Stopping the heartbeat thread"); | ||
break; | ||
} | ||
|
||
request.set_term(m_currentTerm); | ||
request.set_prevlogterm(getLastLogTerm()); | ||
request.set_prevlogindex(getLastLogIndex()); | ||
request.set_leadercommit(m_commitIndex); | ||
request.set_senderid(m_id); | ||
} | ||
|
||
{ | ||
AppendEntriesResponse response; | ||
if (!client.appendEntries(request, &response)) | ||
{ | ||
consecutiveFailures++; | ||
|
||
consecutiveFailures = 0; | ||
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", | ||
consecutiveFailures, | ||
maxRetries); | ||
if (consecutiveFailures >= maxRetries) | ||
{ | ||
return; | ||
} | ||
|
||
continue; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix the consecutive failures logic.
Within the heartbeat loop, consecutiveFailures++
is immediately reset to zero. This prevents retries from accumulating.
if (!client.appendEntries(request, &response)) {
consecutiveFailures++;
- consecutiveFailures = 0; // This negates the increment
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
...
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
void ConsensusModule::sendHeartbeat(NodeClient &client) | |
{ | |
constexpr const auto heartbeatInterval{std::chrono::milliseconds(10)}; | |
constexpr const int maxRetries{3}; | |
m_heartbeatThreads.emplace_back( | |
[this, maxRetries, &client, heartbeatInterval](std::stop_token token) | |
{ | |
spdlog::info("Node={} is starting a heartbeat thread for client={}", m_id, client.getId()); | |
int consecutiveFailures = 0; | |
while (!token.stop_requested()) | |
{ | |
AppendEntriesRequest request; | |
{ | |
absl::ReaderMutexLock locker(&m_stateMutex); | |
if (m_state != NodeState::LEADER) | |
{ | |
spdlog::info("Node={} is no longer a leader. Stopping the heartbeat thread"); | |
break; | |
} | |
request.set_term(m_currentTerm); | |
request.set_prevlogterm(getLastLogTerm()); | |
request.set_prevlogindex(getLastLogIndex()); | |
request.set_leadercommit(m_commitIndex); | |
request.set_senderid(m_id); | |
} | |
{ | |
AppendEntriesResponse response; | |
if (!client.appendEntries(request, &response)) | |
{ | |
consecutiveFailures++; | |
consecutiveFailures = 0; | |
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", | |
consecutiveFailures, | |
maxRetries); | |
if (consecutiveFailures >= maxRetries) | |
{ | |
return; | |
} | |
continue; | |
} | |
void ConsensusModule::sendHeartbeat(NodeClient &client) | |
{ | |
constexpr const auto heartbeatInterval{std::chrono::milliseconds(10)}; | |
constexpr const int maxRetries{3}; | |
m_heartbeatThreads.emplace_back( | |
[this, maxRetries, &client, heartbeatInterval](std::stop_token token) | |
{ | |
spdlog::info("Node={} is starting a heartbeat thread for client={}", m_id, client.getId()); | |
int consecutiveFailures = 0; | |
while (!token.stop_requested()) | |
{ | |
AppendEntriesRequest request; | |
{ | |
absl::ReaderMutexLock locker(&m_stateMutex); | |
if (m_state != NodeState::LEADER) | |
{ | |
spdlog::info("Node={} is no longer a leader. Stopping the heartbeat thread"); | |
break; | |
} | |
request.set_term(m_currentTerm); | |
request.set_prevlogterm(getLastLogTerm()); | |
request.set_prevlogindex(getLastLogIndex()); | |
request.set_leadercommit(m_commitIndex); | |
request.set_senderid(m_id); | |
} | |
{ | |
AppendEntriesResponse response; | |
if (!client.appendEntries(request, &response)) | |
{ | |
consecutiveFailures++; | |
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", | |
consecutiveFailures, | |
maxRetries); | |
if (consecutiveFailures >= maxRetries) | |
{ | |
return; | |
} | |
continue; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
.vscode/launch.json (1)
7-33
: 🛠️ Refactor suggestionEnhance RaftMain debug configuration for Raft development
The configuration needs improvements for effectively debugging Raft nodes:
- Add
preLaunchTask
to ensure RaftMain is compiled before debugging- Add command line arguments for Raft configuration (node ID, peer list)
- Make the debug port configurable using VS Code input variables
Apply this diff to enhance the configuration:
{ "name": "rr - RaftMain", "type": "cppdbg", "request": "launch", "program": "${workspaceFolder}/build/RaftMain", - "miDebuggerServerAddress": "localhost:50505", + "miDebuggerServerAddress": "localhost:${input:debugPort}", + "args": [ + "--node-id", "${input:nodeId}", + "--config", "${workspaceFolder}/assets/raft_config.json" + ], "stopAtEntry": false, "cwd": "${workspaceFolder}", "environment": [], "externalConsole": true, + "preLaunchTask": "build", "linux": {Add these input configurations at the root level:
"inputs": [ { "id": "debugPort", "type": "promptString", "description": "Debug port for remote debugging", "default": "50505" }, { "id": "nodeId", "type": "promptString", "description": "Raft node ID", "default": "1" } ]
🧹 Nitpick comments (3)
lib/proto/Raft.proto (1)
8-17
: Unify naming conventions in enums and fields
Consider using a consistent naming style (e.g.,lastLogTerm
) in the request messages or switching to snake_case to match common protobuf conventions.examples/raft/main.cpp (1)
69-69
: Consider implementing signal-based graceful shutdown
Currently, the Raft node stops only after the main logic completes. Adding handling of SIGTERM/SIGINT to initiate a controlled shutdown can prevent data inconsistencies or partial writes..vscode/launch.json (1)
34-64
: Align Debug - RaftMain configuration with other debug configurationsThe configuration should be consistent with other debug configurations in the file:
- Add command line arguments for Raft configuration
- Add
launchCompleteCommand
for consistent behavior- Add missing field descriptions
Apply this diff:
{ "name": "Debug - RaftMain", "type": "cppdbg", "request": "attach", "program": "${workspaceFolder}/build/RaftMain", "processId": "${command:pickProcess}", "MIMode": "gdb", + "args": [ // Arguments for Raft configuration + "--node-id", "1", + "--config", "${workspaceFolder}/assets/raft_config.json" + ], "setupCommands": [ { "description": "Enable pretty-printing for gdb", "text": "-enable-pretty-printing", "ignoreFailures": true } ], "preLaunchTask": "build", "miDebuggerPath": "/usr/bin/gdb", "logging": { "trace": true, "traceResponse": true, "engineLogging": false }, "targetArchitecture": "x86_64", "pipeTransport": { "pipeCwd": "", "pipeProgram": "/bin/bash", "pipeArgs": [ "-c" ], "debuggerPath": "/usr/bin/gdb" }, + "launchCompleteCommand": "exec-run" // Consistent with other configurations },🧰 Tools
🪛 Biome (1.9.4)
[error] 64-64: Expected a property but instead found '}'.
Expected a property here.
(parse)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
.vscode/launch.json
(2 hunks)examples/raft/CMakeLists.txt
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)lib/proto/Raft.proto
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
.vscode/launch.json
[error] 64-64: Expected a property but instead found '}'.
Expected a property here.
(parse)
🔇 Additional comments (9)
examples/raft/raft.cpp (5)
28-28
: Use atomic or mutex to protectgFirstElection
gFirstElection
is a global integer that is written from multiple threads (e.g., in the election timer and upon heartbeat reception). For thread safety, consider making this variable atomic or protecting it with a mutex.
33-34
: Security concern: Using insecure channel credentials
This code usesgrpc::InsecureChannelCredentials()
, which is typically insecure for production. Consider TLS or an equivalent secure channel.
509-513
: Fix incorrect majority calculation
This logic usesvotes > m_replicas.size() / 2.0
and lacks the node’s own vote in the total. Consider using(m_replicas.size() + 1)
to account for the node itself.
569-581
: Fix the consecutive failures logic
consecutiveFailures++
is immediately reset to zero on line 571, preventing retries from incrementing. This makes the retry loop unusable as intended.
133-134
: Unused or uninitialized server thread
Whilem_serverThread
is declared and stopped instop()
, it never appears to be started. This might cause confusion or unexpected deadlocks. Verify if it’s needed and ensure proper initialization if so.Also applies to: 421-423
examples/raft/CMakeLists.txt (1)
4-6
: Add test configuration for Raft implementation
Comprehensive testing of Raft is crucial. Adding a test binary and registering tests ensures reliability of leader election, log replication, and fault tolerance.examples/raft/main.cpp (1)
25-25
: Environment variable usage may have portability issues
Usingsetenv("ASAN_OPTIONS", ...)
is helpful for debugging, but consider limiting this to debug builds to avoid altering production behavior.examples/raft/raft.h (2)
74-75
: Unused methodinitializeVolatileState
Although declared,initializeVolatileState()
has no definition or usage in the codebase. If it's part of your design, ensure it is implemented. Otherwise, remove it to avoid confusion.
82-83
: Confirm concurrency invariants forsendAppendEntriesRPC
sendAppendEntriesRPC
is not marked with the same lock requirements assendHeartbeat
. Verify that you consistently acquire the necessary locks in both methods so that log mutations are always thread-safe.
], | ||
"debuggerPath": "/usr/bin/gdb" | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix JSON syntax error
Remove the trailing comma after the pipeTransport object.
Apply this diff:
],
"debuggerPath": "/usr/bin/gdb"
- },
+ }
},
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
}, | |
], | |
"debuggerPath": "/usr/bin/gdb" | |
} | |
}, |
🧰 Tools
🪛 Biome (1.9.4)
[error] 64-64: Expected a property but instead found '}'.
Expected a property here.
(parse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (3)
examples/raft/raft.cpp (3)
141-224
:⚠️ Potential issueFix log replication safety issues.
The AppendEntries implementation has several critical issues:
- Missing validation of request fields
- Potential off-by-one errors in log indexing
- Inconsistent state updates
+ // Validate request fields + if (!pRequest) { + return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "Null request"); + } // 3. Append new entries and remove conflicting ones auto newEntryStart = pRequest->prevlogindex() + 1; + // Check for integer overflow + if (newEntryStart < pRequest->prevlogindex()) { + return grpc::Status(grpc::StatusCode::INVALID_ARGUMENT, "Index overflow"); + } - m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end()); + // Validate and append entries + for (const auto& entry : pRequest->entries()) { + if (entry.term() < m_currentTerm) { + return grpc::Status(grpc::StatusCode::FAILED_PRECONDITION, "Entry from previous term"); + } + m_log.push_back(entry); + }
481-584
:⚠️ Potential issueFix election safety issues.
The election implementation has critical safety issues:
- Race condition in vote counting
- Incorrect majority calculation (as noted in past reviews)
- Missing timeout handling for concurrent elections
- constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; void startElection() { + const auto electionTimeout = std::chrono::seconds(generateRandomTimeout()); + const auto deadline = std::chrono::steady_clock::now() + electionTimeout; + RequestVoteRequest request; { absl::WriterMutexLock locker(&m_stateMutex); + if (std::chrono::steady_clock::now() >= deadline) { + spdlog::warn("Election timed out before starting"); + return; + }
586-660
:⚠️ Potential issueFix heartbeat reliability issues.
The heartbeat implementation has several critical issues:
- Incorrect retry logic (consecutiveFailures reset too early)
- Potential thread leaks in error paths
- Missing timeout handling
if (!client.appendEntries(request, &response)) { consecutiveFailures++; - consecutiveFailures = 0; // This negates the increment spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { + absl::WriterMutexLock locker(&m_stateMutex); + becomeFollower(m_currentTerm); // Step down if can't reach majority return; } continue; }
🧹 Nitpick comments (2)
examples/raft/raft.h (1)
1-24
: Add include guards and improve documentation.While
#pragma once
works for most modern compilers, consider adding traditional include guards for maximum compatibility. Also add documentation for type aliases and threading model.#pragma once +#ifndef RAFT_H_ +#define RAFT_H_ +// Type aliases for improved readability and maintainability using IP = std::string; // Represents an IP address using ID = uint32_t; // Unique identifier for nodes using Clock = std::chrono::high_resolution_clock; using TimePoint = std::chrono::high_resolution_clock::time_point; // Valid IDs start from 1 constexpr const ID invalidId = 0; +#endif // RAFT_H_examples/raft/raft.cpp (1)
1-796
: Improve logging for better observability.The logging implementation needs improvements:
- Add structured logging with consistent fields
- Use appropriate log levels
- Add logging for all error paths
+ // Add logging configuration + spdlog::set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%n] [%^%l%$] [%s:%#] %v"); - spdlog::debug("Node={} starts election. New term={}", m_id, m_currentTerm); + spdlog::debug("Starting election node_id={} current_term={} state={}", + m_id, m_currentTerm, static_cast<int>(m_state)); - spdlog::error("AppendEntries RPC call failed"); + spdlog::error("RPC failed operation={} node_id={} peer_id={} error_code={} error_message={}", + "AppendEntries", m_id, client.getId(), + static_cast<int>(status.error_code()), status.error_message());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (7)
examples/raft/raft.cpp (6)
56-67
:⚠️ Potential issueCorrect the error messages in exceptions.
In the
NodeClient
constructor, the error messages incorrectly usem_id
twice when formatting the error string. The secondm_id
should bem_ip
to display the correct IP address.Apply this diff to fix the error messages:
-throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_id)); +throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip));Similarly, update the other error messages:
-throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_id)); +throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip)); -throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_id)); +throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip));
137-138
:⚠️ Potential issueAvoid unsafe
dynamic_cast
and check fornullptr
.In the
ConsensusModule
constructor, you are usingdynamic_cast
without checking for anullptr
result, which can lead to undefined behavior if the cast fails.Consider modifying the code to check if the cast is successful before using the result:
-auto raftService = dynamic_cast<RaftService::Service*>(this); -auto kvService = dynamic_cast<TinyKVPPService::Service*>(this); +auto* raftService = dynamic_cast<RaftService::Service*>(this); +auto* kvService = dynamic_cast<TinyKVPPService::Service*>(this); +if (!raftService || !kvService) { + throw std::runtime_error("Invalid service cast in ConsensusModule constructor"); +} builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials()); -builder.RegisterService(dynamic_cast<RaftService::Service *>(this)); -builder.RegisterService(dynamic_cast<TinyKVPPService::Service *>(this)); +builder.RegisterService(raftService); +builder.RegisterService(kvService);
647-649
:⚠️ Potential issueFix the logic error with
consecutiveFailures
.Inside the heartbeat loop in
sendHeartbeat
,consecutiveFailures
is incremented and then immediately reset to zero. This negates the increment and prevents retry logic from working correctly.Apply this diff to correct the logic:
if (!client.appendEntries(request, &response)) { consecutiveFailures++; - consecutiveFailures = 0; spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { return; } continue; }
471-491
:⚠️ Potential issueEnsure proper shutdown and cleanup in
stop()
method.In the
ConsensusModule::stop()
method, there are potential issues with thread management and server shutdown:
- The
m_serverThread
is being requested to stop and joined, but it's not clear where this thread was started.- The
m_raftServer->Shutdown()
should be called before stopping the election and heartbeat threads to prevent new RPCs from being accepted during shutdown.Consider reordering the shutdown sequence and ensuring all threads are properly managed:
void ConsensusModule::stop() { absl::WriterMutexLock locker{&m_stateMutex}; + if (m_raftServer) + { + m_raftServer->Shutdown(); + } m_stopElection = true; m_electionThread.request_stop(); m_electionThread.join(); for (auto &heartbeatThread : m_heartbeatThreads) { heartbeatThread.request_stop(); heartbeatThread.join(); } m_heartbeatThreads.clear(); - if (m_raftServer) - { - m_raftServer->Shutdown(); - } - m_serverThread.request_stop(); - m_serverThread.join(); }
128-131
:⚠️ Potential issueInitialize
m_matchIndex
andm_nextIndex
for all replicas.The
m_matchIndex
andm_nextIndex
maps are not initialized in the constructor, which can lead to undefined behavior when accessing these values later.Add initialization for these maps in the constructor:
for (auto [id, ip] : std::ranges::views::enumerate(replicas)) { if (id + 1 == m_id) { continue; } m_replicas.emplace(id + 1, NodeClient(id + 1, ip)); + // Initialize matchIndex and nextIndex for each replica + m_matchIndex[id + 1] = 0; + m_nextIndex[id + 1] = getLastLogIndex() + 1; }
169-240
:⚠️ Potential issueEnsure compliance with Raft protocol in
AppendEntries
.The
AppendEntries
method seems to have some incomplete implementation aspects regarding the Raft protocol, such as:
- Not updating
m_currentTerm
correctly.- Missing reset of
m_votedFor
.- Incomplete log replication logic.
Review the method to ensure it fully complies with the Raft specification, especially Sections 5.3 and 5.4.
examples/raft/raft.h (1)
117-121
:⚠️ Potential issueAdd thread safety annotations for member variables.
Member variables that are accessed concurrently should have appropriate thread safety annotations to prevent data races and ensure correctness.
Add
ABSL_GUARDED_BY
annotations:uint32_t m_currentTerm ABSL_GUARDED_BY(m_stateMutex); uint32_t m_votedFor ABSL_GUARDED_BY(m_stateMutex); NodeState m_state ABSL_GUARDED_BY(m_stateMutex); std::vector<LogEntry> m_log ABSL_GUARDED_BY(m_stateMutex); -/*db::wal::wal_t m_wal ABSL_GUARDED_BY(m_stateMutex);*/
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (4)
examples/raft/raft.cpp (1)
379-389
: Handle failure of persistent state initialization ininit()
.In the
init()
method, ifinitializePersistentState()
fails, the node currently logs a warning but still returnsfalse
. There might be a need for more robust error handling or recovery strategies.Please confirm whether it's acceptable for the node to proceed if persistent state initialization fails. If not, consider terminating the node or implementing a retry mechanism.
examples/raft/raft.h (3)
30-37
: Add virtual destructor toNodeClient
.The
NodeClient
class contains virtual methods due to the use of gRPC stubs. It is recommended to have a virtual destructor to ensure proper cleanup of derived classes.Add a virtual destructor:
class NodeClient { public: NodeClient(ID nodeId, IP nodeIp); + virtual ~NodeClient() = default; auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool;
48-75
: Improve documentation and thread safety annotations inConsensusModule
.The
ConsensusModule
class lacks sufficient documentation and thread safety annotations, which are important for maintainability and correctness.Consider adding method documentation and ensuring that thread safety annotations are consistently applied:
class ConsensusModule : public RaftService::Service, public TinyKVPPService::Service, std::enable_shared_from_this<ConsensusModule> { public: // Constructs a ConsensusModule with the given node ID and list of replica IPs. ConsensusModule(ID nodeId, std::vector<IP> replicas); // Raft RPCs auto AppendEntries(grpc::ServerContext *pContext, const AppendEntriesRequest *pRequest, AppendEntriesResponse *pResponse) -> grpc::Status override; // Other methods... + // Initializes the persistent state. Must be called before start(). auto init() -> bool; + // Starts the consensus module's main loop and begins participating in the cluster. void start(); + // Stops the consensus module and cleans up resources. void stop();
134-138
: Usestd::atomic_flag
for better synchronization.For atomic boolean flags,
std::atomic_flag
provides better guarantees and semantics compared tostd::atomic<bool>
.Modify the declarations:
std::atomic<bool> m_leaderHeartbeatReceived{false}; std::atomic<bool> m_stopElection{false}; +std::atomic_flag m_leaderHeartbeatReceived = ATOMIC_FLAG_INIT; +std::atomic_flag m_stopElection = ATOMIC_FLAG_INIT;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (3)
examples/raft/raft.cpp (3)
666-676
:⚠️ Potential issueFix the
consecutiveFailures
logic insendHeartbeat
.Currently,
consecutiveFailures
is reset to zero immediately after being incremented, which prevents the failure count from accumulating correctly. This negates the retry mechanism and may cause premature termination of the heartbeat thread.Apply this diff to correct the logic:
consecutiveFailures++; spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { return; } - consecutiveFailures = 0; continue;
710-784
: 🛠️ Refactor suggestionAvoid using detached threads in
sendAppendEntriesRPC
.Using detached threads can lead to resource management issues and potential undefined behavior. It's safer to manage thread lifetimes explicitly.
Apply this diff to manage threads properly:
- std::thread( + m_replicationThreads.emplace_back( [this](NodeClient &client, std::vector<LogEntry> logEntries) { // ... existing code ... }, std::ref(client), logEntries) - .detach(); + );Ensure that
m_replicationThreads
is stored (e.g., in astd::vector<std::jthread>
) and properly joined or stopped during shutdown to prevent resource leaks.
590-603
:⚠️ Potential issueEnsure thread safety in
becomeFollower
.The method
becomeFollower
modifies shared state but may not have the necessary mutex protection in all calling contexts.Ensure that
becomeFollower
is always called withm_stateMutex
held or add mutex locking within the method:+ absl::MutexLock locker(&m_stateMutex); m_currentTerm = newTerm; m_state = NodeState::FOLLOWER; updatePersistentState(std::nullopt, 0); // ... existing code ...
Alternatively, if the method is intended to be called with the mutex already locked, add a thread annotation to enforce this:
void ConsensusModule::becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
🧹 Nitpick comments (3)
conan/profiles/release-gcc (1)
10-11
: Review profile consistency across compilersThe GCC profile retains the
compiler.libcxx
setting while it's removed from Clang profiles. Consider:
- Standardizing the approach across all profiles
- Documenting the reason for different handling if intended
examples/raft/raft.cpp (1)
642-646
: Prevent tight loop on failed heartbeats.When the heartbeat fails, the loop immediately continues without a delay, potentially causing high CPU usage due to rapid retries.
Consider adding a short sleep before retrying to prevent a tight loop:
continue; + // Add a short delay before the next attempt + std::this_thread::sleep_for(std::chrono::milliseconds(50)); }examples/raft/raft.h (1)
41-41
: Addnoexcept
to destructor forNodeClient
.Consistently marking the destructor as
noexcept
is good practice and conveys exception safety guarantees.Ensure the destructor is declared as:
virtual ~NodeClient() noexcept = default;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
CMakeLists.txt
(2 hunks)conan/profiles/debug-clang
(1 hunks)conan/profiles/debug-clang-tsan
(0 hunks)conan/profiles/debug-gcc
(1 hunks)conan/profiles/release-clang
(1 hunks)conan/profiles/release-gcc
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)
💤 Files with no reviewable changes (1)
- conan/profiles/debug-clang-tsan
✅ Files skipped from review due to trivial changes (1)
- conan/profiles/debug-gcc
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/raft/main.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (6)
conan/profiles/debug-clang (1)
10-10
: Verify Clang 19 availability in the build environmentThe upgrade to Clang 19 is significant. Please ensure:
- Clang 19 is available in the build environment
- All team members have access to this version
- CI/CD pipelines are updated accordingly
Run this script to check Clang availability:
✅ Verification successful
Clang 19 upgrade is well-supported by the project infrastructure
The compiler version upgrade is consistent with the project's requirements:
- Both debug and release Clang profiles are aligned to version 19
- Project requires C++23 features, which Clang 19 fully supports
- Build environment is controlled via Docker images
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if Clang 19 is available and its features clang_path=$(which clang) if [ -x "$clang_path" ]; then $clang_path --version # Check C++23 support echo '#include <concepts>' | $clang_path -x c++ -std=c++23 - fiLength of output: 140
Script:
#!/bin/bash # Search for compiler configurations and CI files echo "=== Compiler Profiles ===" fd ".*" conan/profiles --type f --exec cat {} echo -e "\n=== CI Configurations ===" fd "^\.gitlab-ci\.yml$|^\.github/workflows/.*\.yml$" --type f --exec cat {} echo -e "\n=== Build System Files ===" fd "CMakeLists.txt|conanfile\.txt|conanfile\.py" --type f --exec cat {}Length of output: 14546
conan/profiles/release-clang (1)
10-10
: LGTM! Changes are consistent with debug profileThe changes maintain consistency with the debug-clang profile, which is good practice.
conan/profiles/release-gcc (1)
9-9
: Consider implications of removing GNU extensionsChanging from
gnu23
to23
removes GNU extensions. Please verify that:
- The Raft implementation doesn't rely on any GNU-specific features
- All existing code compiles without GNU extensions
Run this script to check for GNU extension usage:
✅ Verification successful
GNU extensions removal is safe ✅
The codebase doesn't rely on GNU-specific features. The only GNU attribute usage is in thread safety annotations that are already properly guarded for Clang-only usage, and in isolated third-party GSL utility files. The core Raft implementation will compile correctly with standard C++23.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for common GNU extension usage rg -l "(__GNUC__|__extension__|__attribute__|__asm__|typeof|__typeof__|__builtin)" # Check if any files use GNU-specific headers rg -l "#include.*<(gnu/|bits/)"Length of output: 328
Script:
#!/bin/bash # Check thread_safety.h content echo "=== thread_safety.h content ===" cat examples/raft/thread_safety.h # Look for GNU extensions specifically in Raft implementation echo -e "\n=== GNU extensions in Raft code ===" fd -e cpp -e h -e hpp . examples/raft --exec rg -l "(__GNUC__|__extension__|__attribute__|__asm__|typeof|__typeof__|__builtin)" {}Length of output: 9927
examples/raft/raft.cpp (2)
142-147
: Usestatic_cast
instead ofdynamic_cast
for base class casting.Since
ConsensusModule
publicly inherits fromRaftService::Service
,static_cast
is sufficient and more efficient thandynamic_cast
. This change also eliminates the need for runtimenullptr
checks.Apply this diff to replace
dynamic_cast
withstatic_cast
:-auto *raftService = dynamic_cast<RaftService::Service *>(this); +auto *raftService = static_cast<RaftService::Service *>(this); if (raftService == nullptr) { throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService")); }Similarly, update the casting for
tkvppService
:-auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); +auto *tkvppService = static_cast<TinyKVPPService::Service *>(this);
448-458
: Verify election timer reset logic.The logic within
AwaitWithTimeout
may not properly reset the election timer, potentially causing unintended election timeouts.Please verify that the condition in
AwaitWithTimeout
correctly captures both the receipt of a heartbeat and the passage of the timeout duration:bool heartbeatReceived = m_timerMutex.AwaitWithTimeout( absl::Condition( +[this, &timeToWaitDeadlineMs]() { return m_leaderHeartbeatReceived.load() || currentTimeMs() >= timeToWaitDeadlineMs; } ), absl::Milliseconds(timeToWaitMs) );Ensure that
currentTimeMs()
is appropriately calculated within the lambda to prevent stale time comparisons.CMakeLists.txt (1)
16-22
: Conditionally include-Wthread-safety
for Clang compiler.The
-Wthread-safety
flag is specific to Clang and can cause errors with other compilers like GCC.The added conditional correctly handles compiler-specific flags, ensuring compatibility across different compilers.
examples/raft/raft.h
Outdated
std::atomic<bool> m_leaderHeartbeatReceived{false}; | ||
std::atomic<bool> m_stopElection{false}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Atomic variables need explicit memory ordering specifications
The verification reveals that while atomic methods are used, memory ordering is not specified. This could lead to synchronization issues:
m_leaderHeartbeatReceived
: All load/store operations should specify memory orderingm_stopElection
: Direct boolean access should be replaced with load/store operations with proper memory ordering
Recommended changes:
- For
m_leaderHeartbeatReceived.store()
, usememory_order_release
- For
m_leaderHeartbeatReceived.load()
, usememory_order_acquire
- Replace direct access to
m_stopElection
with proper atomic operations - For
m_stopElection
, usememory_order_seq_cst
as it's used for thread termination
🔗 Analysis chain
Ensure atomic variables are used safely with appropriate memory order.
The atomic variables m_leaderHeartbeatReceived
and m_stopElection
may require explicit memory order semantics to prevent data races.
Please verify that these atomic variables are accessed with consistent memory order, or consider protecting them with mutexes if they represent complex state changes.
For instance, use:
std::atomic<bool> m_leaderHeartbeatReceived{false};
And when accessing:
if (m_leaderHeartbeatReceived.load(std::memory_order_acquire)) {
// ...
}
🏁 Scripts executed
The following scripts were executed for the analysis:
Error: No shell scripts found
Script:
#!/bin/bash
# Search for usages of m_leaderHeartbeatReceived
ast-grep --pattern 'm_leaderHeartbeatReceived'
# Search for usages of m_stopElection
ast-grep --pattern 'm_stopElection'
# Backup search using ripgrep for more context
rg "m_leaderHeartbeatReceived|m_stopElection" -A 2 -B 2
Length of output: 3647
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (5)
examples/raft/raft.cpp (2)
605-609
:⚠️ Potential issueCorrect majority calculation in
hasMajority
functionThe current majority calculation may lead to incorrect results, especially in clusters with an even number of nodes. The calculation should consider the total number of nodes, including the current node.
Apply this diff to fix the calculation:
auto ConsensusModule::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + const size_t totalNodes = m_replicas.size() + 1; // Including self + return votes > totalNodes / 2; }
666-678
:⚠️ Potential issueFix
consecutiveFailures
reset logic in heartbeatThe
consecutiveFailures
counter is reset to zero immediately after incrementing, which prevents the failure count from accumulating and may cause premature termination of the heartbeat loop.Apply this diff to correct the failure counter logic:
if (!client.appendEntries(request, &response)) { consecutiveFailures++; spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { return; } - consecutiveFailures = 0; continue; } +else +{ + consecutiveFailures = 0; // Reset on successful response +}examples/raft/raft.h (3)
67-67
:⚠️ Potential issueInherit
enable_shared_from_this
publicly
ConsensusModule
should inherit fromstd::enable_shared_from_this
publicly to useshared_from_this()
safely. Private inheritance may lead to undefined behavior when callingshared_from_this()
.Apply this diff to change the inheritance:
class ConsensusModule : public RaftService::Service, public TinyKVPPService::Service, - std::enable_shared_from_this<ConsensusModule> + public std::enable_shared_from_this<ConsensusModule> { // Class definition };
151-152
: 🛠️ Refactor suggestionSpecify memory order for atomic variables
The
std::atomic<bool>
variablesm_leaderHeartbeatReceived
andm_stopElection
are accessed without specifying memory order, which may lead to synchronization issues.Apply this diff to specify memory ordering:
std::atomic<bool> m_leaderHeartbeatReceived{false}; std::atomic<bool> m_stopElection{false}; // When storing: -m_leaderHeartbeatReceived.store(true); +m_leaderHeartbeatReceived.store(true, std::memory_order_release); // When loading: -if (m_leaderHeartbeatReceived.load()) +if (m_leaderHeartbeatReceived.load(std::memory_order_acquire))Repeat similar changes for
m_stopElection
. Alternatively, protect accesses with mutexes if appropriate.
145-148
:⚠️ Potential issueInitialize
m_matchIndex
andm_nextIndex
properlyEnsure that
m_matchIndex
andm_nextIndex
are initialized for all replica nodes in the constructor to prevent uninitialized access.Confirm that in the constructor, after initializing
m_replicas
, you properly initializem_matchIndex
andm_nextIndex
. If this is already done, ensure that the logic correctly covers all replica nodes.
🧹 Nitpick comments (1)
examples/raft/raft.cpp (1)
925-943
: Handle malformed log entries during restoreIn
restorePersistentState()
, if a log line is malformed, the code may crash or behave unexpectedly. Add checks to ensure that each line is correctly parsed.Apply this diff to add error handling:
while (std::getline(ifs, logLine)) { std::stringstream sst(logLine); std::string key; std::string value; std::uint32_t term = 0; sst >> key >> value >> term; + if (sst.fail()) + { + spdlog::error("Malformed log entry in {}: {}", path.c_str(), logLine); + continue; // Or handle the error as needed + } LogEntry logEntry; logEntry.set_key(key); logEntry.set_value(value); logEntry.set_term(term); m_log.emplace_back(logEntry); spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (4)
examples/raft/raft.cpp (2)
44-46
: PassnodeIp
by const reference inNodeClient
constructorPassing
nodeIp
by const reference avoids unnecessary copying of thestd::string
, improving performance.Apply this diff to modify the constructor:
-NodeClient::NodeClient(ID nodeId, IP nodeIp) +NodeClient::NodeClient(ID nodeId, const IP& nodeIp) : m_id{nodeId}, - m_ip{std::move(nodeIp)}, + m_ip{nodeIp}, m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())), m_stub(RaftService::NewStub(m_channel)), m_kvStub(TinyKVPPService::NewStub(m_channel)) { assert(m_id > 0); assert(!m_ip.empty()); // ... }Remember to update the declaration in
raft.h
as well.
658-659
: Correct previous log term and index in heartbeatIn the
sendHeartbeat
method,request.set_prevlogterm(getLastLogTerm());
andrequest.set_prevlogindex(getLastLogIndex());
may not correctly represent the leader's state when there are no new entries. Ensure that the correct previous log term and index are sent.Please verify that
getLastLogTerm()
andgetLastLogIndex()
return the appropriate values in the context of heartbeats. If they refer to the latest log entries, consider adjusting them to represent the previous ones when necessary.Run the following script to check the implementations of
getLastLogTerm()
andgetLastLogIndex()
:Ensure that these functions return zero when the log is empty and the correct term/index otherwise.
examples/raft/raft.h (2)
32-48
: Add virtual destructor and documentation toNodeClient
To ensure proper resource cleanup and facilitate inheritance, add a virtual destructor to
NodeClient
. Additionally, provide method documentation for clarity.Apply this diff:
class NodeClient { public: + /** + * Virtual destructor for safe polymorphic destruction. + */ + virtual ~NodeClient() noexcept = default; // Existing constructor and methods... /** * Sends AppendEntries RPC to the target node. */ auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; // Other methods... };
65-93
: Improve documentation and inheritance inConsensusModule
Add method documentation and ensure consistent use of override specifiers. Publicly inherit from
enable_shared_from_this
to avoid issues withshared_from_this()
.Apply this diff:
class ConsensusModule : public RaftService::Service, public TinyKVPPService::Service, - std::enable_shared_from_this<ConsensusModule> + public std::enable_shared_from_this<ConsensusModule> { public: /** * Constructs a consensus module for a node in the Raft cluster. * @param nodeId Unique identifier for this node * @param replicas Vector of IP addresses for all nodes in cluster * @throws std::runtime_error if server creation fails */ ConsensusModule(ID nodeId, std::vector<IP> replicas); + virtual ~ConsensusModule() noexcept override = default; /** * Handles AppendEntries RPC calls from other nodes. */ auto AppendEntries(grpc::ServerContext *pContext, const AppendEntriesRequest *pRequest, AppendEntriesResponse *pResponse) -> grpc::Status override; // Other methods with documentation... };
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (3)
examples/raft/raft.cpp (3)
616-620
:⚠️ Potential issueFix incorrect majority calculation in
hasMajority
.The majority calculation is incorrect and could lead to split votes or incorrect behavior. The function should consider the total number of nodes (replicas plus self) when determining if a majority has been achieved.
Apply this diff to correct the calculation:
auto ConsensusModule::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes >= (totalNodes / 2) + 1; }This ensures that a majority requires more than half of the total nodes, which aligns with the Raft consensus requirements.
678-687
:⚠️ Potential issueCorrect the
consecutiveFailures
logic in heartbeat.The
consecutiveFailures
counter is reset to zero immediately after being incremented, which prevents retries from accumulating correctly. This negates the purpose of tracking consecutive failures and may cause the heartbeat thread to continue indefinitely without proper handling of failures.Apply this diff to fix the logic:
if (!client.appendEntries(request, &response)) { consecutiveFailures++; spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { return; } - consecutiveFailures = 0; continue; } +// Reset consecutiveFailures to zero only when AppendEntries RPC call succeeds +} else { + consecutiveFailures = 0; +}This change ensures that
consecutiveFailures
is only reset upon a successful RPC call, allowing the retry mechanism to function correctly.
721-797
: 🛠️ Refactor suggestionAvoid detaching threads in
sendAppendEntriesRPC
.Detaching threads can lead to undefined behavior and resource leaks. Use
std::jthread
or store the thread handles to manage thread lifetimes properly instead of detaching threads. Ensure proper thread management and cleanup.Apply this diff to properly manage the threads:
void ConsensusModule::sendAppendEntriesRPC(NodeClient &client, std::vector<LogEntry> logEntries) { - std::thread( + m_replicationThreads.emplace_back( [this](NodeClient &client, std::vector<LogEntry> logEntries) { // ... existing code ... }, - std::ref(client), - logEntries) - .detach(); + std::ref(client), + std::move(logEntries) + ); } +// Declare m_replicationThreads in your class to store these threads +std::vector<std::jthread> m_replicationThreads;Ensure you join these threads appropriately in the
stop()
method to prevent resource leaks:void ConsensusModule::stop() { // ... existing code ... + for (auto &replicationThread : m_replicationThreads) + { + replicationThread.request_stop(); + replicationThread.join(); + } + m_replicationThreads.clear(); }
🧹 Nitpick comments (3)
examples/raft/raft.cpp (3)
72-75
: Use consistent and fixed RPC timeout values instead of random timeouts.Using random timeouts for RPC calls in
appendEntries
,requestVote
, andput
methods can lead to unpredictable behavior and complicate debugging. It's generally better to use a consistent and reasonable timeout value for RPC calls to ensure reliable communication and easier troubleshooting.Apply this diff to use a fixed timeout:
// In NodeClient::appendEntries (lines 72-75) - const auto RPC_TIMEOUT = std::chrono::seconds(generateRandomTimeout()); + constexpr auto RPC_TIMEOUT = std::chrono::seconds(5); // Use a fixed 5-second timeout // In NodeClient::requestVote (lines 88-91) - const auto RPC_TIMEOUT = std::chrono::seconds(generateRandomTimeout()); + constexpr auto RPC_TIMEOUT = std::chrono::seconds(5); // Use a fixed 5-second timeout // In NodeClient::put (lines 106-108) - context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generateRandomTimeout())); + context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(5)); // Use a fixed 5-second timeoutAlso applies to: 88-91, 106-108
143-155
: Usestatic_cast
instead ofdynamic_cast
for service registration.Since
ConsensusModule
inherits fromRaftService::Service
andTinyKVPPService::Service
, you can safely usestatic_cast
instead ofdynamic_cast
. This avoids the overhead of run-time type checking and does not require RTTI support, improving performance and compatibility.Apply this diff:
// Replace dynamic_cast with static_cast for RaftService - auto *raftService = dynamic_cast<RaftService::Service *>(this); + auto *raftService = static_cast<RaftService::Service *>(this); // Replace dynamic_cast with static_cast for TinyKVPPService - auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); + auto *tkvppService = static_cast<TinyKVPPService::Service *>(this);
508-512
: Remove commented and unused code related tom_serverThread
.The code involving
m_serverThread
is commented out, and there is no evidence ofm_serverThread
being initialized or used elsewhere. Keeping unused or commented code can lead to confusion and maintainability issues. Consider removing this code if it's not needed.Apply this diff:
- /*if (m_serverThread.joinable())*/ { m_serverThread.request_stop(); m_serverThread.join(); }
If
m_serverThread
is intended to be used for managing the server thread, ensure it is properly initialized and managed throughout the code.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/raft/main.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
🧹 Nitpick comments (6)
examples/raft/raft.cpp (6)
73-85
: Use consistent and reasonable RPC timeouts.The RPC timeout is being set using
generateRandomTimeout()
, which may lead to inconsistent behavior. It's preferable to have a consistent and reasonable timeout value for RPC calls.Apply this diff to set a fixed RPC timeout:
- const auto RPC_TIMEOUT = std::chrono::seconds(generateRandomTimeout()); + constexpr auto RPC_TIMEOUT = std::chrono::seconds(5);Also, consider defining
RPC_TIMEOUT
as a class constant if it's used in multiple methods.
87-103
: Improve error handling inrequestVote
method.The method logs an error when
status.ok()
is false but does not provide sufficient context. Also, consider handling specific gRPC error codes.Apply this diff to enhance error logging:
if (!status.ok()) { - spdlog::error("RequestVote RPC call failed. Error code={} and message={}", - static_cast<int>(status.error_code()), - status.error_message()); + spdlog::error("RequestVote RPC to node={} failed. Error code={} ({}) - {}", + m_id, + static_cast<int>(status.error_code()), + status.error_code(), + status.error_message()); return false; }
186-254
: Correct typo in log message and handle missing leader heartbeat.There is a typo in the log message at line 186 (
Recevied
should beReceived
), and the code should ensure that missing heartbeats are handled appropriately.Apply this diff to fix the typo:
-spdlog::debug("Recevied AppendEntries RPC from leader={} during term={}", pRequest->senderid(), pRequest->term()); +spdlog::debug("Received AppendEntries RPC from leader={} during term={}", pRequest->senderid(), pRequest->term());Also, consider adding logic to handle cases where heartbeats are not received in time, potentially triggering a new election.
614-616
: Clarify log message inbecomeFollower
method.The log message at line 614 could be more informative by including the node ID.
Apply this diff to enhance the log message:
-spdlog::debug("Server reverted to follower state in term={}", m_currentTerm); +spdlog::debug("Node={} reverted to follower state in term={}", m_id, m_currentTerm);
511-513
: Remove unreachable code or fix the condition.The commented-out check for
m_serverThread.joinable()
and the subsequent unconditional block may lead to confusion.Either uncomment the check or remove the unused code.
-/*if (m_serverThread.joinable())*/ +if (m_serverThread.joinable()) { m_serverThread.request_stop(); m_serverThread.join(); }
748-752
: Include error status in log messages for better debugging.In
sendAppendEntriesRPC
, when logging an error, including the status message can aid in troubleshooting.Modify the log statement to include
status.error_message()
.spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}. Error: {}", m_id, client.getId(), m_currentTerm, + status.error_message());
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/raft/raft.cpp
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-clang
- GitHub Check: build-gcc
🔇 Additional comments (5)
examples/raft/raft.cpp (5)
617-621
:⚠️ Potential issueCorrect majority calculation in
hasMajority
method.The majority calculation may not be accurate, leading to potential split votes.
Apply this diff to fix the majority calculation:
auto ConsensusModule::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; }Likely invalid or redundant comment.
825-841
:⚠️ Potential issueFix potential infinite wait in
waitForMajorityReplication
.The condition in
waitForMajorityReplication
might never become true, causing an infinite wait.Apply this diff to ensure timely exit:
constexpr const auto replicationTimeout{5}; auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex) { uint32_t count = 1; for (const auto &[peer, matchIdx] : m_matchIndex) { - count += matchIdx >= logIndex; + if (matchIdx >= logIndex) + { + ++count; + } } return count >= (m_replicas.size() + 1) / 2 + 1; }; spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_id, logIndex); return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));Likely invalid or redundant comment.
45-69
:⚠️ Potential issuePotential memory leak and unnecessary dynamic allocation in
NodeClient
constructor.The
NodeClient
constructor initializes gRPC channel and stubs without checking for potential issues with dynamic allocation. Additionally, if the channel or stub creation fails, it throws exceptions, but the use ofnew
is not evident, possibly leading to memory mismanagement.Apply this diff to ensure proper initialization and avoid potential memory leaks:
NodeClient::NodeClient(ID nodeId, IP nodeIp) : m_id{nodeId}, m_ip{std::move(nodeIp)}, - m_channel(grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())), - m_stub(RaftService::NewStub(m_channel)), - m_kvStub(TinyKVPPService::NewStub(m_channel)) + m_channel{grpc::CreateChannel(m_ip, grpc::InsecureChannelCredentials())}, + m_stub{RaftService::NewStub(m_channel)}, + m_kvStub{TinyKVPPService::NewStub(m_channel)} { assert(m_id > 0); assert(!m_ip.empty()); if (!m_channel) { throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip)); } if (!m_stub) { throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip)); } if (!m_kvStub) { throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip)); } }Likely invalid or redundant comment.
186-468
:⚠️ Potential issueAvoid possible deadlock due to lock order inversion.
In
AppendEntries
and election functions, mutexesm_stateMutex
andm_timerMutex
are acquired. Ensure consistent lock ordering to prevent deadlocks.Standardize the order in which mutexes are locked throughout the code, acquiring
m_stateMutex
beforem_timerMutex
or vice versa, but consistently.Run the following script to identify inconsistent mutex lock orders:
107-119
: Forward Put requests to the leader when not the leader.In the
put
method, when the node is not the leader, it attempts to forward the request to the leader. Ensure that the forwarding logic handles errors and edge cases appropriately.Run the following script to verify that all non-leader nodes correctly forward Put requests to the current leader:
✅ Verification successful
Put request forwarding implementation is correct and complete
The implementation properly handles forwarding Put requests to the leader:
- Checks if the node is not the leader
- Verifies if the leader is known (via votedFor)
- Forwards requests using the replica's Put RPC
- Handles errors appropriately with detailed logging
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that non-leader nodes forward Put requests to the leader. # Test: Search for `Put` method implementations and check forwarding logic. rg "auto ConsensusModule::Put" -A 50 | rg "if \(m_state != NodeState::LEADER\)" -A 20Length of output: 1227
Script:
#!/bin/bash # Get the complete Put method implementation ast-grep --pattern 'auto ConsensusModule::Put($$$) { $$$ }' # Also search for any forwarding-related methods or RPC calls rg "Forward.*Put|forward.*put" -B 2 -A 5Length of output: 521
examples/raft/raft.cpp
Outdated
void ConsensusModule::sendAppendEntriesRPC(NodeClient &client, std::vector<LogEntry> logEntries) | ||
{ | ||
std::thread( | ||
[this](NodeClient &client, std::vector<LogEntry> logEntries) | ||
{ | ||
AppendEntriesRequest request; | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
|
||
request.set_term(m_currentTerm); | ||
request.set_prevlogterm(logEntries.front().term()); | ||
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1)); | ||
request.set_leadercommit(m_commitIndex); | ||
request.set_senderid(m_id); | ||
|
||
for (const auto& logEntry : logEntries) | ||
{ | ||
request.add_entries()->CopyFrom(logEntry); | ||
} | ||
} | ||
|
||
AppendEntriesResponse response; | ||
auto status = client.appendEntries(request, &response); | ||
if (!status) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}", | ||
m_id, | ||
client.getId(), | ||
m_currentTerm); | ||
return; | ||
} | ||
|
||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
if (response.term() > m_currentTerm) | ||
{ | ||
becomeFollower(response.term()); | ||
return; | ||
} | ||
} | ||
|
||
if (response.success()) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_matchIndex[client.getId()] = response.match_index(); | ||
m_nextIndex[client.getId()] = response.match_index() + 1; | ||
|
||
uint32_t majorityIndex = findMajorityIndexMatch(); | ||
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm) | ||
{ | ||
updatePersistentState(majorityIndex, std::nullopt); | ||
|
||
// Apply successfull replication to the state machine e.g. in-memory hash-table | ||
while (m_lastApplied < m_commitIndex) | ||
{ | ||
++m_lastApplied; | ||
m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value(); | ||
} | ||
|
||
return; | ||
} | ||
} | ||
|
||
if (!response.success()) | ||
{ | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_nextIndex[client.getId()] = std::max(1U, m_nextIndex[client.getId()] - 1); | ||
} | ||
sendAppendEntriesRPC(client, {}); | ||
} | ||
}, | ||
std::ref(client), | ||
logEntries) | ||
.detach(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid detaching threads in sendAppendEntriesRPC
.
Detaching threads may lead to resource leaks and undefined behavior. It's better to manage thread lifetimes properly.
Apply this diff to store and manage threads:
void ConsensusModule::sendAppendEntriesRPC(NodeClient &client, std::vector<LogEntry> logEntries)
{
- std::thread(
+ m_replicationThreads.emplace_back(
[this](NodeClient &client, std::vector<LogEntry> logEntries)
{
// Existing code...
},
std::ref(client),
logEntries
- ).detach();
+ );
}
Also, declare m_replicationThreads
in your class to store these threads:
std::vector<std::jthread> m_replicationThreads;
Ensure you join these threads appropriately in the stop()
method to prevent resource leaks.
examples/raft/raft.cpp
Outdated
|
||
auto ConsensusModule::updatePersistentState(std::optional<std::uint32_t> commitIndex, | ||
std::optional<std::uint32_t> votedFor) -> bool | ||
{ | ||
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex; | ||
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor; | ||
return flushPersistentState(); | ||
} | ||
|
||
auto ConsensusModule::flushPersistentState() -> bool | ||
{ | ||
// Flush commitIndex and votedFor | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id)); | ||
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc); | ||
if (!fsa.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to flush commitIndex={} and votedFor={}", | ||
m_id, | ||
path.c_str(), | ||
m_commitIndex, | ||
m_votedFor); | ||
return false; | ||
} | ||
fsa << m_commitIndex << " " << m_votedFor << "\n"; | ||
fsa.flush(); | ||
// TODO(lnikon): ::fsync(fsa->handle()); | ||
} | ||
|
||
// Flush the log | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id)); | ||
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc); | ||
if (!fsa.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to flush the log", m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
for (const auto &entry : m_log) | ||
{ | ||
fsa << entry.key() << " " << entry.value() << " " << entry.term() << "\n"; | ||
} | ||
fsa.flush(); | ||
// TODO(lnikon): ::fsync(fsa->handle()); | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure persistence methods handle errors properly.
The methods updatePersistentState
, flushPersistentState
, and restorePersistentState
should handle I/O errors more robustly.
Consider adding error checks and fallback mechanisms. Also, ensure that file operations are exception-safe.
Additionally, the TODO comments about fsync
should be addressed to ensure data is flushed to disk.
Would you like me to implement the error handling and finalize the persistence layer?
examples/raft/raft.cpp
Outdated
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_log.push_back(logEntry); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure thread safety when modifying shared data structures.
In the Put
method, the m_log
is being modified without proper synchronization, which may lead to data races.
Apply this diff to enforce thread safety:
{
- absl::MutexLock locker{&m_stateMutex};
+ absl::WriterMutexLock locker{&m_stateMutex};
m_log.push_back(logEntry);
}
Use absl::WriterMutexLock
instead of absl::MutexLock
for write operations.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{ | |
absl::MutexLock locker{&m_stateMutex}; | |
m_log.push_back(logEntry); | |
} | |
{ | |
absl::WriterMutexLock locker{&m_stateMutex}; | |
m_log.push_back(logEntry); | |
} |
examples/raft/raft.cpp
Outdated
auto ConsensusModule::getLogTerm(uint32_t index) const -> uint32_t | ||
{ | ||
if (index == 0 || index > m_log.size()) | ||
{ | ||
return 0; | ||
} | ||
|
||
return m_log[index - 1].term(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle out-of-range access in getLogTerm
method.
The method getLogTerm
might return an incorrect term if the index is out of bounds.
Apply this diff to fix the index check:
if (index == 0 || index > m_log.size())
{
return 0;
}
-return m_log[index - 1].term();
+return m_log.at(index - 1).term();
Using m_log.at(index - 1)
will throw an exception if the index is out of range, making the error more apparent.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto ConsensusModule::getLogTerm(uint32_t index) const -> uint32_t | |
{ | |
if (index == 0 || index > m_log.size()) | |
{ | |
return 0; | |
} | |
return m_log[index - 1].term(); | |
} | |
auto ConsensusModule::getLogTerm(uint32_t index) const -> uint32_t | |
{ | |
if (index == 0 || index > m_log.size()) | |
{ | |
return 0; | |
} | |
return m_log.at(index - 1).term(); | |
} |
examples/raft/raft.cpp
Outdated
auto ConsensusModule::restorePersistentState() -> bool | ||
{ | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id)); | ||
if (!std::filesystem::exists(path)) | ||
{ | ||
spdlog::info("Node={} is running the first time", m_id); | ||
return true; | ||
} | ||
|
||
std::ifstream ifs(path, std::ifstream::in); | ||
if (!ifs.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to restore commitIndex and votedFor", m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
ifs >> m_commitIndex >> m_votedFor; | ||
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor); | ||
} | ||
|
||
{ | ||
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id)); | ||
std::ifstream ifs(path, std::ifstream::in); | ||
if (!ifs.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to restore log", m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
std::string logLine; | ||
while (std::getline(ifs, logLine)) | ||
{ | ||
std::stringstream sst(logLine); | ||
|
||
std::string key; | ||
std::string value; | ||
std::uint32_t term = 0; | ||
|
||
sst >> key >> value >> term; | ||
|
||
LogEntry logEntry; | ||
logEntry.set_key(key); | ||
logEntry.set_value(value); | ||
logEntry.set_term(term); | ||
m_log.emplace_back(logEntry); | ||
|
||
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term); | ||
} | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle file parsing errors in restorePersistentState
.
While restoring persistent state, the code assumes the file format is correct. If the file is corrupted, it may cause undefined behavior.
Apply this diff to add error handling:
ifs >> m_commitIndex >> m_votedFor;
+if (ifs.fail())
+{
+ spdlog::error("Node={} encountered error while parsing {}", m_id, path.c_str());
+ return false;
+}
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor);
Similarly, validate the parsing of log entries.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto ConsensusModule::restorePersistentState() -> bool | |
{ | |
{ | |
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id)); | |
if (!std::filesystem::exists(path)) | |
{ | |
spdlog::info("Node={} is running the first time", m_id); | |
return true; | |
} | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore commitIndex and votedFor", m_id, path.c_str()); | |
return false; | |
} | |
ifs >> m_commitIndex >> m_votedFor; | |
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor); | |
} | |
{ | |
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id)); | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore log", m_id, path.c_str()); | |
return false; | |
} | |
std::string logLine; | |
while (std::getline(ifs, logLine)) | |
{ | |
std::stringstream sst(logLine); | |
std::string key; | |
std::string value; | |
std::uint32_t term = 0; | |
sst >> key >> value >> term; | |
LogEntry logEntry; | |
logEntry.set_key(key); | |
logEntry.set_value(value); | |
logEntry.set_term(term); | |
m_log.emplace_back(logEntry); | |
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term); | |
} | |
} | |
return true; | |
} | |
auto ConsensusModule::restorePersistentState() -> bool | |
{ | |
{ | |
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_id)); | |
if (!std::filesystem::exists(path)) | |
{ | |
spdlog::info("Node={} is running the first time", m_id); | |
return true; | |
} | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore commitIndex and votedFor", m_id, path.c_str()); | |
return false; | |
} | |
ifs >> m_commitIndex >> m_votedFor; | |
if (ifs.fail()) | |
{ | |
spdlog::error("Node={} encountered error while parsing {}", m_id, path.c_str()); | |
return false; | |
} | |
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_id, m_commitIndex, m_votedFor); | |
} | |
{ | |
auto path = std::filesystem::path(constructFilename(gLogFilename, m_id)); | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore log", m_id, path.c_str()); | |
return false; | |
} | |
std::string logLine; | |
while (std::getline(ifs, logLine)) | |
{ | |
std::stringstream sst(logLine); | |
std::string key; | |
std::string value; | |
std::uint32_t term = 0; | |
sst >> key >> value >> term; | |
LogEntry logEntry; | |
logEntry.set_key(key); | |
logEntry.set_value(value); | |
logEntry.set_term(term); | |
m_log.emplace_back(logEntry); | |
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_id, key, value, term); | |
} | |
} | |
return true; | |
} |
examples/raft/raft.cpp
Outdated
auto ConsensusModule::RequestVote(grpc::ServerContext *pContext, | ||
const RequestVoteRequest *pRequest, | ||
RequestVoteResponse *pResponse) -> grpc::Status | ||
{ | ||
(void)pContext; | ||
|
||
absl::WriterMutexLock locker(&m_stateMutex); | ||
|
||
spdlog::debug("Received RequestVote RPC from candidate={} during term={} peerTerm={}", | ||
pRequest->candidateid(), | ||
m_currentTerm, | ||
pRequest->term()); | ||
|
||
pResponse->set_term(m_currentTerm); | ||
pResponse->set_votegranted(0); | ||
pResponse->set_responderid(m_id); | ||
|
||
// Become follower if candidates has higher term | ||
if (pRequest->term() > m_currentTerm) | ||
{ | ||
becomeFollower(pRequest->term()); | ||
} | ||
|
||
// Don't grant vote to the candidate if the nodes term is higher | ||
if (pRequest->term() < m_currentTerm) | ||
{ | ||
return grpc::Status::OK; | ||
} | ||
|
||
// Grant vote to the candidate if the node hasn't voted yet and | ||
// candidates log is at least as up-to-date as receiver's log | ||
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid()) | ||
{ | ||
if (pRequest->lastlogterm() > getLastLogTerm() || | ||
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex())) | ||
{ | ||
|
||
updatePersistentState(std::nullopt, pRequest->candidateid()); | ||
pResponse->set_term(m_currentTerm); | ||
pResponse->set_votegranted(1); | ||
m_leaderHeartbeatReceived.store(true); | ||
} | ||
} | ||
|
||
return grpc::Status::OK; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Ensure thread safety and proper term updates in RequestVote
method.
The RequestVote
handler should ensure that term updates and state changes are thread-safe. Also, reset election timers upon granting a vote.
Apply this diff to improve the RequestVote
implementation:
// Grant vote to the candidate if the node hasn't voted yet and
// candidate's log is at least as up-to-date as receiver's log
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid())
{
if (pRequest->lastlogterm() > getLastLogTerm() ||
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex()))
{
+ m_votedFor = pRequest->candidateid();
+ m_leaderHeartbeatReceived.store(true);
+ // Reset election timer
+ // Code to reset election timer goes here
+
updatePersistentState(std::nullopt, pRequest->candidateid());
pResponse->set_term(m_currentTerm);
pResponse->set_votegranted(1);
- m_leaderHeartbeatReceived.store(true);
}
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto ConsensusModule::RequestVote(grpc::ServerContext *pContext, | |
const RequestVoteRequest *pRequest, | |
RequestVoteResponse *pResponse) -> grpc::Status | |
{ | |
(void)pContext; | |
absl::WriterMutexLock locker(&m_stateMutex); | |
spdlog::debug("Received RequestVote RPC from candidate={} during term={} peerTerm={}", | |
pRequest->candidateid(), | |
m_currentTerm, | |
pRequest->term()); | |
pResponse->set_term(m_currentTerm); | |
pResponse->set_votegranted(0); | |
pResponse->set_responderid(m_id); | |
// Become follower if candidates has higher term | |
if (pRequest->term() > m_currentTerm) | |
{ | |
becomeFollower(pRequest->term()); | |
} | |
// Don't grant vote to the candidate if the nodes term is higher | |
if (pRequest->term() < m_currentTerm) | |
{ | |
return grpc::Status::OK; | |
} | |
// Grant vote to the candidate if the node hasn't voted yet and | |
// candidates log is at least as up-to-date as receiver's log | |
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid()) | |
{ | |
if (pRequest->lastlogterm() > getLastLogTerm() || | |
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex())) | |
{ | |
updatePersistentState(std::nullopt, pRequest->candidateid()); | |
pResponse->set_term(m_currentTerm); | |
pResponse->set_votegranted(1); | |
m_leaderHeartbeatReceived.store(true); | |
} | |
} | |
return grpc::Status::OK; | |
} | |
auto ConsensusModule::RequestVote(grpc::ServerContext *pContext, | |
const RequestVoteRequest *pRequest, | |
RequestVoteResponse *pResponse) -> grpc::Status | |
{ | |
(void)pContext; | |
absl::WriterMutexLock locker(&m_stateMutex); | |
spdlog::debug("Received RequestVote RPC from candidate={} during term={} peerTerm={}", | |
pRequest->candidateid(), | |
m_currentTerm, | |
pRequest->term()); | |
pResponse->set_term(m_currentTerm); | |
pResponse->set_votegranted(0); | |
pResponse->set_responderid(m_id); | |
// Become follower if candidates has higher term | |
if (pRequest->term() > m_currentTerm) | |
{ | |
becomeFollower(pRequest->term()); | |
} | |
// Don't grant vote to the candidate if the nodes term is higher | |
if (pRequest->term() < m_currentTerm) | |
{ | |
return grpc::Status::OK; | |
} | |
// Grant vote to the candidate if the node hasn't voted yet and | |
// candidates log is at least as up-to-date as receiver's log | |
if (m_votedFor == 0 || m_votedFor == pRequest->candidateid()) | |
{ | |
if (pRequest->lastlogterm() > getLastLogTerm() || | |
(pRequest->lastlogterm() == getLastLogTerm() && pRequest->lastlogindex() >= getLastLogIndex())) | |
{ | |
m_votedFor = pRequest->candidateid(); | |
m_leaderHeartbeatReceived.store(true); | |
// Reset election timer | |
// Code to reset election timer goes here | |
updatePersistentState(std::nullopt, pRequest->candidateid()); | |
pResponse->set_term(m_currentTerm); | |
pResponse->set_votegranted(1); | |
} | |
} | |
return grpc::Status::OK; | |
} |
examples/raft/raft.cpp
Outdated
ConsensusModule::ConsensusModule(ID nodeId, std::vector<IP> replicas) | ||
: m_id{nodeId}, | ||
m_currentTerm{0}, | ||
m_votedFor{0}, | ||
m_state{NodeState::FOLLOWER}, | ||
m_commitIndex{0}, | ||
m_lastApplied{0}, | ||
m_voteCount{0} | ||
{ | ||
assert(m_id > 0); | ||
assert(replicas.size() > 0); | ||
assert(m_id <= replicas.size()); | ||
|
||
m_ip = replicas[m_id - 1]; | ||
|
||
grpc::ServerBuilder builder; | ||
builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials()); | ||
|
||
auto *raftService = dynamic_cast<RaftService::Service *>(this); | ||
if (raftService == nullptr) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService")); | ||
} | ||
builder.RegisterService(raftService); | ||
|
||
auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); | ||
if (tkvppService == nullptr) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService")); | ||
} | ||
builder.RegisterService(tkvppService); | ||
|
||
m_raftServer = builder.BuildAndStart(); | ||
if (!m_raftServer) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to create a gRPC server for node={} ip={}", m_id, m_ip)); | ||
} | ||
|
||
std::size_t id{1}; | ||
for (const auto &ip : replicas) | ||
{ | ||
if (id != m_id) | ||
{ | ||
m_replicas[id] = NodeClient(id, ip); | ||
m_matchIndex[id] = 0; | ||
m_nextIndex[id] = 1; | ||
} | ||
|
||
++id; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Initialize all member variables in ConsensusModule
constructor and check dynamic casts.
The constructor doesn't initialize some member variables, which could lead to undefined behavior. Also, dynamic casts should be validated.
Apply this diff to initialize all member variables and validate dynamic casts:
ConsensusModule::ConsensusModule(ID nodeId, std::vector<IP> replicas)
: m_id{nodeId},
m_currentTerm{0},
m_votedFor{0},
m_state{NodeState::FOLLOWER},
m_commitIndex{0},
m_lastApplied{0},
m_voteCount{0},
+ m_stopElection{false},
+ m_leaderHeartbeatReceived{false},
+ m_raftServer{nullptr}
{
assert(m_id > 0);
assert(replicas.size() > 0);
assert(m_id <= replicas.size());
m_ip = replicas[m_id - 1];
grpc::ServerBuilder builder;
builder.AddListeningPort(m_ip, grpc::InsecureServerCredentials());
- auto *raftService = dynamic_cast<RaftService::Service *>(this);
- if (raftService == nullptr)
+ // Register RaftService
+ if (!this->RegisterService(&builder))
{
throw std::runtime_error(fmt::format("Failed to register RaftService for node={} ip={}", m_id, m_ip));
}
builder.RegisterService(static_cast<RaftService::Service *>(this));
m_raftServer = builder.BuildAndStart();
if (!m_raftServer)
{
throw std::runtime_error(fmt::format("Failed to create a gRPC server for node={} ip={}", m_id, m_ip));
}
std::size_t id{1};
for (const auto &ip : replicas)
{
if (id != m_id)
{
m_replicas[id] = NodeClient(id, ip);
m_matchIndex[id] = 0;
m_nextIndex[id] = 1;
}
++id;
}
}
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
♻️ Duplicate comments (4)
examples/raft/raft.cpp (4)
251-252
:⚠️ Potential issueFix incorrect format string in
spdlog::error
The
spdlog::error
call has one format placeholder but is provided with two arguments. This mismatch can lead to incorrect logging output or runtime errors. Ensure that the number of placeholders matches the number of arguments.Apply this diff to correct the format string:
if (!updatePersistentState(std::min(pRequest->leadercommit(), (uint32_t)m_log.size()), std::nullopt)) { - spdlog::error("Node={} is unable to persist commitIndex", m_id, m_commitIndex); + spdlog::error("Node={} is unable to persist commitIndex={}", m_id, m_commitIndex); pResponse->set_term(m_currentTerm); pResponse->set_success(false);Repeat this correction for other
spdlog
statements where the number of placeholders does not match the number of arguments.
710-718
:⚠️ Potential issueCorrect the reset of
consecutiveFailures
in heartbeat logicIn the heartbeat thread,
consecutiveFailures
is incremented when anAppendEntries
RPC fails but is immediately reset to zero within the failure condition. This prevents the retry logic from working correctly, asconsecutiveFailures
never accumulates.Apply this diff to adjust the position of
consecutiveFailures
reset:spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { return; } - consecutiveFailures = 0; continue; } + consecutiveFailures = 0;
754-829
: 🛠️ Refactor suggestionAvoid detaching threads in
sendAppendEntriesRPC
Detaching threads may lead to resource leaks and undefined behavior. It's important to manage thread lifetimes properly by storing thread handles and ensuring they are joined upon completion.
Apply this diff to manage the threads properly:
void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries) { - std::thread( + m_replicationThreads.emplace_back( [this](node_client_t &client, std::vector<LogEntry> logEntries) { // Existing code... }, std::ref(client), logEntries - ).detach(); + ); }Declare
m_replicationThreads
in your class to store these threads:std::vector<std::jthread> m_replicationThreads;Ensure you join these threads appropriately in the
stop()
method to prevent resource leaks.
831-840
:⚠️ Potential issueHandle out-of-range access in
getLogTerm
methodThe
getLogTerm
method may return an incorrect term if theindex
parameter is out of bounds, which can lead to undefined behavior when accessingm_log[index - 1]
.Apply this diff to fix the index check:
auto consensus_module_t::getLogTerm(uint32_t index) const -> uint32_t { - if (index == 0 || index > m_log.size()) + if (index == 0 || index > m_log.size()) { return 0; } - return m_log[index - 1].term(); + return index <= m_log.size() ? m_log[index - 1].term() : 0; }Alternatively, use
m_log.at(index - 1)
to throw an exception if the index is out of range, making the error more apparent.
🧹 Nitpick comments (5)
examples/raft/raft.h (3)
33-65
: Enhance method documentation and specifications.The class methods need thread safety documentation and noexcept specifications where applicable:
[[nodiscard]] auto id() const -> id_t; + [[nodiscard]] auto id() const noexcept -> id_t; /** Thread-safe: Can be called concurrently by multiple threads */ auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; /** Thread-safe: Can be called concurrently by multiple threads */ auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool; /** Thread-safe: Can be called concurrently by multiple threads */ auto put(const PutRequest &request, PutResponse *pResponse) -> bool;
88-92
: Document public interface methods.Key public methods lack documentation explaining their purpose, thread safety guarantees, and error handling behavior.
+ /** + * Initializes the consensus module's persistent state. + * Must be called before start(). + * @return true if initialization succeeds, false otherwise + * Thread-safe: Can be called only once + */ [[nodiscard]] auto init() -> bool; + /** + * Starts the consensus module, election timer, and RPC servers. + * Thread-safe: Can be called only once + */ void start(); + /** + * Stops all threads and servers gracefully. + * Thread-safe: Can be called multiple times + */ void stop();
165-165
: Consider using persistent storage for key-value pairs.The in-memory hashtable
m_kv
will lose all data if the node crashes. Consider using the WAL (Write-Ahead Log) that's already included or a persistent key-value store for durability.lib/db/db.cpp (2)
67-73
: Remove commented-out code or implement consensus replicationThe code from lines 67 to 73 contains commented-out code related to consensus replication using
ConsensusModule
. Leaving such code commented out may lead to confusion and reduce code clarity.Apply this diff to remove the commented-out code:
// TODO(lnikon): Indicate on insertion failure void db_t::put(const structures::lsmtree::key_t &key, const structures::lsmtree::value_t &value) { - // ConsensusModule m_cm; - // if (!m_cm.replicate(key, value)) - // { - // spdlog::error("ConensusModule is unable to replicate"); - // return; - // }Alternatively, if consensus replication is planned, consider implementing this feature or tracking it with a TODO comment with a clear description.
70-70
: Fix typo in error messageThere's a typo in the commented-out error message: "ConensusModule" should be "ConsensusModule".
Apply this diff to correct the typo:
// spdlog::error("ConensusModule is unable to replicate"); + // spdlog::error("ConsensusModule is unable to replicate");
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)lib/db/db.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/raft/main.cpp
🔇 Additional comments (1)
examples/raft/raft.h (1)
21-28
: LGTM! Well-defined type aliases and constants.The type aliases improve code readability and the invalid ID constant is well-documented.
examples/raft/raft.h
Outdated
#include <wal/wal.h> | ||
|
||
#include "Raft.grpc.pb.h" | ||
#include "Raft.pb.h" | ||
|
||
#include "TinyKVPP.pb.h" | ||
#include "TinyKVPP.grpc.pb.h" | ||
|
||
#include <absl/base/thread_annotations.h> | ||
|
||
#include <chrono> // for 'std::chrono::high_resolution_clock' | ||
#include <cstdint> // for 'uint32_t' | ||
#include <string> // for 'std::string' | ||
#include <thread> // for 'std::jthread' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing essential headers.
The following headers are used but not included:
<memory>
forstd::shared_ptr
,std::unique_ptr
<optional>
forstd::optional
<unordered_map>
forstd::unordered_map
<vector>
forstd::vector
#include <chrono> // for 'std::chrono::high_resolution_clock'
#include <cstdint> // for 'uint32_t'
+#include <memory> // for 'std::shared_ptr', 'std::unique_ptr'
+#include <optional> // for 'std::optional'
#include <string> // for 'std::string'
#include <thread> // for 'std::jthread'
+#include <unordered_map> // for 'std::unordered_map'
+#include <vector> // for 'std::vector'
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#include <wal/wal.h> | |
#include "Raft.grpc.pb.h" | |
#include "Raft.pb.h" | |
#include "TinyKVPP.pb.h" | |
#include "TinyKVPP.grpc.pb.h" | |
#include <absl/base/thread_annotations.h> | |
#include <chrono> // for 'std::chrono::high_resolution_clock' | |
#include <cstdint> // for 'uint32_t' | |
#include <string> // for 'std::string' | |
#include <thread> // for 'std::jthread' | |
#include <wal/wal.h> | |
#include "Raft.grpc.pb.h" | |
#include "Raft.pb.h" | |
#include "TinyKVPP.pb.h" | |
#include "TinyKVPP.grpc.pb.h" | |
#include <absl/base/thread_annotations.h> | |
#include <chrono> // for 'std::chrono::high_resolution_clock' | |
#include <cstdint> // for 'uint32_t' | |
#include <memory> // for 'std::shared_ptr', 'std::unique_ptr' | |
#include <optional> // for 'std::optional' | |
#include <string> // for 'std::string' | |
#include <thread> // for 'std::jthread' | |
#include <unordered_map> // for 'std::unordered_map' | |
#include <vector> // for 'std::vector' |
examples/raft/raft.cpp
Outdated
if (!m_channel) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip)); | ||
} | ||
|
||
if (!m_stub) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip)); | ||
} | ||
|
||
if (!m_kvStub) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct ineffective null checks on gRPC objects
The grpc::CreateChannel
and NewStub
methods do not return null
pointers, even if the connection cannot be established. Therefore, the null checks on m_channel
, m_stub
, and m_kvStub
are ineffective and should be removed or replaced with proper connection checks. Instead of checking for null
, consider verifying the channel's connectivity state or handling errors during RPC calls.
Apply this diff to remove ineffective null checks:
assert(m_id > 0);
assert(!m_ip.empty());
- if (!m_channel)
- {
- throw std::runtime_error(fmt::format("Failed to establish a gRPC channel for node={} ip={}", m_id, m_ip));
- }
- if (!m_stub)
- {
- throw std::runtime_error(fmt::format("Failed to create a stub for node={} ip={}", m_id, m_ip));
- }
- if (!m_kvStub)
- {
- throw std::runtime_error(fmt::format("Failed to create a KV stub for node={} ip={}", m_id, m_ip));
- }
If necessary, implement connection verification using the channel's state:
// Example: Wait for the channel to become ready
if (!m_channel->WaitForConnected(gpr_time_add(gpr_now(GPR_CLOCK_REALTIME), gpr_time_from_seconds(5, GPR_TIMESPAN)))) {
throw std::runtime_error(fmt::format("gRPC channel to node={} ip={} failed to connect", m_id, m_ip));
}
examples/raft/raft.cpp
Outdated
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex, | ||
std::optional<std::uint32_t> votedFor) -> bool | ||
{ | ||
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex; | ||
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor; | ||
return flushPersistentState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle concurrent updates in updatePersistentState
The method updatePersistentState
assigns new values to m_commitIndex
and m_votedFor
without synchronization. This may lead to data races if accessed concurrently from multiple threads.
Apply this diff to protect the assignments with a mutex lock:
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex,
std::optional<std::uint32_t> votedFor) -> bool
{
+ absl::MutexLock locker{&m_stateMutex};
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex;
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor;
return flushPersistentState();
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex, | |
std::optional<std::uint32_t> votedFor) -> bool | |
{ | |
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex; | |
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor; | |
return flushPersistentState(); | |
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex, | |
std::optional<std::uint32_t> votedFor) -> bool | |
{ | |
absl::MutexLock locker{&m_stateMutex}; | |
m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex; | |
m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor; | |
return flushPersistentState(); |
} | ||
|
||
if (!response.success()) | ||
{ | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); | ||
} | ||
sendAppendEntriesRPC(client, {}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix infinite recursion in failed AppendEntries handling
In the sendAppendEntriesRPC
method, when an AppendEntries
RPC fails (success
is false
), the code reduces m_nextIndex[client.id()]
and calls sendAppendEntriesRPC
recursively without a termination condition. This can lead to infinite recursion and stack overflow.
Apply this diff to modify the logic and prevent infinite recursion:
if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
- sendAppendEntriesRPC(client, {});
+ // Retry sending AppendEntries with updated nextIndex
+ // Consider implementing a loop with a termination condition instead of recursion
}
Consider redesigning the method to use iteration instead of recursion, ensuring there's a clear exit condition when retries fail.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
} | |
if (!response.success()) | |
{ | |
{ | |
absl::MutexLock locker{&m_stateMutex}; | |
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); | |
} | |
sendAppendEntriesRPC(client, {}); | |
} | |
if (!response.success()) | |
{ | |
{ | |
absl::MutexLock locker{&m_stateMutex}; | |
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); | |
} | |
// Retry sending AppendEntries with updated nextIndex | |
// Consider implementing a loop with a termination condition instead of recursion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
♻️ Duplicate comments (3)
examples/raft/main.cpp (2)
12-24
:⚠️ Potential issueAdd signal handlers for graceful shutdown.
The program should handle SIGTERM/SIGINT signals for graceful shutdown.
Reference: This was mentioned in a previous review. The program currently lacks proper shutdown handling, which could lead to resource leaks and inconsistent state.
void signal_handler(int signal) { spdlog::info("Received signal {}. Initiating graceful shutdown...", signal); // Trigger graceful shutdown } int main(int argc, char *argv[]) { std::signal(SIGTERM, signal_handler); std::signal(SIGINT, signal_handler); // ... rest of the code }
46-48
:⚠️ Potential issueSecurity concern: Using insecure channel credentials.
The use of
grpc::InsecureChannelCredentials()
is not suitable for production environments.Consider using TLS certificates for secure communication:
auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions()); m_channel = grpc::CreateChannel(replicaIp, creds);
examples/raft/CMakeLists.txt (1)
13-15
: 🛠️ Refactor suggestionComplete test configuration setup
The current setup only includes Catch2 test registration. If you decide to use Google Test, you'll need different registration.
Based on your chosen framework, use one of these configurations:
# For Catch2 (current setup) include(Catch) catch_discover_tests(RaftTest) # OR for Google Test include(GoogleTest) gtest_discover_tests(RaftTest) # Add test properties for both cases set_tests_properties(${ALL_TESTS} PROPERTIES ENVIRONMENT "RAFT_TEST_DIR=${CMAKE_CURRENT_BINARY_DIR}/test_data" )
🧹 Nitpick comments (9)
examples/raft/main.cpp (2)
1-11
: Organize includes and add missing headers.Consider organizing includes into logical groups and adding missing standard headers:
// gRPC includes #include <grpcpp/create_channel.h> #include <grpcpp/security/credentials.h> #include <grpcpp/server.h> // Project includes #include "raft.h" // Third-party includes #include <cxxopts.hpp> #include <spdlog/spdlog.h> +// Standard includes +#include <memory> +#include <vector> +#include <csignal>
29-29
: Fix typo in error message.Correct the spelling of "positve" to "positive".
- spdlog::error("ID of the node should be positve integer"); + spdlog::error("ID of the node should be positive integer");examples/raft/raft.h (2)
59-59
: Passnode_config_t
by const reference to avoid unnecessary copyingConsider passing
config
as aconst node_config_t&
to avoid unnecessary copying in thenode_client_t
constructor.Apply this diff:
- node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub); + node_client_t(const node_config_t& config, std::unique_ptr<RaftService::StubInterface> pRaftStub);
91-91
: Pass parameters by const reference to avoid unnecessary copyingConsider passing
node_config_t
andreplicas
by const reference in theconsensus_module_t
constructor to avoid unnecessary copies.Apply this diff:
- consensus_module_t(node_config_t nodeConfig, std::vector<node_client_t> replicas); + consensus_module_t(const node_config_t& nodeConfig, const std::vector<node_client_t>& replicas);conanfile.txt (1)
14-14
: Consider consolidating dependencies to a single testing frameworkSince the project already uses Catch2, adding
gtest
as a dependency may be unnecessary. Consolidating to one testing framework can reduce complexity and dependency management overhead.examples/raft/raft.cpp (2)
29-39
: Consider using constexpr for timeout bounds and more descriptive names.The timeout generation function could be more maintainable with:
- More descriptive names indicating these are election timeouts
- Documentation explaining the purpose of these specific timeout values
- const int minTimeout{150}; - const int maxTimeout{300}; + // Election timeout bounds (in milliseconds) as per Raft paper + constexpr int MIN_ELECTION_TIMEOUT_MS{150}; + constexpr int MAX_ELECTION_TIMEOUT_MS{300}; std::random_device randomDevice; std::mt19937 gen(randomDevice()); - std::uniform_int_distribution<> dist(minTimeout, maxTimeout); + std::uniform_int_distribution<> dist(MIN_ELECTION_TIMEOUT_MS, MAX_ELECTION_TIMEOUT_MS);
121-136
: Standardize RPC timeout handling across methods.The
put
method's timeout handling differs fromappendEntries
andrequestVote
. Consider extracting the timeout logic to maintain consistency.+ const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout()); + grpc::ClientContext context; - context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generate_random_timeout())); + context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);examples/raft/CMakeLists.txt (2)
1-2
: Consider aligning project name with repository nameThe project name "zkv" differs from the repository name "tinykvpp". This inconsistency might cause confusion. Consider using a consistent naming scheme across the project.
-project(zkv) +project(tinykvpp)
4-6
: Add debug configuration and consider C++ standard compatibilityThe C++23 standard is very recent and might not be widely supported. Consider:
- Adding debug configuration options
- Using C++20 for better compiler compatibility unless C++23 features are specifically needed
Add the following configuration:
# Add debug configuration target_compile_definitions(RaftMain PRIVATE $<$<CONFIG:Debug>:RAFT_DEBUG> $<$<CONFIG:Debug>:RAFT_LOG_LEVEL=debug> ) # Optional: Use C++20 if C++23 features aren't needed # set_target_properties(RaftMain PROPERTIES CXX_STANDARD 20)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
CMakeLists.txt
(3 hunks)conanfile.txt
(1 hunks)examples/raft/CMakeLists.txt
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)examples/raft/raft_test.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (6)
examples/raft/raft.h (3)
85-105
: Enhance class design with virtual destructor and method documentationThe
consensus_module_t
class would benefit from a virtual destructor and detailed method documentation to improve maintainability and safety.
172-173
: Atomic variables need explicit memory ordering specificationsThe atomic variables
m_leaderHeartbeatReceived
andm_stopElection
may require explicit memory order semantics to prevent synchronization issues.
85-86
:⚠️ Potential issueEnsure public inheritance of
std::enable_shared_from_this
Currently,
consensus_module_t
inherits fromstd::enable_shared_from_this<consensus_module_t>
without specifying the access specifier, which defaults to private inheritance in C++. This can lead to undefined behavior when callingshared_from_this()
. Please make the inheritance public.Apply this diff to fix the issue:
- std::enable_shared_from_this<consensus_module_t> + public std::enable_shared_from_this<consensus_module_t>Likely invalid or redundant comment.
examples/raft/raft.cpp (3)
665-669
:⚠️ Potential issueFix incorrect majority calculation.
The current majority calculation is incorrect and could lead to split votes. The total number of nodes should include the current node.
auto consensus_module_t::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; }Likely invalid or redundant comment.
910-916
:⚠️ Potential issueFix race condition in updatePersistentState.
The method modifies shared state without proper synchronization.
auto consensus_module_t::updatePersistentState(std::optional<std::uint32_t> commitIndex, std::optional<std::uint32_t> votedFor) -> bool { + absl::MutexLock locker{&m_stateMutex}; m_commitIndex = commitIndex.has_value() ? commitIndex.value() : m_commitIndex; m_votedFor = votedFor.has_value() ? votedFor.value() : m_votedFor; return flushPersistentState(); }
Likely invalid or redundant comment.
1002-1003
:⚠️ Potential issueAdd error handling for file parsing.
The state restoration code doesn't validate the parsed values.
ifs >> m_commitIndex >> m_votedFor; + if (ifs.fail()) { + spdlog::error("Node={} failed to parse state file", m_config.m_id); + return false; + } spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);Likely invalid or redundant comment.
auto nodeIps = parsedOptions["nodes"].as<std::vector<raft::ip_t>>(); | ||
if (nodeIps.empty()) | ||
{ | ||
spdlog::error("List of node IPs can't be empty"); | ||
return EXIT_FAILURE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance IP address validation.
The current validation only checks if the list is not empty. Consider adding validation for:
- IP address format
- Port number range
- Duplicate addresses
Example validation function:
bool validateIpAddress(const std::string& ip) {
std::regex ipRegex(
"^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}(:[0-9]{1,5})?$"
);
if (!std::regex_match(ip, ipRegex)) {
return false;
}
// Additional validation for port range and IP octets
return true;
}
examples/raft/main.cpp
Outdated
raft::id_t replicaId{1}; | ||
std::vector<raft::node_client_t> replicas; | ||
for (const auto &replicaIp : nodeIps) | ||
{ | ||
if (replicaId != nodeId) | ||
{ | ||
std::unique_ptr<RaftService::Stub> stub{ | ||
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))}; | ||
|
||
replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub)); | ||
} | ||
|
||
++replicaId; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for replica setup.
The replica setup should include error handling for channel creation and connection failures.
for (const auto &replicaIp : nodeIps) {
if (replicaId != nodeId) {
try {
auto channel = grpc::CreateChannel(replicaIp, creds);
if (!channel->WaitForConnected(
std::chrono::system_clock::now() +
std::chrono::seconds(5))) {
spdlog::error("Failed to connect to replica at {}", replicaIp);
return EXIT_FAILURE;
}
std::unique_ptr<RaftService::Stub> stub{
RaftService::NewStub(channel)};
replicas.emplace_back(
raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp},
std::move(stub));
} catch (const std::exception& e) {
spdlog::error("Failed to setup replica: {}", e.what());
return EXIT_FAILURE;
}
}
++replicaId;
}
examples/raft/main.cpp
Outdated
consensusModule.start(); | ||
|
||
/*cm.stop();*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement proper consensus module lifecycle management.
The commented-out stop()
call and lack of proper cleanup could lead to resource leaks.
Consider implementing RAII pattern or proper cleanup:
int main(int argc, char *argv[]) {
// ... existing code ...
std::atomic<bool> running{true};
auto cleanup = [&running, &consensusModule]() {
running = false;
consensusModule.stop();
spdlog::info("Consensus module stopped gracefully");
};
// Set up signal handler to trigger cleanup
std::signal(SIGTERM, [](int) { cleanup(); });
std::signal(SIGINT, [](int) { cleanup(); });
consensusModule.start();
// Wait for shutdown signal
while (running) {
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
return EXIT_SUCCESS;
}
examples/raft/raft_test.cpp
Outdated
#include "gmock/gmock.h" | ||
#include <catch2/catch_test_macros.hpp> | ||
|
||
#include <grpc/grpc.h> | ||
#include <grpcpp/channel.h> | ||
#include <grpcpp/client_context.h> | ||
#include <grpcpp/create_channel.h> | ||
#include <grpcpp/security/credentials.h> | ||
#include <grpcpp/security/server_credentials.h> | ||
#include <grpcpp/server_builder.h> | ||
#include <grpcpp/support/status.h> | ||
|
||
#include <gtest/gtest.h> | ||
#include <memory> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Avoid mixing Catch2 and Google Test in the same test file
Including both Catch2 and Google Test (gtest/gmock) in the same test file can lead to conflicts and is generally not recommended. Choose one testing framework to maintain consistency in your tests.
examples/raft/CMakeLists.txt
Outdated
add_executable(RaftTest "raft_test.cpp" "raft.cpp") | ||
set_target_properties(RaftTest PROPERTIES CXX_STANDARD 23) | ||
target_link_libraries(RaftTest Catch2::Catch2WithMain gtest::gtest DB | ||
RaftProtoObjects TKVProtoObjects) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Standardize on a single testing framework
The test executable links against both Catch2 and Google Test frameworks. This is unusual and could lead to:
- Increased maintenance overhead
- Confusion in test organization
- Larger binary size
Choose one framework and update the configuration:
# If choosing Catch2
target_link_libraries(RaftTest PRIVATE
Catch2::Catch2WithMain
DB
RaftProtoObjects
TKVProtoObjects
)
# OR if choosing Google Test
target_link_libraries(RaftTest PRIVATE
gtest::gtest
gtest::gtest_main
DB
RaftProtoObjects
TKVProtoObjects
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
♻️ Duplicate comments (5)
examples/raft/main.cpp (2)
61-63
:⚠️ Potential issueSecurity concern: Using insecure channel credentials.
The use of
grpc::InsecureChannelCredentials()
is not suitable for production environments.Consider using TLS certificates for secure communication:
- RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))}; + auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions()); + RaftService::NewStub(grpc::CreateChannel(replicaIp, creds))};
56-69
:⚠️ Potential issueAdd error handling for replica setup.
The replica setup should include error handling for channel creation and connection failures.
Consider wrapping the replica setup in a try-catch block and adding connection timeout checks.
examples/raft/raft.cpp (3)
689-693
:⚠️ Potential issueFix incorrect majority calculation.
The majority calculation is incorrect and could lead to split votes.
auto consensus_module_t::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; }
794-874
:⚠️ Potential issueFix thread safety issues in sendAppendEntriesRPC.
Using detached threads is unsafe and could lead to resource leaks and undefined behavior.
- std::thread( + m_replicationThreads.emplace_back( [this](node_client_t &client, std::vector<LogEntry> logEntries) { // ... existing code ... - if (!response.success()) - { - { - absl::MutexLock locker{&m_stateMutex}; - m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); - } - sendAppendEntriesRPC(client, {}); // Recursive call! - } + if (!response.success()) + { + absl::MutexLock locker{&m_stateMutex}; + m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); + // Use a loop instead of recursion + continue; + } }, std::ref(client), - logEntries) - .detach(); + logEntries);
942-1006
:⚠️ Potential issueImprove persistence reliability and error handling.
The persistence implementation lacks atomic write guarantees and proper fsync calls.
auto consensus_module_t::flushPersistentState() -> bool { + const auto tmpPath = path.string() + ".tmp"; + // Flush commitIndex and votedFor { - auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); - std::ofstream fsa(path, std::fstream::out | std::fstream::trunc); + std::ofstream fsa(tmpPath, std::fstream::out | std::fstream::trunc); if (!fsa.is_open()) { return false; } fsa << m_commitIndex << " " << m_votedFor << "\n"; fsa.flush(); - // TODO(lnikon): ::fsync(fsa->handle()); + if (::fsync(fileno(fsa.rdbuf()->fd())) != 0) { + return false; + } + fsa.close(); + + if (std::rename(tmpPath.c_str(), path.c_str()) != 0) { + return false; + } }
🧹 Nitpick comments (8)
examples/raft/main.cpp (4)
1-14
: Add header guards or pragma once directive.Even though this is a source file, it's good practice to add header guards to prevent potential issues if the file is accidentally included elsewhere.
+#pragma once #include <chrono> #include <csignal>
42-47
: Enhance node ID validation.The current validation only checks if the ID is non-zero. Consider adding:
- Upper bound check to prevent overflow
- Validation against the size of the node list
auto nodeId = parsedOptions["id"].as<id_t>(); -if (nodeId == 0) +if (nodeId == 0 || nodeId > nodeIps.size()) { - spdlog::error("ID of the node should be positve integer"); + spdlog::error("Node ID should be a positive integer not exceeding the number of nodes ({})", nodeIps.size()); return EXIT_FAILURE; }
70-76
: Enhance error reporting for consensus module initialization.Consider adding more detailed error reporting to help diagnose initialization failures.
- if (!consensusModule.init()) + try + { + if (!consensusModule.init()) + { + spdlog::error("Failed to initialize the consensus module: initialization returned false"); + return EXIT_FAILURE; + } + } + catch (const std::exception& e) + { + spdlog::error("Failed to initialize the consensus module: {}", e.what()); + return EXIT_FAILURE; + }
77-77
: Make log level configurable.Consider adding a command-line option to configure the log level instead of hardcoding it.
+ options.add_options() + ("log-level", "Log level (trace, debug, info, warn, error, critical)", + cxxopts::value<std::string>()->default_value("info")); + auto logLevel = parsedOptions["log-level"].as<std::string>(); + if (logLevel == "trace") spdlog::set_level(spdlog::level::trace); + else if (logLevel == "debug") spdlog::set_level(spdlog::level::debug); + // ... handle other levels - spdlog::set_level(spdlog::level::debug);examples/raft/raft_test.cpp (1)
93-95
: Implement leader election test casesThe leader election test case is empty but crucial for validating Raft consensus. Key scenarios to test:
- Normal leader election
- Split vote scenarios
- Leader failure and re-election
Would you like me to help implement these test cases or create a GitHub issue to track this task?
examples/raft/raft.h (2)
32-38
: Consider using steady_clock for better portability.
high_resolution_clock
might alias tosystem_clock
on some platforms, which can be adjusted and jump backwards/forwards. For election timeouts and heartbeats,steady_clock
would be more appropriate as it's monotonic and provides consistent intervals.-using clock_t = std::chrono::high_resolution_clock; -using timepoint_t = std::chrono::high_resolution_clock::time_point; +using clock_t = std::chrono::steady_clock; +using timepoint_t = std::chrono::steady_clock::time_point;
40-45
: Add validation in node_config_t constructor.The struct could benefit from a constructor that validates the ID and IP address format.
struct node_config_t { + node_config_t(id_t id, ip_t ip) + : m_id(id), m_ip(std::move(ip)) { + if (m_id == gInvalidId) { + throw std::invalid_argument("Invalid node ID"); + } + if (m_ip.empty()) { + throw std::invalid_argument("Empty IP address"); + } + } + id_t m_id{gInvalidId}; ip_t m_ip; };examples/raft/raft.cpp (1)
166-178
: Improve service registration error handling.The dynamic_cast error handling could be more informative and the casts could be combined.
- auto *raftService = dynamic_cast<RaftService::Service *>(this); - if (raftService == nullptr) - { - throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService")); - } - builder.RegisterService(raftService); - - auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); - if (tkvppService == nullptr) - { - throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService")); - } - builder.RegisterService(tkvppService); + if (auto* raftService = dynamic_cast<RaftService::Service*>(this)) { + builder.RegisterService(raftService); + } else { + throw std::runtime_error( + fmt::format("Node {} failed to register RaftService", m_config.m_id)); + } + + if (auto* tkvppService = dynamic_cast<TinyKVPPService::Service*>(this)) { + builder.RegisterService(tkvppService); + } else { + throw std::runtime_error( + fmt::format("Node {} failed to register TinyKVPPService", m_config.m_id)); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.vscode/launch.json
(2 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)examples/raft/raft_test.cpp
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
.vscode/launch.json
[error] 74-74: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 105-105: Expected a property but instead found '}'.
Expected a property here.
(parse)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (8)
examples/raft/main.cpp (1)
17-24
: LGTM! Thread-safe signal handling implementation.The signal handler correctly uses atomic operations for thread-safe graceful shutdown handling.
examples/raft/raft_test.cpp (3)
1-2
: Avoid mixing test frameworksIncluding both Catch2 (
catch_test_macros.hpp
) and Google Test/Mock (gmock.h
,gtest/gtest.h
) in the same test file can lead to conflicts and maintenance issues.Choose one framework consistently:
- If using Google Test: Remove Catch2 includes and convert test cases to
TEST
/TEST_F
- If using Catch2: Remove Google Test includes and use Catch2's built-in mocking capabilities
Also applies to: 14-14
89-91
: Verify consensus module state after initializationThe test starts the consensus module but doesn't verify its state or behavior.
Add assertions to verify:
- The module transitions to the correct initial state
- The heartbeat mechanism is active
- The term number is correctly initialized
25-42
: 🛠️ Refactor suggestionMaintain consistent assertion style
The test mixes Google Test assertions (
EXPECT_*
) within a Catch2 test case. This makes the test harder to maintain and understand.Convert to consistent Catch2 style:
TEST_CASE("NodeClient AppendEntries", "[NodeClient]") { raft::node_config_t nodeConfig{.m_id = 1, .m_ip = "0.0.0.0:9090"}; auto mockStub = std::make_unique<MockRaftServiceStub>(); AppendEntriesResponse response; - EXPECT_CALL(*mockStub, AppendEntries) + REQUIRE_CALL(*mockStub, AppendEntries) .Times(testing::AtLeast(2)) .WillOnce(testing::DoAll(testing::SetArgPointee<2>(response), testing::Return(grpc::Status::OK))) .WillOnce(testing::DoAll(testing::SetArgPointee<2>(response), testing::Return(grpc::Status::CANCELLED))); raft::node_client_t nodeClient{nodeConfig, std::move(mockStub)}; AppendEntriesRequest request; - EXPECT_EQ(nodeClient.appendEntries(request, &response), true); - EXPECT_EQ(nodeClient.appendEntries(request, &response), false); + REQUIRE(nodeClient.appendEntries(request, &response) == true); + REQUIRE(nodeClient.appendEntries(request, &response) == false); }Likely invalid or redundant comment.
.vscode/launch.json (1)
7-33
: Enhance RaftMain debug configurationsThe debug configurations could be more flexible:
- Add configurable debug port
- Make node ID and peer list configurable via inputs
- Add build task dependency
As suggested in the previous review, consider using input variables for configuration:
{ "inputs": [ { "id": "debugPort", "type": "promptString", "description": "Debug port for remote debugging", "default": "50505" }, { "id": "nodeId", "type": "promptString", "description": "Raft node ID", "default": "1" } ] }Also applies to: 34-74
examples/raft/raft.h (3)
1-30
: LGTM! Well-organized headers with clear grouping.Headers are properly organized and include all necessary dependencies with helpful comments indicating their purpose.
83-86
: Fix inheritance from enable_shared_from_this.The inheritance should be public to ensure proper functionality of shared_from_this().
class consensus_module_t : public RaftService::Service, public TinyKVPPService::Service, - public std::enable_shared_from_this<consensus_module_t> + public std::enable_shared_from_this<consensus_module_t>
171-173
: Specify memory ordering for atomic operations.The atomic variables need explicit memory ordering specifications to ensure proper synchronization.
- std::atomic<bool> m_leaderHeartbeatReceived{false}; + std::atomic<bool> m_leaderHeartbeatReceived ABSL_GUARDED_BY(m_timerMutex){false};
examples/raft/raft_test.cpp
Outdated
aeResponse.set_responderid(1); | ||
aeResponse.set_responderid(2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix overwritten responder ID
The responder ID is being overwritten immediately after being set:
- aeResponse.set_responderid(1);
- aeResponse.set_responderid(2);
+ aeResponse.set_responderid(2);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
aeResponse.set_responderid(1); | |
aeResponse.set_responderid(2); | |
aeResponse.set_responderid(2); |
], | ||
"debuggerPath": "/usr/bin/gdb" | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix JSON syntax errors
Remove trailing commas after the pipeTransport objects:
"pipeTransport": {
// ...
"debuggerPath": "/usr/bin/gdb"
- },
+ }
},
Also applies to: 105-105
🧰 Tools
🪛 Biome (1.9.4)
[error] 74-74: Expected a property but instead found '}'.
Expected a property here.
(parse)
examples/raft/raft.h
Outdated
void startElection(); | ||
void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | ||
void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | ||
void sendHeartbeat(node_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | ||
void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries); | ||
[[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add missing thread safety annotations.
Several private methods lack thread safety annotations despite accessing shared state.
- void startElection();
+ void startElection() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
- void sendHeartbeat(node_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
+ void sendHeartbeat(node_client_t &client)
+ ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex, m_timerMutex);
- void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries);
+ void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
+ ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex);
- [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool;
+ [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool
+ ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
void startElection(); | |
void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
void sendHeartbeat(node_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries); | |
[[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool; | |
void startElection() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
void sendHeartbeat(node_client_t &client) | |
ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex, m_timerMutex); | |
void sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries) | |
ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); | |
[[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool | |
ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex); |
examples/raft/raft.h
Outdated
/** | ||
* Client for communicating with other nodes in the Raft cluster. | ||
* Handles RPC operations for consensus and key-value operations. | ||
*/ | ||
class node_client_t | ||
{ | ||
public: | ||
/** | ||
* Constructs a client for communicating with a specific node. | ||
* @param nodeId Unique identifier for the target node | ||
* @param nodeIp IP address of the target node | ||
* @throws std::runtime_error if connection cannot be established | ||
*/ | ||
node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub); | ||
virtual ~node_client_t() noexcept = default; | ||
|
||
node_client_t(const node_client_t &) = delete; | ||
auto operator=(const node_client_t &) -> node_client_t & = delete; | ||
|
||
node_client_t(node_client_t &&) = default; | ||
auto operator=(node_client_t &&) -> node_client_t & = default; | ||
|
||
auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; | ||
auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool; | ||
|
||
auto put(const PutRequest &request, PutResponse *pResponse) -> bool; | ||
|
||
[[nodiscard]] auto id() const -> id_t; | ||
[[nodiscard]] auto ip() const -> ip_t; | ||
|
||
private: | ||
node_config_t m_config{}; | ||
|
||
std::unique_ptr<RaftService::StubInterface> m_stub{nullptr}; | ||
std::unique_ptr<TinyKVPPService::Stub> m_kvStub{nullptr}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Document thread safety guarantees for RPC methods.
The class handles RPC operations but lacks thread safety documentation. gRPC stubs are thread-safe, but this should be explicitly documented.
/**
* Client for communicating with other nodes in the Raft cluster.
* Handles RPC operations for consensus and key-value operations.
+ * Thread safety: All public methods are thread-safe as they use thread-safe gRPC stubs.
*/
class node_client_t
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
/** | |
* Client for communicating with other nodes in the Raft cluster. | |
* Handles RPC operations for consensus and key-value operations. | |
*/ | |
class node_client_t | |
{ | |
public: | |
/** | |
* Constructs a client for communicating with a specific node. | |
* @param nodeId Unique identifier for the target node | |
* @param nodeIp IP address of the target node | |
* @throws std::runtime_error if connection cannot be established | |
*/ | |
node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub); | |
virtual ~node_client_t() noexcept = default; | |
node_client_t(const node_client_t &) = delete; | |
auto operator=(const node_client_t &) -> node_client_t & = delete; | |
node_client_t(node_client_t &&) = default; | |
auto operator=(node_client_t &&) -> node_client_t & = default; | |
auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; | |
auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool; | |
auto put(const PutRequest &request, PutResponse *pResponse) -> bool; | |
[[nodiscard]] auto id() const -> id_t; | |
[[nodiscard]] auto ip() const -> ip_t; | |
private: | |
node_config_t m_config{}; | |
std::unique_ptr<RaftService::StubInterface> m_stub{nullptr}; | |
std::unique_ptr<TinyKVPPService::Stub> m_kvStub{nullptr}; | |
}; | |
/** | |
* Client for communicating with other nodes in the Raft cluster. | |
* Handles RPC operations for consensus and key-value operations. | |
* Thread safety: All public methods are thread-safe as they use thread-safe gRPC stubs. | |
*/ | |
class node_client_t | |
{ | |
public: | |
/** | |
* Constructs a client for communicating with a specific node. | |
* @param nodeId Unique identifier for the target node | |
* @param nodeIp IP address of the target node | |
* @throws std::runtime_error if connection cannot be established | |
*/ | |
node_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub); | |
virtual ~node_client_t() noexcept = default; | |
node_client_t(const node_client_t &) = delete; | |
auto operator=(const node_client_t &) -> node_client_t & = delete; | |
node_client_t(node_client_t &&) = default; | |
auto operator=(node_client_t &&) -> node_client_t & = default; | |
auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; | |
auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool; | |
auto put(const PutRequest &request, PutResponse *pResponse) -> bool; | |
[[nodiscard]] auto id() const -> id_t; | |
[[nodiscard]] auto ip() const -> ip_t; | |
private: | |
node_config_t m_config{}; | |
std::unique_ptr<RaftService::StubInterface> m_stub{nullptr}; | |
std::unique_ptr<TinyKVPPService::Stub> m_kvStub{nullptr}; | |
}; |
examples/raft/raft.cpp
Outdated
auto node_client_t::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool | ||
{ | ||
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout()); | ||
|
||
grpc::ClientContext context; | ||
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT); | ||
grpc::Status status = m_stub->AppendEntries(&context, request, response); | ||
if (!status.ok()) | ||
{ | ||
spdlog::error("AppendEntries RPC to peer id={} ip={} failed. Error code={} message={}", | ||
m_config.m_id, | ||
m_config.m_ip, | ||
static_cast<int>(status.error_code()), | ||
status.error_message()); | ||
return false; | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor RPC methods to use consistent timeout and error handling.
The RPC methods (appendEntries, requestVote, put) have duplicated error handling code and inconsistent timeout handling.
Consider extracting common RPC handling logic:
template<typename Request, typename Response>
bool executeRPC(const char* operation, const Request& req, Response* resp,
std::function<grpc::Status(grpc::ClientContext*, const Request&, Response*)> rpc) {
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
grpc::Status status = rpc(&context, req, resp);
if (!status.ok()) {
spdlog::error("{} RPC to peer id={} ip={} failed. Error code={} message={}",
operation,
m_config.m_id,
m_config.m_ip,
static_cast<int>(status.error_code()),
status.error_message());
return false;
}
return true;
}
Then use it in RPC methods:
auto node_client_t::appendEntries(const AppendEntriesRequest& request,
AppendEntriesResponse* response) -> bool {
return executeRPC("AppendEntries", request, response,
[this](auto ctx, auto& req, auto* resp) {
return m_stub->AppendEntries(ctx, req, resp);
});
}
Also applies to: 102-121, 123-138
{ | ||
const int minTimeout{150}; | ||
const int maxTimeout{300}; | ||
|
||
std::random_device randomDevice; | ||
std::mt19937 gen(randomDevice()); | ||
std::uniform_int_distribution<> dist(minTimeout, maxTimeout); | ||
|
||
return dist(gen); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve thread safety and configurability of timeout generation.
The random number generator is not thread-safe and timeout values are hardcoded.
Apply this diff to fix these issues:
+constexpr const int MIN_TIMEOUT_MS = 150;
+constexpr const int MAX_TIMEOUT_MS = 300;
auto generate_random_timeout() -> int
{
- const int minTimeout{150};
- const int maxTimeout{300};
- std::random_device randomDevice;
- std::mt19937 gen(randomDevice());
+ thread_local std::random_device randomDevice;
+ thread_local std::mt19937 gen(randomDevice());
std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
return dist(gen);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{ | |
const int minTimeout{150}; | |
const int maxTimeout{300}; | |
std::random_device randomDevice; | |
std::mt19937 gen(randomDevice()); | |
std::uniform_int_distribution<> dist(minTimeout, maxTimeout); | |
return dist(gen); | |
} | |
{ | |
constexpr const int MIN_TIMEOUT_MS = 150; | |
constexpr const int MAX_TIMEOUT_MS = 300; | |
thread_local std::random_device randomDevice; | |
thread_local std::mt19937 gen(randomDevice()); | |
std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS); | |
return dist(gen); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (4)
examples/raft/raft.cpp (2)
803-883
: 🛠️ Refactor suggestionAvoid detaching threads in
sendAppendEntriesRPC
.Detaching threads can lead to resource leaks and undefined behavior, as there is no way to ensure they have completed before the program exits. It's better to manage thread lifetimes properly by storing the threads and joining them appropriately.
Apply this diff to manage the threads:
void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries) { - std::thread( + m_replicationThreads.emplace_back( [this](node_client_t &client, std::vector<LogEntry> logEntries) { // Existing code... }, std::ref(client), std::move(logEntries) - ).detach(); + ); }Declare
m_replicationThreads
in your class to store these threads:std::vector<std::jthread> m_replicationThreads;Ensure you join these threads appropriately in the
stop()
method to prevent resource leaks.
871-878
:⚠️ Potential issueFix infinite recursion in
sendAppendEntriesRPC
.In the
sendAppendEntriesRPC
method, when anAppendEntries
RPC fails (success
isfalse
), the code reducesm_nextIndex[client.id()]
and callssendAppendEntriesRPC
recursively without a termination condition. This can lead to infinite recursion and stack overflow.Consider modifying the logic to use a loop instead of recursion, ensuring there's a clear exit condition when retries fail. For example:
if (!response.success()) { { absl::MutexLock locker{&m_stateMutex}; m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); } - sendAppendEntriesRPC(client, {}); + // Retry sending AppendEntries with updated nextIndex + // Implement a loop with a termination condition instead of recursion }examples/raft/raft.h (2)
171-171
:⚠️ Potential issueSpecify memory ordering for atomic variables.
The atomic variable
m_leaderHeartbeatReceived
is used without specifying explicit memory ordering. This may lead to subtle synchronization issues. Consider specifying memory ordering when callingload()
andstore()
methods.For example:
// When setting the flag m_leaderHeartbeatReceived.store(true, std::memory_order_release); // When reading the flag if (m_leaderHeartbeatReceived.load(std::memory_order_acquire)) { // ... }Alternatively, if you are protecting access with mutexes, consider whether you need the variable to be atomic at all.
85-85
:⚠️ Potential issueInherit
enable_shared_from_this
publicly.The class
consensus_module_t
should inherit fromstd::enable_shared_from_this
publicly to useshared_from_this()
correctly. Failing to do so can lead to undefined behavior whenshared_from_this()
is called.Ensure the inheritance is public:
-class consensus_module_t : public RaftService::Service, - public TinyKVPPService::Service, - std::enable_shared_from_this<consensus_module_t> +class consensus_module_t : public RaftService::Service, + public TinyKVPPService::Service, + public std::enable_shared_from_this<consensus_module_t>
🧹 Nitpick comments (3)
examples/raft/main.cpp (3)
1-14
: Improve header organization and encapsulation.Consider:
- Grouping related headers (system, gRPC, project, third-party)
- Encapsulating the global condition variable within a SignalHandler class
// System headers #include <csignal> +#include <condition_variable> +#include <mutex> // gRPC headers #include <grpcpp/create_channel.h> #include <grpcpp/security/credentials.h> #include <grpcpp/server.h> // Project headers #include "raft.h" // Third-party headers #include <cxxopts.hpp> #include <spdlog/spdlog.h> -std::condition_variable gCv; +class SignalHandler { +public: + static SignalHandler& instance() { + static SignalHandler handler; + return handler; + } + void wait() { + std::unique_lock<std::mutex> lock(m_mutex); + m_cv.wait(lock); + } + void notify() { + m_cv.notify_all(); + } +private: + std::condition_variable m_cv; + std::mutex m_mutex; +};
32-38
: Enhance error handling in command-line parsing.Add try-catch block to handle parsing errors and improve error messages.
- auto parsedOptions = options.parse(argc, argv); + cxxopts::ParseResult parsedOptions; + try { + parsedOptions = options.parse(argc, argv); + } catch (const cxxopts::OptionParseException& e) { + spdlog::error("Failed to parse command line options: {}", e.what()); + spdlog::info("{}", options.help()); + return EXIT_FAILURE; + }
75-76
: Make log level configurable via command line.Allow users to specify the desired log level instead of hardcoding it to debug.
+ options.add_options() + // ... existing options ... + ("log-level", "Log level (trace, debug, info, warn, error, critical)", + cxxopts::value<std::string>()->default_value("info")); + + auto logLevel = parsedOptions["log-level"].as<std::string>(); + if (logLevel == "trace") spdlog::set_level(spdlog::level::trace); + else if (logLevel == "debug") spdlog::set_level(spdlog::level::debug); + else if (logLevel == "info") spdlog::set_level(spdlog::level::info); + else if (logLevel == "warn") spdlog::set_level(spdlog::level::warn); + else if (logLevel == "error") spdlog::set_level(spdlog::level::err); + else if (logLevel == "critical") spdlog::set_level(spdlog::level::critical); + else { + spdlog::warn("Unknown log level: {}. Defaulting to info", logLevel); + spdlog::set_level(spdlog::level::info); + } - spdlog::set_level(spdlog::level::debug); consensusModule.start();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
CMakeLists.txt
(3 hunks)conan/profiles/debug-clang-tsan
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- conan/profiles/debug-clang-tsan
- CMakeLists.txt
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-clang
- GitHub Check: build-gcc
🔇 Additional comments (1)
examples/raft/raft.cpp (1)
31-41
: Optimize random number generation ingenerate_random_timeout()
.Currently, the function
generate_random_timeout()
re-initializes the random number generator every time it is called, which can lead to unnecessary overhead. Consider using a thread-local random number generator to improve performance and ensure thread safety.Apply this diff:
+constexpr const int MIN_TIMEOUT_MS = 150; +constexpr const int MAX_TIMEOUT_MS = 300; auto generate_random_timeout() -> int { - const int minTimeout{150}; - const int maxTimeout{300}; - std::random_device randomDevice; - std::mt19937 gen(randomDevice()); + thread_local std::random_device randomDevice; + thread_local std::mt19937 gen(randomDevice()); std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS); return dist(gen); }
raft::consensus_module_t consensusModule({.m_id = nodeId, .m_ip = nodeIps[nodeId - 1]}, std::move(replicas)); | ||
if (!consensusModule.init()) | ||
{ | ||
spdlog::error("Failed to initialize the state machine"); | ||
return EXIT_FAILURE; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance consensus module initialization error handling.
Add timeout and detailed error reporting for initialization failures.
- if (!consensusModule.init())
+ auto initStart = std::chrono::steady_clock::now();
+ while (!consensusModule.init())
{
- spdlog::error("Failed to initialize the state machine");
+ auto elapsed = std::chrono::steady_clock::now() - initStart;
+ if (elapsed > std::chrono::seconds(10)) {
+ spdlog::error("Failed to initialize consensus module after {} seconds",
+ std::chrono::duration_cast<std::chrono::seconds>(elapsed).count());
+ spdlog::error("Last error: {}", consensusModule.getLastError());
+ return EXIT_FAILURE;
+ }
+ std::this_thread::sleep_for(std::chrono::milliseconds(100));
+ }
Committable suggestion skipped: line range outside the PR's diff.
examples/raft/main.cpp
Outdated
std::vector<raft::node_client_t> replicas; | ||
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps) | ||
{ | ||
if (replicaId != nodeId) | ||
{ | ||
std::unique_ptr<RaftService::Stub> stub{ | ||
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))}; | ||
|
||
replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub)); | ||
} | ||
|
||
++replicaId; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential index out of bounds error in replica setup.
The replicaId
starts from 1 and increments for each IP, but there's no validation that the number of IPs matches the expected node count.
+ if (nodeIps.size() < nodeId) {
+ spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size());
+ return EXIT_FAILURE;
+ }
+
std::vector<raft::node_client_t> replicas;
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::vector<raft::node_client_t> replicas; | |
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps) | |
{ | |
if (replicaId != nodeId) | |
{ | |
std::unique_ptr<RaftService::Stub> stub{ | |
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))}; | |
replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub)); | |
} | |
++replicaId; | |
} | |
if (nodeIps.size() < nodeId) { | |
spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size()); | |
return EXIT_FAILURE; | |
} | |
std::vector<raft::node_client_t> replicas; | |
for (raft::id_t replicaId{1}; const auto &replicaIp : nodeIps) | |
{ | |
if (replicaId != nodeId) | |
{ | |
std::unique_ptr<RaftService::Stub> stub{ | |
RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))}; | |
replicas.emplace_back(raft::node_config_t{.m_id = replicaId, .m_ip = replicaIp}, std::move(stub)); | |
} | |
++replicaId; | |
} |
examples/raft/raft.cpp
Outdated
auto consensus_module_t::restorePersistentState() -> bool | ||
{ | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | ||
if (!std::filesystem::exists(path)) | ||
{ | ||
spdlog::info("Node={} is running the first time", m_config.m_id); | ||
return true; | ||
} | ||
|
||
std::ifstream ifs(path, std::ifstream::in); | ||
if (!ifs.is_open()) | ||
{ | ||
spdlog::error( | ||
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
ifs >> m_commitIndex >> m_votedFor; | ||
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor); | ||
} | ||
|
||
{ | ||
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | ||
std::ifstream ifs(path, std::ifstream::in); | ||
if (!ifs.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
std::string logLine; | ||
while (std::getline(ifs, logLine)) | ||
{ | ||
std::stringstream sst(logLine); | ||
|
||
std::string key; | ||
std::string value; | ||
std::uint32_t term = 0; | ||
|
||
sst >> key >> value >> term; | ||
|
||
LogEntry logEntry; | ||
logEntry.set_key(key); | ||
logEntry.set_value(value); | ||
logEntry.set_term(term); | ||
m_log.emplace_back(logEntry); | ||
|
||
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term); | ||
} | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle file parsing errors in restorePersistentState()
.
While restoring persistent state, the code assumes that the file format is correct. If the file is corrupted, it may cause undefined behavior or crashes. Add error handling to validate the parsed values and handle file corruption gracefully.
Apply this diff to add error handling:
ifs >> m_commitIndex >> m_votedFor;
+if (ifs.fail())
+{
+ spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
+ return false;
+}
+if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
+ m_votedFor > std::numeric_limits<uint32_t>::max())
+{
+ spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
+ return false;
+}
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto consensus_module_t::restorePersistentState() -> bool | |
{ | |
{ | |
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | |
if (!std::filesystem::exists(path)) | |
{ | |
spdlog::info("Node={} is running the first time", m_config.m_id); | |
return true; | |
} | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error( | |
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str()); | |
return false; | |
} | |
ifs >> m_commitIndex >> m_votedFor; | |
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor); | |
} | |
{ | |
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str()); | |
return false; | |
} | |
std::string logLine; | |
while (std::getline(ifs, logLine)) | |
{ | |
std::stringstream sst(logLine); | |
std::string key; | |
std::string value; | |
std::uint32_t term = 0; | |
sst >> key >> value >> term; | |
LogEntry logEntry; | |
logEntry.set_key(key); | |
logEntry.set_value(value); | |
logEntry.set_term(term); | |
m_log.emplace_back(logEntry); | |
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term); | |
} | |
} | |
return true; | |
} | |
auto consensus_module_t::restorePersistentState() -> bool | |
{ | |
{ | |
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | |
if (!std::filesystem::exists(path)) | |
{ | |
spdlog::info("Node={} is running the first time", m_config.m_id); | |
return true; | |
} | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error( | |
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str()); | |
return false; | |
} | |
ifs >> m_commitIndex >> m_votedFor; | |
if (ifs.fail()) | |
{ | |
spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str()); | |
return false; | |
} | |
if (m_commitIndex > std::numeric_limits<uint32_t>::max() || | |
m_votedFor > std::numeric_limits<uint32_t>::max()) | |
{ | |
spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str()); | |
return false; | |
} | |
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor); | |
} | |
{ | |
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str()); | |
return false; | |
} | |
std::string logLine; | |
while (std::getline(ifs, logLine)) | |
{ | |
std::stringstream sst(logLine); | |
std::string key; | |
std::string value; | |
std::uint32_t term = 0; | |
sst >> key >> value >> term; | |
LogEntry logEntry; | |
logEntry.set_key(key); | |
logEntry.set_value(value); | |
logEntry.set_term(term); | |
m_log.emplace_back(logEntry); | |
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term); | |
} | |
} | |
return true; | |
} |
examples/raft/raft.h
Outdated
std::jthread m_serverThread; | ||
|
||
// Temporary in-memory hashtable to store KVs | ||
std::unordered_map<std::string, std::string> m_kv; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure thread safety when accessing m_kv
.
The in-memory key-value store m_kv
is accessed from multiple threads (e.g., via Put
and Get
RPCs) but lacks thread protection. This can lead to data races and undefined behavior.
Add a mutex to protect m_kv
:
mutable absl::Mutex m_kvMutex;
std::unordered_map<std::string, std::string> m_kv ABSL_GUARDED_BY(m_kvMutex);
And guard access to m_kv
with the mutex:
{
absl::MutexLock locker(&m_kvMutex);
m_kv[logEntry.key()] = logEntry.value();
}
Ensure all accesses and modifications to m_kv
are properly synchronized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (2)
examples/raft/raft.h (1)
87-90
: Document error handling for RPC methods.The RPC methods should document their error handling behavior:
- When do they return
false
?- What exceptions can they throw?
+ /** + * Sends append entries RPC to the node. + * @return false if RPC fails or connection is lost + * @throws std::runtime_error if stub is null + */ auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; + /** + * Sends request vote RPC to the node. + * @return false if RPC fails or connection is lost + * @throws std::runtime_error if stub is null + */ auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool; + /** + * Sends put operation RPC to the node. + * @return false if RPC fails or connection is lost + * @throws std::runtime_error if stub is null + */ auto put(const PutRequest &request, PutResponse *pResponse) -> bool;examples/raft/raft.cpp (1)
432-516
: Improve election timer reliability and clarity.The election timer implementation has several areas for improvement:
- Complex timer logic that's hard to understand
- Potential race conditions in state checks
- Magic numbers in timeout calculations
Consider extracting the timer logic into a separate class:
class ElectionTimer { absl::Mutex m_mutex; std::atomic<bool> m_heartbeatReceived{false}; public: bool waitForHeartbeat(std::stop_token& token) { const auto deadline = std::chrono::steady_clock::now() + std::chrono::milliseconds(generate_random_timeout()); absl::MutexLock lock(&m_mutex); return m_mutex.AwaitWithDeadline( absl::Condition(this, &ElectionTimer::shouldWakeUp), absl::FromChrono(deadline)); } private: bool shouldWakeUp() const { return m_heartbeatReceived.load() || token.stop_requested(); } };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)examples/raft/raft_test.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/raft/raft_test.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (8)
examples/raft/raft.h (6)
3-27
: Add missing essential headers.The header includes are well-organized, but some essential headers are still missing:
<mutex>
forstd::mutex
and synchronization primitives<stdexcept>
forstd::runtime_error
32-38
: Well-structured type aliases and constants!Good use of type aliases for improved maintainability and self-documenting constant definition.
137-142
: Add missing thread safety annotations for private methods.Several private methods lack proper thread safety annotations despite accessing shared state.
191-195
: Fix atomic variable usage.The atomic variables need explicit memory ordering specifications:
m_leaderHeartbeatReceived
should use proper memory orderingm_voteCount
should specify memory ordering for increment/decrement operations
206-206
: Add thread safety for key-value store access.The in-memory key-value store
m_kv
is accessed from multiple threads but lacks synchronization.
102-104
: Fix inheritance ofenable_shared_from_this
.
enable_shared_from_this
should be inherited publicly.examples/raft/raft.cpp (2)
691-695
:⚠️ Potential issueFix incorrect majority calculation.
The current implementation could lead to split votes and doesn't account for the total number of nodes correctly.
auto consensus_module_t::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; }Likely invalid or redundant comment.
376-379
:⚠️ Potential issueFix thread safety in log modification.
The log modification is not properly synchronized, which could lead to data races.
{ - absl::MutexLock locker{&m_stateMutex}; + absl::WriterMutexLock locker{&m_stateMutex}; m_log.push_back(logEntry); }Use
absl::WriterMutexLock
instead ofabsl::MutexLock
to properly protect write operations on shared data.Likely invalid or redundant comment.
examples/raft/raft.h
Outdated
struct tkvpp_absl_try_unlock | ||
{ | ||
tkvpp_absl_try_unlock(absl::Mutex *mu) | ||
: m_mu{mu} | ||
{ | ||
} | ||
|
||
~tkvpp_absl_try_unlock() | ||
{ | ||
if (m_mu) | ||
{ | ||
m_mu->Unlock(); | ||
} | ||
} | ||
|
||
private: | ||
absl::Mutex *m_mu{nullptr}; | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve RAII mutex guard implementation.
The tkvpp_absl_try_unlock
class needs improvements:
- Add null check in constructor
- Prevent potential double unlocks by deleting move operations
struct tkvpp_absl_try_unlock
{
tkvpp_absl_try_unlock(absl::Mutex *mu)
- : m_mu{mu}
+ : m_mu{mu ? mu : throw std::invalid_argument("Mutex pointer cannot be null")}
{
}
+ tkvpp_absl_try_unlock(const tkvpp_absl_try_unlock&) = delete;
+ tkvpp_absl_try_unlock& operator=(const tkvpp_absl_try_unlock&) = delete;
+ tkvpp_absl_try_unlock(tkvpp_absl_try_unlock&&) = delete;
+ tkvpp_absl_try_unlock& operator=(tkvpp_absl_try_unlock&&) = delete;
~tkvpp_absl_try_unlock()
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
struct tkvpp_absl_try_unlock | |
{ | |
tkvpp_absl_try_unlock(absl::Mutex *mu) | |
: m_mu{mu} | |
{ | |
} | |
~tkvpp_absl_try_unlock() | |
{ | |
if (m_mu) | |
{ | |
m_mu->Unlock(); | |
} | |
} | |
private: | |
absl::Mutex *m_mu{nullptr}; | |
}; | |
struct tkvpp_absl_try_unlock | |
{ | |
tkvpp_absl_try_unlock(absl::Mutex *mu) | |
: m_mu{mu ? mu : throw std::invalid_argument("Mutex pointer cannot be null")} | |
{ | |
} | |
tkvpp_absl_try_unlock(const tkvpp_absl_try_unlock&) = delete; | |
tkvpp_absl_try_unlock& operator=(const tkvpp_absl_try_unlock&) = delete; | |
tkvpp_absl_try_unlock(tkvpp_absl_try_unlock&&) = delete; | |
tkvpp_absl_try_unlock& operator=(tkvpp_absl_try_unlock&&) = delete; | |
~tkvpp_absl_try_unlock() | |
{ | |
if (m_mu) | |
{ | |
m_mu->Unlock(); | |
} | |
} | |
private: | |
absl::Mutex *m_mu{nullptr}; | |
}; |
examples/raft/raft.cpp
Outdated
auto node_client_t::appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool | ||
{ | ||
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout()); | ||
|
||
grpc::ClientContext context; | ||
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT); | ||
grpc::Status status = m_stub->AppendEntries(&context, request, response); | ||
if (!status.ok()) | ||
{ | ||
spdlog::error("AppendEntries RPC to peer id={} ip={} failed. Error code={} message={}", | ||
m_config.m_id, | ||
m_config.m_ip, | ||
static_cast<int>(status.error_code()), | ||
status.error_message()); | ||
return false; | ||
} | ||
|
||
return true; | ||
} | ||
|
||
auto node_client_t::requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool | ||
{ | ||
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout()); | ||
|
||
grpc::ClientContext context; | ||
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT); | ||
|
||
grpc::Status status = m_stub->RequestVote(&context, request, response); | ||
if (!status.ok()) | ||
{ | ||
spdlog::error("RequestVote RPC to peer id={} ip={} failed. Error code={} message={}", | ||
m_config.m_id, | ||
m_config.m_ip, | ||
static_cast<int>(status.error_code()), | ||
status.error_message()); | ||
return false; | ||
} | ||
|
||
return true; | ||
} | ||
|
||
auto node_client_t::put(const PutRequest &request, PutResponse *pResponse) -> bool | ||
{ | ||
grpc::ClientContext context; | ||
context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generate_random_timeout())); | ||
|
||
grpc::Status status = m_kvStub->Put(&context, request, pResponse); | ||
if (!status.ok()) | ||
{ | ||
spdlog::error("Put RPC call failed. Error code={} and message={}", | ||
static_cast<int>(status.error_code()), | ||
status.error_message()); | ||
return false; | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicated RPC error handling code.
The RPC methods (appendEntries, requestVote, put) have duplicated error handling logic and inconsistent timeout handling.
Extract common RPC handling logic into a template method:
template<typename Request, typename Response>
bool executeRPC(const char* operation, const Request& req, Response* resp,
std::function<grpc::Status(grpc::ClientContext*, const Request&, Response*)> rpc) {
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
grpc::Status status = rpc(&context, req, resp);
if (!status.ok()) {
spdlog::error("{} RPC to peer id={} ip={} failed. Error code={} message={}",
operation,
m_config.m_id,
m_config.m_ip,
static_cast<int>(status.error_code()),
status.error_message());
return false;
}
return true;
}
Then use it in RPC methods:
auto node_client_t::appendEntries(const AppendEntriesRequest& request,
AppendEntriesResponse* response) -> bool {
return executeRPC("AppendEntries", request, response,
[this](auto ctx, auto& req, auto* resp) {
return m_stub->AppendEntries(ctx, req, resp);
});
}
examples/raft/raft.cpp
Outdated
void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries) | ||
{ | ||
std::thread( | ||
[this](node_client_t &client, std::vector<LogEntry> logEntries) | ||
{ | ||
AppendEntriesRequest request; | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
|
||
request.set_term(m_currentTerm); | ||
request.set_prevlogterm(logEntries.front().term()); | ||
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1)); | ||
request.set_leadercommit(m_commitIndex); | ||
request.set_senderid(m_config.m_id); | ||
|
||
for (const auto &logEntry : logEntries) | ||
{ | ||
request.add_entries()->CopyFrom(logEntry); | ||
} | ||
} | ||
|
||
AppendEntriesResponse response; | ||
auto status = client.appendEntries(request, &response); | ||
if (!status) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}", | ||
m_config.m_id, | ||
client.id(), | ||
m_currentTerm); | ||
return; | ||
} | ||
|
||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
if (response.term() > m_currentTerm) | ||
{ | ||
becomeFollower(response.term()); | ||
return; | ||
} | ||
} | ||
|
||
if (response.success()) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_matchIndex[client.id()] = response.match_index(); | ||
m_nextIndex[client.id()] = response.match_index() + 1; | ||
|
||
uint32_t majorityIndex = findMajorityIndexMatch(); | ||
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm) | ||
{ | ||
if (!updatePersistentState(majorityIndex, std::nullopt)) | ||
{ | ||
spdlog::error("Node={} is unable to persist commitIndex={}", m_config.m_id, m_commitIndex); | ||
return; | ||
} | ||
|
||
// Apply successfull replication to the state machine e.g. in-memory hash-table | ||
while (m_lastApplied < m_commitIndex) | ||
{ | ||
++m_lastApplied; | ||
m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value(); | ||
} | ||
|
||
return; | ||
} | ||
} | ||
|
||
if (!response.success()) | ||
{ | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); | ||
} | ||
sendAppendEntriesRPC(client, {}); | ||
} | ||
}, | ||
std::ref(client), | ||
logEntries) | ||
.detach(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix critical thread safety and lifecycle issues in sendAppendEntriesRPC.
The method has several critical issues:
- Uses
std::thread::detach()
which can lead to resource leaks - Potential infinite recursion in failure handling
- Unsafe access to shared state
-void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, std::vector<LogEntry> logEntries)
+void consensus_module_t::sendAppendEntriesRPC(node_client_t &client, const std::vector<LogEntry>& logEntries)
{
- std::thread(
+ m_replicationThreads.emplace_back(
[this](node_client_t &client, std::vector<LogEntry> logEntries)
{
// ... existing code ...
if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
- sendAppendEntriesRPC(client, {}); // Recursive call!
+ // Use a retry loop instead of recursion
+ continue;
}
},
std::ref(client),
- logEntries)
- .detach();
+ logEntries);
}
Also add this member to the class to manage thread lifecycles:
std::vector<std::jthread> m_replicationThreads;
examples/raft/raft.cpp
Outdated
m_voteCount{0} | ||
{ | ||
assert(m_config.m_id > 0); | ||
// assert(replicas.size() > 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uncomment and fix the replicas size assertion.
The assertion for checking replicas size is commented out, which could lead to undefined behavior if an empty replicas vector is provided.
- // assert(replicas.size() > 0);
+ assert(!replicas.empty());
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// assert(replicas.size() > 0); | |
assert(!replicas.empty()); |
examples/raft/raft.cpp
Outdated
auto *raftService = dynamic_cast<RaftService::Service *>(this); | ||
if (raftService == nullptr) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService")); | ||
} | ||
builder.RegisterService(raftService); | ||
|
||
auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); | ||
if (tkvppService == nullptr) | ||
{ | ||
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService")); | ||
} | ||
builder.RegisterService(tkvppService); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve service registration error handling.
The dynamic casts and service registration could be more robust with better error messages and validation.
- auto *raftService = dynamic_cast<RaftService::Service *>(this);
- if (raftService == nullptr)
+ if (auto *raftService = dynamic_cast<RaftService::Service *>(this))
{
- throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService"));
+ builder.RegisterService(raftService);
+ }
+ else
+ {
+ throw std::runtime_error(
+ fmt::format("Node {} failed to register RaftService: invalid cast", m_config.m_id));
}
- builder.RegisterService(raftService);
- auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this);
- if (tkvppService == nullptr)
+ if (auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this))
{
- throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService"));
+ builder.RegisterService(tkvppService);
+ }
+ else
+ {
+ throw std::runtime_error(
+ fmt::format("Node {} failed to register TinyKVPPService: invalid cast", m_config.m_id));
}
- builder.RegisterService(tkvppService);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto *raftService = dynamic_cast<RaftService::Service *>(this); | |
if (raftService == nullptr) | |
{ | |
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService")); | |
} | |
builder.RegisterService(raftService); | |
auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); | |
if (tkvppService == nullptr) | |
{ | |
throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService")); | |
} | |
builder.RegisterService(tkvppService); | |
if (auto *raftService = dynamic_cast<RaftService::Service *>(this)) | |
{ | |
builder.RegisterService(raftService); | |
} | |
else | |
{ | |
throw std::runtime_error( | |
fmt::format("Node {} failed to register RaftService: invalid cast", m_config.m_id)); | |
} | |
if (auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this)) | |
{ | |
builder.RegisterService(tkvppService); | |
} | |
else | |
{ | |
throw std::runtime_error( | |
fmt::format("Node {} failed to register TinyKVPPService: invalid cast", m_config.m_id)); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (5)
examples/raft/main.cpp (2)
68-68
:⚠️ Potential issuePotential Out-of-Bounds Access When Accessing
nodeIps[nodeId - 1]
The code accesses
nodeIps[nodeId - 1]
without validating thatnodeId
is within the bounds ofnodeIps
. IfnodeId
is greater thannodeIps.size()
, this will result in undefined behavior due to out-of-bounds access.This issue was previously raised but remains unaddressed.
Suggested fix:
Add a check to ensure that
nodeId
is within the valid range before accessingnodeIps[nodeId - 1]
:+ if (nodeId > nodeIps.size()) + { + spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size()); + return EXIT_FAILURE; + } raft::consensus_module_t consensusModule({.m_id = nodeId, .m_ip = nodeIps[nodeId - 1]}, std::move(replicas));
47-52
: 🛠️ Refactor suggestionEnhance IP Address Validation
Currently, the code only checks if the list of node IPs is empty. Consider adding validation for:
- Correct IP address format
- Port number range
- Duplicate addresses
This suggestion was previously made but not yet implemented.
Suggested improvement:
Implement a validation function for IP addresses and use it to validate each IP in the list:
bool validateIpAddress(const std::string& ip) { std::regex ipRegex( R"(^(?:\d{1,3}\.){3}\d{1,3}:\d{1,5}$)" ); if (!std::regex_match(ip, ipRegex)) { return false; } // Additional validation for IP octet ranges and port numbers can be added here return true; }Then, use it in your code:
auto nodeIps = parsedOptions["nodes"].as<std::vector<raft::ip_t>>(); if (nodeIps.empty()) { spdlog::error("List of node IPs can't be empty"); return EXIT_FAILURE; } + for (const auto& ip : nodeIps) + { + if (!validateIpAddress(ip)) + { + spdlog::error("Invalid IP address format: {}", ip); + return EXIT_FAILURE; + } + }examples/raft/raft_test.cpp (2)
1-2
: 🛠️ Refactor suggestionAvoid Mixing Catch2 and Google Test Frameworks
Including both Catch2 and Google Test in the same test file can lead to conflicts and maintenance issues. It is recommended to choose one testing framework for consistency.
This issue was raised previously but remains unresolved.
Suggested action:
Standardize on a single testing framework for this test file. If possible, choose either Catch2 or Google Test and refactor the tests accordingly.
Also applies to: 14-14
72-82
:⚠️ Potential issueOverwriting
aeResponse
May Cause Inconsistent Test BehaviorIn the test case,
aeResponse
is modified between mock expectations, which could lead to unexpected behavior due to shared state.This issue was previously noted but still exists.
Suggested fix:
Use separate
AppendEntriesResponse
instances for each expectation to avoid unintended side effects:AppendEntriesResponse aeResponse2; aeResponse2.set_responderid(2); aeResponse2.set_success(true); EXPECT_CALL(*mockStub2, AppendEntries) .Times(testing::AtLeast(1)) - .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse), testing::Return(grpc::Status::OK))); + .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse2), testing::Return(grpc::Status::OK))); AppendEntriesResponse aeResponse3; aeResponse3.set_responderid(3); aeResponse3.set_success(true); EXPECT_CALL(*mockStub3, AppendEntries) .Times(testing::AtLeast(1)) - .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse), testing::Return(grpc::Status::OK))); + .WillRepeatedly(testing::DoAll(testing::SetArgPointee<2>(aeResponse3), testing::Return(grpc::Status::OK)));This ensures that each mock expectation uses its own response object.
examples/raft/raft.h (1)
218-218
:⚠️ Potential issueEnsure Thread Safety When Accessing
m_kv
The in-memory key-value store
m_kv
is accessed from multiple threads (e.g., viaPut
andGet
RPC handlers) without synchronization, which can lead to data races and undefined behavior.This issue was previously highlighted but remains unaddressed.
Suggested fix:
Add a mutex to protect
m_kv
and use it to guard all accesses and modifications:+ mutable absl::Mutex m_kvMutex; // Temporary in-memory hashtable to store KVs - std::unordered_map<std::string, std::string> m_kv; + std::unordered_map<std::string, std::string> m_kv ABSL_GUARDED_BY(m_kvMutex);Then, in your methods, lock the mutex when accessing
m_kv
:grpc::Status consensus_module_t::Put(grpc::ServerContext *pContext, const PutRequest *pRequest, PutResponse *pResponse) { absl::MutexLock lock(&m_kvMutex); m_kv[pRequest->key()] = pRequest->value(); // ... }Ensure that all accesses to
m_kv
inPut
andGet
methods are properly synchronized.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)examples/raft/raft_test.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- examples/raft/raft.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-clang
- GitHub Check: build-gcc
🔇 Additional comments (1)
examples/raft/raft.h (1)
148-153
: Add Thread Safety Annotations to Private MethodsSeveral private methods that access shared state lack thread safety annotations. Adding these annotations helps with static analysis and ensures proper synchronization.
This suggestion was made earlier but has not been implemented.
Suggested improvement:
Add appropriate thread safety annotations to the methods:
void startElection(); + void startElection() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); void becomeFollower(uint32_t newTerm) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); void becomeLeader() ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); void sendHeartbeat(raft_node_grpc_client_t &client); + void sendHeartbeat(raft_node_grpc_client_t &client) ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); void sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries); + void sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries) + ABSL_EXCLUSIVE_LOCKS_REQUIRED(m_stateMutex); [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool; + [[nodiscard]] auto waitForMajorityReplication(uint32_t logIndex) -> bool + ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex);This helps to document and enforce the locking requirements for these methods.
static void signalHandler(int sig) | ||
{ | ||
if (sig == SIGTERM || sig == SIGINT) | ||
{ | ||
gCv.notify_all(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical Issue: Unsafe use of std::condition_variable
in signal handler
Calling std::condition_variable::notify_all()
within a signal handler is unsafe and can lead to undefined behavior because it is not an async-signal-safe function. According to the C++ standard, only a limited set of functions are safe to call from within a signal handler.
Suggested fix:
Use an std::atomic<bool>
flag to signal the main thread to exit, instead of using a condition variable. Modify the code as follows:
-std::condition_variable gCv;
-static void signalHandler(int sig)
+std::atomic<bool> gShutdown{false};
+void signalHandler(int sig)
{
if (sig == SIGTERM || sig == SIGINT)
{
- gCv.notify_all();
+ gShutdown.store(true);
}
}
...
- std::mutex mtx;
- std::unique_lock<std::mutex> lock(mtx);
- gCv.wait(lock);
+ while (!gShutdown.load())
+ {
+ std::this_thread::sleep_for(std::chrono::milliseconds(100));
+ }
This replaces the condition variable with an atomic flag that the main thread can periodically check.
Committable suggestion skipped: line range outside the PR's diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 7
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
examples/raft/CMakeLists.txt
(1 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft_test.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- examples/raft/CMakeLists.txt
- examples/raft/main.cpp
- examples/raft/raft_test.cpp
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-gcc
- GitHub Check: build-clang
🔇 Additional comments (8)
examples/raft/raft.cpp (8)
95-114
: Extract common RPC handling logic.The RPC error handling code is duplicated across different methods and timeout handling is inconsistent.
Also applies to: 116-135
166-178
: 🛠️ Refactor suggestionImprove service registration error handling.
The dynamic casts and service registration could be more robust with better error messages and validation.
- auto *raftService = dynamic_cast<RaftService::Service *>(this); - if (raftService == nullptr) + if (auto *raftService = dynamic_cast<RaftService::Service *>(this)) { - throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to RaftService")); + builder.RegisterService(raftService); + } + else + { + throw std::runtime_error( + fmt::format("Node {} failed to register RaftService: invalid cast", m_config.m_id)); } - builder.RegisterService(raftService); - auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this); - if (tkvppService == nullptr) + if (auto *tkvppService = dynamic_cast<TinyKVPPService::Service *>(this)) { - throw std::runtime_error(fmt::format("Failed to dynamic_cast ConsensusModule to TinyKVPPService")); + builder.RegisterService(tkvppService); + } + else + { + throw std::runtime_error( + fmt::format("Node {} failed to register TinyKVPPService: invalid cast", m_config.m_id)); } - builder.RegisterService(tkvppService);Likely invalid or redundant comment.
1064-1067
:⚠️ Potential issueHandle file parsing errors in restorePersistentState.
The method assumes that the file format is correct. If the file is corrupted, it may cause undefined behavior.
Apply this diff to add error handling:
ifs >> m_commitIndex >> m_votedFor; +if (ifs.fail()) +{ + spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str()); + return false; +} + +if (m_commitIndex > std::numeric_limits<uint32_t>::max() || + m_votedFor > std::numeric_limits<uint32_t>::max()) +{ + spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str()); + return false; +} -m_votedFor = 0; // Remove this line as it overwrites the loaded value spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor);Likely invalid or redundant comment.
407-410
:⚠️ Potential issueEnsure thread safety when modifying shared data structures.
In the
Put
method, them_log
is being modified without proper synchronization, which may lead to data races.Apply this diff to enforce thread safety:
{ - absl::MutexLock locker{&m_stateMutex}; + absl::WriterMutexLock locker{&m_stateMutex}; m_log.push_back(logEntry); }Use
absl::WriterMutexLock
instead ofabsl::MutexLock
for write operations.Likely invalid or redundant comment.
719-723
:⚠️ Potential issueFix incorrect majority calculation.
The majority calculation is incorrect and could lead to split votes.
auto consensus_module_t::hasMajority(uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; }Likely invalid or redundant comment.
736-739
:⚠️ Potential issueReset vote count when becoming leader.
The vote count should be reset when becoming leader to avoid interference with future elections.
m_state = NodeState::LEADER; + m_voteCount = 0; spdlog::info("Node={} become a leader at term={}", m_config.m_id, m_currentTerm);
Likely invalid or redundant comment.
913-921
:⚠️ Potential issueAdd bounds checking to getLogTerm.
The array access is not properly bounds-checked and could lead to undefined behavior.
auto consensus_module_t::getLogTerm(uint32_t index) const -> uint32_t { if (index == 0 || index > m_log.size()) { return 0; } - return m_log[index - 1].term(); + try { + return m_log.at(index - 1).term(); + } catch (const std::out_of_range&) { + spdlog::error("Node={} attempted to access invalid log index {}", m_config.m_id, index); + return 0; + } }Likely invalid or redundant comment.
248-261
:⚠️ Potential issueFix potential buffer overflow in log entry handling.
The log entry insertion doesn't properly validate array bounds and could cause buffer overflow. Also, the insertion method is inefficient.
- auto newEntryStart = pRequest->prevlogindex() + 1; - for (auto idx{0}; idx < pRequest->entries().size(); ++idx) + const auto newEntryStart = pRequest->prevlogindex() + 1; + if (newEntryStart > m_log.size() + 1) { + pResponse->set_term(m_currentTerm); + pResponse->set_success(false); + pResponse->set_responderid(m_config.m_id); + return grpc::Status::OK; + } + + for (size_t idx = 0; idx < pRequest->entries().size(); ++idx) { - const auto newIdx{newEntryStart + idx - 1}; + const auto newIdx = newEntryStart + idx - 1; if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term()) { m_log.resize(newIdx); break; } } - m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end()); + + // Use reserve to prevent multiple reallocations + m_log.reserve(m_log.size() + pRequest->entries().size()); + for (const auto& entry : pRequest->entries()) { + m_log.push_back(entry); + }Likely invalid or redundant comment.
examples/raft/raft.cpp
Outdated
auto tkvpp_node_grpc_client_t::put(const PutRequest &request, PutResponse *pResponse) -> bool | ||
{ | ||
grpc::ClientContext context; | ||
context.set_deadline(std::chrono::system_clock::now() + std::chrono::seconds(generate_random_timeout())); | ||
|
||
grpc::Status status = m_stub->Put(&context, request, pResponse); | ||
if (!status.ok()) | ||
{ | ||
spdlog::error("Put RPC call failed. Error code={} and message={}", | ||
static_cast<int>(status.error_code()), | ||
status.error_message()); | ||
return false; | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Extract common RPC handling logic.
The RPC error handling code is duplicated across different methods and timeout handling is inconsistent.
Consider extracting common RPC handling logic into a template method:
template<typename Request, typename Response>
bool executeRPC(const char* operation, const Request& req, Response* resp,
std::function<grpc::Status(grpc::ClientContext*, const Request&, Response*)> rpc) {
const auto RPC_TIMEOUT = std::chrono::seconds(generate_random_timeout());
grpc::ClientContext context;
context.set_deadline(std::chrono::system_clock::now() + RPC_TIMEOUT);
grpc::Status status = rpc(&context, req, resp);
if (!status.ok()) {
spdlog::error("{} RPC to peer id={} ip={} failed. Error code={} message={}",
operation,
m_config.m_id,
m_config.m_ip,
static_cast<int>(status.error_code()),
status.error_message());
return false;
}
return true;
}
Then use it in RPC methods:
auto tkvpp_node_grpc_client_t::put(const PutRequest& request, PutResponse* response) -> bool {
return executeRPC("Put", request, response,
[this](auto ctx, auto& req, auto* resp) {
return m_stub->Put(ctx, req, resp);
});
}
examples/raft/raft.cpp
Outdated
void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries) | ||
{ | ||
std::thread( | ||
[this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries) | ||
{ | ||
AppendEntriesRequest request; | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
|
||
request.set_term(m_currentTerm); | ||
request.set_prevlogterm(logEntries.front().term()); | ||
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1)); | ||
request.set_leadercommit(m_commitIndex); | ||
request.set_senderid(m_config.m_id); | ||
|
||
for (const auto &logEntry : logEntries) | ||
{ | ||
request.add_entries()->CopyFrom(logEntry); | ||
} | ||
} | ||
|
||
AppendEntriesResponse response; | ||
auto status = client.appendEntries(request, &response); | ||
if (!status) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}", | ||
m_config.m_id, | ||
client.id(), | ||
m_currentTerm); | ||
return; | ||
} | ||
|
||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
if (response.term() > m_currentTerm) | ||
{ | ||
becomeFollower(response.term()); | ||
return; | ||
} | ||
} | ||
|
||
if (response.success()) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_matchIndex[client.id()] = response.match_index(); | ||
m_nextIndex[client.id()] = response.match_index() + 1; | ||
|
||
uint32_t majorityIndex = findMajorityIndexMatch(); | ||
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm) | ||
{ | ||
if (!updatePersistentState(majorityIndex, std::nullopt)) | ||
{ | ||
spdlog::error("Node={} is unable to persist commitIndex={}", m_config.m_id, m_commitIndex); | ||
return; | ||
} | ||
|
||
// Apply successfull replication to the state machine e.g. in-memory hash-table | ||
while (m_lastApplied < m_commitIndex) | ||
{ | ||
++m_lastApplied; | ||
m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value(); | ||
} | ||
|
||
return; | ||
} | ||
} | ||
|
||
if (!response.success()) | ||
{ | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); | ||
} | ||
sendAppendEntriesRPC(client, {}); | ||
} | ||
}, | ||
std::ref(client), | ||
logEntries) | ||
.detach(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix critical thread safety and lifecycle issues in sendAppendEntriesRPC.
The method has several critical issues:
- Uses
std::thread::detach()
which can lead to resource leaks - Potential infinite recursion in failure handling
- Unsafe access to shared state
-void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
+void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, const std::vector<LogEntry>& logEntries)
{
- std::thread(
+ m_replicationThreads.emplace_back(
[this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
{
// ... existing code ...
if (!response.success())
{
{
absl::MutexLock locker{&m_stateMutex};
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1);
}
- sendAppendEntriesRPC(client, {}); // Recursive call!
+ // Use a retry loop instead of recursion
+ continue;
}
},
std::ref(client),
- logEntries)
- .detach();
+ logEntries);
}
Also add this member to the class to manage thread lifecycles:
std::vector<std::jthread> m_replicationThreads;
auto consensus_module_t::flushPersistentState() -> bool | ||
{ | ||
// Flush commitIndex and votedFor | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | ||
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc); | ||
if (!fsa.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to flush commitIndex={} and votedFor={}", | ||
m_config.m_id, | ||
path.c_str(), | ||
m_commitIndex, | ||
m_votedFor); | ||
return false; | ||
} | ||
|
||
fsa << m_commitIndex << " " << m_votedFor << "\n"; | ||
if (fsa.fail()) | ||
{ | ||
spdlog::error("Node={} is unable to write into {} the commitIndex={} and votedFor={}", | ||
m_config.m_id, | ||
path.c_str(), | ||
m_commitIndex, | ||
m_votedFor); | ||
return false; | ||
} | ||
|
||
fsa.flush(); | ||
if (fsa.fail()) | ||
{ | ||
return false; | ||
} | ||
// TODO(lnikon): ::fsync(fsa->handle()); | ||
} | ||
|
||
// Flush the log | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | ||
std::ofstream fsa(path, std::fstream::out | std::fstream::trunc); | ||
if (!fsa.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to flush the log", m_config.m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
for (const auto &entry : m_log) | ||
{ | ||
fsa << entry.key() << " " << entry.value() << " " << entry.term() << "\n"; | ||
|
||
if (fsa.fail()) | ||
{ | ||
spdlog::error("Node={} is unable to write into {} the commitIndex={} and votedFor={}", | ||
m_config.m_id, | ||
path.c_str(), | ||
m_commitIndex, | ||
m_votedFor); | ||
return false; | ||
} | ||
} | ||
fsa.flush(); | ||
// TODO(lnikon): ::fsync(fsa->handle()); | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve persistence reliability and error handling.
The persistence implementation lacks atomic write guarantees and proper error handling. The TODO comments about fsync need to be addressed.
- Use atomic file operations to prevent data corruption
- Implement proper fsync calls
- Add comprehensive error handling
auto consensus_module_t::flushPersistentState() -> bool
{
+ const auto tmpPath = path.string() + ".tmp";
// Flush commitIndex and votedFor
{
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id));
- std::ofstream fsa(path, std::fstream::out | std::fstream::trunc);
+ std::ofstream fsa(tmpPath, std::fstream::out | std::fstream::trunc);
if (!fsa.is_open()) {
return false;
}
fsa << m_commitIndex << " " << m_votedFor << "\n";
fsa.flush();
+ if (::fsync(fileno(fsa.rdbuf()->fd())) != 0) {
+ return false;
+ }
+ fsa.close();
+
+ if (std::rename(tmpPath.c_str(), path.c_str()) != 0) {
+ return false;
+ }
}
// Similar changes for log persistence...
return true;
}
Committable suggestion skipped: line range outside the PR's diff.
auto consensus_module_t::waitForMajorityReplication(uint32_t logIndex) -> bool | ||
{ | ||
constexpr const auto replicationTimeout{5}; | ||
|
||
auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex) | ||
{ | ||
uint32_t count = 1; | ||
for (const auto &[peer, matchIdx] : m_matchIndex) | ||
{ | ||
count += matchIdx >= logIndex; | ||
} | ||
return count >= (m_replicas.size() + 1) / 2 + 1; | ||
}; | ||
|
||
spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex); | ||
return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve replication wait mechanism.
The timeout value is hardcoded and the majority calculation could be improved.
- constexpr const auto replicationTimeout{5};
+ static constexpr const auto REPLICATION_TIMEOUT_SECONDS = 5;
+ static constexpr const auto MIN_MAJORITY = 2; // Minimum number of nodes for consensus
auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex)
{
uint32_t count = 1; // Count self
for (const auto &[peer, matchIdx] : m_matchIndex)
{
count += matchIdx >= logIndex;
}
- return count >= (m_replicas.size() + 1) / 2 + 1;
+ const size_t majorityCount = (m_replicas.size() + 1) / 2 + 1;
+ return count >= std::max(majorityCount, MIN_MAJORITY);
};
spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex);
- return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout));
+ return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority),
+ absl::Seconds(REPLICATION_TIMEOUT_SECONDS));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto consensus_module_t::waitForMajorityReplication(uint32_t logIndex) -> bool | |
{ | |
constexpr const auto replicationTimeout{5}; | |
auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex) | |
{ | |
uint32_t count = 1; | |
for (const auto &[peer, matchIdx] : m_matchIndex) | |
{ | |
count += matchIdx >= logIndex; | |
} | |
return count >= (m_replicas.size() + 1) / 2 + 1; | |
}; | |
spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex); | |
return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout)); | |
} | |
auto consensus_module_t::waitForMajorityReplication(uint32_t logIndex) -> bool | |
{ | |
static constexpr const auto REPLICATION_TIMEOUT_SECONDS = 5; | |
static constexpr const auto MIN_MAJORITY = 2; // Minimum number of nodes for consensus | |
auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex) | |
{ | |
uint32_t count = 1; // Count self | |
for (const auto &[peer, matchIdx] : m_matchIndex) | |
{ | |
count += matchIdx >= logIndex; | |
} | |
const size_t majorityCount = (m_replicas.size() + 1) / 2 + 1; | |
return count >= std::max(majorityCount, MIN_MAJORITY); | |
}; | |
spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex); | |
return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), | |
absl::Seconds(REPLICATION_TIMEOUT_SECONDS)); | |
} |
auto generate_random_timeout() -> int | ||
{ | ||
const int minTimeout{150}; | ||
const int maxTimeout{300}; | ||
|
||
std::random_device randomDevice; | ||
std::mt19937 gen(randomDevice()); | ||
std::uniform_int_distribution<> dist(minTimeout, maxTimeout); | ||
|
||
return dist(gen); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix thread safety issues in random number generation.
The random number generator is not thread-safe and could lead to race conditions when called from multiple threads. Also, the timeout values should be configurable constants.
Apply this diff to fix these issues:
+constexpr const int MIN_TIMEOUT_MS = 150;
+constexpr const int MAX_TIMEOUT_MS = 300;
auto generate_random_timeout() -> int
{
- const int minTimeout{150};
- const int maxTimeout{300};
- std::random_device randomDevice;
- std::mt19937 gen(randomDevice());
+ thread_local std::random_device randomDevice;
+ thread_local std::mt19937 gen(randomDevice());
- std::uniform_int_distribution<> dist(minTimeout, maxTimeout);
+ std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS);
return dist(gen);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto generate_random_timeout() -> int | |
{ | |
const int minTimeout{150}; | |
const int maxTimeout{300}; | |
std::random_device randomDevice; | |
std::mt19937 gen(randomDevice()); | |
std::uniform_int_distribution<> dist(minTimeout, maxTimeout); | |
return dist(gen); | |
} | |
auto generate_random_timeout() -> int | |
{ | |
constexpr const int MIN_TIMEOUT_MS = 150; | |
constexpr const int MAX_TIMEOUT_MS = 300; | |
thread_local std::random_device randomDevice; | |
thread_local std::mt19937 gen(randomDevice()); | |
std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS); | |
return dist(gen); | |
} |
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t | ||
{ | ||
std::vector<int> matchIndexes; | ||
matchIndexes.resize(m_replicas.size() + 1); | ||
for (const auto &[peer, matchIdx] : m_matchIndex) | ||
{ | ||
matchIndexes.emplace_back(matchIdx); | ||
} | ||
matchIndexes.emplace_back(m_log.back().index()); | ||
|
||
std::ranges::sort(matchIndexes); | ||
|
||
return matchIndexes[matchIndexes.size() / 2]; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix majority index calculation.
The vector size calculation is incorrect and there are potential duplicate entries.
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t
{
- std::vector<int> matchIndexes;
- matchIndexes.resize(m_replicas.size() + 1);
+ std::vector<uint32_t> matchIndexes;
+ matchIndexes.reserve(m_replicas.size() + 1); // Reserve space for all replicas + self
+
for (const auto &[peer, matchIdx] : m_matchIndex)
{
matchIndexes.emplace_back(matchIdx);
}
- matchIndexes.emplace_back(m_log.back().index());
+ // Add self's last log index
+ matchIndexes.emplace_back(getLastLogIndex());
std::ranges::sort(matchIndexes);
+ // Majority is (N+1)/2 where N is total number of servers
return matchIndexes[matchIndexes.size() / 2];
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t | |
{ | |
std::vector<int> matchIndexes; | |
matchIndexes.resize(m_replicas.size() + 1); | |
for (const auto &[peer, matchIdx] : m_matchIndex) | |
{ | |
matchIndexes.emplace_back(matchIdx); | |
} | |
matchIndexes.emplace_back(m_log.back().index()); | |
std::ranges::sort(matchIndexes); | |
return matchIndexes[matchIndexes.size() / 2]; | |
} | |
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t | |
{ | |
std::vector<uint32_t> matchIndexes; | |
matchIndexes.reserve(m_replicas.size() + 1); // Reserve space for all replicas + self | |
for (const auto &[peer, matchIdx] : m_matchIndex) | |
{ | |
matchIndexes.emplace_back(matchIdx); | |
} | |
// Add self's last log index | |
matchIndexes.emplace_back(getLastLogIndex()); | |
std::ranges::sort(matchIndexes); | |
// Majority is (N+1)/2 where N is total number of servers | |
return matchIndexes[matchIndexes.size() / 2]; | |
} |
{ | ||
AppendEntriesResponse response; | ||
if (!client.appendEntries(request, &response)) | ||
{ | ||
consecutiveFailures++; | ||
|
||
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", | ||
consecutiveFailures, | ||
maxRetries); | ||
if (consecutiveFailures >= maxRetries) | ||
{ | ||
spdlog::error( | ||
"Stopping heartbeat thread due to too much failed AppendEntries RPC attempts"); | ||
return; | ||
} | ||
} | ||
|
||
consecutiveFailures = 0; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix consecutive failures handling in heartbeat mechanism.
The consecutive failures counter is reset too early, which breaks the retry mechanism.
if (!client.appendEntries(request, &response))
{
consecutiveFailures++;
- consecutiveFailures = 0; // Remove this line
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}",
consecutiveFailures,
maxRetries);
if (consecutiveFailures >= maxRetries)
{
return;
}
continue;
}
+ // Only reset on success
+ consecutiveFailures = 0;
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
{ | |
AppendEntriesResponse response; | |
if (!client.appendEntries(request, &response)) | |
{ | |
consecutiveFailures++; | |
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", | |
consecutiveFailures, | |
maxRetries); | |
if (consecutiveFailures >= maxRetries) | |
{ | |
spdlog::error( | |
"Stopping heartbeat thread due to too much failed AppendEntries RPC attempts"); | |
return; | |
} | |
} | |
consecutiveFailures = 0; | |
{ | |
AppendEntriesResponse response; | |
if (!client.appendEntries(request, &response)) | |
{ | |
consecutiveFailures++; | |
spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", | |
consecutiveFailures, | |
maxRetries); | |
if (consecutiveFailures >= maxRetries) | |
{ | |
spdlog::error( | |
"Stopping heartbeat thread due to too much failed AppendEntries RPC attempts"); | |
return; | |
} | |
continue; | |
} | |
// Only reset on success | |
consecutiveFailures = 0; | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (4)
examples/raft/main.cpp (3)
14-20
:⚠️ Potential issueCritical Issue: Unsafe use of
std::condition_variable
in signal handlerCalling
std::condition_variable::notify_all()
within a signal handler is unsafe and can lead to undefined behavior because it is not async-signal-safe. According to the C++ standard, only a limited set of functions are safe to call from within a signal handler.Apply this diff to replace the condition variable with an atomic flag:
-std::condition_variable gCv; static void signalHandler(int sig) { if (sig == SIGTERM || sig == SIGINT) { - gCv.notify_all(); + gShutdown.store(true); } } +std::atomic<bool> gShutdown{false}; ... int main(int argc, char *argv[]) { std::signal(SIGTERM, signalHandler); std::signal(SIGINT, signalHandler); ... - std::mutex mtx; - std::unique_lock<std::mutex> lock(mtx); - gCv.wait(lock); + while (!gShutdown.load()) + { + std::this_thread::sleep_for(std::chrono::milliseconds(100)); + } consensusModule.stop(); return EXIT_SUCCESS; }
46-51
: 🛠️ Refactor suggestionEnhance IP address validation and error handling
Currently, the code only checks if the list of node IPs is not empty. For robustness, consider adding validation for:
- The
nodeId
is within the range of provided IPs.- IP address format correctness.
- Duplicate IP addresses.
Apply this diff to enhance validation:
if (nodeIps.empty()) { spdlog::error("List of node IPs can't be empty"); return EXIT_FAILURE; } + +if (nodeId > nodeIps.size()) +{ + spdlog::error("Node ID {} is larger than the number of provided IPs {}", nodeId, nodeIps.size()); + return EXIT_FAILURE; +} + +std::unordered_set<std::string> ipSet(nodeIps.begin(), nodeIps.end()); +if (ipSet.size() != nodeIps.size()) +{ + spdlog::error("Duplicate IP addresses detected in the node IP list"); + return EXIT_FAILURE; +} + +std::regex ipRegex("^(?:[0-9]{1,3}\\.){3}[0-9]{1,3}:[0-9]{1,5}$"); +for (const auto &ip : nodeIps) +{ + if (!std::regex_match(ip, ipRegex)) + { + spdlog::error("Invalid IP address format: {}", ip); + return EXIT_FAILURE; + } +}
58-59
:⚠️ Potential issueSecurity Concern: Using insecure channel credentials
The use of
grpc::InsecureChannelCredentials()
is not suitable for production environments as it exposes the communication to potential man-in-the-middle attacks.Consider using secure channel credentials:
-std::unique_ptr<RaftService::Stub> stub{ - RaftService::NewStub(grpc::CreateChannel(replicaIp, grpc::InsecureChannelCredentials()))}; +auto creds = grpc::SslCredentials(grpc::SslCredentialsOptions()); +std::unique_ptr<RaftService::Stub> stub{ + RaftService::NewStub(grpc::CreateChannel(replicaIp, creds))};.vscode/launch.json (1)
105-105
:⚠️ Potential issueFix JSON syntax error.
Remove the trailing comma after the pipeTransport object.
Apply this diff:
"debuggerPath": "/usr/bin/gdb" - }, + } },🧰 Tools
🪛 Biome (1.9.4)
[error] 105-105: Expected a property but instead found '}'.
Expected a property here.
(parse)
🧹 Nitpick comments (9)
examples/raft/main.cpp (1)
74-75
: Set logging level before module initializationConsider setting the logging level before initializing the consensus module to capture all debug logs during initialization.
Apply this diff to adjust the order:
spdlog::set_level(spdlog::level::debug); consensusModule.start();examples/raft/raft.h (3)
14-24
: Remove unnecessary includesThe includes for
Raft.grpc.pb.h
andRaft.pb.h
are sufficient. The additional gRPC headers in lines 3~ to 11~ may not be necessary and can be removed to reduce compilation time.Apply this diff to clean up the includes:
-#include <grpc/grpc.h> -#include <grpcpp/channel.h> -#include <grpcpp/client_context.h> -#include <grpcpp/create_channel.h> -#include <grpcpp/security/credentials.h> -#include <grpcpp/security/server_credentials.h> -#include <grpcpp/server_builder.h> -#include <grpcpp/support/status.h> #include "Raft.grpc.pb.h" #include "Raft.pb.h"
79-106
: Enhance class design and documentation forraft_node_grpc_client_t
To improve maintainability and readability, consider the following:
- Add method documentation for public methods.
- Mark the destructor as
noexcept
.- Ensure thread safety guarantees are documented.
Apply this diff to enhance the class:
/** * Client for communicating with other nodes in the Raft cluster. * Handles RPC operations for consensus and key-value operations. + * Thread-safe: All public methods are thread-safe as they use thread-safe gRPC stubs. */ class raft_node_grpc_client_t { public: /** * Constructs a client for communicating with a specific node. * @param config Node configuration containing ID and IP address. * @param pRaftStub Unique pointer to the Raft service stub interface. */ raft_node_grpc_client_t(node_config_t config, std::unique_ptr<RaftService::StubInterface> pRaftStub); - virtual ~raft_node_grpc_client_t() noexcept = default; + virtual ~raft_node_grpc_client_t() noexcept; raft_node_grpc_client_t(const raft_node_grpc_client_t &) = delete; auto operator=(const raft_node_grpc_client_t &) -> raft_node_grpc_client_t & = delete; raft_node_grpc_client_t(raft_node_grpc_client_t &&) = default; auto operator=(raft_node_grpc_client_t &&) -> raft_node_grpc_client_t & = default; /** * Sends append entries RPC to the node. * @param request AppendEntriesRequest message. * @param response Pointer to AppendEntriesResponse message. * @return True if the RPC was successful; otherwise, false. */ auto appendEntries(const AppendEntriesRequest &request, AppendEntriesResponse *response) -> bool; /** * Sends request vote RPC to the node. * @param request RequestVoteRequest message. * @param response Pointer to RequestVoteResponse message. * @return True if the RPC was successful; otherwise, false. */ auto requestVote(const RequestVoteRequest &request, RequestVoteResponse *response) -> bool; [[nodiscard]] auto id() const -> id_t; [[nodiscard]] auto ip() const -> ip_t; };
176-179
: Add missing thread safety annotationsThe
m_state
variable is guarded bym_stateMutex
but lacks thread safety annotations.Apply this diff to add the annotation:
uint32_t m_commitIndex ABSL_GUARDED_BY(m_stateMutex); uint32_t m_lastApplied ABSL_GUARDED_BY(m_stateMutex); -NodeState m_state ABSL_GUARDED_BY(m_stateMutex); +NodeState m_state ABSL_GUARDED_BY(m_stateMutex);examples/raft/raft_test.cpp (2)
84-85
: Simplify range transformationsThe use of
std::views::transform
andstd::ranges::to
may be unnecessary if a simpler approach suffices.Consider simplifying the
raftClients()
method:auto raftClients() { - return m_raftClients | std::views::transform([](auto &&pair) { return std::move(pair.second); }) | - std::ranges::to<std::vector<raft::raft_node_grpc_client_t>>(); + std::vector<raft::raft_node_grpc_client_t> clients; + for (auto &[id, client] : m_raftClients) + { + clients.push_back(std::move(client)); + } + return clients; }
121-123
: Ensure proper cleanup inmain
functionAfter running all tests, it's good practice to clean up any resources or perform any necessary teardown.
Consider adding any required cleanup code after
RUN_ALL_TESTS()
:int main(int argc, char **argv) { spdlog::set_level(spdlog::level::debug); testing::InitGoogleTest(&argc, argv); int result = RUN_ALL_TESTS(); // Perform any necessary cleanup here return result; }.vscode/launch.json (1)
39-44
: Make node configuration configurable.The hardcoded node ID and peer list make it difficult to debug different nodes.
Apply this diff:
"args": [ "--id", - "1", + "${input:nodeId}", "--nodes", - "0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082" + "${input:peerList}" ],Add to the inputs section:
{ "id": "peerList", "type": "promptString", "description": "Comma-separated list of peer addresses", "default": "0.0.0.0:8080,0.0.0.0:8081,0.0.0.0:8082" }examples/raft/raft.cpp (2)
918-934
: Improve replication wait mechanism.The timeout value is hardcoded and the majority calculation could be improved.
Apply this diff:
- constexpr const auto replicationTimeout{5}; + static constexpr const auto REPLICATION_TIMEOUT_SECONDS = 5; + static constexpr const auto MIN_MAJORITY = 2; // Minimum number of nodes for consensus auto hasMajority = [this, logIndex]() ABSL_SHARED_LOCKS_REQUIRED(m_stateMutex) { uint32_t count = 1; // Count self for (const auto &[peer, matchIdx] : m_matchIndex) { count += matchIdx >= logIndex; } - return count >= (m_replicas.size() + 1) / 2 + 1; + const size_t majorityCount = (m_replicas.size() + 1) / 2 + 1; + return count >= std::max(majorityCount, MIN_MAJORITY); }; spdlog::info("Node={} is waiting for majority to agree on logIndex={}", m_config.m_id, logIndex); - return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), absl::Seconds(replicationTimeout)); + return m_stateMutex.AwaitWithTimeout(absl::Condition(&hasMajority), + absl::Seconds(REPLICATION_TIMEOUT_SECONDS));
893-901
: Use safe array access in getLogTerm.The array access should use
at()
to catch out-of-bounds access.Apply this diff:
if (index == 0 || index > m_log.size()) { return 0; } - return m_log[index - 1].term(); + return m_log.at(index - 1).term();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.vscode/launch.json
(2 hunks)examples/raft/main.cpp
(1 hunks)examples/raft/raft.cpp
(1 hunks)examples/raft/raft.h
(1 hunks)examples/raft/raft_test.cpp
(1 hunks)
🧰 Additional context used
🪛 Biome (1.9.4)
.vscode/launch.json
[error] 74-74: Expected a property but instead found '}'.
Expected a property here.
(parse)
[error] 105-105: Expected a property but instead found '}'.
Expected a property here.
(parse)
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: build-clang
- GitHub Check: build-gcc
🔇 Additional comments (10)
examples/raft/main.cpp (1)
68-72
: Enhance consensus module initialization error handlingAdd timeout and detailed error reporting for initialization failures to improve robustness.
Apply this diff to enhance error handling:
auto nodeId = parsedOptions["id"].as<id_t>(); ... raft::consensus_module_t consensusModule({.m_id = nodeId, .m_ip = nodeIps[nodeId - 1]}, std::move(replicas)); +auto initStart = std::chrono::steady_clock::now(); +while (!consensusModule.init()) { spdlog::error("Failed to initialize the state machine"); + auto elapsed = std::chrono::steady_clock::now() - initStart; + if (elapsed > std::chrono::seconds(10)) + { + spdlog::error("Failed to initialize consensus module after {} seconds", + std::chrono::duration_cast<std::chrono::seconds>(elapsed).count()); + return EXIT_FAILURE; + } + std::this_thread::sleep_for(std::chrono::milliseconds(100)); }examples/raft/raft_test.cpp (1)
28-45
: Avoid mixing Google Test and Catch2 frameworksThe test file uses Google Test (
gtest/gtest.h
) and Google Mock (gmock/gmock.h
). Ensure that only one testing framework is used to prevent potential conflicts..vscode/launch.json (1)
74-74
:⚠️ Potential issueFix JSON syntax error.
Remove the trailing comma after the pipeTransport object.
Apply this diff:
"debuggerPath": "/usr/bin/gdb" - }, + } },Likely invalid or redundant comment.
🧰 Tools
🪛 Biome (1.9.4)
[error] 74-74: Expected a property but instead found '}'.
Expected a property here.
(parse)
examples/raft/raft.cpp (7)
704-725
:⚠️ Potential issueReset vote count when becoming leader.
The vote count should be reset to avoid interference with future elections.
Apply this diff:
m_state = NodeState::LEADER; + m_voteCount = 0; // Reset vote count
Likely invalid or redundant comment.
903-916
:⚠️ Potential issueFix majority index calculation.
The vector size calculation is incorrect and there are potential duplicate entries.
Apply this diff:
auto consensus_module_t::findMajorityIndexMatch() -> uint32_t { - std::vector<int> matchIndexes; - matchIndexes.resize(m_replicas.size() + 1); + std::vector<uint32_t> matchIndexes; + matchIndexes.reserve(m_replicas.size() + 1); // Reserve space for all replicas + self + for (const auto &[peer, matchIdx] : m_matchIndex) { matchIndexes.emplace_back(matchIdx); } - matchIndexes.emplace_back(m_log.back().index()); + // Add self's last log index + matchIndexes.emplace_back(getLastLogIndex()); std::ranges::sort(matchIndexes); + // Majority is (N+1)/2 where N is total number of servers return matchIndexes[matchIndexes.size() / 2]; }Likely invalid or redundant comment.
31-41
:⚠️ Potential issueFix thread safety issues in random number generation.
The current implementation creates new random device and generator on each call, which is inefficient and not thread-safe.
Apply this diff:
+constexpr const int MIN_TIMEOUT_MS = 150; +constexpr const int MAX_TIMEOUT_MS = 300; auto generate_random_timeout() -> int { - const int minTimeout{150}; - const int maxTimeout{300}; - std::random_device randomDevice; - std::mt19937 gen(randomDevice()); + thread_local std::random_device randomDevice; + thread_local std::mt19937 gen(randomDevice()); - std::uniform_int_distribution<> dist(minTimeout, maxTimeout); + std::uniform_int_distribution<> dist(MIN_TIMEOUT_MS, MAX_TIMEOUT_MS); return dist(gen); }Likely invalid or redundant comment.
960-1024
:⚠️ Potential issueImprove persistence reliability and error handling.
The persistence implementation lacks atomic write guarantees and proper error handling.
Apply this diff:
auto consensus_module_t::flushPersistentState() -> bool { + const auto tmpPath = path.string() + ".tmp"; // Flush commitIndex and votedFor { auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); - std::ofstream fsa(path, std::fstream::out | std::fstream::trunc); + std::ofstream fsa(tmpPath, std::fstream::out | std::fstream::trunc); if (!fsa.is_open()) { return false; } fsa << m_commitIndex << " " << m_votedFor << "\n"; fsa.flush(); + if (::fsync(fileno(fsa.rdbuf()->fd())) != 0) { + return false; + } + fsa.close(); + + if (std::rename(tmpPath.c_str(), path.c_str()) != 0) { + return false; + } } // Similar changes for log persistence... return true; }Likely invalid or redundant comment.
761-779
:⚠️ Potential issueFix consecutive failures handling in heartbeat mechanism.
The consecutive failures counter is reset too early, which breaks the retry mechanism.
Apply this diff:
if (!client.appendEntries(request, &response)) { consecutiveFailures++; - consecutiveFailures = 0; // Remove this line spdlog::error("AppendEntriesRequest failed during heartbeat. Attempt {}/{}", consecutiveFailures, maxRetries); if (consecutiveFailures >= maxRetries) { return; } continue; } + // Only reset on success + consecutiveFailures = 0;Likely invalid or redundant comment.
877-885
:⚠️ Potential issueFix infinite recursion in failed AppendEntries handling.
The recursive call to
sendAppendEntriesRPC
on failure can lead to infinite recursion and stack overflow.Apply this diff:
if (!response.success()) { { absl::MutexLock locker{&m_stateMutex}; m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); } - sendAppendEntriesRPC(client, {}); + // Use a retry loop instead of recursion + continue; }Likely invalid or redundant comment.
698-702
:⚠️ Potential issueFix incorrect majority calculation.
The majority calculation is incorrect and could lead to split votes.
Apply this diff:
auto consensus_module_t::hasMajority(const uint32_t votes) const -> bool { - constexpr const double HALF_OF_THE_REPLICAS = 2.0; - return votes > static_cast<double>(m_replicas.size()) / HALF_OF_THE_REPLICAS; + // Total nodes = replicas + self + const size_t totalNodes = m_replicas.size() + 1; + return votes > totalNodes / 2; }Likely invalid or redundant comment.
uint32_t m_currentTerm ABSL_GUARDED_BY(m_stateMutex); | ||
uint32_t m_votedFor ABSL_GUARDED_BY(m_stateMutex); | ||
std::vector<LogEntry> m_log ABSL_GUARDED_BY(m_stateMutex); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure proper initialization of member variables
Member variables m_currentTerm
and m_votedFor
are declared but not initialized. Uninitialized variables can lead to undefined behavior.
Apply this diff to initialize the variables:
uint32_t m_currentTerm ABSL_GUARDED_BY(m_stateMutex);
uint32_t m_votedFor ABSL_GUARDED_BY(m_stateMutex);
Initialize them in the constructor:
consensus_module_t::consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas)
: m_config(std::move(nodeConfig))
, m_replicas()
, m_currentTerm(0)
, m_votedFor(gInvalidId)
{
// Constructor implementation
}
class consensus_module_t : public RaftService::Service | ||
{ | ||
public: | ||
// @id is the ID of the current node. Order of RaftServices in @replicas is important! | ||
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas); | ||
|
||
// NOLINTBEGIN(modernize-use-trailing-return-type) | ||
grpc::Status AppendEntries(grpc::ServerContext *pContext, | ||
const AppendEntriesRequest *pRequest, | ||
AppendEntriesResponse *pResponse) override ABSL_LOCKS_EXCLUDED(m_stateMutex); | ||
|
||
auto RequestVote(grpc::ServerContext *pContext, const RequestVoteRequest *pRequest, RequestVoteResponse *pResponse) | ||
-> grpc::Status override ABSL_LOCKS_EXCLUDED(m_stateMutex); | ||
// NOLINTEND(modernize-use-trailing-return-type) | ||
|
||
[[nodiscard]] auto init() -> bool; | ||
void start(); | ||
void stop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Correct inheritance of std::enable_shared_from_this
and document thread safety
- Inherit publicly from
std::enable_shared_from_this
to ensure proper functionality. - Provide documentation on thread safety guarantees of
consensus_module_t
.
Apply this diff to correct inheritance and add documentation:
class consensus_module_t : public RaftService::Service
-{
+ , public std::enable_shared_from_this<consensus_module_t>
+{
public:
/**
* Constructs a consensus module for a node in the Raft cluster.
* @param nodeConfig Configuration of the current node.
* @param replicas Vector of client stubs for replica nodes.
*/
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas);
+ /**
+ * Initializes the consensus module. Must be called before start().
+ * @return True if initialization was successful; otherwise, false.
+ */
[[nodiscard]] auto init() -> bool;
/**
* Starts the consensus module operations.
*/
void start();
/**
* Stops the consensus module operations.
*/
void stop();
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
class consensus_module_t : public RaftService::Service | |
{ | |
public: | |
// @id is the ID of the current node. Order of RaftServices in @replicas is important! | |
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas); | |
// NOLINTBEGIN(modernize-use-trailing-return-type) | |
grpc::Status AppendEntries(grpc::ServerContext *pContext, | |
const AppendEntriesRequest *pRequest, | |
AppendEntriesResponse *pResponse) override ABSL_LOCKS_EXCLUDED(m_stateMutex); | |
auto RequestVote(grpc::ServerContext *pContext, const RequestVoteRequest *pRequest, RequestVoteResponse *pResponse) | |
-> grpc::Status override ABSL_LOCKS_EXCLUDED(m_stateMutex); | |
// NOLINTEND(modernize-use-trailing-return-type) | |
[[nodiscard]] auto init() -> bool; | |
void start(); | |
void stop(); | |
class consensus_module_t : public RaftService::Service | |
, public std::enable_shared_from_this<consensus_module_t> | |
{ | |
public: | |
/** | |
* Constructs a consensus module for a node in the Raft cluster. | |
* @param nodeConfig Configuration of the current node. | |
* @param replicas Vector of client stubs for replica nodes. | |
*/ | |
consensus_module_t(node_config_t nodeConfig, std::vector<raft_node_grpc_client_t> replicas); | |
// NOLINTBEGIN(modernize-use-trailing-return-type) | |
grpc::Status AppendEntries(grpc::ServerContext *pContext, | |
const AppendEntriesRequest *pRequest, | |
AppendEntriesResponse *pResponse) override ABSL_LOCKS_EXCLUDED(m_stateMutex); | |
auto RequestVote(grpc::ServerContext *pContext, const RequestVoteRequest *pRequest, RequestVoteResponse *pResponse) | |
-> grpc::Status override ABSL_LOCKS_EXCLUDED(m_stateMutex); | |
// NOLINTEND(modernize-use-trailing-return-type) | |
/** | |
* Initializes the consensus module. Must be called before start(). | |
* @return True if initialization was successful; otherwise, false. | |
*/ | |
[[nodiscard]] auto init() -> bool; | |
/** | |
* Starts the consensus module operations. | |
*/ | |
void start(); | |
/** | |
* Stops the consensus module operations. | |
*/ | |
void stop(); |
{ | ||
"name": "rr - RaftMain", | ||
"type": "cppdbg", | ||
"request": "launch", | ||
"program": "${workspaceFolder}/build/RaftMain", | ||
"miDebuggerServerAddress": "localhost:50505", | ||
"stopAtEntry": false, | ||
"cwd": "${workspaceFolder}", | ||
"environment": [], | ||
"externalConsole": true, | ||
"linux": { | ||
"MIMode": "gdb", | ||
"setupCommands": [ | ||
{ | ||
"description": "Setup to resolve symbols", | ||
"text": "set sysroot /", | ||
"ignoreFailures": false | ||
} | ||
] | ||
}, | ||
"osx": { | ||
"MIMode": "gdb" | ||
}, | ||
"windows": { | ||
"MIMode": "gdb" | ||
} | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Enhance RaftMain debug configuration for multi-node debugging.
The configuration needs improvements for debugging multiple Raft nodes:
- Add build task to ensure RaftMain is compiled
- Add command line arguments for Raft configuration (node ID, peer list)
- Make debug port configurable to avoid conflicts
Apply this diff:
{
"name": "rr - RaftMain",
"type": "cppdbg",
"request": "launch",
"program": "${workspaceFolder}/build/RaftMain",
- "miDebuggerServerAddress": "localhost:50505",
+ "miDebuggerServerAddress": "localhost:${input:debugPort}",
+ "args": [
+ "--node-id", "${input:nodeId}",
+ "--config", "${workspaceFolder}/assets/raft_config.json"
+ ],
"stopAtEntry": false,
"cwd": "${workspaceFolder}",
"environment": [],
"externalConsole": true,
+ "preLaunchTask": "build",
"linux": {
"MIMode": "gdb",
Add to the end of the file:
"inputs": [
{
"id": "debugPort",
"type": "promptString",
"description": "Debug port for remote debugging",
"default": "50505"
},
{
"id": "nodeId",
"type": "promptString",
"description": "Raft node ID",
"default": "1"
}
]
void consensus_module_t::sendAppendEntriesRPC(raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries) | ||
{ | ||
std::thread( | ||
[this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries) | ||
{ | ||
AppendEntriesRequest request; | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
|
||
request.set_term(m_currentTerm); | ||
request.set_prevlogterm(logEntries.front().term()); | ||
request.set_prevlogindex(getLogTerm(logEntries.front().index() - 1)); | ||
request.set_leadercommit(m_commitIndex); | ||
request.set_senderid(m_config.m_id); | ||
|
||
for (const auto &logEntry : logEntries) | ||
{ | ||
request.add_entries()->CopyFrom(logEntry); | ||
} | ||
} | ||
|
||
AppendEntriesResponse response; | ||
auto status = client.appendEntries(request, &response); | ||
if (!status) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
spdlog::error("Node={} failed to send AppendEntries RPC to peer={} at term={}", | ||
m_config.m_id, | ||
client.id(), | ||
m_currentTerm); | ||
return; | ||
} | ||
|
||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
if (response.term() > m_currentTerm) | ||
{ | ||
becomeFollower(response.term()); | ||
return; | ||
} | ||
} | ||
|
||
if (response.success()) | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_matchIndex[client.id()] = response.match_index(); | ||
m_nextIndex[client.id()] = response.match_index() + 1; | ||
|
||
uint32_t majorityIndex = findMajorityIndexMatch(); | ||
if (majorityIndex > m_commitIndex && m_log[majorityIndex - 1].term() == m_currentTerm) | ||
{ | ||
if (!updatePersistentState(majorityIndex, std::nullopt)) | ||
{ | ||
spdlog::error("Node={} is unable to persist commitIndex={}", m_config.m_id, m_commitIndex); | ||
return; | ||
} | ||
|
||
// Apply successfull replication to the state machine e.g. in-memory hash-table | ||
while (m_lastApplied < m_commitIndex) | ||
{ | ||
++m_lastApplied; | ||
// TODO(lnikon): Update the state machine | ||
// m_kv[m_log[m_lastApplied - 1].key()] = m_log[m_lastApplied - 1].value(); | ||
} | ||
|
||
return; | ||
} | ||
} | ||
|
||
if (!response.success()) | ||
{ | ||
{ | ||
absl::MutexLock locker{&m_stateMutex}; | ||
m_nextIndex[client.id()] = std::max(1U, m_nextIndex[client.id()] - 1); | ||
} | ||
sendAppendEntriesRPC(client, {}); | ||
} | ||
}, | ||
std::ref(client), | ||
logEntries) | ||
.detach(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid detaching threads in sendAppendEntriesRPC.
Using detached threads can lead to resource leaks and undefined behavior.
Apply this diff:
- std::thread(
+ m_replicationThreads.emplace_back(
[this](raft_node_grpc_client_t &client, std::vector<LogEntry> logEntries)
{
// ... existing code ...
},
std::ref(client),
- logEntries)
- .detach();
+ std::move(logEntries));
Add to the class definition:
std::vector<std::jthread> m_replicationThreads;
// 3. Append new entries and remove conflicting ones | ||
auto newEntryStart = pRequest->prevlogindex() + 1; | ||
for (auto idx{0}; idx < pRequest->entries().size(); ++idx) | ||
{ | ||
const auto newIdx{newEntryStart + idx - 1}; | ||
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term()) | ||
{ | ||
m_log.resize(newIdx); | ||
break; | ||
} | ||
} | ||
|
||
m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix potential buffer overflow in log entry handling.
The log entry insertion doesn't properly validate array bounds and could cause buffer overflow. Also, the insertion method is inefficient.
Apply this diff:
- auto newEntryStart = pRequest->prevlogindex() + 1;
- for (auto idx{0}; idx < pRequest->entries().size(); ++idx)
+ const auto newEntryStart = pRequest->prevlogindex() + 1;
+ if (newEntryStart > m_log.size() + 1) {
+ pResponse->set_term(m_currentTerm);
+ pResponse->set_success(false);
+ pResponse->set_responderid(m_config.m_id);
+ return grpc::Status::OK;
+ }
+
+ for (size_t idx = 0; idx < pRequest->entries().size(); ++idx)
{
- const auto newIdx{newEntryStart + idx - 1};
+ const auto newIdx = newEntryStart + idx - 1;
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term())
{
m_log.resize(newIdx);
break;
}
}
- m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end());
+
+ // Use reserve to prevent multiple reallocations
+ m_log.reserve(m_log.size() + pRequest->entries().size());
+ for (const auto& entry : pRequest->entries()) {
+ m_log.push_back(entry);
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// 3. Append new entries and remove conflicting ones | |
auto newEntryStart = pRequest->prevlogindex() + 1; | |
for (auto idx{0}; idx < pRequest->entries().size(); ++idx) | |
{ | |
const auto newIdx{newEntryStart + idx - 1}; | |
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term()) | |
{ | |
m_log.resize(newIdx); | |
break; | |
} | |
} | |
m_log.insert(m_log.end(), pRequest->entries().begin(), pRequest->entries().end()); | |
// 3. Append new entries and remove conflicting ones | |
const auto newEntryStart = pRequest->prevlogindex() + 1; | |
if (newEntryStart > m_log.size() + 1) { | |
pResponse->set_term(m_currentTerm); | |
pResponse->set_success(false); | |
pResponse->set_responderid(m_config.m_id); | |
return grpc::Status::OK; | |
} | |
for (size_t idx = 0; idx < pRequest->entries().size(); ++idx) | |
{ | |
const auto newIdx = newEntryStart + idx - 1; | |
if (m_log.size() >= newEntryStart + idx && m_log[newIdx].term() != pRequest->entries(idx).term()) | |
{ | |
m_log.resize(newIdx); | |
break; | |
} | |
} | |
// Use reserve to prevent multiple reallocations | |
m_log.reserve(m_log.size() + pRequest->entries().size()); | |
for (const auto& entry : pRequest->entries()) { | |
m_log.push_back(entry); | |
} |
auto consensus_module_t::restorePersistentState() -> bool | ||
{ | ||
{ | ||
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | ||
if (!std::filesystem::exists(path)) | ||
{ | ||
spdlog::info("Node={} is running the first time", m_config.m_id); | ||
return true; | ||
} | ||
|
||
std::ifstream ifs(path, std::ifstream::in); | ||
if (!ifs.is_open()) | ||
{ | ||
spdlog::error( | ||
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
ifs >> m_commitIndex >> m_votedFor; | ||
m_votedFor = 0; | ||
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor); | ||
} | ||
|
||
{ | ||
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | ||
std::ifstream ifs(path, std::ifstream::in); | ||
if (!ifs.is_open()) | ||
{ | ||
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str()); | ||
return false; | ||
} | ||
|
||
std::string logLine; | ||
while (std::getline(ifs, logLine)) | ||
{ | ||
std::stringstream sst(logLine); | ||
|
||
std::string key; | ||
std::string value; | ||
std::uint32_t term = 0; | ||
|
||
sst >> key >> value >> term; | ||
|
||
LogEntry logEntry; | ||
logEntry.set_key(key); | ||
logEntry.set_value(value); | ||
logEntry.set_term(term); | ||
m_log.emplace_back(logEntry); | ||
|
||
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term); | ||
} | ||
} | ||
|
||
return true; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle file parsing errors in restorePersistentState.
The method doesn't properly validate parsed values and could lead to undefined behavior with corrupted files.
Apply this diff:
ifs >> m_commitIndex >> m_votedFor;
+ if (ifs.fail()) {
+ spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str());
+ return false;
+ }
+
+ if (m_commitIndex > std::numeric_limits<uint32_t>::max() ||
+ m_votedFor > std::numeric_limits<uint32_t>::max()) {
+ spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str());
+ return false;
+ }
while (std::getline(ifs, logLine)) {
std::stringstream sst(logLine);
std::string key, value;
std::uint32_t term = 0;
sst >> key >> value >> term;
+ if (sst.fail()) {
+ spdlog::error("Node={} failed to parse log entry", m_config.m_id);
+ return false;
+ }
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto consensus_module_t::restorePersistentState() -> bool | |
{ | |
{ | |
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | |
if (!std::filesystem::exists(path)) | |
{ | |
spdlog::info("Node={} is running the first time", m_config.m_id); | |
return true; | |
} | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error( | |
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str()); | |
return false; | |
} | |
ifs >> m_commitIndex >> m_votedFor; | |
m_votedFor = 0; | |
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor); | |
} | |
{ | |
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str()); | |
return false; | |
} | |
std::string logLine; | |
while (std::getline(ifs, logLine)) | |
{ | |
std::stringstream sst(logLine); | |
std::string key; | |
std::string value; | |
std::uint32_t term = 0; | |
sst >> key >> value >> term; | |
LogEntry logEntry; | |
logEntry.set_key(key); | |
logEntry.set_value(value); | |
logEntry.set_term(term); | |
m_log.emplace_back(logEntry); | |
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term); | |
} | |
} | |
return true; | |
} | |
auto consensus_module_t::restorePersistentState() -> bool | |
{ | |
{ | |
auto path = std::filesystem::path(constructFilename(gRaftFilename, m_config.m_id)); | |
if (!std::filesystem::exists(path)) | |
{ | |
spdlog::info("Node={} is running the first time", m_config.m_id); | |
return true; | |
} | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error( | |
"Node={} is unable to open {} to restore commitIndex and votedFor", m_config.m_id, path.c_str()); | |
return false; | |
} | |
ifs >> m_commitIndex >> m_votedFor; | |
if (ifs.fail()) { | |
spdlog::error("Node={} failed to parse state file {}", m_config.m_id, path.c_str()); | |
return false; | |
} | |
if (m_commitIndex > std::numeric_limits<uint32_t>::max() || | |
m_votedFor > std::numeric_limits<uint32_t>::max()) { | |
spdlog::error("Node={} invalid values in state file {}", m_config.m_id, path.c_str()); | |
return false; | |
} | |
m_votedFor = 0; | |
spdlog::info("Node={} restored commitIndex={} and votedFor={}", m_config.m_id, m_commitIndex, m_votedFor); | |
} | |
{ | |
auto path = std::filesystem::path(constructFilename(gLogFilename, m_config.m_id)); | |
std::ifstream ifs(path, std::ifstream::in); | |
if (!ifs.is_open()) | |
{ | |
spdlog::error("Node={} is unable to open {} to restore log", m_config.m_id, path.c_str()); | |
return false; | |
} | |
std::string logLine; | |
while (std::getline(ifs, logLine)) | |
{ | |
std::stringstream sst(logLine); | |
std::string key; | |
std::string value; | |
std::uint32_t term = 0; | |
sst >> key >> value >> term; | |
if (sst.fail()) { | |
spdlog::error("Node={} failed to parse log entry", m_config.m_id); | |
return false; | |
} | |
LogEntry logEntry; | |
logEntry.set_key(key); | |
logEntry.set_value(value); | |
logEntry.set_term(term); | |
m_log.emplace_back(logEntry); | |
spdlog::info("Node={} restored logEntry=[key={}, value={}, term={}]", m_config.m_id, key, value, term); | |
} | |
} | |
return true; | |
} |
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores