Test Retry Logic #49

tustvold · 2022-01-11T15:41:17Z

Currently there is no real testing of the retry logic, in particular leader failover, connection errors, etc... This should be addressed

carols10cents · 2022-01-13T16:14:31Z

I'm trying to cause leader failover manually in redpanda to just observe what happens. I have a test where I'm creating 2 topics with replication 3, and I can see each topic has a different leader and has all 3 redpanda instances as replicas.

Then I put a tokio::time::sleep in, during which I go over to the Docker dashboard and stop redpanda-0.

In the docker logs for redpanda-1 and redpanda-2, I'm seeing errors about leader election:

Redpanda logs

redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] redpanda::main - application.cc:175 - Stopping...
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:166 - kafka rpc protocol - Stopping 1 listeners
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:175 - kafka rpc protocol - Shutting down 1 connections
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:166 - vectorized internal rpc protocol - Stopping 1 listeners
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:175 - vectorized internal rpc protocol - Shutting down 1 connections
redpanda-0    | ERROR 2022-01-13 16:06:57,473 [shard 0] cluster - controller_backend.cc:345 - Error while reconciling topics - seastar::abort_requested_exception (abort requested)
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] cluster - state_machine.cc:139 - State machine for ntp={redpanda/controller/0} handles raft::offset_monitor::wait_aborted (offset monitor wait aborted)
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] cluster - members_manager.cc:290 - stopping cluster::members_manager...
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:158 - Stopping
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:158 - Stopping
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] raft - [group_id:2, {kafka/test_topic_f8e9a618-5a15-4214-8398-ab18357ba8c9/0}] consensus.cc:158 - Stopping
redpanda-0    | INFO  2022-01-13 16:06:57,481 [shard 0] kvstore - kvstore.cc:105 - Stopping kvstore: dir /var/lib/redpanda/data/redpanda/kvstore/0_0
redpanda-2    | WARN  2022-01-13 16:06:57,490 [shard 0] rpc - error connecting to 172.19.0.4:33145 - std::__1::system_error (error system:111, Connection refused)
redpanda-0 exited with code 0
redpanda-1    | INFO  2022-01-13 16:07:02,911 [shard 0] cluster - leader_balancer.cc:92 - Leader balancer: controller leadership lost
redpanda-2    | INFO  2022-01-13 16:07:02,923 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1356 - Received vote request with larger term from node {id: {1}, revision: {0}}, received 2, current 1
redpanda-1    | INFO  2022-01-13 16:07:02,935 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:76 - vote reply from {id: {2}, revision: {0}} - {term:{2}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
redpanda-1    | INFO  2022-01-13 16:07:02,935 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:246 - became the leader term:2
redpanda-1    | INFO  2022-01-13 16:07:02,935 [shard 0] cluster - leader_balancer.cc:86 - Leader balancer: controller leadership detected. Starting rebalancer in 30 seconds
redpanda-1    | INFO  2022-01-13 16:07:02,939 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/redpanda/controller/0_0/14-2-v1.log
redpanda-2    | INFO  2022-01-13 16:07:03,005 [shard 0] cluster - leader_balancer.cc:92 - Leader balancer: controller leadership lost
redpanda-2    | INFO  2022-01-13 16:07:03,008 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/redpanda/controller/0_0/14-2-v1.log
redpanda-1    | INFO  2022-01-13 16:07:03,293 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:1356 - Received vote request with larger term from node {id: {2}, revision: {11}}, received 2, current 1
redpanda-2    | INFO  2022-01-13 16:07:03,305 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] vote_stm.cc:76 - vote reply from {id: {1}, revision: {11}} - {term:{2}, target_node_id{id: {2}, revision: {11}}, vote_granted: 1, log_ok:1}
redpanda-2    | INFO  2022-01-13 16:07:03,305 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] vote_stm.cc:246 - became the leader term:2
redpanda-2    | INFO  2022-01-13 16:07:03,319 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0_11/1-2-v1.log
redpanda-1    | INFO  2022-01-13 16:07:03,356 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0_11/1-2-v1.log
redpanda-2    | INFO  2022-01-13 16:07:32,930 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:2385 - Starting leadership transfer from {id: {2}, revision: {11}} to {id: {0}, revision: {11}} in term 2
redpanda-2    | WARN  2022-01-13 16:07:32,930 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:2437 - Waiting on node to recover before requesting election
redpanda-2    | WARN  2022-01-13 16:07:32,932 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:2305 - Unable to transfer leadership to {0}: raft::errc::timeout
redpanda-1    | INFO  2022-01-13 16:07:32,932 [shard 0] cluster - leader_balancer.cc:629 - Leadership transfer of group 1 failed with error: raft::errc:10
redpanda-1    | INFO  2022-01-13 16:07:32,932 [shard 0] cluster - leader_balancer.cc:249 - Error transferring leadership group 1 from {node_id: 2, shard: 0} to {node_id: 0, shard: 0}
redpanda-2    | INFO  2022-01-13 16:07:37,933 [shard 0] raft - [group_id:2, {kafka/test_topic_f8e9a618-5a15-4214-8398-ab18357ba8c9/0}] consensus.cc:2385 - Starting leadership transfer from {id: {2}, revision: {13}} to {id: {0}, revision: {13}} in term 1
redpanda-2    | WARN  2022-01-13 16:07:37,933 [shard 0] raft - [group_id:2, {kafka/test_topic_f8e9a618-5a15-4214-8398-ab18357ba8c9/0}] consensus.cc:2305 - Unable to transfer leadership to {0}: rpc::errc::exponential_backoff
redpanda-1    | INFO  2022-01-13 16:07:37,934 [shard 0] cluster - leader_balancer.cc:629 - Leadership transfer of group 2 failed with error: raft::errc:0
redpanda-1    | INFO  2022-01-13 16:07:37,934 [shard 0] cluster - leader_balancer.cc:249 - Error transferring leadership group 2 from {node_id: 2, shard: 0} to {node_id: 0, shard: 0}
redpanda-1    | INFO  2022-01-13 16:07:42,938 [shard 0] cluster - leader_balancer.cc:234 - No leadership balance improvements found with total delta 2.666666666666667, number of muted groups 2
redpanda-1    | INFO  2022-01-13 16:09:42,938 [shard 0] cluster - leader_balancer.cc:365 - Leadership rebalancer muting node 0 last heartbeat 130007 ms
redpanda-1    | INFO  2022-01-13 16:09:42,939 [shard 0] cluster - leader_balancer.cc:234 - No leadership balance improvements found with total delta 3.0, number of muted groups 2

Maybe this issue? That looks like it was fixed in redpanda 21.10.1, but this repo is using 21.9.2, is there a reason not to upgrade? (Latest looks like 21.11.2)

Then when my test continues, it tries to request the topic metadata again and it just hangs.

I'm going to try something similar with Kafka instead now that Marco's improved the Kafka compatibility, but am I on the right track for creating the conditions we need to test, redpanda bugs notwithstanding? Is there anything else I could try?

crepererum · 2022-01-13T16:53:00Z

@carols10cents it's safe to upgrade redpanda.

To debug client issues just run the tests with RUST_LOG=trace and --nocapture (I should have documented that).

carols10cents · 2022-01-13T16:53:29Z

And now all the integration tests on main are failing against Kafka with ServerError(NotController, "") whenever the tests try to create a topic :(

crepererum · 2022-01-13T17:00:43Z

Do you get the NoController error during the normal integration tests or only when you kill nodes?

carols10cents · 2022-01-13T18:48:02Z

Normal integration tests, not touching the docker processes.

crepererum · 2022-01-14T08:46:28Z

Could you please select a failing test and post the FULL output of the following command:

env TEST_INTEGRATION=1 \
    KAFKA_CONNECT=localhost:9093 \
    RUST_BACKTRACE=1 \
    RUST_LOG=trace \
    cargo test -- <the test you chose> --nocapture

I wanna figure out which action fails (backtrace) and what rskafka thinks is going one (logs).

crepererum · 2022-01-14T09:08:34Z

I can reproduce this issue now by using a different port (the 3 Kafka instances are mapped to three ports). It seems that Client::create_topic is failing. I guess that this request needs to be sent to the cluster controller and not any random broker.

I guess that you got unlucky because by luck this controller is normally the first node (since it starts first) but your restart tests changed it to a different node. I'll look into it.

crepererum · 2022-01-14T11:04:47Z

#60 should fix this issue.

tustvold mentioned this issue Jan 11, 2022

feat: more automatic retries #43

Merged

This was referenced Jan 14, 2022

refactor: Extract pieces to make them more testable #66

Merged

refactor: Add a little testability, abstract a bit over broker kinds #74

Merged

crepererum mentioned this issue Jul 14, 2022

Chaos testing #149

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Retry Logic #49

Test Retry Logic #49

tustvold commented Jan 11, 2022

carols10cents commented Jan 13, 2022

crepererum commented Jan 13, 2022

carols10cents commented Jan 13, 2022 •

edited

Loading

crepererum commented Jan 13, 2022

carols10cents commented Jan 13, 2022

crepererum commented Jan 14, 2022

crepererum commented Jan 14, 2022

crepererum commented Jan 14, 2022

Test Retry Logic #49

Test Retry Logic #49

Comments

tustvold commented Jan 11, 2022

carols10cents commented Jan 13, 2022

crepererum commented Jan 13, 2022

carols10cents commented Jan 13, 2022 • edited Loading

crepererum commented Jan 13, 2022

carols10cents commented Jan 13, 2022

crepererum commented Jan 14, 2022

crepererum commented Jan 14, 2022

crepererum commented Jan 14, 2022

carols10cents commented Jan 13, 2022 •

edited

Loading