Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test Retry Logic #49

Open
tustvold opened this issue Jan 11, 2022 · 8 comments
Open

Test Retry Logic #49

tustvold opened this issue Jan 11, 2022 · 8 comments

Comments

@tustvold
Copy link
Contributor

Currently there is no real testing of the retry logic, in particular leader failover, connection errors, etc... This should be addressed

@carols10cents
Copy link
Contributor

I'm trying to cause leader failover manually in redpanda to just observe what happens. I have a test where I'm creating 2 topics with replication 3, and I can see each topic has a different leader and has all 3 redpanda instances as replicas.

Then I put a tokio::time::sleep in, during which I go over to the Docker dashboard and stop redpanda-0.

In the docker logs for redpanda-1 and redpanda-2, I'm seeing errors about leader election:

Redpanda logs
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] redpanda::main - application.cc:175 - Stopping...
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:166 - kafka rpc protocol - Stopping 1 listeners
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:175 - kafka rpc protocol - Shutting down 1 connections
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:166 - vectorized internal rpc protocol - Stopping 1 listeners
redpanda-0    | INFO  2022-01-13 16:06:57,473 [shard 0] rpc - server.cc:175 - vectorized internal rpc protocol - Shutting down 1 connections
redpanda-0    | ERROR 2022-01-13 16:06:57,473 [shard 0] cluster - controller_backend.cc:345 - Error while reconciling topics - seastar::abort_requested_exception (abort requested)
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] cluster - state_machine.cc:139 - State machine for ntp={redpanda/controller/0} handles raft::offset_monitor::wait_aborted (offset monitor wait aborted)
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] cluster - members_manager.cc:290 - stopping cluster::members_manager...
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:158 - Stopping
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:158 - Stopping
redpanda-0    | INFO  2022-01-13 16:06:57,475 [shard 0] raft - [group_id:2, {kafka/test_topic_f8e9a618-5a15-4214-8398-ab18357ba8c9/0}] consensus.cc:158 - Stopping
redpanda-0    | INFO  2022-01-13 16:06:57,481 [shard 0] kvstore - kvstore.cc:105 - Stopping kvstore: dir /var/lib/redpanda/data/redpanda/kvstore/0_0
redpanda-2    | WARN  2022-01-13 16:06:57,490 [shard 0] rpc - error connecting to 172.19.0.4:33145 - std::__1::system_error (error system:111, Connection refused)
redpanda-0 exited with code 0
redpanda-1    | INFO  2022-01-13 16:07:02,911 [shard 0] cluster - leader_balancer.cc:92 - Leader balancer: controller leadership lost
redpanda-2    | INFO  2022-01-13 16:07:02,923 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.cc:1356 - Received vote request with larger term from node {id: {1}, revision: {0}}, received 2, current 1
redpanda-1    | INFO  2022-01-13 16:07:02,935 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:76 - vote reply from {id: {2}, revision: {0}} - {term:{2}, target_node_id{id: {1}, revision: {0}}, vote_granted: 1, log_ok:1}
redpanda-1    | INFO  2022-01-13 16:07:02,935 [shard 0] raft - [group_id:0, {redpanda/controller/0}] vote_stm.cc:246 - became the leader term:2
redpanda-1    | INFO  2022-01-13 16:07:02,935 [shard 0] cluster - leader_balancer.cc:86 - Leader balancer: controller leadership detected. Starting rebalancer in 30 seconds
redpanda-1    | INFO  2022-01-13 16:07:02,939 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/redpanda/controller/0_0/14-2-v1.log
redpanda-2    | INFO  2022-01-13 16:07:03,005 [shard 0] cluster - leader_balancer.cc:92 - Leader balancer: controller leadership lost
redpanda-2    | INFO  2022-01-13 16:07:03,008 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/redpanda/controller/0_0/14-2-v1.log
redpanda-1    | INFO  2022-01-13 16:07:03,293 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:1356 - Received vote request with larger term from node {id: {2}, revision: {11}}, received 2, current 1
redpanda-2    | INFO  2022-01-13 16:07:03,305 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] vote_stm.cc:76 - vote reply from {id: {1}, revision: {11}} - {term:{2}, target_node_id{id: {2}, revision: {11}}, vote_granted: 1, log_ok:1}
redpanda-2    | INFO  2022-01-13 16:07:03,305 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] vote_stm.cc:246 - became the leader term:2
redpanda-2    | INFO  2022-01-13 16:07:03,319 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0_11/1-2-v1.log
redpanda-1    | INFO  2022-01-13 16:07:03,356 [shard 0] storage - segment.cc:628 - Creating new segment /var/lib/redpanda/data/kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0_11/1-2-v1.log
redpanda-2    | INFO  2022-01-13 16:07:32,930 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:2385 - Starting leadership transfer from {id: {2}, revision: {11}} to {id: {0}, revision: {11}} in term 2
redpanda-2    | WARN  2022-01-13 16:07:32,930 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:2437 - Waiting on node to recover before requesting election
redpanda-2    | WARN  2022-01-13 16:07:32,932 [shard 0] raft - [group_id:1, {kafka/test_topic_60dcf21b-41e9-4f5f-8c7a-123f84e2f0e6/0}] consensus.cc:2305 - Unable to transfer leadership to {0}: raft::errc::timeout
redpanda-1    | INFO  2022-01-13 16:07:32,932 [shard 0] cluster - leader_balancer.cc:629 - Leadership transfer of group 1 failed with error: raft::errc:10
redpanda-1    | INFO  2022-01-13 16:07:32,932 [shard 0] cluster - leader_balancer.cc:249 - Error transferring leadership group 1 from {node_id: 2, shard: 0} to {node_id: 0, shard: 0}
redpanda-2    | INFO  2022-01-13 16:07:37,933 [shard 0] raft - [group_id:2, {kafka/test_topic_f8e9a618-5a15-4214-8398-ab18357ba8c9/0}] consensus.cc:2385 - Starting leadership transfer from {id: {2}, revision: {13}} to {id: {0}, revision: {13}} in term 1
redpanda-2    | WARN  2022-01-13 16:07:37,933 [shard 0] raft - [group_id:2, {kafka/test_topic_f8e9a618-5a15-4214-8398-ab18357ba8c9/0}] consensus.cc:2305 - Unable to transfer leadership to {0}: rpc::errc::exponential_backoff
redpanda-1    | INFO  2022-01-13 16:07:37,934 [shard 0] cluster - leader_balancer.cc:629 - Leadership transfer of group 2 failed with error: raft::errc:0
redpanda-1    | INFO  2022-01-13 16:07:37,934 [shard 0] cluster - leader_balancer.cc:249 - Error transferring leadership group 2 from {node_id: 2, shard: 0} to {node_id: 0, shard: 0}
redpanda-1    | INFO  2022-01-13 16:07:42,938 [shard 0] cluster - leader_balancer.cc:234 - No leadership balance improvements found with total delta 2.666666666666667, number of muted groups 2
redpanda-1    | INFO  2022-01-13 16:09:42,938 [shard 0] cluster - leader_balancer.cc:365 - Leadership rebalancer muting node 0 last heartbeat 130007 ms
redpanda-1    | INFO  2022-01-13 16:09:42,939 [shard 0] cluster - leader_balancer.cc:234 - No leadership balance improvements found with total delta 3.0, number of muted groups 2

Maybe this issue? That looks like it was fixed in redpanda 21.10.1, but this repo is using 21.9.2, is there a reason not to upgrade? (Latest looks like 21.11.2)

Then when my test continues, it tries to request the topic metadata again and it just hangs.

I'm going to try something similar with Kafka instead now that Marco's improved the Kafka compatibility, but am I on the right track for creating the conditions we need to test, redpanda bugs notwithstanding? Is there anything else I could try?

@crepererum
Copy link
Collaborator

@carols10cents it's safe to upgrade redpanda.

To debug client issues just run the tests with RUST_LOG=trace and --nocapture (I should have documented that).

@carols10cents
Copy link
Contributor

carols10cents commented Jan 13, 2022

And now all the integration tests on main are failing against Kafka with ServerError(NotController, "") whenever the tests try to create a topic :(

@crepererum
Copy link
Collaborator

Do you get the NoController error during the normal integration tests or only when you kill nodes?

@carols10cents
Copy link
Contributor

Normal integration tests, not touching the docker processes.

@crepererum
Copy link
Collaborator

Could you please select a failing test and post the FULL output of the following command:

env TEST_INTEGRATION=1 \
    KAFKA_CONNECT=localhost:9093 \
    RUST_BACKTRACE=1 \
    RUST_LOG=trace \
    cargo test -- <the test you chose> --nocapture

I wanna figure out which action fails (backtrace) and what rskafka thinks is going one (logs).

@crepererum
Copy link
Collaborator

I can reproduce this issue now by using a different port (the 3 Kafka instances are mapped to three ports). It seems that Client::create_topic is failing. I guess that this request needs to be sent to the cluster controller and not any random broker.

I guess that you got unlucky because by luck this controller is normally the first node (since it starts first) but your restart tests changed it to a different node. I'll look into it.

@crepererum
Copy link
Collaborator

#60 should fix this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants