-
Notifications
You must be signed in to change notification settings - Fork 703
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakly half-finish importing cluster test #1532
base: unstable
Are you sure you want to change the base?
Fix flakly half-finish importing cluster test #1532
Conversation
test "Half-finish importing" { | ||
# Now we half finish 'importing' node | ||
assert_equal {OK} [$nodeto(link) cluster setslot 609 node $nodeto(id)] | ||
assert_equal {OK} [$nodefrom(link) cluster setslot 609 node $nodeto(id)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the point of "Half-finish importing" that we should only finish the importing on the target node, not on the source node, and check that the cluster can handle this? Or what does "half-finish" mean here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hpatro @zuiderkwast I was not sure if half-finish importing is related to setslots itself or not. Without this, the key was lost
in more than a few cases. The output for keys *
was empty.
I tried with setting additional timeouts, retries, wait_for_cluster_propagation
, it didn't seem to reliably fix it. I can mark this PR as draft, and look more behind the scenes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, yes it would be good to understand better why the keys are lost. $nodeto(link) cluster setslot 609 node $nodeto(id)
should make this node bump the epoch and the keys are already migrated to this node, so it seems strange that they are lost. Do you know what the cluster_fix
line below does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dove deep a bit into this.
Do you know what the cluster_fix line below does
I seems to be running the valkey-cli --cluster fix
command. Based on documentation ~ in order to fix the cluster so that keys will be migrated according to the hash slots each node is authoritative or not.
Although, it would have run successfully or else the test might have failed with fail "Cluster could not settle with configuration"
I was checking this documentation: https://valkey.io/commands/cluster-setslot/
The order of step 4 and 5 is important. The destination node is responsible for propagating the change to the rest of the cluster.
Looks like one of the nodes was not informed about the slot migration (the test wants to assert during this behaviour only). With this half-finish importing, we are at step 4, when we assert whether the key aga
is present or not. When I was locally replicating this, it seems like the request is indeed getting redirected from the non-slot-owning
node to the slot-owning
node.
╰─ ./src/valkey-cli -c -h 127.0.0.1 -p 7000 ─╯
127.0.0.1:7000> get aga
-> Redirected to slot [609] located at 127.0.0.1:6379
"xyz"
127.0.0.1:6379>
$cluster
looks like needs to be refreshed to know which is the master node for the slot 609. I am not sure if there is a way that client redirection fails here. It's a bit hard for me to replicate this manually with commands. So far, I have not been able to even after moving the key few times.
The latest change to refresh the cluster before we inform the nodes about the migrated key seems to reliably fix the test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of step 4 and 5 is important. The destination node is responsible for propagating the change to the rest of the cluster.
This is interesting. I wonder what can happen if this is not followed. This test case is obviously not following this rule, so maybe the test case is invalid? Shall we just delete it?
@PingXie Do you remember what happens if you send SETSLOT NODE to the source node only? Do the other nodes gossip back the ownership of the slot to the old owner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for commandstats, could you reset after the half-migrating test is completed and then capture it. It's a bit difficult to process the information as it is from the start of the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEBUG HTSTATS 0
is empty when the test fails
Output:
[Dictionary HT]
[Expires HT]
Command Stats after reset:
Nodeto command stats: # Commandstats
cmdstat_restore-asking:calls=1,usec=12,usec_per_call=12.00,rejected_calls=0,failed_calls=0
cmdstat_debug:calls=1,usec=3,usec_per_call=3.00,rejected_calls=0,failed_calls=0
cmdstat_keys:calls=3,usec=10,usec_per_call=3.33,rejected_calls=0,failed_calls=0
cmdstat_dbsize:calls=2,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_get:calls=1,usec=1,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_asking:calls=1,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_config|resetstat:calls=1,usec=36,usec_per_call=36.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|slots:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=4,usec=93,usec_per_call=23.25,rejected_calls=0,failed_calls=0
cmdstat_cluster|setslot:calls=2,usec=12,usec_per_call=6.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|shards:calls=1,usec=11,usec_per_call=11.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|countkeysinslot:calls=2,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=1,usec=14,usec_per_call=14.00,rejected_calls=0,failed_calls=0
I don't see a delete call :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While experimenting with the test, this seemed intriguing.
These lines are executed, after the fix_cluster $nodefrom(addr)
puts $nodeto(addr)
puts $nodefrom(addr)
puts [$nodeto(link) get aga]
I got this response
127.0.0.1:21143
127.0.0.1:21144
[exception]: Executing test client: MOVED 609 127.0.0.1:21144.
MOVED 609 127.0.0.1:21144
But HTSTATS
for both the nodes aren't showing the key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried running it on mac/linux, it never fails for me. Do we need to build it with any particular flags to reproduce it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No! Just make
and if I run the test in loop for few times. I am on Apple M3 Chip
. Are you also checking aarch/arm? (Not sure if this is making a difference.)
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
40fb786
to
71861a6
Compare
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
71861a6
to
489b38e
Compare
With this, the source node is also informed that the slot ownership has been changed.
100 iterations passed for this test, which were flakily failing before.
Fixes #1526.