Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flakly half-finish importing cluster test #1532

Open
wants to merge 2 commits into
base: unstable
Choose a base branch
from

Conversation

sarthakaggarwal97
Copy link
Contributor

@sarthakaggarwal97 sarthakaggarwal97 commented Jan 8, 2025

With this, the source node is also informed that the slot ownership has been changed.
100 iterations passed for this test, which were flakily failing before.

Fixes #1526.

Comment on lines 84 to 87
test "Half-finish importing" {
# Now we half finish 'importing' node
assert_equal {OK} [$nodeto(link) cluster setslot 609 node $nodeto(id)]
assert_equal {OK} [$nodefrom(link) cluster setslot 609 node $nodeto(id)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the point of "Half-finish importing" that we should only finish the importing on the target node, not on the source node, and check that the cluster can handle this? Or what does "half-finish" mean here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hpatro @zuiderkwast I was not sure if half-finish importing is related to setslots itself or not. Without this, the key was lost in more than a few cases. The output for keys * was empty.
I tried with setting additional timeouts, retries, wait_for_cluster_propagation, it didn't seem to reliably fix it. I can mark this PR as draft, and look more behind the scenes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, yes it would be good to understand better why the keys are lost. $nodeto(link) cluster setslot 609 node $nodeto(id) should make this node bump the epoch and the keys are already migrated to this node, so it seems strange that they are lost. Do you know what the cluster_fix line below does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dove deep a bit into this.

Do you know what the cluster_fix line below does

I seems to be running the valkey-cli --cluster fix command. Based on documentation ~ in order to fix the cluster so that keys will be migrated according to the hash slots each node is authoritative or not.
Although, it would have run successfully or else the test might have failed with fail "Cluster could not settle with configuration"

I was checking this documentation: https://valkey.io/commands/cluster-setslot/

The order of step 4 and 5 is important. The destination node is responsible for propagating the change to the rest of the cluster.

Looks like one of the nodes was not informed about the slot migration (the test wants to assert during this behaviour only). With this half-finish importing, we are at step 4, when we assert whether the key aga is present or not. When I was locally replicating this, it seems like the request is indeed getting redirected from the non-slot-owning node to the slot-owning node.

╰─ ./src/valkey-cli -c -h 127.0.0.1 -p 7000                                                                                                                                                                                                                      ─╯
127.0.0.1:7000> get aga
-> Redirected to slot [609] located at 127.0.0.1:6379
"xyz"
127.0.0.1:6379> 

$cluster looks like needs to be refreshed to know which is the master node for the slot 609. I am not sure if there is a way that client redirection fails here. It's a bit hard for me to replicate this manually with commands. So far, I have not been able to even after moving the key few times.

The latest change to refresh the cluster before we inform the nodes about the migrated key seems to reliably fix the test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of step 4 and 5 is important. The destination node is responsible for propagating the change to the rest of the cluster.

This is interesting. I wonder what can happen if this is not followed. This test case is obviously not following this rule, so maybe the test case is invalid? Shall we just delete it?

@PingXie Do you remember what happens if you send SETSLOT NODE to the source node only? Do the other nodes gossip back the ownership of the slot to the old owner?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for commandstats, could you reset after the half-migrating test is completed and then capture it. It's a bit difficult to process the information as it is from the start of the server.

Copy link
Contributor Author

@sarthakaggarwal97 sarthakaggarwal97 Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEBUG HTSTATS 0 is empty when the test fails

Output:

[Dictionary HT]
[Expires HT]

Command Stats after reset:

Nodeto command stats: # Commandstats
cmdstat_restore-asking:calls=1,usec=12,usec_per_call=12.00,rejected_calls=0,failed_calls=0
cmdstat_debug:calls=1,usec=3,usec_per_call=3.00,rejected_calls=0,failed_calls=0
cmdstat_keys:calls=3,usec=10,usec_per_call=3.33,rejected_calls=0,failed_calls=0
cmdstat_dbsize:calls=2,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_get:calls=1,usec=1,usec_per_call=1.00,rejected_calls=0,failed_calls=0
cmdstat_asking:calls=1,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_config|resetstat:calls=1,usec=36,usec_per_call=36.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|slots:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|nodes:calls=4,usec=93,usec_per_call=23.25,rejected_calls=0,failed_calls=0
cmdstat_cluster|setslot:calls=2,usec=12,usec_per_call=6.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|shards:calls=1,usec=11,usec_per_call=11.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|countkeysinslot:calls=2,usec=0,usec_per_call=0.00,rejected_calls=0,failed_calls=0
cmdstat_cluster|info:calls=1,usec=14,usec_per_call=14.00,rejected_calls=0,failed_calls=0

I don't see a delete call :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While experimenting with the test, this seemed intriguing.

These lines are executed, after the fix_cluster $nodefrom(addr)

    puts $nodeto(addr)
    puts $nodefrom(addr)
    puts [$nodeto(link) get aga]

I got this response

127.0.0.1:21143
127.0.0.1:21144
[exception]: Executing test client: MOVED 609 127.0.0.1:21144.
MOVED 609 127.0.0.1:21144

But HTSTATS for both the nodes aren't showing the key.

Copy link
Collaborator

@hpatro hpatro Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried running it on mac/linux, it never fails for me. Do we need to build it with any particular flags to reproduce it?

Copy link
Contributor Author

@sarthakaggarwal97 sarthakaggarwal97 Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! Just make and if I run the test in loop for few times. I am on Apple M3 Chip. Are you also checking aarch/arm? (Not sure if this is making a difference.)

@sarthakaggarwal97 sarthakaggarwal97 marked this pull request as draft January 8, 2025 22:44
@enjoy-binbin enjoy-binbin changed the title Fixes Test #1526 Fix flakly half-finish importing cluster test Jan 9, 2025
@sarthakaggarwal97 sarthakaggarwal97 marked this pull request as ready for review January 9, 2025 18:41
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Signed-off-by: Sarthak Aggarwal <sarthagg@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Test Failure] Half-finish importing in tests/unit/cluster/half-migrated-slot.tcl
4 participants