Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: SDK47 Distribution Panic #16940

Closed
shellvish opened this issue Jul 12, 2023 · 14 comments
Closed

[Bug]: SDK47 Distribution Panic #16940

shellvish opened this issue Jul 12, 2023 · 14 comments
Labels

Comments

@shellvish
Copy link

shellvish commented Jul 12, 2023

Summary of Bug

Some addresses on Stride have reported an issue with claiming staking rewards and redelegating tokens. These users were able to claim staking rewards successfully prior to Stride's last upgrade, and the upgrade effectively only upgraded from SDK 46 to SDK 47. We are investigating on our end as well, but I suspect there might be a bug with SDK 47 (or, perhaps, a very precise way that state should be migrated from SDK 46 to 47).

Two examples of users are stride1ct5jdjxwcs2whefy4c4dpuc0t3vuv7ksrnh4j3 and stride1lf87nq0pxw7qzcvvc65evnntd7za6mxw8evpcx.

Importantly, both of these users delegated to a validator that got downtime slashed, restarted their validator, and then started producing blocks again.

Ways to Observe the Issue

Either of the links above link to BigDipper and show “0” staking rewards despite having many tokens staked.

In addition, this tx shows the error msg when a user tries to claim staking rewards. (TLDR is calculated final stake for delegator stride1ct5... greater than current stake, final stake: 0.257945648504322936 current stake: 0.257919854260751778).

Lastly, this cmd line query strided q distribution rewards stride1lf87nq0pxw7qzcvvc65evnntd7za6mxw8evpcx also returns a similar error Error: rpc error: code = Unknown desc = calculated final stake for delegator stride1lf87nq0pxw7qzcvvc65evnntd7za6mxw8evpcx greater than current stake.

My initial guess as to what’s going on is that SDK 47 has some sort of slashing bug, and users are either getting slashed more than they should be, or there’s some accounting bug in the slashing code. It’s possible this error only appears with non-staking based tokens (e.g. native tokens), but extremely unclear right now.

Version

This is occurring on Stride version v11.0.0 which uses Cosmos SDK v0.47.3.

Steps to Reproduce

This tx will show a transaction that is failing. strided q distribution rewards stride1lf87nq0pxw7qzcvvc65evnntd7za6mxw8evpcx --node https://stride-rpc.polkachu.com:443 --chain-id stride-1 will also display the same issue.

Please let me know if I can provide any more detail, or if there's anything I can do to help debug. Thanks so much for the help, it's very much appreciated.

@shellvish
Copy link
Author

@alexanderbez @tac0turtle just tagging you as I think you might be familiar with this part of the codebase, but please let me know if there's anyone else I should reach out to instead!

I really appreciate your help here!

@facundomedica
Copy link
Member

facundomedica commented Jul 12, 2023

I was checking on this and found that both of those accounts are delegating to stridevaloper1tlz6ksce084ndhwlq2usghamvh0dut9q4z2gxd (Crypteloid). All of its delegators are getting the same error, so the clue must be in that validator's slashing history

Here the last successful tx + first failed tx with this error (14 days ago):

https://bigdipper.live/stride/transactions/8EB8195D344435B287BF57A8F0BB7476089F0446C5CB1CE005E4F88BA4DFE939
https://bigdipper.live/stride/transactions/4E9F682D94507EAA50B9E2774652B60B580BDA4DF9E49660F1104EFA797A2FD4

@facundomedica
Copy link
Member

@shellvish it would be helpful to log everything inside CalculateDelegationRewards. So we can see the starting value and all the substractions we do

@alexanderbez
Copy link
Contributor

alexanderbez commented Jul 12, 2023

I would review the validator's slashing history as @facundomedica pointed out. The delta is small, 0.00002579424357, so it could be a rounding issue in stake or something else. This particular area of the module's codebase is very...duct-tapey, so we'll need more insight into the validator's history prior to understanding the root issue.

@facundomedica
Copy link
Member

@shellvish friendly ping :)

@shellvish
Copy link
Author

@alexanderbez @facundomedica Apologies for the delay here! We were a bit sidetracked with some post-ICS cleanup and took a bit longer to dive into this than I would have hoped.. Should be more responsive going forward!

I dug into this a bit more and I think I've narrowed down the issue. Here are logs from the validator in question (in cosmos-sdk/x/distribution/keeper/delegation.go:CalculateDelegationRewards

STARTING HEIGHT: 1160182
ENDING HEIGHT: 4835808

Validator: stridevaloper1tlz6ksce084ndhwlq2usghamvh0dut9q4z2gxd
Height: 3777879
Starting Period: 323
Ending Period: 3113
Stake: 1300000.000000000000000000
Fraction: 0.000099998511687014
===========
Validator: stridevaloper1tlz6ksce084ndhwlq2usghamvh0dut9q4z2gxd
Height: 4005933
Starting Period: 3113
Ending Period: 3469
Stake: 1299870.001934806881800000
Fraction: 0.000099997750168942

This gives a final Token Amount of 1299740.0178591, but the TokensFromShares calculation gives 1299610.04578328, which is almost exactly 1 basis point lower (the slash penalty on Stride).

It seems like the TokensFromShares is factoring in 3 slashes for this validator, but as far as I can tell, only 2 slashes actually occurred. This also matches up with Numia's Bigquery Table of Stride Events, which we can inspect with this query:

SELECT * FROM `numia-data.stride.stride_event_attributes` 
WHERE event_type = "slash" AND attribute_value = 'stridevalcons1a86u9267kgma5g4ptv35fp5pmpq4szmn8s6s94'

I'm still digging into how this "ghost slash" happened, but I'm having a bit of a hard time investigating. I'm curious if either of you might have any theories here?

In terms of solving the issue, we think there are two possible solutions

  1. We raise the threshold of the invariant check above to have a margin of error of ~200ustrd
  2. We fix the Shares > Tokens conversion properly to only reflect two slashes

I think (2) is the optimal route, but it might take some time to do this change correctly without introducing any new bugs (although would fully defer to you both here!). (1) might make sense as a short-term bandaid solution though, in order to allow delegators to this validator to have access to their funds again (right now, any tx or query touching this validator throws a panic, so users can't undelegate their STRD).

Let me know what you think :) And I truly appreciate your help here!

@alexanderbez
Copy link
Contributor

Interesting. So is it possible this "ghost" slash is something that Numia just didn't catch? Are the two known slashes from the current version of Stride? Perhaps the ghost one was from a previous version (i.e. pre-upgrade)?

In any case, how does (2) really solve it? Could you describe that solution in more detail pls?

@shellvish
Copy link
Author

Very possible they didn't catch it! But the logs above show that CalculateDelegationRewards also only thinks there were 2 slashes. I'm not sure if there's a better way to query for the ground truth here, do you know if there's anywhere else I should look for?

I'm not quite sure what (2) would entail, but I'm thinking something like: in an upgrade handler, we manually modify the Token -> Shares conversion to reflect that there were 2 slashes and not 3. This could either be increasing the number of tokens that the validator has staked or decreasing the number of shares in the validator. To be clear though, I'm purely speculating, as I'm quite unfamiliar with this part of the codebase!

@alexanderbez
Copy link
Contributor

Very possible they didn't catch it! But the logs above show that CalculateDelegationRewards also only thinks there were 2 slashes. I'm not sure if there's a better way to query for the ground truth here, do you know if there's anywhere else I should look for?

Yes, that's a good point. The only way that could happen is if there was a traditional export/import upgrade where one record might've been missed or perhaps an upgrade handler that modified these records.

In any case, it is a bit worrisome. This part of the codebase is probably the most sensitive as well (primarily due to sdk.Dec), so your idea makes sense.

@shellvish
Copy link
Author

shellvish commented Aug 8, 2023

Hm, it's possible, but as far as I know, I don't think there was an export/import upgrade or an upgrade handler that modified these records? I just checked the v10 and v11 upgrades, and I don't think they modified the distribution records.

Do you know which of those two approaches to solve the root cause is preferable, e.g. should we increase the number of tokens or decrease the number of shares?

Also, for the short-term, I think we'll raise the threshold of the invariant check above to have a margin of error of ~200ustrd. I just want to confirm with you that you think this is okay / isn't likely to raise any alarm bells? Also happy to chat through this live if that's easier for you!

@alexanderbez
Copy link
Contributor

Do you know which of those two approaches to solve the root cause is preferable, e.g. should we increase the number of tokens or decrease the number of shares?

I would probably keep the tokens untouched, and modify the shares to have parity with the tokens.

Yeah, overall this is pretty weird. Typically, what we'd do to debug this is log all related slashing/distribution actions related to a specific validator and see what we can infer from that. Given that you're already at v11, this might be super tedious. So debugging this is gonna be a challenge.

@facundomedica
Copy link
Member

Stride added a fix for the specific validator Stride-Labs@1bc120b

@tac0turtle
Copy link
Member

closing for now, i believe this could be related to custom logic in a fork

@angelorc
Copy link

angelorc commented Dec 3, 2024

Hello everyone, I would like to reopen this issue. The bug has not been corrected and also bitsong is now affected by this problem. It seems that there are "ghost slash" that have not been taken into consideration.

#22733

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants