-
Notifications
You must be signed in to change notification settings - Fork 794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lighthouse space usage - Increasing dramastically from 87GB to 430GB in 7 days #4773
Comments
If you still have any oversized nodes (or if it happens again) you can try: Running If it is, you can try running We haven't identified the bug that causes this, but have had a few reports of it. The database schema is getting completely overhauled in a few versions time, with a new system that prunes states more aggressively and so far has not shown any signs of blowing up in the same way (see the tree states releases if you're curious). Still, if we find the bug in the current DB it would be nice to fix it. Debug logs from an affected node would be helpful (DM me on Discord |
Sure! I still have that node oversized to help to debug this issue Here you are the results that you have asked for:
I do not know if I am executing correctly this second one:
Perfect, I will contact you in Discord |
Looks like you've got the command right but need to just stop the BN while you run it. It will produce quite a lot of output so you might also want to pipe it to a file I'm afk from Discord for the weekend but will get in touch Monday :) |
Thanks, I can see what the issue is now:
The tree states release I mentioned should alleviate the latter point, because it results in vastly less data being stored on disk. However, it would be good to understand why the database is erroring, as I haven't seen this error before. Strangely the block that's missing from your DB isn't one that I can find: it's not part of the canonical chain and it wasn't seen by the couple of nodes I spot-checked (I'll check more thoroughly tomorrow). It would be great if you could provide logs from around the time the database starting growing, as this might lead us to the mystery block and the root cause. For the best chance of success it would be great to have debug level logs, which should have been written by default to your datadir at Thanks! |
Good to know that! Thanks! Unfortunately, I have only logs from September 20th because of the internal log rotation I would say... |
Any info logs from earlier that might have been retained by Docker or the OS? |
I have checked and it is not possible neither, they have been rotated :( |
That's ok, thanks for your efforts! I'll see if I can dig up that block from somewhere and see if it provides a clue about what went wrong |
In case that the same issue is present in some of my Lighthouse nodes, I will copy those logs before they are rotated @michaelsproul |
Hey @luarx I did some investigating and found a few interesting things:
Anyway, the details of the block's publishing probably aren't too important, because the issue is how it was processed by your node. To get that error, the database pruning process must have hit the lighthouse/beacon_node/beacon_chain/src/migrate.rs Lines 487 to 496 in 441fc16
The pruning is trying to fetch a block which exists in the
There are only two places where we add a block to the head tracker. The main one is here, after importing the block: lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs Lines 3105 to 3106 in 441fc16
Notably this happens after the block is written to disk, here: lighthouse/beacon_node/beacon_chain/src/beacon_chain.rs Lines 3024 to 3028 in 441fc16
So on this code path it seems like it would be impossible to end up in a situation where the block is in the head tracker, but not written to disk. Even if the node crashed or was SIGKILLed in between those two calls, the result would be that the block was on disk and not in the head tracker. So where's the other place we call lighthouse/beacon_node/beacon_chain/src/builder.rs Lines 680 to 697 in 441fc16
I doubt that your node could have hit this, because it hadn't missed the most recent fork (Capella, which happened back in April). It could have tried to run that revert if the head block was corrupt on disk, but even if it did, it wouldn't have selected 7323584 because of the condition here: lighthouse/beacon_node/beacon_chain/src/fork_revert.rs Lines 57 to 58 in 441fc16
So, in terms of entries being added to the head tracker without the block existing on disk, I think we're in the clear. The other option is that a block was deleted from disk, but not deleted from the lighthouse/beacon_node/beacon_chain/src/migrate.rs Lines 629 to 631 in 441fc16
Those head tracker deletions get added to an atomic I/O batch here, together with the block deletions: lighthouse/beacon_node/beacon_chain/src/migrate.rs Lines 651 to 659 in 441fc16
If LevelDB (and the filesystem) uphold their ACID guarantees, then this batch should happen atomically: either the blocks & head tracker entries are both deleted, or neither are. Even though we release the lock on the head tracker early ( The other possibility is that blocks are deleted elsewhere, but this is not the case. The only usage of Therefore, I'm a bit stumped. For the bug to occur, one of the assumptions I've made must have been violated, e.g.
Even if we don't understand exactly how it happened, we could still try to mitigate the impact of this occurring by making the pruning process ignore this kind of error. We could simply log a warning whenever this is detected, rather than stalling pruning indefinitely (which is what leads to the database growth). The outcome would be a little bit of database bloat, due to states/blocks that weren't fully cleaned up, but I think it would be better than the alternative. I think this is probably a good harm-minimisation approach, particularly if we discover that this is quite likely on NFS. Long-term we could also mitigate it by:
TL;DR: I haven't been able to find an obvious Lighthouse bug, and suspect foul play from the filesystem, possibly NFS. |
Amazing, that you for time debugging this and sharing what you have found in such a detailed explanation! 🙌 🙌 🙌
Unfortunately I can not support more on this as it is internal Lighthouse logic that it escapes my control... but I do agree that a good solution at least is to make the pruning process ignore this kind of error to prevent the volume usage increase 👀 |
Just curious, do you think that it could be a bug incorporated in v4.1.0 and fixed in most recent versions @michaelsproul ? |
No, I don't think that's likely, as most of this code hasn't changed since v4.1.0. I would still recommend upgrading though, as there's a small chance it will help.
Are you using Docker or Kubernetes? I'm trying to understand whether there could be some extra layer between Lighthouse and AWS EBS that is buggy |
I am using Kubernetes 🔥 |
I had a similar problem on my Before resyncing
About 24hrs after resyncing started and the loading of the historical data was completed
So it would appear a "regular" X86 Ubuntu machine can run into this problem:
Sadly, I didn't save any logs so I don't have any other information, but let me know if there are any testing/debugging you'd like me to do. I vote this bug be fixed ASAP :) |
Had another report of this on GCP (Google Cloud Platform) using ext4 as the filesystem. Were you also using GCP @luarx? |
No. And not only that a different system ran out of disk space yesterday :(
I removed beacon directory and did a checkpoint-sync and it's running fine
now and 78% available.
…On Thu, Jan 11, 2024, 20:34 Michael Sproul ***@***.***> wrote:
Had another report of this on GCP (Google Cloud Platform) using ext4 as
the filesystem.
Were you also using GCP @luarx <https://github.com/luarx>?
—
Reply to this email directly, view it on GitHub
<#4773 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAH2CHCF5IGLCWAGRWBS7NLYOC4PDAVCNFSM6AAAAAA5C4KM4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBYGQYTQNJXGU>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
I'll add this to my TODO list for next week. Just getting stuck in after the holidays, but can probably squeeze a fix into v4.6.0 proper. |
I just saw the announcement for the Experimental 4.6.111 release, is your proposed fix for that in that release? |
No, a different cloud provider 👀 |
No, it isn't. The database is substantially different in that release but I think it could still be affected by this issue, seeing as we don't actually know the root cause. It doesn't contain the fix that I'm proposing: to ignore the database inconsistency when it happens, in order to avoid the space blow-up |
@michaelsproul, maybe you should add it to 4.6.111 and I'll test it on my prater machine that has failed. Actually, anyone else besides @luarx and me seen this problem? |
@winksaville 4.6.111 requires a resync at the moment, so I'd rather get you to test a PR off stable I think a few other people on Discord have hit this issue as well |
OK |
Resolved via #5084 |
So it seems this issue is a non-trivial manifestation #1557 has been waiting for? ;) |
@adaszko Exactly, yes! I'd forgotten about that issue, thanks for making the link! We think there are still issues remaining in the head tracker impl, so my resolution now is to delete it entirely. |
Is there another issue tracking the remaining issues? What's the expected time frame for #5084 to be incorporated into a release? |
@winksaville There's one here: Another (milder) atomicity issue that we decided not to fix is described in the fix PR: There's also more discussion on the issue about deleting the head tracker: |
Looking forward to seeing the fix in a new version as we are suffering this bug yet in v4.6.0 😞 Could a hotfix version be created at least with #5084? 🙏 |
@luarx The fix is already in v4.6 |
Did you have a node that wasn't already corrupt become corrupt while running 4.6? If so, can you please share logs? The fix will not un-corrupt an already corrupt database. For that you need to re-sync. |
Regarding your questions:
|
Are you sure @michaelsproul ? I see that it is only mentioned in https://github.com/sigp/lighthouse/releases/tag/v4.6.222-exp |
Yeah it just got left out of the release notes for 4.6, I'll add it now. You can see all the tags that a commit is part of by clicking on it: 585124f. That commit is in v4.6.0 and v4.6.222-exp. |
It could be... for the next time we will try to have logs to debug! |
Description
We are running several Lighthouse clients and I have noticed that in 10% of them there are some ocassions were the volume usage is increased 5x in 7 days!
My temporal solution is about deleting the volume date and resync, after that the Lighthouse client is stable again. But some weeks later a different Lighthouse client repeats this weird behaviour...
Version
Lighthouse v4.1.0 (docker image)
Present Behaviour
The volume usage increases 5x in 7 days
Params:
Expected Behaviour
Do not increment so heavy the volume usage in a few days, at least some stable values
Steps to resolve
Do not increment so heavy the volume usage in a few days
The text was updated successfully, but these errors were encountered: