Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve files cache memory usage even further #8618

Open
MichaelDietzel opened this issue Jan 2, 2025 · 1 comment
Open

improve files cache memory usage even further #8618

MichaelDietzel opened this issue Jan 2, 2025 · 1 comment

Comments

@MichaelDietzel
Copy link

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

None of the above. Probably a feature request.

Your borg version (borg -V).

2.0.0b14

Describe the problem you're observing.

In #8511 the memory usage of the files cache is already greatly reduced by using the IDs provided by the new borghash. If I understand everything correctly, there probably often are multiple consecutive IDs in the chunk list of a file. Thus the memory usage could be optimized by storing an additional number indicating how many chunks with consecutive IDs follow after a given chunk ID.

disclaimer

I am not familiar with how to use msgpack efficiently so here is just one random idea on how to implement this. Probably there are better ideas, but I wanted to be able to get a rough estimate what this could mean for the memory usage. (I could find barely any documentation about the inner workings of msgpack and I did not read the source code to find out what exactly happens. If anyone knows a good documentation I'd be happy for a hint)

optimization for consecutive chunks

Increase the size of the ID from 4 Bytes to 5 Bytes. The added Byte contains how many consecutive chunks there are after the given ID. So there can be 0 to 255 following consecutice chunks. If there are more consecutive chunks, another ID would have to be encoded. This would hurt a little but at that point the reduction in memory usage should already be pretty good.
Some estimates of the change in memory usage based on the estimate given in #8511 which is X + N * 5 Bytes, with N being the number of chunks used by a file.

  • Worst case X + N * 6 Bytes. Probably mostly for small files where X is big compared to N * 6.
  • Best case X + N * ~0.02 Bytes
  • When each chunk ID has exactly one consecutive following ID: X + N * 3 Bytes

This assumes that msgpack does not align the data it stores and thus would align 5 Byte-values at 8 Byte adresses and thus contain 3 Bytes of overhead each. But in that case maybe separate list would help.

additional optimization for repeated chunks

I assume besides consecutive IDs there could also be repeated IDs as well, for example if a disk image with long runs of zeroes (empty space) is stored. So maybe having a special case for this could be worthwhile as well. Either by separately storing this information or otherwise by "taking some values away" from the additional byte and using them to indicate repated chunks instead. Example: 255 means 1 additional repetition, 254 means 2 repetitions, 253 means 4 repetitions, 252 means 8 repetitions, ... up to some good value. That way the best case from above would only get slightly worse while still being able to get some significant gains for long runs of repeated values.

So I hope my assumptions are correct and this leads to some more nice improvements

@ThomasWaldmann
Copy link
Member

Not sure how frequent consecutive runs of the same chunk are in practice.

I guess there is one "popular" case, when long runs of all-zero ranges get chunked. But guess that's all, isn't it? In case of sparse files, such ranges might already get some special treatment depending on the chunker (only the fixed chunker does that).

Guess we need to avoid "optimizing" for a not-that-frequent case - otherwise there might be more overhead for that than we save by it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants