improve files cache memory usage even further #8618

MichaelDietzel · 2025-01-02T22:12:51Z

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Yes

Is this a BUG / ISSUE report or a QUESTION?

None of the above. Probably a feature request.

Your borg version (borg -V).

2.0.0b14

Describe the problem you're observing.

In #8511 the memory usage of the files cache is already greatly reduced by using the IDs provided by the new borghash. If I understand everything correctly, there probably often are multiple consecutive IDs in the chunk list of a file. Thus the memory usage could be optimized by storing an additional number indicating how many chunks with consecutive IDs follow after a given chunk ID.

disclaimer

I am not familiar with how to use msgpack efficiently so here is just one random idea on how to implement this. Probably there are better ideas, but I wanted to be able to get a rough estimate what this could mean for the memory usage. (I could find barely any documentation about the inner workings of msgpack and I did not read the source code to find out what exactly happens. If anyone knows a good documentation I'd be happy for a hint)

optimization for consecutive chunks

Increase the size of the ID from 4 Bytes to 5 Bytes. The added Byte contains how many consecutive chunks there are after the given ID. So there can be 0 to 255 following consecutice chunks. If there are more consecutive chunks, another ID would have to be encoded. This would hurt a little but at that point the reduction in memory usage should already be pretty good.
Some estimates of the change in memory usage based on the estimate given in #8511 which is X + N * 5 Bytes, with N being the number of chunks used by a file.

Worst case X + N * 6 Bytes. Probably mostly for small files where X is big compared to N * 6.
Best case X + N * ~0.02 Bytes
When each chunk ID has exactly one consecutive following ID: X + N * 3 Bytes

This assumes that msgpack does not align the data it stores and thus would align 5 Byte-values at 8 Byte adresses and thus contain 3 Bytes of overhead each. But in that case maybe separate list would help.

additional optimization for repeated chunks

I assume besides consecutive IDs there could also be repeated IDs as well, for example if a disk image with long runs of zeroes (empty space) is stored. So maybe having a special case for this could be worthwhile as well. Either by separately storing this information or otherwise by "taking some values away" from the additional byte and using them to indicate repated chunks instead. Example: 255 means 1 additional repetition, 254 means 2 repetitions, 253 means 4 repetitions, 252 means 8 repetitions, ... up to some good value. That way the best case from above would only get slightly worse while still being able to get some significant gains for long runs of repeated values.

So I hope my assumptions are correct and this leads to some more nice improvements

ThomasWaldmann · 2025-01-04T23:26:17Z

Not sure how frequent consecutive runs of the same chunk are in practice.

I guess there is one "popular" case, when long runs of all-zero ranges get chunked. But guess that's all, isn't it? In case of sparse files, such ranges might already get some special treatment depending on the chunker (only the fixed chunker does that).

Guess we need to avoid "optimizing" for a not-that-frequent case - otherwise there might be more overhead for that than we save by it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve files cache memory usage even further #8618

improve files cache memory usage even further #8618

MichaelDietzel commented Jan 2, 2025

ThomasWaldmann commented Jan 4, 2025

improve files cache memory usage even further #8618

improve files cache memory usage even further #8618

Comments

MichaelDietzel commented Jan 2, 2025

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Is this a BUG / ISSUE report or a QUESTION?

Your borg version (borg -V).

Describe the problem you're observing.

disclaimer

optimization for consecutive chunks

additional optimization for repeated chunks

ThomasWaldmann commented Jan 4, 2025