Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: move aside TSM file on errBlockRead #25839

Open
wants to merge 4 commits into
base: master-1.x
Choose a base branch
from

Conversation

davidby-influx
Copy link
Contributor

The error type check for errBlockRead was incorrect,
and bad TSM files were not being moved aside when
that error was encountered. Use errors.Join,
errors.Is, and errors.As to correctly unwrap multiple
errors.

Closes #25838

@gwossum
Copy link
Member

gwossum commented Jan 15, 2025

@davidby-influx This is probably an uncommon occurrence, but what if the first errBlockRead that a tsmBatchKeyIterator encounters is after the error list has reached its max? In this case, the errors.As(err, &blockReadErr) in MoveTsmOnReadErr would fail, and the TSM file would not be moved out of the way because the errBlockRead got dropped.

One solution would be to have tsmBatchKeyIterator.AppendError keep a boolean to indicate if it has included a errBlockRead yet. If it gets an errBlockRead and the error list is at its max but it hasn't included an errBlockRead then it should go ahead and append it, then mark that it now has an errBlockRead. This would prevent the issue described above, but it kind of breaks abstraction layers.

Another solution would to change tsmBatchKeyIterator.AppendError so that if the error list has hit its max, it will scan the list of errors to see if it has included one of the errors it just received. If not, then it will append the error even the list is already at its max. This might also prevent other edge cases in the code for other error types.

@davidby-influx
Copy link
Contributor Author

The second solution is conceptually cleaner, but the reason we put the list limit in was that this error list was crashing the system with an OOM when a bad file was being queried again and again. The search for existing errors matching the new error should probably dictate a different data structure to store errors to reduce search costs.

@davidby-influx
Copy link
Contributor Author

davidby-influx commented Jan 16, 2025

The original problem for which the error list was limited to avoid running out of memory was because a bad block caused thousands of the same error. I suggest we store each unique error (as determined by its Error() string) once in a map.

We will still limit map size, but this will cause far fewer overflows of the error limit, because the symptom we have seen is many, many multiples of the same error. This still runs the risk of discarding an errBlockRead, but I think the chance is acceptably small.

@davidby-influx
Copy link
Contributor Author

Alternately, we could remove the limit on error storage, on the theory that the number of unique errors is unlikely to cause an out-of-memory situation. Then we would have a small risk of an OOM, but reduce the risk of missing an errBlockRead

@gwossum
Copy link
Member

gwossum commented Jan 16, 2025

I'd still be tempted to key off the type of error instead of the value Error. I have fears there might be small differences in the error messages (e.g. bad thingy at offset 1, bad thingy at offset 2, bad thingy at offset 3) that would still cause the list of errors to essentially be unbounded.

I think it would also be good to keep the errors in order, if possible. The ordering of the errors might yield clues to what went wrong.

@davidby-influx
Copy link
Contributor Author

Types are problematic. A lot of errors come up as type *errors.errorString, with different innards.

The point on ordering is important, though.

@davidby-influx
Copy link
Contributor Author

@gwossum - take a look at my latest revision. We now store unique errors in order of appearance up to a limit. Should reduce storage in the case we first saw, and retains the ordering.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

block read error should move aside bad TSM file
2 participants