-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An assortment of Pbench Ops fixes and fun #3612
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
dbutenhof
added
Server
Audit
Of and relating to server side changes to data
Database
Operations
Related to operation and monitoring of a service
labels
Mar 8, 2024
FYI: I faked broken metadata by using
|
This fixes several issues observed during ops review: 1. The `/api/v1/endpoints` API fails if the server is shut down 2. `tar` unpack errors can result in enormous `stderr` output, which is captured in the `Audit` log; truncate it to 5Mb 3. Change the `pbench-audit` utility to use `dateutil.parser` instead of `click.DateTime()` so we can include fractional seconds and timezone. During the time when we broke PostgreSQL, we failed to create metadata for a number of datasets that were allowed to upload. (Whether we should allow this vs failing the upload is a separate issue.) We have want to repair the excessively large `Audit` attributes records. So I took a stab at some wondrous and magical SQL queries and hackery to begin a new `pbench-repair` utility. Right now, it repairs long audit attributes "intelligently" by trimming individual JSON key values; and it add metadata to datasets which lack critical values. Currently, this includes `server.tarball-path` (which we need to enable TOC and visualization), `dataset.metalog` (capturing the tarball `metadata.log` file), and `server.benchmark` for visualization. There are other `server` namespace values (including expiration time) that could be repaired: I decided not to worry about that as we're not doing expiration anyway. (Though I might add it over the weekend, since it shouldn't be hard.) And there are probably other things we might want to repair in the future using this framework. I tested this in a `runlocal` container, using `psql` to "break" datasets and repair them. I hacked the local `repair.py` with a low "max error" limit to force truncation of audit attributes: ``` pbench-repair --detail --errors --verify --progress 10 (22:52:08) Repairing audit || 60:FAILURE upload fio_rw_2018.02.01T22.40.57 [message] truncated (107) to 105 || 116:SUCCESS apikey None [key] truncated (197) to 105 22 audit records had attributes too long 2 records were fixed (22:52:08) Repairing metadata || fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz || fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log || fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio' || pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz || pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from metadata.log || pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'pbench-user-benchmark' 2 server.tarball-path repairs, 0 failures 2 dataset.metalog repairs, 0 failures 2 server.benchmark repairs ```
This adds repair for `server.deletion` (expiration timestamp), completing the repair of the `server` namespace. In copying the setup from `intake_base.py` I realized that intake was technically incorrect (not that it really matters much as we don't, and likely won't, implement dataset expiration) in that it always uses the static lifetime setting from `pbench-config.cfg` rather than recognizing the dynamic server settings value. So I fixed that and made a common implementation. It's also been bothering me that, in the midst of our PostgreSQL problems, we allowed upload of datasets without metadata. I'd initially deliberatedly allowed this looking at the metadata as "extra" and figuring I didn't want to fail an upload just because of that. However, with recent optimizations, we really depend internally on `server.tarball-path` in particular: the new optimized `CacheManager.find_dataset` won't work without it. So failure in setting metadata on intake is now a fatal internal server error.
To simplify edge cases, I give in, although for the record I'm not happy about giving up on the line-based truncation: I just want it to be done. (And, ultimately, I don't think it really matters all that much.)
webbnh
approved these changes
Apr 5, 2024
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Audit
Of and relating to server side changes to data
Database
Operations
Related to operation and monitoring of a service
Server
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This fixes several issues observed during ops review:
/api/v1/endpoints
API fails if the server is shut downtar
unpack errors can result in enormousstderr
output, which is captured in theAudit
log; truncate it to 5Kbpbench-audit
utility to usedateutil.parser
instead ofclick.DateTime()
so we can include fractional seconds and timezone.During the time when we broke PostgreSQL, we failed to create metadata for a number of datasets that were allowed to upload. (Whether we should allow this vs failing the upload is a separate issue.) We have want to repair the excessively large
Audit
attributes records. So I took a stab at some wondrous and magical SQL queries and hackery to begin a newpbench-repair
utility. Right now, it repairs long audit attributes "intelligently" by trimming individual JSON key values; and it add metadata to datasets which lack critical values. Currently, this includesserver.tarball-path
(which we need to enable TOC and visualization),dataset.metalog
(capturing the tarballmetadata.log
file), andserver.benchmark
for visualization.There are other
server
namespace values (including expiration time) that could be repaired: I decided not to worry about that as we're not doing expiration anyway. (Though I might add it over the weekend, since it shouldn't be hard.) And there are probably other things we might want to repair in the future using this framework.I tested this in a
runlocal
container, usingpsql
to "break" datasets and repair them. I hacked the localrepair.py
with a low "max error" limit to force truncation of audit attributes: