Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An assortment of Pbench Ops fixes and fun #3612

Merged
merged 7 commits into from
Apr 5, 2024

Conversation

dbutenhof
Copy link
Member

@dbutenhof dbutenhof commented Mar 8, 2024

This fixes several issues observed during ops review:

  1. The /api/v1/endpoints API fails if the server is shut down
  2. tar unpack errors can result in enormous stderr output, which is captured in the Audit log; truncate it to 5Kb
  3. Change the pbench-audit utility to use dateutil.parser instead of click.DateTime() so we can include fractional seconds and timezone.

During the time when we broke PostgreSQL, we failed to create metadata for a number of datasets that were allowed to upload. (Whether we should allow this vs failing the upload is a separate issue.) We have want to repair the excessively large Audit attributes records. So I took a stab at some wondrous and magical SQL queries and hackery to begin a new pbench-repair utility. Right now, it repairs long audit attributes "intelligently" by trimming individual JSON key values; and it add metadata to datasets which lack critical values. Currently, this includes server.tarball-path (which we need to enable TOC and visualization), dataset.metalog (capturing the tarball metadata.log file), and server.benchmark for visualization.

There are other server namespace values (including expiration time) that could be repaired: I decided not to worry about that as we're not doing expiration anyway. (Though I might add it over the weekend, since it shouldn't be hard.) And there are probably other things we might want to repair in the future using this framework.

I tested this in a runlocal container, using psql to "break" datasets and repair them. I hacked the local repair.py with a low "max error" limit to force truncation of audit attributes:

pbench-repair --detail --errors --verify --progress 10
(22:52:08) Repairing audit
|| 60:FAILURE upload fio_rw_2018.02.01T22.40.57 [message] truncated (107) to 105
|| 116:SUCCESS apikey None [key] truncated (197) to 105
22 audit records had attributes too long
2 records were fixed
(22:52:08) Repairing metadata
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from metadata.log
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'pbench-user-benchmark'
2 server.tarball-path repairs, 0 failures
2 dataset.metalog repairs, 0 failures
2 server.benchmark repairs

@dbutenhof dbutenhof added Server Audit Of and relating to server side changes to data Database Operations Related to operation and monitoring of a service labels Mar 8, 2024
@dbutenhof dbutenhof requested a review from webbnh March 8, 2024 23:14
@dbutenhof dbutenhof self-assigned this Mar 8, 2024
webbnh

This comment was marked as resolved.

@dbutenhof
Copy link
Member Author

FYI:

I faked broken metadata by using psql to delete some server and metalog rows:

|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
|| Isolator directory /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa contains multiple tarballs: ['/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz', '/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_mock_2020.02.27T22.16.14.tar.xz']
(16:01:28) Found ['/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz', '/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_mock_2020.02.27T22.16.14.tar.xz'] for ID 08516cc7448035be2cc502f0517783fa
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 server.deletion set (730 days) to 2026-03-12T15:20:34.380181+00:00
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
(16:01:29) Found /srv/pbench/archive/fs-version-001/dhcp31-44.perf.lab.eng.bos.redhat.com/22a4bc5748b920c6ce271eb68f08d91c/fio_rw_2018.02.01T22.40.57.tar.xz for ID 22a4bc5748b920c6ce271eb68f08d91c
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-44.perf.lab.eng.bos.redhat.com/22a4bc5748b920c6ce271eb68f08d91c/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 server.deletion set (730 days) to 2026-03-12T15:20:33.301420+00:00
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
|| Isolated tarball /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz MD5 doesn't match isolator 45f0e2af41977b89e07bae4303dc9972
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 doesn't seem to have a tarball
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from default
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 server.deletion set (730 days) to 2026-03-12T15:20:33.441340+00:00
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'unknown'
|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
(16:01:29) Found /srv/pbench/archive/fs-version-001/rhel8-1/4b8da5832aa9c7c6a21dc74123b8968b/uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57.tar.xz for ID 4b8da5832aa9c7c6a21dc74123b8968b
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/rhel8-1/4b8da5832aa9c7c6a21dc74123b8968b/uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57.tar.xz
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no metalog: setting from metadata.log
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 server.deletion set (730 days) to 2026-03-12T15:20:33.609509+00:00
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no server.benchmark: setting 'uperf'
4 server.tarball-path repairs, 1 failures
4 server.deletion repairs, 0 failures
4 dataset.metalog repairs, 0 failures
4 server.benchmark repairs

webbnh

This comment was marked as resolved.

webbnh

This comment was marked as resolved.

webbnh

This comment was marked as resolved.

This fixes several issues observed during ops review:

1. The `/api/v1/endpoints` API fails if the server is shut down
2. `tar` unpack errors can result in enormous `stderr` output, which is
   captured in the `Audit` log; truncate it to 5Mb
3. Change the `pbench-audit` utility to use `dateutil.parser` instead of
   `click.DateTime()` so we can include fractional seconds and timezone.

During the time when we broke PostgreSQL, we failed to create metadata for a
number of datasets that were allowed to upload. (Whether we should allow this
vs failing the upload is a separate issue.) We have want to repair the
excessively large `Audit` attributes records. So I took a stab at some
wondrous and magical SQL queries and hackery to begin a new `pbench-repair`
utility. Right now, it repairs long audit attributes "intelligently" by
trimming individual JSON key values; and it add metadata to datasets which
lack critical values. Currently, this includes `server.tarball-path` (which
we need to enable TOC and visualization), `dataset.metalog` (capturing the
tarball `metadata.log` file), and `server.benchmark` for visualization.

There are other `server` namespace values (including expiration time) that
could be repaired: I decided not to worry about that as we're not doing
expiration anyway. (Though I might add it over the weekend, since it shouldn't
be hard.) And there are probably other things we might want to repair in the
future using this framework.

I tested this in a `runlocal` container, using `psql` to "break" datasets and
repair them. I hacked the local `repair.py` with a low "max error" limit to
force truncation of audit attributes:

```
pbench-repair --detail --errors --verify --progress 10
(22:52:08) Repairing audit
|| 60:FAILURE upload fio_rw_2018.02.01T22.40.57 [message] truncated (107) to 105
|| 116:SUCCESS apikey None [key] truncated (197) to 105
22 audit records had attributes too long
2 records were fixed
(22:52:08) Repairing metadata
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from metadata.log
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'pbench-user-benchmark'
2 server.tarball-path repairs, 0 failures
2 dataset.metalog repairs, 0 failures
2 server.benchmark repairs
```
This adds repair for `server.deletion` (expiration timestamp), completing the
repair of the `server` namespace.

In copying the setup from `intake_base.py` I realized that intake was
technically incorrect (not that it really matters much as we don't, and likely
won't, implement dataset expiration) in that it always uses the static
lifetime setting from `pbench-config.cfg` rather than recognizing the dynamic
server settings value. So I fixed that and made a common implementation.

It's also been bothering me that, in the midst of our PostgreSQL problems, we
allowed upload of datasets without metadata. I'd initially deliberatedly
allowed this looking at the metadata as "extra" and figuring I didn't want
to fail an upload just because of that. However, with recent optimizations,
we really depend internally on `server.tarball-path` in particular: the new
optimized `CacheManager.find_dataset` won't work without it. So failure in
setting metadata on intake is now a fatal internal server error.
To simplify edge cases, I give in, although for the record I'm not happy about
giving up on the line-based truncation: I just want it to be done. (And,
ultimately, I don't think it really matters all that much.)
webbnh

This comment was marked as resolved.

Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@dbutenhof dbutenhof merged commit 4a35b7e into distributed-system-analysis:main Apr 5, 2024
4 checks passed
@dbutenhof dbutenhof deleted the ops branch April 5, 2024 15:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audit Of and relating to server side changes to data Database Operations Related to operation and monitoring of a service Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants