An assortment of Pbench Ops fixes and fun #3612

dbutenhof · 2024-03-08T23:14:24Z

This fixes several issues observed during ops review:

The /api/v1/endpoints API fails if the server is shut down
tar unpack errors can result in enormous stderr output, which is captured in the Audit log; truncate it to 5Kb
Change the pbench-audit utility to use dateutil.parser instead of click.DateTime() so we can include fractional seconds and timezone.

During the time when we broke PostgreSQL, we failed to create metadata for a number of datasets that were allowed to upload. (Whether we should allow this vs failing the upload is a separate issue.) We have want to repair the excessively large Audit attributes records. So I took a stab at some wondrous and magical SQL queries and hackery to begin a new pbench-repair utility. Right now, it repairs long audit attributes "intelligently" by trimming individual JSON key values; and it add metadata to datasets which lack critical values. Currently, this includes server.tarball-path (which we need to enable TOC and visualization), dataset.metalog (capturing the tarball metadata.log file), and server.benchmark for visualization.

There are other server namespace values (including expiration time) that could be repaired: I decided not to worry about that as we're not doing expiration anyway. (Though I might add it over the weekend, since it shouldn't be hard.) And there are probably other things we might want to repair in the future using this framework.

I tested this in a runlocal container, using psql to "break" datasets and repair them. I hacked the local repair.py with a low "max error" limit to force truncation of audit attributes:

pbench-repair --detail --errors --verify --progress 10
(22:52:08) Repairing audit
|| 60:FAILURE upload fio_rw_2018.02.01T22.40.57 [message] truncated (107) to 105
|| 116:SUCCESS apikey None [key] truncated (197) to 105
22 audit records had attributes too long
2 records were fixed
(22:52:08) Repairing metadata
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from metadata.log
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'pbench-user-benchmark'
2 server.tarball-path repairs, 0 failures
2 dataset.metalog repairs, 0 failures
2 server.benchmark repairs

dbutenhof · 2024-03-12T18:01:54Z

FYI:

I faked broken metadata by using psql to delete some server and metalog rows:

|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
|| Isolator directory /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa contains multiple tarballs: ['/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz', '/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_mock_2020.02.27T22.16.14.tar.xz']
(16:01:28) Found ['/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz', '/srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_mock_2020.02.27T22.16.14.tar.xz'] for ID 08516cc7448035be2cc502f0517783fa
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 server.deletion set (730 days) to 2026-03-12T15:20:34.380181+00:00
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
(16:01:29) Found /srv/pbench/archive/fs-version-001/dhcp31-44.perf.lab.eng.bos.redhat.com/22a4bc5748b920c6ce271eb68f08d91c/fio_rw_2018.02.01T22.40.57.tar.xz for ID 22a4bc5748b920c6ce271eb68f08d91c
|| fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-44.perf.lab.eng.bos.redhat.com/22a4bc5748b920c6ce271eb68f08d91c/fio_rw_2018.02.01T22.40.57.tar.xz
|| fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log
|| fio_rw_2018.02.01T22.40.57 server.deletion set (730 days) to 2026-03-12T15:20:33.301420+00:00
|| fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio'
|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
|| Isolated tarball /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz MD5 doesn't match isolator 45f0e2af41977b89e07bae4303dc9972
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 doesn't seem to have a tarball
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from default
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 server.deletion set (730 days) to 2026-03-12T15:20:33.441340+00:00
|| pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'unknown'
|| Missing MD5 /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz.md5
(16:01:29) Found /srv/pbench/archive/fs-version-001/rhel8-1/4b8da5832aa9c7c6a21dc74123b8968b/uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57.tar.xz for ID 4b8da5832aa9c7c6a21dc74123b8968b
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/rhel8-1/4b8da5832aa9c7c6a21dc74123b8968b/uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57.tar.xz
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no metalog: setting from metadata.log
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 server.deletion set (730 days) to 2026-03-12T15:20:33.609509+00:00
|| uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 has no server.benchmark: setting 'uperf'
4 server.tarball-path repairs, 1 failures
4 server.deletion repairs, 0 failures
4 dataset.metalog repairs, 0 failures
4 server.benchmark repairs

This fixes several issues observed during ops review: 1. The `/api/v1/endpoints` API fails if the server is shut down 2. `tar` unpack errors can result in enormous `stderr` output, which is captured in the `Audit` log; truncate it to 5Mb 3. Change the `pbench-audit` utility to use `dateutil.parser` instead of `click.DateTime()` so we can include fractional seconds and timezone. During the time when we broke PostgreSQL, we failed to create metadata for a number of datasets that were allowed to upload. (Whether we should allow this vs failing the upload is a separate issue.) We have want to repair the excessively large `Audit` attributes records. So I took a stab at some wondrous and magical SQL queries and hackery to begin a new `pbench-repair` utility. Right now, it repairs long audit attributes "intelligently" by trimming individual JSON key values; and it add metadata to datasets which lack critical values. Currently, this includes `server.tarball-path` (which we need to enable TOC and visualization), `dataset.metalog` (capturing the tarball `metadata.log` file), and `server.benchmark` for visualization. There are other `server` namespace values (including expiration time) that could be repaired: I decided not to worry about that as we're not doing expiration anyway. (Though I might add it over the weekend, since it shouldn't be hard.) And there are probably other things we might want to repair in the future using this framework. I tested this in a `runlocal` container, using `psql` to "break" datasets and repair them. I hacked the local `repair.py` with a low "max error" limit to force truncation of audit attributes: ``` pbench-repair --detail --errors --verify --progress 10 (22:52:08) Repairing audit || 60:FAILURE upload fio_rw_2018.02.01T22.40.57 [message] truncated (107) to 105 || 116:SUCCESS apikey None [key] truncated (197) to 105 22 audit records had attributes too long 2 records were fixed (22:52:08) Repairing metadata || fio_rw_2018.02.01T22.40.57 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/dhcp31-45.perf.lab.eng.bos.redhat.com/08516cc7448035be2cc502f0517783fa/fio_rw_2018.02.01T22.40.57.tar.xz || fio_rw_2018.02.01T22.40.57 has no metalog: setting from metadata.log || fio_rw_2018.02.01T22.40.57 has no server.benchmark: setting 'fio' || pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.tarball-path: setting /srv/pbench/archive/fs-version-001/ansible-host/45f0e2af41977b89e07bae4303dc9972/pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18.tar.xz || pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no metalog: setting from metadata.log || pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 has no server.benchmark: setting 'pbench-user-benchmark' 2 server.tarball-path repairs, 0 failures 2 dataset.metalog repairs, 0 failures 2 server.benchmark repairs ```

This adds repair for `server.deletion` (expiration timestamp), completing the repair of the `server` namespace. In copying the setup from `intake_base.py` I realized that intake was technically incorrect (not that it really matters much as we don't, and likely won't, implement dataset expiration) in that it always uses the static lifetime setting from `pbench-config.cfg` rather than recognizing the dynamic server settings value. So I fixed that and made a common implementation. It's also been bothering me that, in the midst of our PostgreSQL problems, we allowed upload of datasets without metadata. I'd initially deliberatedly allowed this looking at the metadata as "extra" and figuring I didn't want to fail an upload just because of that. However, with recent optimizations, we really depend internally on `server.tarball-path` in particular: the new optimized `CacheManager.find_dataset` won't work without it. So failure in setting metadata on intake is now a fatal internal server error.

To simplify edge cases, I give in, although for the record I'm not happy about giving up on the line-based truncation: I just want it to be done. (And, ultimately, I don't think it really matters all that much.)

webbnh

🚢

dbutenhof added Server Audit Of and relating to server side changes to data Database Operations Related to operation and monitoring of a service labels Mar 8, 2024

dbutenhof requested a review from webbnh March 8, 2024 23:14

dbutenhof self-assigned this Mar 8, 2024

This comment was marked as resolved.

Sign in to view

dbutenhof dismissed webbnh’s stale review via e395385 March 12, 2024 17:59

This comment was marked as resolved.

Sign in to view

dbutenhof dismissed webbnh’s stale review via 6247f3e March 13, 2024 19:21

This comment was marked as resolved.

Sign in to view

dbutenhof added 6 commits April 3, 2024 13:26

Well, that didn't go well. Take 2.

74bedaf

Changes

740312d

More churn

159a2c5

Rebased with reluctant refactoring

fa2c8d2

To simplify edge cases, I give in, although for the record I'm not happy about giving up on the line-based truncation: I just want it to be done. (And, ultimately, I don't think it really matters all that much.)

dbutenhof dismissed webbnh’s stale review via fa2c8d2 April 3, 2024 20:19

dbutenhof force-pushed the ops branch from 6247f3e to fa2c8d2 Compare April 3, 2024 20:19

This comment was marked as resolved.

Sign in to view

refine

c6f3079

dbutenhof dismissed webbnh’s stale review via c6f3079 April 5, 2024 11:25

webbnh approved these changes Apr 5, 2024

View reviewed changes

dbutenhof merged commit 4a35b7e into distributed-system-analysis:main Apr 5, 2024
4 checks passed

dbutenhof deleted the ops branch April 5, 2024 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An assortment of Pbench Ops fixes and fun #3612

An assortment of Pbench Ops fixes and fun #3612

dbutenhof commented Mar 8, 2024 •

edited

Loading

This comment was marked as resolved.

dbutenhof commented Mar 12, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

webbnh left a comment

An assortment of Pbench Ops fixes and fun #3612

An assortment of Pbench Ops fixes and fun #3612

Conversation

dbutenhof commented Mar 8, 2024 • edited Loading

This comment was marked as resolved.

dbutenhof commented Mar 12, 2024

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof commented Mar 8, 2024 •

edited

Loading