Track and report unpack performance #3610

dbutenhof · 2024-02-21T22:02:23Z

I added a simple server.unpack-perf metadata, which is a JSON block like {"min": <seconds>, "max": <seconds>, "count": <unpack_count>}, and then played with the report generator to get some statistics.

I also wrote a report of the Audit table contents to summarize the operations, statuses, and users involved in the Pbench Server.

The sample below is for a runlocal, with a few small-ish tarballs. The big catch in deploying this would be that none of the existing datasets will have server.unpack-perf until they're unpacked again, which somewhat reduces the value of the statistics until they get unpacked again (e.g., for TOC or visualize).

Nevertheless, I figured I might as well post it for consideration. Some of the statistics (and how they're calculated and/or represented) are no doubt arguable; but I enjoyed seeing the numbers anyway. 😆

Cache report:
  7 datasets currently unpacked, consuming 51.7 MB
  7 datasets have been unpacked a total of 7 times
  The least recently used cache was referenced today, fio_rw_2018.02.01T22.40.57
  The most recently used cache was referenced today, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The smallest cache is 307.2 kB, linpack_mock_2020.02.28T19.10.55
  The biggest cache is 19.6 MB, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The worst compression ratio is 22.156%, uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57
  The best compression ratio is 96.834%, pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18
  The fastest cache unpack is 0.014 seconds, linpack_mock_2020.02.28T19.10.55
  The slowest cache unpack is 0.084 seconds, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The fastest cache unpack streaming rate is 233.226 Mb/second, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The slowest cache unpack streaming rate is 22.228 Mb/second, linpack_mock_2020.02.28T19.10.55
  1 datasets have no unpacked size, 1 are missing reference timestamps, 0 have bad size metadata
  1 datasets are missing unpack metric data, 0 have bad unpack metric data
  1 datasets are missing unpack performance data
Audit logs:
  138 audit log rows for 69 events
  0 unterminated root rows, 0 unmatched terminators
  Status summary:
                   BEGIN         69
                 SUCCESS         68
                 FAILURE          1
  Operation summary:
                template         36
                  upload          9
                   cache          7
                   index          6
                  apikey          1
                  update         10
  Object type summary:
                TEMPLATE         36
                 DATASET         32
                 API_KEY          1
  Users summary:
              BACKGROUND         49
                  tester         18
               testadmin          2

This is a thought experiment, based off our earlier discussion of tracking an asynchronous unpacker. I thought it would be fun to see some information on tarball compression ratios, as well as tracking the min/max unpack time in accessible metadata rather than just in logs. I need to jog back to Horreum, but before diving back into the pool of Java, I took a recreational break... So I added a simple `server.unpack-perf` metadata, which is a JSON block like `{"min": <seconds>, "max": <seconds>, "count": <unpack_count>}`, and then played with the report generator to get some statistics. The sample below is for a `runlocal`, with a few small-ish tarballs. The big catch in deploying this would be that none of the existing datasets will have `server.unpack-perf` until they're unpacked again, which definitely reduces the usefulness of this thought experiment. Nevertheless, I figured I might as well post it for consideration. ``` Cache report: 5 datasets currently unpacked, consuming 51.7 MB 8 datasets have been unpacked a total of 15 times The least recently used cache was referenced today, pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 The most recently used cache was referenced today, uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 The smallest cache is 4.1 kB, nometadata The biggest cache is 19.6 MB, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38 The worst compression ratio is 22.156%, uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57 The best compression ratio is 96.834%, pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18 The fastest cache unpack is 0.013 seconds, nometadata The slowest cache unpack is 0.078 seconds, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38 The fastest cache unpack streaming rate is 253.666 Mb/second, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38 The slowest cache unpack streaming rate is 0.133 Mb/second, nometadata ```

The `pbench-tree-manage` utility supports a deep `ARCHIVE` tree display and also managed the periodic background cache reclamation. Curious after we saw a failed reclaim, I wanted to play with it a bit and realized that it does a full (`search`) discovery unconditionally. First, even with `--display` (which now that we have a proper report generator is rarely necessary) we probably can use the faster SQL discovery most of the time, although I added an option to select the slower `--search` discovery. More importantly, though, the cache reclaimer doesn't need a fully discovered cache manager since it takes the short-cut of examining the `/srv/pbench/cache` tree directly: so we can move the discovery into the `--display` path.

Also a few minor corrections identified during ops review.

I thought this morning about adding a CLI audit tool to query the audit log. While I wasn't quite motivated enough to write it, it occurred to me to at least cobble up a simple set of audit log statistics while eating breakfast. So here 'tis.

dbutenhof · 2024-03-02T19:08:57Z

Well isn't that cute: an expired SSL cert trying to copy the IT CA cert!

Get "https://certs.corp.redhat.com/certs/2015-IT-Root-CA.pem": x509: certificate has expired or is not yet valid: current time 2024-03-02T15:11:39Z is after 2024-03-01T23:59:59Z

And ... I can't log in to Jenkins to restart the build (just in case), because it seems to just ignore the login. Which might conceivably be related ...

I suppose it's telling me to "enjoy my weekend and get off the computer"...

Some cleanup and review comments.

webbnh

Looks good, although I do have one wrinkle for your consideration.

lib/pbench/cli/server/report.py

webbnh

👍

dbutenhof added Server Code Infrastructure Audit Of and relating to server side changes to data Operations Related to operation and monitoring of a service labels Feb 21, 2024

dbutenhof requested a review from webbnh February 21, 2024 22:02

dbutenhof self-assigned this Feb 21, 2024

dbutenhof marked this pull request as ready for review February 23, 2024 01:07

This comment was marked as resolved.

Sign in to view

dbutenhof added 3 commits February 29, 2024 16:15

Review comments

8b070c7

Also a few minor corrections identified during ops review.

dbutenhof dismissed webbnh’s stale review via 8b070c7 February 29, 2024 21:16

dbutenhof force-pushed the timer branch from 214497c to 8b070c7 Compare February 29, 2024 21:16

This comment was marked as resolved.

Sign in to view

dbutenhof added 3 commits March 1, 2024 10:02

Review comments, cleanup

9ad4a16

Add audit report

e99a072

I thought this morning about adding a CLI audit tool to query the audit log. While I wasn't quite motivated enough to write it, it occurred to me to at least cobble up a simple set of audit log statistics while eating breakfast. So here 'tis.

A few tweaks

45b8cff

dbutenhof requested a review from webbnh March 2, 2024 14:57

This comment was marked as resolved.

Sign in to view

Optimize cache discovery to pure SQL

cdbc009

Some cleanup and review comments.

dbutenhof dismissed webbnh’s stale review via cdbc009 March 4, 2024 20:30

webbnh previously approved these changes Mar 5, 2024

View reviewed changes

lib/pbench/cli/server/report.py Outdated Show resolved Hide resolved

Minor nesting

8b32bc6

dbutenhof dismissed webbnh’s stale review via 8b32bc6 March 6, 2024 00:08

webbnh approved these changes Mar 6, 2024

View reviewed changes

dbutenhof merged commit 906a06a into distributed-system-analysis:main Mar 6, 2024
4 checks passed

dbutenhof deleted the timer branch March 6, 2024 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Track and report unpack performance #3610

Track and report unpack performance #3610

dbutenhof commented Feb 21, 2024 •

edited

Loading

This comment was marked as resolved.

This comment was marked as resolved.

dbutenhof commented Mar 2, 2024 •

edited

Loading

This comment was marked as resolved.

webbnh left a comment

webbnh left a comment

Track and report unpack performance #3610

Track and report unpack performance #3610

Conversation

dbutenhof commented Feb 21, 2024 • edited Loading

This comment was marked as resolved.

This comment was marked as resolved.

dbutenhof commented Mar 2, 2024 • edited Loading

This comment was marked as resolved.

webbnh left a comment

Choose a reason for hiding this comment

webbnh left a comment

Choose a reason for hiding this comment

dbutenhof commented Feb 21, 2024 •

edited

Loading

dbutenhof commented Mar 2, 2024 •

edited

Loading