Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track and report unpack performance #3610

Merged
merged 8 commits into from
Mar 6, 2024

Conversation

dbutenhof
Copy link
Member

@dbutenhof dbutenhof commented Feb 21, 2024

I added a simple server.unpack-perf metadata, which is a JSON block like {"min": <seconds>, "max": <seconds>, "count": <unpack_count>}, and then played with the report generator to get some statistics.

I also wrote a report of the Audit table contents to summarize the operations, statuses, and users involved in the Pbench Server.

The sample below is for a runlocal, with a few small-ish tarballs. The big catch in deploying this would be that none of the existing datasets will have server.unpack-perf until they're unpacked again, which somewhat reduces the value of the statistics until they get unpacked again (e.g., for TOC or visualize).

Nevertheless, I figured I might as well post it for consideration. Some of the statistics (and how they're calculated and/or represented) are no doubt arguable; but I enjoyed seeing the numbers anyway. 😆

Cache report:
  7 datasets currently unpacked, consuming 51.7 MB
  7 datasets have been unpacked a total of 7 times
  The least recently used cache was referenced today, fio_rw_2018.02.01T22.40.57
  The most recently used cache was referenced today, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The smallest cache is 307.2 kB, linpack_mock_2020.02.28T19.10.55
  The biggest cache is 19.6 MB, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The worst compression ratio is 22.156%, uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57
  The best compression ratio is 96.834%, pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18
  The fastest cache unpack is 0.014 seconds, linpack_mock_2020.02.28T19.10.55
  The slowest cache unpack is 0.084 seconds, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The fastest cache unpack streaming rate is 233.226 Mb/second, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The slowest cache unpack streaming rate is 22.228 Mb/second, linpack_mock_2020.02.28T19.10.55
  1 datasets have no unpacked size, 1 are missing reference timestamps, 0 have bad size metadata
  1 datasets are missing unpack metric data, 0 have bad unpack metric data
  1 datasets are missing unpack performance data
Audit logs:
  138 audit log rows for 69 events
  0 unterminated root rows, 0 unmatched terminators
  Status summary:
                   BEGIN         69
                 SUCCESS         68
                 FAILURE          1
  Operation summary:
                template         36
                  upload          9
                   cache          7
                   index          6
                  apikey          1
                  update         10
  Object type summary:
                TEMPLATE         36
                 DATASET         32
                 API_KEY          1
  Users summary:
              BACKGROUND         49
                  tester         18
               testadmin          2

@dbutenhof dbutenhof added Server Code Infrastructure Audit Of and relating to server side changes to data Operations Related to operation and monitoring of a service labels Feb 21, 2024
@dbutenhof dbutenhof requested a review from webbnh February 21, 2024 22:02
@dbutenhof dbutenhof self-assigned this Feb 21, 2024
@dbutenhof dbutenhof marked this pull request as ready for review February 23, 2024 01:07
webbnh

This comment was marked as resolved.

This is a thought experiment, based off our earlier discussion of tracking an
asynchronous unpacker. I thought it would be fun to see some information on
tarball compression ratios, as well as tracking the min/max unpack time in
accessible metadata rather than just in logs. I need to jog back to Horreum,
but before diving back into the pool of Java, I took a recreational break...

So I added a simple `server.unpack-perf` metadata, which is a JSON block like
`{"min": <seconds>, "max": <seconds>, "count": <unpack_count>}`, and then
played with the report generator to get some statistics. The sample below is
for a `runlocal`, with a few small-ish tarballs. The big catch in deploying
this would be that none of the existing datasets will have
`server.unpack-perf` until they're unpacked again, which definitely reduces
the usefulness of this thought experiment.

Nevertheless, I figured I might as well post it for consideration.

```
Cache report:
  5 datasets currently unpacked, consuming 51.7 MB
  8 datasets have been unpacked a total of 15 times
  The least recently used cache was referenced today, pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18
  The most recently used cache was referenced today, uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57
  The smallest cache is 4.1 kB, nometadata
  The biggest cache is 19.6 MB, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The worst compression ratio is 22.156%, uperf_rhel8.1_4.18.0-107.el8_snap4_25gb_virt_2019.06.21T01.28.57
  The best compression ratio is 96.834%, pbench-user-benchmark_example-vmstat_2018.10.24T14.38.18
  The fastest cache unpack is 0.013 seconds, nometadata
  The slowest cache unpack is 0.078 seconds, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The fastest cache unpack streaming rate is 253.666 Mb/second, trafficgen_basic-forwarding-example_tg:trex-profile_pf:forwarding_test.json_ml:5_tt:bs__2019-08-27T14:58:38
  The slowest cache unpack streaming rate is 0.133 Mb/second, nometadata
```
The `pbench-tree-manage` utility supports a deep `ARCHIVE` tree display and
also managed the periodic background cache reclamation. Curious after we saw
a failed reclaim, I wanted to play with it a bit and realized that it does a
full (`search`) discovery unconditionally. First, even with `--display` (which
now that we have a proper report generator is rarely necessary) we probably
can use the faster SQL discovery most of the time, although I added an option
to select the slower `--search` discovery. More importantly, though, the
cache reclaimer doesn't need a fully discovered cache manager since it takes
the short-cut of examining the `/srv/pbench/cache` tree directly: so we can
move the discovery into the `--display` path.
Also a few minor corrections identified during ops review.
webbnh

This comment was marked as resolved.

I thought this morning about adding a CLI audit tool to query the audit log.

While I wasn't quite motivated enough to write it, it occurred to me to at
least cobble up a simple set of audit log statistics while eating breakfast.

So here 'tis.
@dbutenhof dbutenhof requested a review from webbnh March 2, 2024 14:57
@dbutenhof
Copy link
Member Author

dbutenhof commented Mar 2, 2024

Well isn't that cute: an expired SSL cert trying to copy the IT CA cert!

Get "https://certs.corp.redhat.com/certs/2015-IT-Root-CA.pem": x509: certificate has expired or is not yet valid: current time 2024-03-02T15:11:39Z is after 2024-03-01T23:59:59Z

And ... I can't log in to Jenkins to restart the build (just in case), because it seems to just ignore the login. Which might conceivably be related ...

I suppose it's telling me to "enjoy my weekend and get off the computer"...

webbnh

This comment was marked as resolved.

Some cleanup and review comments.
webbnh
webbnh previously approved these changes Mar 5, 2024
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, although I do have one wrinkle for your consideration.

lib/pbench/cli/server/report.py Outdated Show resolved Hide resolved
Copy link
Member

@webbnh webbnh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@dbutenhof dbutenhof merged commit 906a06a into distributed-system-analysis:main Mar 6, 2024
4 checks passed
@dbutenhof dbutenhof deleted the timer branch March 6, 2024 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Audit Of and relating to server side changes to data Code Infrastructure Operations Related to operation and monitoring of a service Server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants