Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving the monitoring portability #20

Merged
merged 9 commits into from
May 31, 2024
Merged

Improving the monitoring portability #20

merged 9 commits into from
May 31, 2024

Conversation

ErwanAliasr1
Copy link
Collaborator

This PR is about improving the monitoring feature with all feedback received since.

@ErwanAliasr1 ErwanAliasr1 force-pushed the monito branch 5 times, most recently from 5b618a5 to 7a22bd8 Compare May 31, 2024 14:16
When starting hwbench or when reading a result file, there is no mention
of the BMC driver used. This could be useful to understand some metrics
or even for hwgraph to take some decision.

This commit is :
- adding BMC.get_driver_name() to report the class name as the driver
  name

- adding a BMC.dump() so the driver name can be added in the result
  file. The hardware data structure looks like the following :

  "hardware": {
    "dmi": {
      "vendor": "Dell Inc.",
      "product": "PowerEdge C6615",
      "serial": "XXXXXX",
      "bios": {
        "version": "1.2.3",
        "release": "1.2"
      },
      "chassis": {
        "product": "PowerEdge C6600",
        "serial": "XXXXXX"
      },
      "sysconf_threads": 128
    },
    "cpu": {
      "vendor": "AuthenticAMD",
      "model": "AMD EPYC 8534P 64-Core Processor",
      "logical_cores": 128,
      "physical_cores": 64,
      "numa_domains": 8,
      "sockets": 1
    },
    "bmc": {
      "driver": "IDRAC"
    }

- updating the startup message to indicate which driver is used, a
  typical output looks like :

  python3 -m hwbench.hwbench -j configs/mini.conf -m monitoring.cfg
  Starting monitoring for DELL vendor with driver IDRAC @ 10.168.97.148
  ...

Signed-off-by: Erwan Velu <e.velu@criteo.com>
Some block devices like zram does not have any scheduler.
This case made hwbench crashing at starting time.

This commit is just ignoring block devices with no scheduler.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
When an engine is using a 3rd-party binary, it's mandatory to test its
presence unless the code will crash.

This commit is :
- adding a new helper (is_binary_available) to check if a binary is
  available
- Add a generic check for engines

Signed-off-by: Erwan Velu <e.velu@criteo.com>
Testing if the BMC IP is set to 0.0.0.0 is useless since:
- Some vendors uses dedicated channel interface like CHIF on HPE
- If a network connection is required (like redfish), the connection is
  already established or generate a fault.

So this commit is removing this code that is useless

Signed-off-by: Erwan Velu <e.velu@criteo.com>
This simple commit is updating the monitoring text at start time.

A typical output looks like the following:

	Monitoring/turbostat: initialize
	Monitoring/turbostat: Freq metrics:64xCPU
	Monitoring/BMC: initialize DELL vendor with IDRAC driver @ 10.168.97.148
	Monitoring/BMC: Thermal metrics:1xCPU, 1xIntake
	Monitoring/BMC: Fans metrics:10xFan
	Monitoring/BMC: PowerConsumption metrics:65xCPU, 4xBMC
	Monitoring/BMC: PowerSupplies metrics:2xBMC

Signed-off-by: Erwan Velu <e.velu@criteo.com>
When External class is used, if the pointed binary is not installed, a
FileNotFoundError exception is triggered.

Instead of this crash, let's have a custom fatal message to indicate
what binary is missing.

Signed-off-by: Erwan Velu <e.velu@criteo.com>
hwbench requires at least turbostat 2022.04.16 (from Kernel 5.19) unless
filtering C1% field would not be possible.

This commit is:
- update the requirement in the documentation

- implements a simple test when Turbostat() is instantiated to guarantee
  the minimal release is present.

- If no suitable release is found, hwbench will stop with a fatal
  message. A typical example looks like the following :

	Monitoring/turbostat: Detected release 19.8.31
	ERROR:root:Monitoring/turbostat: minimal expected release is 2022.4.16

Signed-off-by: Erwan Velu <e.velu@criteo.com>
Some processors like Intel(R) Core(TM) i7-9750H, report the Corewatt only for Core0.

This commit is about to just ignore cores that do not report corewatt
even if the header mention it.

A typical turbostat output of such processor:

Core	CPU	Avg_MHz	Busy%	Bzy_MHz	TSC_MHz	IPC	IRQ	SMI	POLL	C1	C1E	C3	C6	C7s	C8	C9	C10	POLL%	C1%	C1E%	C3%	C6%	C7s%	C8%	C9%	C10%	CPU%c1	CPU%c3	CPU%c6	CPU%c7	CoreTmp	CoreThr	PkgTmp	Totl%C0	Any%C0	GFX%C0	CPUGFX%	Pkg%pc2	Pkg%pc3	Pkg%pc6	Pkg%pc7	Pkg%pc8	Pkg%pc9	Pk%pc10	CPU%LPI	SYS%LPI	PkgWatt	CorWatt	GFXWatt	RAMWatt	PKG_%	RAM_%	UncMHz
-	-	3	0.33	800	2592	0.50	1620	0	1	3	10	16	206	0	214	1	1342	0.00	0.00	0.00	0.00	0.26	0.00	0.36	0.02	99.05	0.64	0.00	0.47	98.57	40	2592	40	4.90	4.24	0.00	0.00	9.55	85.04	0.00	0.00	0.00	0.00	0.00	0.00	0.00	11.38	0.25	0.00	1.17	0.00	0.00	800
0	0	1	0.09	800	2592	0.35	20	0	0	0	0	0	2	0	4	0	113	0.00	0.00	0.00	0.00	0.02	0.00	0.07	0.00	99.82	1.13	0.00	0.08	98.69	37	2592	40	4.90	4.24	0.00	0.00	9.55	85.04	0.00	0.00	0.00	0.00	0.00	0.00	0.00	11.38	0.25	0.00	1.17	0.00	0.00	800
0	6	6	0.69	800	2592	0.31	341	0	0	0	0	0	7	0	3	0	311	0.00	0.00	0.00	0.00	0.08	0.00	0.06	0.00	99.20	0.53
1	1	6	0.70	800	2592	0.51	260	0	1	3	3	3	15	0	23	0	187	0.00	0.00	0.01	0.01	0.20	0.00	0.47	0.00	98.64	0.62	0.01	0.32	98.35	40	1352
1	7	2	0.31	800	2592	1.57	67	0	0	0	1	0	11	0	10	0	36	0.00	0.00	0.00	0.00	0.16	0.00	0.21	0.00	99.33	1.00
2	2	5	0.57	800	2592	0.33	66	0	0	0	1	3	11	0	9	0	145	0.00	0.00	0.00	0.00	0.17	0.00	0.19	0.00	99.08	0.46	0.00	0.52	98.44	38	1255
2	8	1	0.17	800	2592	0.38	108	0	0	0	1	2	24	0	21	0	66	0.00	0.00	0.00	0.01	0.42	0.00	0.41	0.00	99.01	0.86
3	3	4	0.44	800	2592	0.32	230	0	0	0	1	0	9	0	15	0	203	0.00	0.00	0.00	0.00	0.11	0.00	0.30	0.00	99.17	0.70	0.00	0.75	98.11	37	1078
3	9	2	0.29	800	2592	0.54	151	0	0	0	0	0	48	0	50	1	62	0.00	0.00	0.00	0.00	0.73	0.00	1.00	0.21	97.79	0.85
4	4	3	0.39	800	2592	0.30	264	0	0	0	2	7	34	0	57	0	158	0.00	0.00	0.00	0.01	0.52	0.00	1.13	0.00	97.98	0.38	0.00	0.50	98.73	37	237
4	10	1	0.08	800	2592	0.58	18	0	0	0	0	0	5	0	6	0	17	0.00	0.00	0.00	0.00	0.08	0.00	0.12	0.00	99.72	0.68
5	5	0	0.05	800	2592	0.47	25	0	0	0	1	0	7	0	1	0	22	0.00	0.00	0.00	0.00	0.10	0.00	0.02	0.00	99.84	0.26	0.01	0.62	99.07	36	0
5	11	1	0.14	800	2592	0.90	70	0	0	0	0	1	33	0	15	0	22	0.00	0.00	0.00	0.01	0.58	0.00	0.30	0.00	98.98	0.17

Signed-off-by: Erwan Velu <e.velu@criteo.com>
Starting Kernel 6.9, the -n option became ambigous which prevents
turbostat to run with the following message:

	turbostat: option '-n' is ambiguous; possibilities: '-num_iterations' '-no-msr' '-no-perf'

This commit is removing all short name options and replace them with
long name to avoid this case.

This patch got tested successfully from Kernel 5.19 (2022.4.16) up
to the incoming 6.10 (2024.5.10).

Signed-off-by: Erwan Velu <e.velu@criteo.com>
Copy link
Collaborator

@beorn- beorn- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to go !

@ErwanAliasr1 ErwanAliasr1 merged commit d2bc303 into main May 31, 2024
4 checks passed
@ErwanAliasr1 ErwanAliasr1 deleted the monito branch June 5, 2024 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants