Broken scrape jobs can get past our checks #554

dstathis · 2023-12-07T14:43:55Z

Bug Description

When a certain set of scrape jobs are deployed, Our scrape job validation is "fooled" and the scrape jobs are written to disk causing Prometheus to fail.

To Reproduce

Deploy the attached bundle and relate to cos. (adjust saas section as needed)
machine_model_bundle.txt
Here is the charm used in the bundle in case the branch goes away. (remove .txt file extension)
grafana-agent_ubuntu-22.04-amd64.charm.txt

Environment

                           juju info v0.1                            
┌──────────────┬────────────────────────────────────────────────────┐
│ jhack        │ 0.3.18.3                                           │
│ python       │ 3.10.12 (/home/dylan/repos/jhack/venv/bin/python3) │
│ juju-* snaps │  juju      │ 3.3.0 - 25355 (latest/stable)         │
│              │  juju-wait │ 2.8.4~2.8.4 - 96 (stable)             │
│ microk8s     │ MicroK8s v1.28.3 revision 6089                     │
│ lxd          │ 5.19                                               │
│ multipass    │ 1.12.2                                             │
│ multipassd   │ 1.12.2                                             │
│ os           │ Ubuntu 22.04.3 LTS                                 │
│ kernel       │ Linux 5.15.0-89-generic x86_64                     │
└──────────────┴────────────────────────────────────────────────────┘

Relevant log output

unit-prometheus-0: 14:43:14 ERROR unit.prometheus/0.juju-log Invalid prometheus configuration. Stdout: Checking /etc/prometheus/prometheus.yml
  SUCCESS: 6 rule files found
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file syntax

Checking /etc/prometheus/rules/juju_lma_0c334414_alertmanager_metrics-endpoint_17.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_grafana_metrics-endpoint_19.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_loki_metrics-endpoint_18.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_traefik_metrics-endpoint_16.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_machine_006cfa02_zookeeper.rules

Checking /etc/prometheus/rules/juju_stuff_6e55ee64_agent.rules
  SUCCESS: 35 rules found

 Stderr:   FAILED:
lint error 39 duplicate rule(s) found.
Metric: CollectorFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: error
Metric: IPMICurrentStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMIDCMICommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMIDCMIPowerConsumptionPercentageOutstanding
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: IPMIFanSpeedStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMIMonitoringCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMIPowerStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMISELCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMISELStateCritical
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMISELStateWarning
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: IPMISensorStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMITemperatureStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMIVoltageStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: LSISASControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: LSISASIRVolumeNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: LSISASIRVolumeUnready
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: LSISASPhysicalDiskUnready
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: MegaRAIDControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: MegaRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: PerccliCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: PowerEdgeRAIDControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: PowerEdgeRAIDControllerSuccess
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: PowerEdgeRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: RedfishCallFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: RedfishChassisHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishMemoryDimmHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishProcessorHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishSensorHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishServiceUnavailable
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: RedfishSmartStorageHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishStorageControllerHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishStorageDriveHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SasircuCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLICommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLIControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: SsaCLIControllerNotOK
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLILogicalDriveNotOK
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLIPhysicalDriveNotOK
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: StorcliCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Might cause inconsistency while recording expressions

Additional context

No response

The text was updated successfully, but these errors were encountered:

lucabello · 2024-08-23T13:32:04Z

First, we should validate the scrape jobs with promtool; if we find one that's malformed, we should set the charm to Blocked. We don't want to stop Prometheus, because having Blocked is better than an outage.

If schema validation fails for the scrape jobs coming from one relation, we omit those scrape jobs from the final configuration, and we set the charm to Blocked.

We need to add the same behavior in Grafana Agent, because that can also scrape metrics. We should probably have some helper function in the Prometheus library to handle that.

dstathis added Status: Triage Type: Bug labels Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Broken scrape jobs can get past our checks #554

Broken scrape jobs can get past our checks #554

dstathis commented Dec 7, 2023

lucabello commented Aug 23, 2024

Broken scrape jobs can get past our checks #554

Broken scrape jobs can get past our checks #554

Comments

dstathis commented Dec 7, 2023

Bug Description

To Reproduce

Environment

Relevant log output

Additional context

lucabello commented Aug 23, 2024