Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken scrape jobs can get past our checks #554

Open
dstathis opened this issue Dec 7, 2023 · 1 comment
Open

Broken scrape jobs can get past our checks #554

dstathis opened this issue Dec 7, 2023 · 1 comment

Comments

@dstathis
Copy link
Contributor

dstathis commented Dec 7, 2023

Bug Description

When a certain set of scrape jobs are deployed, Our scrape job validation is "fooled" and the scrape jobs are written to disk causing Prometheus to fail.

To Reproduce

Deploy the attached bundle and relate to cos. (adjust saas section as needed)
machine_model_bundle.txt
Here is the charm used in the bundle in case the branch goes away. (remove .txt file extension)
grafana-agent_ubuntu-22.04-amd64.charm.txt

Environment

                           juju info v0.1                            
┌──────────────┬────────────────────────────────────────────────────┐
│ jhack        │ 0.3.18.3                                           │
│ python       │ 3.10.12 (/home/dylan/repos/jhack/venv/bin/python3) │
│ juju-* snaps │  juju      │ 3.3.0 - 25355 (latest/stable)         │
│              │  juju-wait │ 2.8.4~2.8.4 - 96 (stable)             │
│ microk8s     │ MicroK8s v1.28.3 revision 6089                     │
│ lxd          │ 5.19                                               │
│ multipass    │ 1.12.2                                             │
│ multipassd   │ 1.12.2                                             │
│ os           │ Ubuntu 22.04.3 LTS                                 │
│ kernel       │ Linux 5.15.0-89-generic x86_64                     │
└──────────────┴────────────────────────────────────────────────────┘

Relevant log output

unit-prometheus-0: 14:43:14 ERROR unit.prometheus/0.juju-log Invalid prometheus configuration. Stdout: Checking /etc/prometheus/prometheus.yml
  SUCCESS: 6 rule files found
 SUCCESS: /etc/prometheus/prometheus.yml is valid prometheus config file syntax

Checking /etc/prometheus/rules/juju_lma_0c334414_alertmanager_metrics-endpoint_17.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_grafana_metrics-endpoint_19.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_loki_metrics-endpoint_18.rules
  SUCCESS: 4 rules found

Checking /etc/prometheus/rules/juju_lma_0c334414_traefik_metrics-endpoint_16.rules
  SUCCESS: 2 rules found

Checking /etc/prometheus/rules/juju_machine_006cfa02_zookeeper.rules

Checking /etc/prometheus/rules/juju_stuff_6e55ee64_agent.rules
  SUCCESS: 35 rules found

 Stderr:   FAILED:
lint error 39 duplicate rule(s) found.
Metric: CollectorFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: error
Metric: IPMICurrentStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMIDCMICommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMIDCMIPowerConsumptionPercentageOutstanding
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: IPMIFanSpeedStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMIMonitoringCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMIPowerStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMISELCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMISELStateCritical
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: IPMISELStateWarning
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: IPMISensorStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMITemperatureStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: IPMIVoltageStateNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: {{ toLower $labels.state }}
Metric: LSISASControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: LSISASIRVolumeNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: LSISASIRVolumeUnready
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: LSISASPhysicalDiskUnready
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: MegaRAIDControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: MegaRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: PerccliCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: PowerEdgeRAIDControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: PowerEdgeRAIDControllerSuccess
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: PowerEdgeRAIDVirtualDriveNotOptimal
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: RedfishCallFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: RedfishChassisHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishMemoryDimmHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishProcessorHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishSensorHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishServiceUnavailable
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: RedfishSmartStorageHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishStorageControllerHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: RedfishStorageDriveHealthNotOk
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SasircuCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLICommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLIControllerNotFound
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: warning
Metric: SsaCLIControllerNotOK
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLILogicalDriveNotOK
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: SsaCLIPhysicalDriveNotOK
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Metric: StorcliCommandFailed
Label(s):
	juju_application: hob
	juju_charm: hardware-observer
	juju_model: machine
	juju_model_uuid: 006cfa02-a74e-4ba7-863f-b3c423a4ebd1
	severity: critical
Might cause inconsistency while recording expressions

Additional context

No response

@lucabello
Copy link
Contributor

First, we should validate the scrape jobs with promtool; if we find one that's malformed, we should set the charm to Blocked. We don't want to stop Prometheus, because having Blocked is better than an outage.

If schema validation fails for the scrape jobs coming from one relation, we omit those scrape jobs from the final configuration, and we set the charm to Blocked.

We need to add the same behavior in Grafana Agent, because that can also scrape metrics. We should probably have some helper function in the Prometheus library to handle that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants