Skip to content

Commit

Permalink
Merge pull request #166 from ramachandranravi/remove_expose_cgroup_me…
Browse files Browse the repository at this point in the history
…trics

removed_cgroupfs_from_kepler_doc
  • Loading branch information
sthaha authored Nov 13, 2024
2 parents 293c205 + 3f4bb00 commit 702573f
Show file tree
Hide file tree
Showing 8 changed files with 13 additions and 59 deletions.
21 changes: 0 additions & 21 deletions docs/design/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -116,27 +116,6 @@ All the metrics specific to the Kepler Exporter are prefixed with `kepler`.
!!! note
You can enable/disable expose of those metrics through `expose-hardware-counter-metrics` Kepler execution option or `EXPOSE_HW_COUNTER_METRICS` environment value.

### cGroups Metrics

- **kepler_container_cgroupfs_cpu_usage_us_total**

This measures the total CPU time used by the container reading from cGroups stat.

- **kepler_container_cgroupfs_memory_usage_bytes_total**

This measures the total memory in bytes used by the container reading from cGroups stat.

- **kepler_container_cgroupfs_system_cpu_usage_us_total**

This measures the total CPU time in kernel space used by the container reading from cGroups stat.

- **kepler_container_cgroupfs_user_cpu_usage_us_total**

This measures the total CPU time in userspace used by the container reading from cGroups stat.

!!! note
You can enable/disable expose of those metrics through `EXPOSE_CGROUP_METRICS` environment value.

### IRQ Metrics

- **kepler_container_bpf_net_tx_irq_total**
Expand Down
9 changes: 6 additions & 3 deletions docs/hardwareengagement/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,12 @@ Currently, we use power consumption API as RAPL or ACPI.
For some of the devices, you may need to find your own way to get power consumption, and implement in golang for Kepler usage.
For further plan, please ref [here](https://github.com/sustainable-computing-io/kepler/issues/644)

### eBPF/cgroup data
### eBPF data

Currently, we relays on eBPF and cgroup to characterization a process/pod. Hence, you can ref to our dependency as BCC or cgroup. To test those golang package works well on your device.
Currently, we rely on eBPF to obtain key cpu, irq and perf information about a process.
Hence, refer to the documentation of [cilium/ebpf](https://github.com/cilium/ebpf) to test whether these Go packages work well on your device.

Please let us know if you need any further adjustments!

## Stage 1 Integration with ratio

Expand All @@ -39,7 +42,7 @@ You should know the scope of the Power consumption API. How many API do you have
### Interval

You should know the intervals of the Power consumption API.
As Kepler collects eBPF and cgroup data in each 3s by default, you should know the interval and make them in same time slot.
As Kepler collects eBPF data in each 3s by default, you should know the interval and make them in same time slot.

### Verify

Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Kepler (Kubernetes-based Efficient Power Level Exporter) is a Prometheus exporter. It uses eBPF to probe CPU performance counters and Linux kernel tracepoints.

These data and stats from cgroup and sysfs can then be fed into ML models to estimate energy consumption by Pods.
These data and stats from sysfs can then be fed into ML models to estimate energy consumption by Pods.

Check out the project on GitHub ➡️ [Kepler](https://github.com/sustainable-computing-io/kepler).

Expand Down
5 changes: 2 additions & 3 deletions docs/kepler_model_server/pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,13 +43,12 @@ for each defined resource utilization metric group as below.
Group Name|Features|Kepler Metric Source(s)
---|---|---
CounterOnly|COUNTER_FEATURES|[Hardware Counter](../design/metrics.md#hardware-counter-metrics)
CgroupOnly|CGROUP_FEATURES|[cGroups](../design/metrics.md#cgroups-metrics)
BPFOnly|BPF_FEATURES|[BPF](../design/metrics.md#base-metric)
IRQOnly|IRQ_FEATURES|[IRQ](../design/metrics.md#irq-metrics)
AcceleratorOnly|ACCELERATOR_FEATURES|[Accelerator](../design/metrics.md#Accelerator-metrics)
CounterIRQCombined|COUNTER_FEATURES, IRQ_FEATURES|BPF and Hardware Counter
Basic|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES|All except IRQ and node information
WorkloadOnly|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, IRQ_FEATURES, ACCELERATOR_FEATURES|All except node information
Basic|COUNTER_FEATURES, BPF_FEATURES|All except IRQ and node information
WorkloadOnly|COUNTER_FEATURES, BPF_FEATURES, IRQ_FEATURES, ACCELERATOR_FEATURES|All except node information
Full|WORKLOAD_FEATURES, SYSTEM_FEATURES|All

Node information refers to value from [kepler_node_info](../design/metrics.md#kepler-metrics-for-node-information)
Expand Down
8 changes: 4 additions & 4 deletions docs/usage/deep_dive.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ Kepler, Kubernetes-based Efficient Power Level Exporter, offers a way to estimat

Kepler uses the following to collects power data:

#### EBPF, Hardware Counters, cGroups
#### EBPF, Hardware Counters

Kepler can utilize a BPF program integrated into the kernel's pathway to extract process-related resource utilization metrics or use metrics from Hardware Counters or cGroups.
Kepler can utilize a BPF program integrated into the kernel's pathway to extract process-related resource utilization metrics or use metrics from Hardware Counters.
The type of metrics used to build the model can differ based on the system's environment.
For example, it might use hardware counters, or metrics from tools like eBPF or cGroups, depending on what is available in the system that will use the model.
For example, it might use hardware counters, or metrics from tools like eBPF, depending on what is available in the system that will use the model.

#### Real-time Component Power Meters

Expand Down Expand Up @@ -44,7 +44,7 @@ When creating the power model, the Model Server uses a regression algorithm. It

Once trained, the Model Server makes these models accessible through a github repository, where any Kepler deployment can download the model from.
Kepler then uses these models to calculate how much power a node (VM) consumes based on the way its resources are being used. The type of metrics used to build the model can differ based on the system's environment.
For example, it might use hardware counters, or metrics from tools like eBPF or cGroups, depending on what is available in the system that will use the model.
For example, it might use hardware counters, or metrics from tools like eBPF, depending on what is available in the system that will use the model.

![Power model training](../fig/power_model_training.jpg)

Expand Down
1 change: 0 additions & 1 deletion docs/usage/general_config.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ This is a list of configurable values of Kepler System. The configuration can be
|Model Server Pod Environment (INITIAL_MODEL_NAMES.[`MODEL_TYPE`])|model-server.[`MODEL_TYPE`]|Name of default pipeline for each model type|-|
|***CollectMetric CR*** (single item: default)||||
|Kepler DaemonSet Environment (COUNTER_METRICS)|counter|List of performance metrics to enable from counter source| * (enable all available metrics from counter source)|
|Kepler DaemonSet Environment (CGROUP_METRICS)|cgroup|List of performance metrics to enable from cgroup source| * (enable all available metrics from cgroup source)|
|Kepler DaemonSet Environment (BPF_METRICS)|bpf|List of performance metrics to enable from bpf (aka. eBPF) source| * (enable all available metrics from bpf source)|
|Kepler DaemonSet Environment (GPU_METRICS)|gpu|List of performance metrics to enable from gpu source| * (enable all available metrics from gpu source)|
|***ExportMetric CR*** (single item: default)||||
Expand Down
1 change: 0 additions & 1 deletion docs/usage/kepler_daemon.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ To set environments by ConfigMap:
data:
MODEL_SERVER_ENABLE: true
COUNTER_METRICS: '*'
CGROUP_METRICS: '*'
BPF_METRICS: '*'
# KUBELET_METRICS: ''
# GPU_METRICS: ''
Expand Down
25 changes: 0 additions & 25 deletions docs/usage/trouble_shooting.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,3 @@ apt install linux-headers-$(uname -r)
```

On OpenShift, install the MachineConfiguration [here](https://github.com/sustainable-computing-io/kepler/tree/main/manifests/config/cluster-prereqs)

## Kepler energy metrics are zeroes

<!-- markdownlint-disable MD024 -->
### Background

Kepler uses RAPL counters on x86 platforms to read energy consumption.
VMs do not have RAPL counters and thus Kepler estimates energy consumption based on the pre-trained
ML models. The models use either hardware performance counters or cGroup stats to estimate energy
consumed by processes. Currently the cGroup based models use cGroup v2 features such as
`cgroupfs_cpu_usage_us`, `cgroupfs_memory_usage_bytes`, `cgroupfs_system_cpu_usage_us`,
`cgroupfs_user_cpu_usage_us`, `bytes_read`, and `bytes_writes`.

### Diagnose

The Kepler metrics are zeroes, check if cGroup version on the node:

```bash
ls /sys/fs/cgroup/cgroup.controllers
```

### Solution
<!-- markdownlint-enable MD024 -->

Enable cGroup v2 on the node by following [these Kubernetes instruction](https://kubernetes.io/docs/concepts/architecture/cgroups/).

0 comments on commit 702573f

Please sign in to comment.