diff --git a/docs/design/metrics.md b/docs/design/metrics.md index 7b867e2c..a511f069 100644 --- a/docs/design/metrics.md +++ b/docs/design/metrics.md @@ -116,27 +116,6 @@ All the metrics specific to the Kepler Exporter are prefixed with `kepler`. !!! note You can enable/disable expose of those metrics through `expose-hardware-counter-metrics` Kepler execution option or `EXPOSE_HW_COUNTER_METRICS` environment value. -### cGroups Metrics - -- **kepler_container_cgroupfs_cpu_usage_us_total** - - This measures the total CPU time used by the container reading from cGroups stat. - -- **kepler_container_cgroupfs_memory_usage_bytes_total** - - This measures the total memory in bytes used by the container reading from cGroups stat. - -- **kepler_container_cgroupfs_system_cpu_usage_us_total** - - This measures the total CPU time in kernel space used by the container reading from cGroups stat. - -- **kepler_container_cgroupfs_user_cpu_usage_us_total** - - This measures the total CPU time in userspace used by the container reading from cGroups stat. - -!!! note - You can enable/disable expose of those metrics through `EXPOSE_CGROUP_METRICS` environment value. - ### IRQ Metrics - **kepler_container_bpf_net_tx_irq_total** diff --git a/docs/hardwareengagement/index.md b/docs/hardwareengagement/index.md index 41a048a9..77de6275 100644 --- a/docs/hardwareengagement/index.md +++ b/docs/hardwareengagement/index.md @@ -24,9 +24,12 @@ Currently, we use power consumption API as RAPL or ACPI. For some of the devices, you may need to find your own way to get power consumption, and implement in golang for Kepler usage. For further plan, please ref [here](https://github.com/sustainable-computing-io/kepler/issues/644) -### eBPF/cgroup data +### eBPF data -Currently, we relays on eBPF and cgroup to characterization a process/pod. Hence, you can ref to our dependency as BCC or cgroup. To test those golang package works well on your device. +Currently, we rely on eBPF to obtain key cpu, irq and perf information about a process. Hence, refer to the documentation of +[cilium/ebpf](https://github.com/cilium/ebpf) to test whether these Go packages work well on your device. + +Please let us know if you need any further adjustments! ## Stage 1 Integration with ratio @@ -39,7 +42,7 @@ You should know the scope of the Power consumption API. How many API do you have ### Interval You should know the intervals of the Power consumption API. -As Kepler collects eBPF and cgroup data in each 3s by default, you should know the interval and make them in same time slot. +As Kepler collects eBPF data in each 3s by default, you should know the interval and make them in same time slot. ### Verify diff --git a/docs/index.md b/docs/index.md index 5cda3176..5d0b6313 100644 --- a/docs/index.md +++ b/docs/index.md @@ -2,7 +2,7 @@ Kepler (Kubernetes-based Efficient Power Level Exporter) is a Prometheus exporter. It uses eBPF to probe CPU performance counters and Linux kernel tracepoints. -These data and stats from cgroup and sysfs can then be fed into ML models to estimate energy consumption by Pods. +These data and stats from sysfs can then be fed into ML models to estimate energy consumption by Pods. Check out the project on GitHub ➡️ [Kepler](https://github.com/sustainable-computing-io/kepler). diff --git a/docs/kepler_model_server/pipeline.md b/docs/kepler_model_server/pipeline.md index 81a43247..3c2a716a 100644 --- a/docs/kepler_model_server/pipeline.md +++ b/docs/kepler_model_server/pipeline.md @@ -43,13 +43,12 @@ for each defined resource utilization metric group as below. Group Name|Features|Kepler Metric Source(s) ---|---|--- CounterOnly|COUNTER_FEATURES|[Hardware Counter](../design/metrics.md#hardware-counter-metrics) -CgroupOnly|CGROUP_FEATURES|[cGroups](../design/metrics.md#cgroups-metrics) BPFOnly|BPF_FEATURES|[BPF](../design/metrics.md#base-metric) IRQOnly|IRQ_FEATURES|[IRQ](../design/metrics.md#irq-metrics) AcceleratorOnly|ACCELERATOR_FEATURES|[Accelerator](../design/metrics.md#Accelerator-metrics) CounterIRQCombined|COUNTER_FEATURES, IRQ_FEATURES|BPF and Hardware Counter -Basic|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES|All except IRQ and node information -WorkloadOnly|COUNTER_FEATURES, CGROUP_FEATURES, BPF_FEATURES, IRQ_FEATURES, ACCELERATOR_FEATURES|All except node information +Basic|COUNTER_FEATURES, BPF_FEATURES|All except IRQ and node information +WorkloadOnly|COUNTER_FEATURES, BPF_FEATURES, IRQ_FEATURES, ACCELERATOR_FEATURES|All except node information Full|WORKLOAD_FEATURES, SYSTEM_FEATURES|All Node information refers to value from [kepler_node_info](../design/metrics.md#kepler-metrics-for-node-information) diff --git a/docs/usage/deep_dive.md b/docs/usage/deep_dive.md index af98a0e8..c0619d60 100644 --- a/docs/usage/deep_dive.md +++ b/docs/usage/deep_dive.md @@ -12,11 +12,11 @@ Kepler, Kubernetes-based Efficient Power Level Exporter, offers a way to estimat Kepler uses the following to collects power data: -#### EBPF, Hardware Counters, cGroups +#### EBPF, Hardware Counters -Kepler can utilize a BPF program integrated into the kernel's pathway to extract process-related resource utilization metrics or use metrics from Hardware Counters or cGroups. +Kepler can utilize a BPF program integrated into the kernel's pathway to extract process-related resource utilization metrics or use metrics from Hardware Counters. The type of metrics used to build the model can differ based on the system's environment. -For example, it might use hardware counters, or metrics from tools like eBPF or cGroups, depending on what is available in the system that will use the model. +For example, it might use hardware counters, or metrics from tools like eBPF, depending on what is available in the system that will use the model. #### Real-time Component Power Meters @@ -44,7 +44,7 @@ When creating the power model, the Model Server uses a regression algorithm. It Once trained, the Model Server makes these models accessible through a github repository, where any Kepler deployment can download the model from. Kepler then uses these models to calculate how much power a node (VM) consumes based on the way its resources are being used. The type of metrics used to build the model can differ based on the system's environment. -For example, it might use hardware counters, or metrics from tools like eBPF or cGroups, depending on what is available in the system that will use the model. +For example, it might use hardware counters, or metrics from tools like eBPF, depending on what is available in the system that will use the model. ![Power model training](../fig/power_model_training.jpg) diff --git a/docs/usage/general_config.md b/docs/usage/general_config.md index 7f1b5dc4..18db2e5e 100644 --- a/docs/usage/general_config.md +++ b/docs/usage/general_config.md @@ -27,7 +27,6 @@ This is a list of configurable values of Kepler System. The configuration can be |Model Server Pod Environment (INITIAL_MODEL_NAMES.[`MODEL_TYPE`])|model-server.[`MODEL_TYPE`]|Name of default pipeline for each model type|-| |***CollectMetric CR*** (single item: default)|||| |Kepler DaemonSet Environment (COUNTER_METRICS)|counter|List of performance metrics to enable from counter source| * (enable all available metrics from counter source)| -|Kepler DaemonSet Environment (CGROUP_METRICS)|cgroup|List of performance metrics to enable from cgroup source| * (enable all available metrics from cgroup source)| |Kepler DaemonSet Environment (BPF_METRICS)|bpf|List of performance metrics to enable from bpf (aka. eBPF) source| * (enable all available metrics from bpf source)| |Kepler DaemonSet Environment (GPU_METRICS)|gpu|List of performance metrics to enable from gpu source| * (enable all available metrics from gpu source)| |***ExportMetric CR*** (single item: default)|||| diff --git a/docs/usage/kepler_daemon.md b/docs/usage/kepler_daemon.md index 9327002c..c44a9fe1 100644 --- a/docs/usage/kepler_daemon.md +++ b/docs/usage/kepler_daemon.md @@ -15,7 +15,6 @@ To set environments by ConfigMap: data: MODEL_SERVER_ENABLE: true COUNTER_METRICS: '*' - CGROUP_METRICS: '*' BPF_METRICS: '*' # KUBELET_METRICS: '' # GPU_METRICS: '' diff --git a/docs/usage/trouble_shooting.md b/docs/usage/trouble_shooting.md index 522fee3b..53b5c66c 100644 --- a/docs/usage/trouble_shooting.md +++ b/docs/usage/trouble_shooting.md @@ -28,28 +28,3 @@ apt install linux-headers-$(uname -r) ``` On OpenShift, install the MachineConfiguration [here](https://github.com/sustainable-computing-io/kepler/tree/main/manifests/config/cluster-prereqs) - -## Kepler energy metrics are zeroes - - -### Background - -Kepler uses RAPL counters on x86 platforms to read energy consumption. -VMs do not have RAPL counters and thus Kepler estimates energy consumption based on the pre-trained -ML models. The models use either hardware performance counters or cGroup stats to estimate energy -consumed by processes. Currently the cGroup based models use cGroup v2 features such as -`cgroupfs_cpu_usage_us`, `cgroupfs_memory_usage_bytes`, `cgroupfs_system_cpu_usage_us`, -`cgroupfs_user_cpu_usage_us`, `bytes_read`, and `bytes_writes`. - -### Diagnose - -The Kepler metrics are zeroes, check if cGroup version on the node: - -```bash -ls /sys/fs/cgroup/cgroup.controllers -``` - -### Solution - - -Enable cGroup v2 on the node by following [these Kubernetes instruction](https://kubernetes.io/docs/concepts/architecture/cgroups/).