Skip to content

Commit

Permalink
Update monitoring.md and remove Ganglia (#180)
Browse files Browse the repository at this point in the history
Co-authored-by: Bernt Popp <bernt.popp@googlemail.com>
  • Loading branch information
sellth and berntpopp authored Oct 28, 2024
1 parent 5d303bd commit 9dfcb3f
Show file tree
Hide file tree
Showing 5 changed files with 7 additions and 62 deletions.
2 changes: 1 addition & 1 deletion bih-cluster/docs/admin/getting-access.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ Additional members (cluster user names):

!!! note "Notes"
- All projects must have one owner and may have one delegate.
- Please note that we will enforce [kebab case]([url](https://en.wikipedia.org/wiki/Letter_case#Kebab_case)) for all project names and folders.
- Please note that we will enforce [kebab case](https://en.wikipedia.org/wiki/Letter_case#Kebab_case) for all project names and folders.
- Tier 1 project storage will be supplemented with 10 TB of T1 scratch by default.
- Users can be associated with multiple projects.
- Project membership does not grant cluster access. A primary group affiliation is still required.
3 changes: 0 additions & 3 deletions bih-cluster/docs/overview/figures/Ganglia_Aggregate_GPUs.png

This file was deleted.

3 changes: 0 additions & 3 deletions bih-cluster/docs/overview/figures/Ganglia_Example.png

This file was deleted.

3 changes: 3 additions & 0 deletions bih-cluster/docs/overview/figures/metrics_dashboard.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
58 changes: 3 additions & 55 deletions bih-cluster/docs/overview/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,59 +1,7 @@
# Monitoring

We currently provide you only with Ganglia for monitoring the cluster status.
We currently provide a Grafana dashboard for monitoring various aspects of the cluster's status:

## Using Ganglia
https://metrics.cubi.bihealth.org/public-dashboards/dc3e4d5b1ea049429abf39e412c47302

Go to the following address and login with your home organization (Charite or MDC):

- https://hpc-ganglia.cubi.bihealth.org


!!! cite "Ganglia does not know about Slurm"
Ganglia will not show you anything about the Slurm job schedulign system.
If a job uses a whole node but uses no CPUs then this will be displayed as unused in Ganglia.
However, Slurm would not schedule another job on this node.

You will be show a screen as shown below.
This allows you to get a good idea of what is going on on the HPC.

![](figures/Ganglia_Example.png)

By default you will be shown the cluster usage of the last day.
You can quickly switch to report for two or four hours as well, etc.

In the first row of pictures you see the number of total CPUs (actually hardware threads), number of hosts seen as up and down by Ganglia, and cluster load/utilization.
You will then see the overall cluster load, memory usage, CPU usage, and network utilization across the selected time period.

!!! cite "Linux load is not intuitive"
Note that the technical details behind Linux **load** is not very interactive.
It is incorporating much more than just the CPU usage.
You can find a quite comprehensive [treatement of Linux Load here](https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html).

We are using a fast shared storage system and almost no local storage (except in `/tmp`).
Also, almost no jobs use MPI or other heavy network communication.
Thus, the network utilization is a good measure of the I/O on the cluster.

Below, you can drill down into various metrics and visualize them historically.
Just try it out and find your way around, you cannot break anything.
Sadly, there is no good documentation of Ganglia online.

## Aggregate GPU Utilization Visualization

Ganglia allows you to obtain metrics in several interesting and useful ways.
If you click on "Aggregate Graphs" then you could enter the following values to get an overview of the live GPU utilization.

- Title: `Aggreate GPU Utilization`
- Host Regular expression: `hpc-gpu-.*`
- Metric Regular Expressions: `gpu._util`
- Graph Type: `Stacked`
- Legend Options: `Hide legend`

Then click `Create Graph`.

![](figures/Ganglia_Aggregate_GPUs.png)

If a GPU is fully used, it will contribute 100 points on the vertical axis.
See above for an example, and here is a direct link:

- [Aggregate GPU Utilization](https://hpc-ganglia.cubi.bihealth.org/ganglia/graph_all_periods.php?title=Aggregate+GPU+Utilization&cs=&ce=&vl=&x=&n=&hreg%5B%5D=hpc-gpu-.*&mreg%5B%5D=gpu._util&gtype=stack&glegend=hide&aggregate=1)
![](figures/metrics_dashboard.png)

0 comments on commit 9dfcb3f

Please sign in to comment.