diff --git a/bih-cluster/docs/admin/getting-access.md b/bih-cluster/docs/admin/getting-access.md index fab1eecae..caeecec80 100644 --- a/bih-cluster/docs/admin/getting-access.md +++ b/bih-cluster/docs/admin/getting-access.md @@ -122,7 +122,7 @@ Additional members (cluster user names): !!! note "Notes" - All projects must have one owner and may have one delegate. - - Please note that we will enforce [kebab case]([url](https://en.wikipedia.org/wiki/Letter_case#Kebab_case)) for all project names and folders. + - Please note that we will enforce [kebab case](https://en.wikipedia.org/wiki/Letter_case#Kebab_case) for all project names and folders. - Tier 1 project storage will be supplemented with 10 TB of T1 scratch by default. - Users can be associated with multiple projects. - Project membership does not grant cluster access. A primary group affiliation is still required. diff --git a/bih-cluster/docs/overview/figures/Ganglia_Aggregate_GPUs.png b/bih-cluster/docs/overview/figures/Ganglia_Aggregate_GPUs.png deleted file mode 100644 index 0ff6a81b0..000000000 --- a/bih-cluster/docs/overview/figures/Ganglia_Aggregate_GPUs.png +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:a4e8aa2ea5757f40b961986f29aab37c992cbdeaffa19ea0de8c170593b2227f -size 179155 diff --git a/bih-cluster/docs/overview/figures/Ganglia_Example.png b/bih-cluster/docs/overview/figures/Ganglia_Example.png deleted file mode 100644 index 06a8b2093..000000000 --- a/bih-cluster/docs/overview/figures/Ganglia_Example.png +++ /dev/null @@ -1,3 +0,0 @@ -version https://git-lfs.github.com/spec/v1 -oid sha256:a50436a5e539f115b12c2516fa2001076b6262b27a2a803b72e55e86544da7f3 -size 245295 diff --git a/bih-cluster/docs/overview/figures/metrics_dashboard.png b/bih-cluster/docs/overview/figures/metrics_dashboard.png new file mode 100644 index 000000000..4c2a14cbe --- /dev/null +++ b/bih-cluster/docs/overview/figures/metrics_dashboard.png @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:5e85de08aff75d193c884d2df336be4e84cbb216dbaf9cb3a849bd13ba4b50da +size 300694 diff --git a/bih-cluster/docs/overview/monitoring.md b/bih-cluster/docs/overview/monitoring.md index a05092d36..c4615a114 100644 --- a/bih-cluster/docs/overview/monitoring.md +++ b/bih-cluster/docs/overview/monitoring.md @@ -1,59 +1,7 @@ # Monitoring -We currently provide you only with Ganglia for monitoring the cluster status. +We currently provide a Grafana dashboard for monitoring various aspects of the cluster's status: -## Using Ganglia +https://metrics.cubi.bihealth.org/public-dashboards/dc3e4d5b1ea049429abf39e412c47302 -Go to the following address and login with your home organization (Charite or MDC): - -- https://hpc-ganglia.cubi.bihealth.org - - -!!! cite "Ganglia does not know about Slurm" - Ganglia will not show you anything about the Slurm job schedulign system. - If a job uses a whole node but uses no CPUs then this will be displayed as unused in Ganglia. - However, Slurm would not schedule another job on this node. - -You will be show a screen as shown below. -This allows you to get a good idea of what is going on on the HPC. - -![](figures/Ganglia_Example.png) - -By default you will be shown the cluster usage of the last day. -You can quickly switch to report for two or four hours as well, etc. - -In the first row of pictures you see the number of total CPUs (actually hardware threads), number of hosts seen as up and down by Ganglia, and cluster load/utilization. -You will then see the overall cluster load, memory usage, CPU usage, and network utilization across the selected time period. - -!!! cite "Linux load is not intuitive" - Note that the technical details behind Linux **load** is not very interactive. - It is incorporating much more than just the CPU usage. - You can find a quite comprehensive [treatement of Linux Load here](https://www.brendangregg.com/blog/2017-08-08/linux-load-averages.html). - -We are using a fast shared storage system and almost no local storage (except in `/tmp`). -Also, almost no jobs use MPI or other heavy network communication. -Thus, the network utilization is a good measure of the I/O on the cluster. - -Below, you can drill down into various metrics and visualize them historically. -Just try it out and find your way around, you cannot break anything. -Sadly, there is no good documentation of Ganglia online. - -## Aggregate GPU Utilization Visualization - -Ganglia allows you to obtain metrics in several interesting and useful ways. -If you click on "Aggregate Graphs" then you could enter the following values to get an overview of the live GPU utilization. - -- Title: `Aggreate GPU Utilization` -- Host Regular expression: `hpc-gpu-.*` -- Metric Regular Expressions: `gpu._util` -- Graph Type: `Stacked` -- Legend Options: `Hide legend` - -Then click `Create Graph`. - -![](figures/Ganglia_Aggregate_GPUs.png) - -If a GPU is fully used, it will contribute 100 points on the vertical axis. -See above for an example, and here is a direct link: - -- [Aggregate GPU Utilization](https://hpc-ganglia.cubi.bihealth.org/ganglia/graph_all_periods.php?title=Aggregate+GPU+Utilization&cs=&ce=&vl=&x=&n=&hreg%5B%5D=hpc-gpu-.*&mreg%5B%5D=gpu._util>ype=stack&glegend=hide&aggregate=1) \ No newline at end of file +![](figures/metrics_dashboard.png)