Skip to content

Commit

Permalink
module updates
Browse files Browse the repository at this point in the history
  • Loading branch information
mrobson committed Jan 10, 2025
1 parent 7b8d138 commit 418cc10
Show file tree
Hide file tree
Showing 8 changed files with 224 additions and 38 deletions.
4 changes: 3 additions & 1 deletion content/modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -61,4 +61,6 @@
** xref:module-09.adoc#checkingressconfig[Check the _Ingress Controller_ configuration]
** xref:module-09.adoc#solution[Issue solution]
* xref:module-10.adoc[10. Exploring etcd snapshots with koff]
* xref:module-10.adoc[10. Exploring etcd snapshots with koff]
** xref:module-10.adoc#gettingstarted[Getting Started with koff]
** xref:module-10.adoc#koffget[Viewing resources in etcd]
2 changes: 1 addition & 1 deletion content/modules/ROOT/pages/module-02.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ cd ~/Module2/
[source,bash]
----
$ omc use module2-demo-must-gather
Must-Gather : /home/lab-user/Module2/sno_demo/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2de07af89683678ae6bb7a939615fc0d4ced7fe185add38b050f2c6f60023b6f
Must-Gather : /home/lab-user/Module2/module2-demo-must-gather/sno_demo/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-2de07af89683678ae6bb7a939615fc0d4ced7fe185add38b050f2c6f60023b6f
Project : default
ApiServerURL : https://api.cluster-6fmht.dynamic.redhatworkshops.io:6443
Platform : None
Expand Down
2 changes: 1 addition & 1 deletion content/modules/ROOT/pages/module-03.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ We recently patched this cluster to from *4.14.27* to *4.14.37*. Previous scale-

. Check the cluster `nodes` and cluster `machines` to verify the nodes do not exist and the machines do

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand Down
10 changes: 5 additions & 5 deletions content/modules/ROOT/pages/module-04.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Start by taking a high level view. You can be both broad and granular with audit

Look at the top usage for the common `--by=` groups like `resource` and `user`

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -120,7 +120,7 @@ We spotted something suspicious, so let's drill down a little deeper.
When evaluating the data, always factor in things like the total number of requests, time period and the number of nodes.
====

.Click to show some details if you need a hint
.*Click to show some details if you need a hint*
[%collapsible]
====
Our top 3 resources from the previous command were `nodes`, `configmaps` and `pods`:
Expand All @@ -140,7 +140,7 @@ Our top 3 users from the previous command were `sysdig-agent`, `apiserver` and `
One of those sticks out a lot, but let's first take a look at our top 3 resources. For this we can use the `--resource=` flag, in addition to `--by=` and `-o top` to down on a specific resource.
.Click to show some details if you need a hint
.*Click to show some details if you need a hint*
[%collapsible]
====
----
Expand All @@ -167,7 +167,7 @@ Let's try to answer the following:
What is the user doing? +
What is the problem?
.Click to show some details if you need a hint
.*Click to show some details if you need a hint*
[%collapsible]
====
----
Expand All @@ -186,7 +186,7 @@ Top 10 "GET" (of 440076 total hits):
8308x [ 270.327µs] [403-8307] /api/v1/nodes/cluster-app-02.dmz/proxy/metrics [system:serviceaccount:openshift-example-sysdig-agent:sysdig-agent]
----

The conclusion is that there was an issue with the `SysDig`` monitoring component that was causing it to fail authentication when trying to collect `node` metrics and in turn spam the API server.
The conclusion is that there was an issue with the `SysDig` monitoring component that was causing it to fail authentication when trying to collect `node` metrics and in turn spam the API server.
====

I hope you found this introduction to the `kubectl-dev_tool` useful and can leverage it the next time you have an issue!
Expand Down
13 changes: 9 additions & 4 deletions content/modules/ROOT/pages/module-05.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,12 @@ Options:
====
[source,bash]
----
$ ocp_insights.sh --file insights_archive.tar.gz
cd ~/Module5/
----
[source,bash]
----
$ ocp_insights.sh --file module5-insights-data
Cluster Version: 4.14.27
Channel: eus-4.14
Expand Down Expand Up @@ -152,7 +157,7 @@ To see all Alerts run: jq -r . insights-2024-08-14-144858/config/alerts.json
====
[source,bash]
----
$ ocp_insights.sh --file insights_archive.tar.gz --customer_memory
$ ocp_insights.sh --file module5-insights-data --customer_memory
...
Customer Namespace Memory Usage.
Expand Down Expand Up @@ -184,7 +189,7 @@ Total Customer Namespace Memory Usage: 121.9884G
====
[source,bash]
----
$ ocp_insights.sh --file insights_archive.tar.gz --etcd_metrics
$ ocp_insights.sh --file module5-insights-data --etcd_metrics
etcd server slow apply total
etcd-ocp4-2nvq7-master-0,3548
Expand All @@ -209,7 +214,7 @@ etcd-ocp4-2nvq7-master-1,22
====
[source,bash]
----
$ ocp_insights.sh --file insights_archive.tar.gz --storage_classes
$ ocp_insights.sh --file module5-insights-data --storage_classes
...
StorageClass Information.
Expand Down
17 changes: 11 additions & 6 deletions content/modules/ROOT/pages/module-06.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,12 @@ options:
====
[source,bash]
----
$ etcd-ocp-diag.py --path <path_to_mg> --stats
cd ~/Module6/
----
[source,bash]
----
$ etcd-ocp-diag.py --path module6-must-gather.6521552859184261155/ --stats
Stats about etcd "apply request took too long" messages: etcd-ocp4-2nvq7-master-0
First Occurrence: 2024-07-28T04:00:27
Last Occurrence: 2024-08-14T15:18:23
Expand Down Expand Up @@ -135,7 +140,7 @@ Stats about etcd "slow fdatasync" messages: etcd-ocp4-2nvq7-master-2
====
[source,bash]
----
$ etcd-ocp-diag.py --path <path_to_mg> --errors
$ etcd-ocp-diag.py --path module6-must-gather.6521552859184261155/ --errors
POD ERROR COUNT
etcd-ocp4-2nvq7-master-0 waiting for ReadIndex response took too long, retrying 295
etcd-ocp4-2nvq7-master-0 slow fdatasync 60
Expand Down Expand Up @@ -180,7 +185,7 @@ etcd-ocp4-2nvq7-master-2 sending buffer is full
====
[source,bash]
----
$ etcd-ocp-diag.py --path <path_to_mg> --ttl
$ etcd-ocp-diag.py --path module6-must-gather.6521552859184261155/ --ttl
POD DATE COUNT
etcd-ocp4-2nvq7-master-0 2024-07-28 121
etcd-ocp4-2nvq7-master-0 2024-07-29 112
Expand Down Expand Up @@ -215,7 +220,7 @@ etcd-ocp4-2nvq7-master-2 2024-08-14 952
====
[source,bash]
----
$ etcd-ocp-diag.py --path <path_to_mg> --ttl --date 2024-07-28 --pod etcd-ocp4-2nvq7-master-1
$ etcd-ocp-diag.py --path module6-must-gather.6521552859184261155/ --ttl --date 2024-07-28 --pod etcd-prodshift-2nvq7-master-1
POD DATE COUNT
etcd-ocp4-2nvq7-master-1 05:16 12
etcd-ocp4-2nvq7-master-1 05:31 13
Expand All @@ -232,7 +237,7 @@ etcd-ocp4-2nvq7-master-1 08:12 3
====
[source,bash]
----
$ etcd-ocp-diag.py --path <path_to_mg> --ttl --compare
$ etcd-ocp-diag.py --path module6-must-gather.6521552859184261155/ --ttl --compare
Date: 2024-07-28
POD COUNT
etcd-ocp4-2nvq7-master-0 121
Expand All @@ -254,7 +259,7 @@ etcd-ocp4-2nvq7-master-2 152

[source,bash]
----
$ etcd-ocp-diag.py --path <path_to_mg> --ttl --date 2024-07-28 --compare
$ etcd-ocp-diag.py --path module6-must-gather.6521552859184261155/ --ttl --date 2024-07-28 --compare
Date: 04:02
POD COUNT
etcd-ocp4-2nvq7-master-0 8
Expand Down
41 changes: 21 additions & 20 deletions content/modules/ROOT/pages/module-09.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ We are on a _UPI_ cluster and use an external _Load Balancer_ to send traffic to

. Using the `omc use` command, set the `module9-must-gather.local` _must-gather_ as the current archive in use.

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -46,7 +46,7 @@ OCP might receive critical networking bugfixes between different z-releases, the

* Check the cluster version.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -57,7 +57,7 @@ omc get ClusterVersion version
* Check which CNI (_Container Network Interface_) plugin is being used on the cluster.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -68,7 +68,7 @@ omc get Network cluster -o json | yq '.spec.networkType'
* Check which are the installed _Incress Controllers_.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand Down Expand Up @@ -99,7 +99,7 @@ The command `oc adm must-gather` does not collect data from all Namepsaces, but

Which command should we ask to a customer in order to collect data of a specific Namespace ?

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -115,7 +115,7 @@ In this lab, the _inspect_ of `fsi-project` is named `module9-inspect-fsi-projec
The `omc` tool isn't restricted to _must-gathers_, but it can be set to read from a Namespace _inspect_ archive too.
=====

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -141,7 +141,7 @@ As the saying goes: _"When you hear hoofbeats behind you, don't expect to see a

* First of all, it will be handy to find the Selector used by the Deployment `fsi-application` for its Pods. Let's check it and put it into a shell variable.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -157,7 +157,7 @@ echo $SELECTOR_LABEL
* Then, check that the Pod replicas in the reported Deployment `fsi-application` are all running.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -178,7 +178,7 @@ omc get pod -l $SELECTOR_LABEL
When a Pod is correctly "connected" to a Service, its IP address will appear in the Endpoints object corresponding to the Service
=====

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -194,7 +194,7 @@ omc get pod -l $SELECTOR_LABEL -o wide
* As reported by the customer, even if the above checks were successfull, we should still expect to see traffic logs (for example, _GET_ requests) only in the logs of one of the two Pods. Let's verify by checking all Pods logs.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand Down Expand Up @@ -222,7 +222,7 @@ That is, let's analyze the application related Route and how it is configured in
* Let's check the Route.
.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand All @@ -243,14 +243,16 @@ fsi-route fsi-route-fsi-project.apps.foobarbank.lab.upshift.rdu2.redhat.com
[IMPORTANT]
=====
The application specific Route is contained into the _inspect_, however the `default` _Ingress Controller_ configuration is always contained into the _must-gather_.
The application specific Route is found inside the _inspect_ must-gather, however the `default` _Ingress Controller_ configuration is always only found in a full _must-gather_.
=====

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
Switch back to the full must-gather and use the `backends` command to view all of the haproxy configurations.
[source,bash]
----
omc use module9-must-gather.local/
omc haproxy backends fsi-project
----
====
Expand All @@ -263,18 +265,18 @@ NAMESPACE NAME INGRESSCONTROLLER SERVICES PORT TERMINATION
fsi-project fsi-route default fsi-service https(8443) passthrough/Redirect
----
* Everything seems correct so far, therefore we need to dig deeper. Let's manually print the whole `fsi-route` Route "admission" directly from the `default` _Ingress Controller_ configuration file.
* Everything seems correct so far, therefore we need to dig deeper. Let's manually print the `fsi-route` configuration directly from the `default` _Ingress Controller_ haproxy configuration file.
[TIP]
=====
In general, the `default` _Ingress Controller_ configuration file can be found at the following path:
In a full must-gather, the `default` _Ingress Controller_ configuration file can be found at the following path:

`<must-gather-archive>/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-<hash>/ingress_controllers/default/<ingress-default-pod>/haproxy.config`.

Note that there is one `haproxy.config` file for each _Ingress Controller_ Pod.
Note that there is one `haproxy.config` file for each _Ingress Controller_ Pod, although they should all be the same.
=====

.Click to show some commands if you need a hint
.*Click to show some commands if you need a hint*
[%collapsible]
====
[source,bash]
Expand Down Expand Up @@ -309,7 +311,7 @@ backend be_tcp:fsi-project:fsi-route
[#solution]
== Issue solution
Gothca ! The Route seems using the _balance_ of type `source`. We can verify whether this is the intended _Ingress Controller_ behavior by checking the official OCP documentation about link:https://docs.openshift.com/container-platform/4.17/networking/routes/route-configuration.html#nw-route-specific-annotations_route-configuration[_Route-specific annotations_].
Gothca! The Route seems using the _balance_ of type `source`. We can verify whether this is the intended _Ingress Controller_ behavior by checking the official OCP documentation about link:https://docs.openshift.com/container-platform/4.17/networking/routes/route-configuration.html#nw-route-specific-annotations_route-configuration[_Route-specific annotations_].
There we can read:
Expand All @@ -318,5 +320,4 @@ There we can read:
The default value is "source" for TLS passthrough routes. For all other routes, the default is "random".
----
OCP is therefore correctly behaving. The issue is not a bug, but a misconfiguration by the customer who supposed the _balance_ type was `random` for all types of Routes.
OCP is therefore correctly behaving. The issue is not a bug, but a misconfiguration/misunderstanding by the customer who assumed the _balance_ type was `random` for all Routes.
Loading

0 comments on commit 418cc10

Please sign in to comment.