lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods #1119

joeldavis84 · 2023-05-19T19:14:21Z

Relevant portion of the TNF config (CNF name redacted, can provide whatever is needed side channel):

podsUnderTestLabels:
  - "app.kubernetes.io/name:cnfname-topology"
  - "app.kubernetes.io/name:cnfname-directive"
  - "cnfname-directive:cnfname-directive-payload-coredns"
  - "cnfname-directive:cnfname-directive-topology"
  - "app:nat"
  - "app:cnfname-operator"

And there are pods with requests/limits configured in a way that seems like it would cause the test to fail (mismatched, not using whole CPU's, etc):

[root@dci-server cnf-certification-test]# oc get pods -l cnfname-directive=cnfname-directive-payload-coredns
NAME                                    READY   STATUS    RESTARTS      AGE
cnfname-directive-payload-coredns-2fsgz   1/1     Running   0             21h
cnfname-directive-payload-coredns-5g6xg   1/1     Running   0             21h
cnfname-directive-payload-coredns-5p2tf   1/1     Running   0             21h
cnfname-directive-payload-coredns-5tj9d   1/1     Running   1 (17h ago)   21h
cnfname-directive-payload-coredns-m4sbd   1/1     Running   0             21h
cnfname-directive-payload-coredns-vz489   1/1     Running   0             21h
cnfname-directive-payload-coredns-znkhh   1/1     Running   0             21h

[root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh -o yaml | grep -A6 resources:
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: 250m
        memory: 250Mi

but that aren't found when the test suite runs:

          {
            "-classname": "CNF Certification Test Suite",
            "-name": "[It] lifecycle lifecycle-cpu-isolation [common, telco, lifecycle-cpu-isolation, lifecycle]",
            "-status": "skipped",
            "-time": "0.000523242",
            "skipped": {
              "-message": "skipped - Test skipped because there are no []*provider.Pod to test, please check under test labels"
            },
            "system-err": "\u003e Enter [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076\n\u003c Exit [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It] at: /usr/tnf/tnf-src/pkg/testhelper/testhelper.go:130 @ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076 (1ms)\n\u003e Enter [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076\n\u003c Exit [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076 (0s)"
          },

The above is for lifecycle-cpu-isolation but's also relevant for lifecycle-affinity-required-pods as these same pods have affinity rules but not the affinity-related labels or annotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.

The text was updated successfully, but these errors were encountered:

sebrandon1 · 2023-05-31T19:47:35Z

Has this been figured out? Just going through the issue list and saw this hasn't been responded to.

joeldavis84 · 2023-05-31T20:22:51Z

I haven't heard back about it and as of the last test run in DCI it seems like this is still being skipped even though AFAIK they're only currently working through failures at the moment.

sebrandon1 · 2023-06-01T18:56:27Z

Looking that DCI run that you sent me via Slack, the tnf_config.yml looks like its still using targetPodLabels instead of podsUnderTestLabels that is in the original comment.

joeldavis84 · 2023-06-02T12:48:24Z

There have been multiple iterations on this problem. IIRC the issue was filed with a manual run of the test suite but I gave you the latest DCI run. It's possible that's the disconnect.

I can try using the new version of the test suite

ramperher · 2023-06-05T08:05:10Z

Looking that DCI run that you sent me via Slack, the tnf_config.yml looks like its still using targetPodLabels instead of podsUnderTestLabels that is in the original comment.

Just in case; on DCI, we are moving to podsUnderTestLabels starting on tnf v4.2.3, so if you use latest v4.3.0, you should have it in place.

joeldavis84 · 2023-06-05T16:54:30Z

OK I re-ran the tests in DCI again: https://www.distributed-ci.io/jobs/22d3f865-2240-4fb5-972c-6379127b0be2/files

Still getting a skip because of it's not locating the pods though.

sebrandon1 · 2023-06-06T17:31:22Z

@joeldavis84 I'm getting a "job not found" error from that link.

joeldavis84 · 2023-06-07T13:47:16Z

Ok sorry, I deleted a bunch of jobs in response to partner input and may have gotten a bit too over zealous.

New test run: https://www.distributed-ci.io/jobs/abc116a5-e157-478b-9c2b-387302f17fae?sort=date

From tnf_config:
podsUnderTestLabels:
- "app.kubernetes.io/name: cnfname-topology"
- "app.kubernetes.io/name: cnfname-directive"
- "cnfname-directive: cnfname-directive-payload-coredns"
- "cnfname-directive: cnfname-directive-topology"

And verified that the label is present (not sure if this is in the DCI results):

 [root@dci-server dci-openshift-app-agent]# oc get pod cnfname-directive-payload-coredns-dev-hv4gm -o yaml | yq .metadata.labels
{
  "app.kubernetes.io/name": "cnfname-directive",
  "controller-revision-hash": "6bb7fb979",
  "pod-template-generation": "1",
  "cnfname-directive": "cnfname-directive-payload-coredns-dev"
}


[root@dci-server dci-openshift-app-agent]# oc get pod cnfname-directive-payload-coredns-dev-hv4gm -o json | jq .spec.containers[0].resources 
{
  "limits": {
    "cpu": "1",
    "memory": "1Gi"
  },
  "requests": {
    "cpu": "250m",
    "memory": "250Mi"
  }
}

In the DCI results I can see "lifecycle-cpu-isolation" being skipped due to no labels:

[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels

And from how CATALOG.md is worded I would assume both the mismatch between limits and requests as well as requests.cpu being in milicores this would fail.

ramperher · 2023-06-08T08:59:28Z

@joeldavis84 , I've been checking your job, and I think that what is happening is that you're not looking at the correct namespace to retrieve the pods.

If you check the tnf_config.yml file uploaded in the DCI job in Files section, you'll see that you're only tracking tawon namespace:

---
targetNameSpaces:
  - name: tawon

podsUnderTestLabels:
  - "app.kubernetes.io/name: tawon-topology"
  - "app.kubernetes.io/name: tawon-directive"
  - "tawon-directive: tawon-directive-payload-coredns"
  - "tawon-directive: tawon-directive-topology"
...

In the file called tawon_status.log, you can see all the pods deployed in that namespace:

NAME                                                  READY   STATUS    RESTARTS       AGE     IP             NODE        NOMINATED NODE   READINESS GATES   LABELS
nats-7d8fff94fb-6fj7b                                 1/1     Running   0              5d11h   10.129.2.92    worker-0    <none>           <none>            app=nats,pod-template-hash=7d8fff94fb
tawon-directive-payload-coredns-dev-8z69k             1/1     Running   0              2d      10.129.2.147   worker-0    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-b6qg6             1/1     Running   0              2d      10.130.2.222   storage-2   <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-hv4gm             1/1     Running   2 (29m ago)    2d      10.131.1.110   worker-3    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-p9cn9             1/1     Running   0              2d      10.129.4.231   storage-1   <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-pgkzp             1/1     Running   0              2d      10.128.4.109   worker-1    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-r5btk             1/1     Running   2 (166m ago)   2d      10.128.3.201   worker-2    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-wpbnt             1/1     Running   0              2d      10.131.2.232   storage-0   <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-operator-controller-manager-6575fd7675-8pdmh    2/2     Running   0              7d7h    10.131.0.132   worker-3    <none>           <none>            control-plane=controller-manager,pod-template-hash=6575fd7675
tawon-topology-topology-aggregator-747dbb8966-ld59w   1/1     Running   0              7d6h    10.128.4.230   worker-1    <none>           <none>            app.kubernetes.io/name=tawon-topology,pod-template-hash=747dbb8966,tawon-directive=tawon-topology-topology-aggregator
tawon-topology-topology-aggregator-747dbb8966-wgpzg   1/1     Running   0              7d6h    10.131.0.138   worker-3    <none>           <none>            app.kubernetes.io/name=tawon-topology,pod-template-hash=747dbb8966,tawon-directive=tawon-topology-topology-aggregator

If you check the list, the pod you're commenting is not there, I suppose it's in a different namespace.

So, you need to really check the tnf_config variable passed to your DCI job to be sure that you're passing all the namepsaces where you're deploying workloads and the labels for pods and operators to retrieve by tnf. Remember you have some examples in our cnf-cert role.

I'd not say, for the moment, that it's an issue on tnf side, because we're testing these tests recurrently with the latest and we can confirm this is working fine in tnf v4.2.4 and v4.3.0 for sure. If you need some support here, don't hesitate to reach us!

joeldavis84 · 2023-06-08T10:54:54Z

The pod I was looking at was just a random pod in the CNF that had that label. The pods do exist in the tawon namespace though:

[root@dci-server dci-openshift-app-agent]# oc get pods -n tawon -l app.kubernetes.io/name=tawon-directive
NAME                                        READY   STATUS    RESTARTS      AGE
tawon-directive-payload-coredns-dev-8z69k   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-b6qg6   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-hv4gm   1/1     Running   2 (21h ago)   2d21h
tawon-directive-payload-coredns-dev-p9cn9   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-pgkzp   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-r5btk   1/1     Running   2 (24h ago)   2d21h
tawon-directive-payload-coredns-dev-wpbnt   1/1     Running   0             2d21h

The particular pod not being in the namespace anymore is likely just because the partner is working on the CNF and likely restarting or recreating pods somehow.

joeldavis84 · 2023-06-08T11:39:17Z

Is it possible it's related to them using DaemonSets? I don't know how the tests are written but that is another weird thing that they're doing so I don't know if the pods are being tracked down by looking at ReplicaSets and deployments or something.

sebrandon1 · 2023-06-08T16:38:10Z

As long as the pods are labeled they will be tested regardless of their parent being a daemonset or not. CNFs aren't supposed to use daemonsets per the requirements docs but it shouldn't prevent the pods themselves from being tested if they are labeled.

joeldavis84 · 2023-09-08T13:55:35Z

we seem to be running into this issue again with a different partner who is using regular deployments. The test is being skipped when it seems like it should be failing due to subdividing CPU's and "limits" != "requests"

sebrandon1 · 2023-09-08T17:50:06Z

Can you send me the claim.json from their run?

joeldavis84 · 2023-09-08T18:10:39Z

This is the run I was looking at when I made this comment and it's the latest one: https://www.distributed-ci.io/jobs/80912905-b9ed-45dd-9701-ad35584d0712?sort=date

Apetree100122 · 2024-01-12T07:40:16Z

Relevant
portion of theTNF config-(-CNF name redacted,
can provide whatever is
needed side channel--
):> > ```> podsUnder
TestLabels:> "app.kubernetes.io
/name:cnfname-topology"> - "app.kubernetes.io/
name:cnfname-directive"> - "cn
fname-directive:cnf
name-directivepay
load-coredns"> - "cnfname-directive:cnfname-directive-
topology"> - "app:nat">

"app:cnf
name-operator">```>
And there are pods
with requests
/limits configured in
a way that
seems like
it would
cause the test to fail (
mismatched
, not
using
whole CPU's,
):> [root@dci-server cnf- certification-test ]# oc {{03}} get pods -l cnfname-directive=cnf name-directive-payload-coredns> NAME READY STATUS RESTARTS AGE> cnfname-directive-payload-coredns-2fsgz 1/1 Running 0 21h> cnfname-directive-payload-coredns-5g6xg 1/1 Running 0 21h> cnfname-directive-payload-coredns-5p2tf 1/1 Running 0 21h> cnfname-directive-payload-coredns-5tj9d 1/1 Running 1 (17h ago) 21h> cnfname-directive-payload-coredns-m4sbd 1/1 Running 0 21h> cnfname-directive-payload-coredns-vz489 1/1 Running 0 21h> cnfname-directive-payload-coredns-znkhh 1/1 Running 0 21h> > [root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh - o yaml | grep -A6 resources: > resources:> limits:> cpu: "1"> memory: 1G -i> requests:> cpu: 250m> memory: 250M -i>> > but that aren't
found when the test suite runs:> > ```> {-->"-classname": "CNF
Certification Test Suite",> name"I-t] lifecycle lifecycle-
cpu-isolation [common, telco,lifecycle-
cpu-isolation, lifecycle]",> "-
status":
"skipped",> "-time": "0.000523242",

        "skipped": {>"-message":

             "skipped - Test                                        
  skipped because there are no []*provider.Pod to test, please check under 
                                    test labels">             },>                                      "system-err": "\u003e Enter                            [BeforeEach] 
                                      lifecycle - /usr/tnf/tnf-src/cnf-certification- 
           test/lifecycle/suite.go:55 @ 
                                       05/19/23 
                 17:19:27.076\n\u003c Exit [ 
                    BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification- 
            test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It]at:/usr/tnf/tnfsrc/pkg/testhelper/testhelper.go:130 @ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076 (1ms)\n\u003e Enter [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076\n\u003c Exit [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076 (0s)">           },

> > The above is for `lifecycle-cpu-isolation` but's also relevant for `lifecycle-affinity-required-pods` as these same pods have affinity rules but not the affinity-related labels or annotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.> Relevant portion of the TNF config (CNF name redacted, can provide whatever is needed side channel):> > > podsUnderTestLabels:> -"app.kubernetes.io/name:cnfname-topology"> - "app.kubernetes.io/name:cnfname-directive"> - "cnfname-directive:cnfname-directivepayload-coredns"> - "cnfname-directive:cnfname-directive-topology"> - "app:nat"> - "app:cnfname-operator"> >> And there are pods with requests/limits configured in a way that seems like it would cause the test to fail (mismatched, not using whole CPU's, etc):> > > [root@dci-server cnf-certification-test]# oc get pods -l cnfname-directive=cnfname-directive-payload-coredns> NAME READY STATUS RESTARTS AGE> cnfname-directive-payload-coredns-2fsgz 1/1 Running 0 21h> cnfname-directive-payload-coredns-5g6xg 1/1 Running 0 21h> cnfname-directive-payload-coredns-5p2tf 1/1 Running 0 21h> cnfname-directive-payload-coredns-5tj9d 1/1 Running 1 (17h ago) 21h> cnfname-directive-payload-coredns-m4sbd 1/1 Running 0 21h> cnfname-directive-payload-coredns-vz489 1/1 Running 0 21h> cnfname-directive-payload-coredns-znkhh 1/1 Running 0 21h> > [root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh -o yaml | grep -A6 resources:> resources:> limits:> cpu: "1"> memory: 1Gi> requests:> cpu: 250m> memory: 250Mi> ```
but that aren't found when the test suite runs:
        "skipped": {>               "-message": "skipped - Test skippedbecause there are no []*provider.Pod to test, please check under test labels">             },>             "system-err": "\u003e Enter[BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076\n\u003c Exit [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It]at: 

/usr/tnf/tnfsrc/pkg/testhelper/testhelper.
go:130
@ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-
test/lifecycle/suite.go:167 @05/19/23 17:19:27.076 (1ms)\n\u003e Enter
[ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-
test/lifecycle/suite.go:58@ 05/19/23 17:19:27.076\n\u003c

                                       Exit[ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf- 
       certificationtest/lifecycle/suite.go:58@ 
                                                 05/19/23 17:19:27.076 (0s)">       
                                             },> ```> > The above is for `lifecycle-cpu-isolation` but's also relevant for `lifecycle-affinity-required-pods` as these same pods have affinity rules but not the affinity-related labels orannotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.

sebrandon1 mentioned this issue Sep 8, 2023

Add clarification for cpu-isolation test in catalog #1401

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods #1119

lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods #1119

joeldavis84 commented May 19, 2023

sebrandon1 commented May 31, 2023

joeldavis84 commented May 31, 2023

sebrandon1 commented Jun 1, 2023

joeldavis84 commented Jun 2, 2023

ramperher commented Jun 5, 2023

joeldavis84 commented Jun 5, 2023

sebrandon1 commented Jun 6, 2023

joeldavis84 commented Jun 7, 2023

ramperher commented Jun 8, 2023

joeldavis84 commented Jun 8, 2023

joeldavis84 commented Jun 8, 2023

sebrandon1 commented Jun 8, 2023

joeldavis84 commented Sep 8, 2023

sebrandon1 commented Sep 8, 2023

joeldavis84 commented Sep 8, 2023

Apetree100122 commented Jan 12, 2024 •

edited

Loading

lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods #1119

lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods #1119

Comments

joeldavis84 commented May 19, 2023

sebrandon1 commented May 31, 2023

joeldavis84 commented May 31, 2023

sebrandon1 commented Jun 1, 2023

joeldavis84 commented Jun 2, 2023

ramperher commented Jun 5, 2023

joeldavis84 commented Jun 5, 2023

sebrandon1 commented Jun 6, 2023

joeldavis84 commented Jun 7, 2023

ramperher commented Jun 8, 2023

joeldavis84 commented Jun 8, 2023

joeldavis84 commented Jun 8, 2023

sebrandon1 commented Jun 8, 2023

joeldavis84 commented Sep 8, 2023

sebrandon1 commented Sep 8, 2023

joeldavis84 commented Sep 8, 2023

Apetree100122 commented Jan 12, 2024 • edited Loading

Apetree100122 commented Jan 12, 2024 •

edited

Loading