Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lifecycle-cpu-isolation and lifecycle-affinity-required-pods don't appear to be matching pods #1119

Open
joeldavis84 opened this issue May 19, 2023 · 15 comments

Comments

@joeldavis84
Copy link

Relevant portion of the TNF config (CNF name redacted, can provide whatever is needed side channel):

podsUnderTestLabels:
  - "app.kubernetes.io/name:cnfname-topology"
  - "app.kubernetes.io/name:cnfname-directive"
  - "cnfname-directive:cnfname-directive-payload-coredns"
  - "cnfname-directive:cnfname-directive-topology"
  - "app:nat"
  - "app:cnfname-operator"

And there are pods with requests/limits configured in a way that seems like it would cause the test to fail (mismatched, not using whole CPU's, etc):

[root@dci-server cnf-certification-test]# oc get pods -l cnfname-directive=cnfname-directive-payload-coredns
NAME                                    READY   STATUS    RESTARTS      AGE
cnfname-directive-payload-coredns-2fsgz   1/1     Running   0             21h
cnfname-directive-payload-coredns-5g6xg   1/1     Running   0             21h
cnfname-directive-payload-coredns-5p2tf   1/1     Running   0             21h
cnfname-directive-payload-coredns-5tj9d   1/1     Running   1 (17h ago)   21h
cnfname-directive-payload-coredns-m4sbd   1/1     Running   0             21h
cnfname-directive-payload-coredns-vz489   1/1     Running   0             21h
cnfname-directive-payload-coredns-znkhh   1/1     Running   0             21h

[root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh -o yaml | grep -A6 resources:
    resources:
      limits:
        cpu: "1"
        memory: 1Gi
      requests:
        cpu: 250m
        memory: 250Mi

but that aren't found when the test suite runs:

          {
            "-classname": "CNF Certification Test Suite",
            "-name": "[It] lifecycle lifecycle-cpu-isolation [common, telco, lifecycle-cpu-isolation, lifecycle]",
            "-status": "skipped",
            "-time": "0.000523242",
            "skipped": {
              "-message": "skipped - Test skipped because there are no []*provider.Pod to test, please check under test labels"
            },
            "system-err": "\u003e Enter [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076\n\u003c Exit [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It] at: /usr/tnf/tnf-src/pkg/testhelper/testhelper.go:130 @ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076 (1ms)\n\u003e Enter [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076\n\u003c Exit [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076 (0s)"
          },

The above is for lifecycle-cpu-isolation but's also relevant for lifecycle-affinity-required-pods as these same pods have affinity rules but not the affinity-related labels or annotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.

@sebrandon1
Copy link
Member

Has this been figured out? Just going through the issue list and saw this hasn't been responded to.

@joeldavis84
Copy link
Author

I haven't heard back about it and as of the last test run in DCI it seems like this is still being skipped even though AFAIK they're only currently working through failures at the moment.

@sebrandon1
Copy link
Member

Looking that DCI run that you sent me via Slack, the tnf_config.yml looks like its still using targetPodLabels instead of podsUnderTestLabels that is in the original comment.

@joeldavis84
Copy link
Author

There have been multiple iterations on this problem. IIRC the issue was filed with a manual run of the test suite but I gave you the latest DCI run. It's possible that's the disconnect.

I can try using the new version of the test suite

@ramperher
Copy link
Collaborator

Looking that DCI run that you sent me via Slack, the tnf_config.yml looks like its still using targetPodLabels instead of podsUnderTestLabels that is in the original comment.

Just in case; on DCI, we are moving to podsUnderTestLabels starting on tnf v4.2.3, so if you use latest v4.3.0, you should have it in place.

@joeldavis84
Copy link
Author

OK I re-ran the tests in DCI again: https://www.distributed-ci.io/jobs/22d3f865-2240-4fb5-972c-6379127b0be2/files

Still getting a skip because of it's not locating the pods though.

@sebrandon1
Copy link
Member

@joeldavis84 I'm getting a "job not found" error from that link.

@joeldavis84
Copy link
Author

Ok sorry, I deleted a bunch of jobs in response to partner input and may have gotten a bit too over zealous.

New test run: https://www.distributed-ci.io/jobs/abc116a5-e157-478b-9c2b-387302f17fae?sort=date

From tnf_config:
podsUnderTestLabels:
- "app.kubernetes.io/name: cnfname-topology"
- "app.kubernetes.io/name: cnfname-directive"
- "cnfname-directive: cnfname-directive-payload-coredns"
- "cnfname-directive: cnfname-directive-topology"

And verified that the label is present (not sure if this is in the DCI results):

 [root@dci-server dci-openshift-app-agent]# oc get pod cnfname-directive-payload-coredns-dev-hv4gm -o yaml | yq .metadata.labels
{
  "app.kubernetes.io/name": "cnfname-directive",
  "controller-revision-hash": "6bb7fb979",
  "pod-template-generation": "1",
  "cnfname-directive": "cnfname-directive-payload-coredns-dev"
}


[root@dci-server dci-openshift-app-agent]# oc get pod cnfname-directive-payload-coredns-dev-hv4gm -o json | jq .spec.containers[0].resources 
{
  "limits": {
    "cpu": "1",
    "memory": "1Gi"
  },
  "requests": {
    "cpu": "250m",
    "memory": "250Mi"
  }
}

In the DCI results I can see "lifecycle-cpu-isolation" being skipped due to no labels:

[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels

And from how CATALOG.md is worded I would assume both the mismatch between limits and requests as well as requests.cpu being in milicores this would fail.

@ramperher
Copy link
Collaborator

@joeldavis84 , I've been checking your job, and I think that what is happening is that you're not looking at the correct namespace to retrieve the pods.

If you check the tnf_config.yml file uploaded in the DCI job in Files section, you'll see that you're only tracking tawon namespace:

---
targetNameSpaces:
  - name: tawon

podsUnderTestLabels:
  - "app.kubernetes.io/name: tawon-topology"
  - "app.kubernetes.io/name: tawon-directive"
  - "tawon-directive: tawon-directive-payload-coredns"
  - "tawon-directive: tawon-directive-topology"
...

In the file called tawon_status.log, you can see all the pods deployed in that namespace:

NAME                                                  READY   STATUS    RESTARTS       AGE     IP             NODE        NOMINATED NODE   READINESS GATES   LABELS
nats-7d8fff94fb-6fj7b                                 1/1     Running   0              5d11h   10.129.2.92    worker-0    <none>           <none>            app=nats,pod-template-hash=7d8fff94fb
tawon-directive-payload-coredns-dev-8z69k             1/1     Running   0              2d      10.129.2.147   worker-0    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-b6qg6             1/1     Running   0              2d      10.130.2.222   storage-2   <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-hv4gm             1/1     Running   2 (29m ago)    2d      10.131.1.110   worker-3    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-p9cn9             1/1     Running   0              2d      10.129.4.231   storage-1   <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-pgkzp             1/1     Running   0              2d      10.128.4.109   worker-1    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-r5btk             1/1     Running   2 (166m ago)   2d      10.128.3.201   worker-2    <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-directive-payload-coredns-dev-wpbnt             1/1     Running   0              2d      10.131.2.232   storage-0   <none>           <none>            app.kubernetes.io/name=tawon-directive,controller-revision-hash=6bb7fb979,pod-template-generation=1,tawon-directive=tawon-directive-payload-coredns-dev
tawon-operator-controller-manager-6575fd7675-8pdmh    2/2     Running   0              7d7h    10.131.0.132   worker-3    <none>           <none>            control-plane=controller-manager,pod-template-hash=6575fd7675
tawon-topology-topology-aggregator-747dbb8966-ld59w   1/1     Running   0              7d6h    10.128.4.230   worker-1    <none>           <none>            app.kubernetes.io/name=tawon-topology,pod-template-hash=747dbb8966,tawon-directive=tawon-topology-topology-aggregator
tawon-topology-topology-aggregator-747dbb8966-wgpzg   1/1     Running   0              7d6h    10.131.0.138   worker-3    <none>           <none>            app.kubernetes.io/name=tawon-topology,pod-template-hash=747dbb8966,tawon-directive=tawon-topology-topology-aggregator

If you check the list, the pod you're commenting is not there, I suppose it's in a different namespace.

So, you need to really check the tnf_config variable passed to your DCI job to be sure that you're passing all the namepsaces where you're deploying workloads and the labels for pods and operators to retrieve by tnf. Remember you have some examples in our cnf-cert role.

I'd not say, for the moment, that it's an issue on tnf side, because we're testing these tests recurrently with the latest and we can confirm this is working fine in tnf v4.2.4 and v4.3.0 for sure. If you need some support here, don't hesitate to reach us!

@joeldavis84
Copy link
Author

The pod I was looking at was just a random pod in the CNF that had that label. The pods do exist in the tawon namespace though:

[root@dci-server dci-openshift-app-agent]# oc get pods -n tawon -l app.kubernetes.io/name=tawon-directive
NAME                                        READY   STATUS    RESTARTS      AGE
tawon-directive-payload-coredns-dev-8z69k   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-b6qg6   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-hv4gm   1/1     Running   2 (21h ago)   2d21h
tawon-directive-payload-coredns-dev-p9cn9   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-pgkzp   1/1     Running   0             2d21h
tawon-directive-payload-coredns-dev-r5btk   1/1     Running   2 (24h ago)   2d21h
tawon-directive-payload-coredns-dev-wpbnt   1/1     Running   0             2d21h

The particular pod not being in the namespace anymore is likely just because the partner is working on the CNF and likely restarting or recreating pods somehow.

@joeldavis84
Copy link
Author

Is it possible it's related to them using DaemonSets? I don't know how the tests are written but that is another weird thing that they're doing so I don't know if the pods are being tracked down by looking at ReplicaSets and deployments or something.

@sebrandon1
Copy link
Member

As long as the pods are labeled they will be tested regardless of their parent being a daemonset or not. CNFs aren't supposed to use daemonsets per the requirements docs but it shouldn't prevent the pods themselves from being tested if they are labeled.

@joeldavis84
Copy link
Author

we seem to be running into this issue again with a different partner who is using regular deployments. The test is being skipped when it seems like it should be failing due to subdividing CPU's and "limits" != "requests"

@sebrandon1
Copy link
Member

Can you send me the claim.json from their run?

@joeldavis84
Copy link
Author

This is the run I was looking at when I made this comment and it's the latest one: https://www.distributed-ci.io/jobs/80912905-b9ed-45dd-9701-ad35584d0712?sort=date

@Apetree100122
Copy link

Apetree100122 commented Jan 12, 2024

Relevant
portion of theTNF config-(-CNF name redacted,
can provide whatever is
needed side channel--
):> > ```> podsUnder
TestLabels:> "app.kubernetes.io
/name:cnfname-topology"> - "app.kubernetes.io/
name:cnfname-directive"> - "cn
fname-directive:cnf
name-directivepay
load-coredns"> - "cnfname-directive:cnfname-directive-
topology"> - "app:nat">

  • "app:cnf
  • name-operator">```>
  • And there are pods

  • with requests
  • /limits configured in
    a way that
    seems like
    it would
    cause the test to fail (
    mismatched
    , not
    using
    whole CPU's,
    ):> [root@dci-server cnf- certification-test ]# oc {{03}} get pods -l cnfname-directive=cnf name-directive-payload-coredns> NAME READY STATUS RESTARTS AGE> cnfname-directive-payload-coredns-2fsgz 1/1 Running 0 21h> cnfname-directive-payload-coredns-5g6xg 1/1 Running 0 21h> cnfname-directive-payload-coredns-5p2tf 1/1 Running 0 21h> cnfname-directive-payload-coredns-5tj9d 1/1 Running 1 (17h ago) 21h> cnfname-directive-payload-coredns-m4sbd 1/1 Running 0 21h> cnfname-directive-payload-coredns-vz489 1/1 Running 0 21h> cnfname-directive-payload-coredns-znkhh 1/1 Running 0 21h> > [root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh - o yaml | grep -A6 resources: > resources:> limits:> cpu: "1"> memory: 1G -i> requests:> cpu: 250m> memory: 250M -i>> > but that aren't
    found when the test suite runs:> > ```> {-->"-classname": "CNF
    Certification Test Suite",> name"I-t] lifecycle lifecycle-
    cpu-isolation [common, telco,lifecycle-
    cpu-isolation, lifecycle]",> "-
    status":
    "skipped",> "-time": "0.000523242",
        "skipped": {>"-message":      
             "skipped - Test                                        
  skipped because there are no []*provider.Pod to test, please check under 
                                    test labels">             },>                                      "system-err": "\u003e Enter                            [BeforeEach] 
                                      lifecycle - /usr/tnf/tnf-src/cnf-certification- 
           test/lifecycle/suite.go:55 @ 
                                       05/19/23 
                 17:19:27.076\n\u003c Exit [ 
                    BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification- 
            test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It]at:/usr/tnf/tnfsrc/pkg/testhelper/testhelper.go:130 @ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076 (1ms)\n\u003e Enter [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076\n\u003c Exit [ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:58 @ 05/19/23 17:19:27.076 (0s)">           },

> > The above is for `lifecycle-cpu-isolation` but's also relevant for `lifecycle-affinity-required-pods` as these same pods have affinity rules but not the affinity-related labels or annotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.> Relevant portion of the TNF config (CNF name redacted, can provide whatever is needed side channel):> > > podsUnderTestLabels:> -"app.kubernetes.io/name:cnfname-topology"> - "app.kubernetes.io/name:cnfname-directive"> - "cnfname-directive:cnfname-directivepayload-coredns"> - "cnfname-directive:cnfname-directive-topology"> - "app:nat"> - "app:cnfname-operator"> >> And there are pods with requests/limits configured in a way that seems like it would cause the test to fail (mismatched, not using whole CPU's, etc):> > > [root@dci-server cnf-certification-test]# oc get pods -l cnfname-directive=cnfname-directive-payload-coredns> NAME READY STATUS RESTARTS AGE> cnfname-directive-payload-coredns-2fsgz 1/1 Running 0 21h> cnfname-directive-payload-coredns-5g6xg 1/1 Running 0 21h> cnfname-directive-payload-coredns-5p2tf 1/1 Running 0 21h> cnfname-directive-payload-coredns-5tj9d 1/1 Running 1 (17h ago) 21h> cnfname-directive-payload-coredns-m4sbd 1/1 Running 0 21h> cnfname-directive-payload-coredns-vz489 1/1 Running 0 21h> cnfname-directive-payload-coredns-znkhh 1/1 Running 0 21h> > [root@dci-server cnf-certification-test]# oc get pods cnfname-directive-payload-coredns-znkhh -o yaml | grep -A6 resources:> resources:> limits:> cpu: "1"> memory: 1Gi> requests:> cpu: 250m> memory: 250Mi> ```

but that aren't found when the test suite runs:

        "skipped": {>               "-message": "skipped - Test skippedbecause there are no []*provider.Pod to test, please check under test labels">             },>             "system-err": "\u003e Enter[BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076\n\u003c Exit [BeforeEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:55 @ 05/19/23 17:19:27.076 (0s)\n\u003e Enter [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-test/lifecycle/suite.go:167 @ 05/19/23 17:19:27.076\n[SKIPPED] Test skipped because there are no []*provider.Pod to test, please check under test labels\nIn [It]at: 

/usr/tnf/tnfsrc/pkg/testhelper/testhelper.
go:130
@ 05/19/23 17:19:27.076\n\u003c Exit [It] lifecycle-cpu-isolation - /usr/tnf/tnf-src/cnf-certification-
test/lifecycle/suite.go:167 @05/19/23 17:19:27.076 (1ms)\n\u003e Enter
[ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf-certification-
test/lifecycle/suite.go:58@ 05/19/23 17:19:27.076\n\u003c

                                       Exit[ReportAfterEach] lifecycle - /usr/tnf/tnf-src/cnf- 
       certificationtest/lifecycle/suite.go:58@ 
                                                 05/19/23 17:19:27.076 (0s)">       
                                             },> ```> > The above is for `lifecycle-cpu-isolation` but's also relevant for `lifecycle-affinity-required-pods` as these same pods have affinity rules but not the affinity-related labels orannotations mentioned in the CATALOG.md description which seems like it should cause a failure rather than be skipped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants