Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Constant reconciliation due to changing error message on CRD validation #1456

Open
m1kola opened this issue Nov 13, 2024 · 3 comments
Open
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@m1kola
Copy link
Member

m1kola commented Nov 13, 2024

Steps to reproduce

  1. Create pre-requisites such as RBAC and service account. This can be used as a reference.
  2. Deploy argocd-operator of version 0.5.0 from operatorhubio catalog.
    apiVersion: olm.operatorframework.io/v1
    kind: ClusterExtension
    metadata:
      name: argocd
    spec:
      namespace: argocd
      serviceAccount:
        name: argocd-installer
      source:
        sourceType: Catalog
        catalog:
          packageName: argocd-operator
          version: 0.5.0
  3. Upgrade to 0.6.0
    kubectl patch clusterextension argocd --type='merge' -p '{"spec": {"source": {"catalog": {"version": "0.6.0"}}}}'
  4. Watch for changes:
    kubectl get clusterextension argocd -o yaml -w

Result

Error message seems to be ordered differently on each reconciliation which probably results in useless reconciliation runs.

E.g. I see the condition alternating few variations such as:

  - lastTransitionTime: "2024-11-13T15:48:14Z"
    message: |-
      validating upgrade for CRD "argocds.argoproj.io" failed: CustomResourceDefinition argocds.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.status.dex" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.redis" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.phase" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.notificationsController" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.server" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.repo" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.ssoConfig" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.applicationController" has unknown change, refusing to determine that change is safe
      validating upgrade for CRD "applications.argoproj.io" failed: CustomResourceDefinition applications.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.status.history" has unknown change, refusing to determine that change is safe
      validating upgrade for CRD "applicationsets.argoproj.io" failed: CustomResourceDefinition applicationsets.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.spec.generators" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.spec.generators[*].merge.generators" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.spec.generators[*].matrix.generators" has unknown change, refusing to determine that change is safe for resolved bundle "argocd-operator.v0.6.0" with version "0.6.0"
    observedGeneration: 4
    reason: Retrying
    status: "True"
    type: Progressing
  - lastTransitionTime: "2024-11-13T15:48:14Z"
    message: |-
      validating upgrade for CRD "argocds.argoproj.io" failed: CustomResourceDefinition argocds.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.status.dex" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.ssoConfig" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.applicationController" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.server" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.repo" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.phase" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.notificationsController" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.status.redis" has unknown change, refusing to determine that change is safe
      validating upgrade for CRD "applications.argoproj.io" failed: CustomResourceDefinition applications.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.status.history" has unknown change, refusing to determine that change is safe
      validating upgrade for CRD "applicationsets.argoproj.io" failed: CustomResourceDefinition applicationsets.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.spec.generators[*].matrix.generators" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.spec.generators" has unknown change, refusing to determine that change is safe
      version "v1alpha1", field "^.spec.generators[*].merge.generators" has unknown change, refusing to determine that change is safe for resolved bundle "argocd-operator.v0.6.0" with version "0.6.0"
    observedGeneration: 4
    reason: Retrying
    status: "True"
    type: Progressing

Expected result

@m1kola m1kola added the kind/bug Categorizes issue or PR as related to a bug. label Nov 13, 2024
@azych
Copy link
Contributor

azych commented Jan 15, 2025

I can look into it

@azych
Copy link
Contributor

azych commented Jan 16, 2025

This is what I have been able to find so far, using the provided argocd clusterextension example and reproduction steps:

Random order

The main issue with random ordering comes directly from the way the github.com/carvel-dev/kapp's ChangeValidator processes and returns its data. For context, this validator is what we use inside our only currently used preflight validator (ie. crdupgradesafety) - see: https://github.com/operator-framework/operator-controller/blob/main/internal/rukpak/preflights/crdupgradesafety/crdupgradesafety.go#L57)

As you can see here: https://github.com/carvel-dev/kapp/blob/develop/pkg/kapp/crdupgradesafety/change_validator.go#L452 ChangeValidator iterates on a map (diffs) when calling underlying validation logic. This means there is no guaranteed order in which those validations happen and no guaranteed order in which any potential errors are added to a local errors list (https://github.com/carvel-dev/kapp/blob/develop/pkg/kapp/crdupgradesafety/change_validator.go#L466). This local errors list is then finally combined into a single multi-line error returned to us/callers.

This means that when validating the example argocd manifests for 0.6.0 version, we get three errors as output, with each one being a multi-line string that can have its lines returned in any order, effectively making it a "different error" with each variation:

  1. "argocds.argoproj.io" object validation fails with:
CustomResourceDefinition argocds.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed:

And returns the following error (all of the below is a single error)

version "v1alpha1", field "^.status.dex" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.repo" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.ssoConfig" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.redis" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.notificationsController" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.phase" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.server" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.status.applicationController" has unknown change, refusing to determine that change is safe
  1. "applicationsets.argoproj.io" object validation also fails with a "ChangeValidator" error (again, single error):
version "v1alpha1", field "^.spec.generators[*].matrix.generators" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.spec.generators" has unknown change, refusing to determine that change is safe
	version "v1alpha1", field "^.spec.generators[*].merge.generators" has unknown change, refusing to determine that change is safe
  1. "applications.argoproj.io" object validation also fails with a "ChangeValidator" error:
version "v1alpha1", field "^.status.history" has unknown change, refusing to determine that change is safe"

Reconciliations

The above random order indeed causes unnecessary attempts to reconcile the status of failed clusterextensions, by forcing the call to r.Client.Status().Update (https://github.com/operator-framework/operator-controller/blob/main/internal/controllers/clusterextension_controller.go#L132) at every reconciliation, which definitely is an issue.
I was able to verify this by temporarily overriding the Apply() error to be fixed on return.

Fix

I'm currently looking into how best to approach fixing it. Ensuring order directly upstream seems like the best idea imho, otherwise we would have split - (re)order - rejoin messages we're getting with an assumption of a specific separator being used (newline).

@azych
Copy link
Contributor

azych commented Jan 16, 2025

Update on the fix: I created and linked to upstream PR.

Once this is merged, besides bumping kapp's version in our dependencies, we would have to ensure that we order Validate() calls errors by manifests here: https://github.com/operator-framework/operator-controller/blob/main/internal/rukpak/preflights/crdupgradesafety/crdupgradesafety.go#L115 because that cannot be done for us upstream.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

2 participants