Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Argo S3 Artifacts Failing in v3.6.2 #14021

Open
3 of 4 tasks
RyanDevlin opened this issue Dec 19, 2024 · 0 comments
Open
3 of 4 tasks

Argo S3 Artifacts Failing in v3.6.2 #14021

RyanDevlin opened this issue Dec 19, 2024 · 0 comments
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@RyanDevlin
Copy link

Pre-requisites

  • I have double-checked my configuration
  • I have tested with the :latest image tag (i.e. quay.io/argoproj/workflow-controller:latest) and can confirm the issue still exists on :latest. If not, I have explained why, in detail, in my description below.
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened? What did you expect to happen?

Background

We've been using Argo Workflows in production for a while now and recently upgraded to v3.6.2 in a test environment. With the new Argo version the S3 artifacts suddenly stopped working.

Our first Workflow step uploads output Artifacts back to S3, this has been working just fine in v3.5.11. With the upgrade we now see the first step of the workflow successfully run our code, but the Emissary container fails with the following error:

Error (exit code 64): failed to create new S3 client: operation error STS: 
AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 
failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 
failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 

I've double checked our Workflow Artifact configuration, and we do correctly supply the region:

outputs:
      artifacts:
      - archive:
          none: {}
        name: manifests
        path: <REDACTED>/manifests
        s3:
          bucket: <REDACTED>
          endpoint: s3.amazonaws.com
          key: <REDACTED>/executionName/etcd-test-7/{{inputs.parameters.mb-id}}
          region: us-east-1
          roleARN: arn:aws:iam::<REDACTED>:role/<REDACTED>
      - archive:
          none: {}
        name: execution-plan
        path: <REDACTED>/ExecutionPlan.gz
        s3:
          bucket: <REDACTED>
          endpoint: s3.amazonaws.com
          key: <REDACTED>/executionName/etcd-test-7/{{inputs.parameters.mb-id}}/ExecutionPlan.gz
          region: us-east-1
          roleARN: arn:aws:iam::<REDACTED>:role/<REDACTED>
      - archive:
          none: {}
        name: entities
        path: <REDACTED>/Entities.gz
        s3:
          bucket: <REDACTED>
          endpoint: s3.amazonaws.com
          key: <REDACTED>/executionName/etcd-test-7/{{inputs.parameters.mb-id}}/Entities-{{inputs.parameters.mb-id}}.gz
          region: us-east-1
          roleARN: arn:aws:iam::<REDACTED>:role/<REDACTED>

Problem

I've spent time tracing this through the code and it seems that Argo first takes the S3 Artifact Config we supply in the workflow and builds an S3 Driver using our supplied region here. This is then fed into the client options here, and then passed to the STS client here.

The STS client (used to assume IAM Roles, which provides authentication to AWS services) unpacks the region from these options here, and then uses them to resolve the endpoint of the STS server.

During the endpoint resolution process is where the failure occurs, with the STS SDK throwing this error.

Solution

To test my theory that the STS client within the Emissary Executor was not correctly resolving the AWS region, I manually set this env var within the Emissary container by leveraging the workflow-controller-configmap:

-bash-4.2$ kubectl -n <REDACTED> get cm workflow-controller-configmap -o yaml
apiVersion: v1
data:
  executor: |
    env:
    - name: AWS_DEFAULT_REGION
      value: us-east-1

This immediately fixes the issue and my artifacts started appearing in S3. This works because the AWS SDK will fall back on the standard AWS env vars if regular configurations fail to produce metadata needed for the SDK to figure out the endpoint and credentials of an AWS service.

Version(s)

v3.6.2

Paste a minimal workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflow that uses private images.

# This is an example of a workflow producing an S3 output artifact which is saved to a hard-wired
# location. This is useful for workflows which want to publish results to a well known or
# pre-determined location.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: output-artifact-s3-
spec:
  entrypoint: hello-world-to-file
  templates:
  - name: hello-world-to-file
    container:
      image: busybox
      command: [sh, -c]
      args: ["echo hello world | tee /tmp/hello_world.txt"]
    outputs:
      artifacts:
      - name: message
        path: /tmp
        # It is possible to disable tar.gz archiving by setting the archive strategy to 'none'
        # Disabling archiving has the following limitations on S3: symbolic links will not be
        # uploaded, as S3 does not support the concept/file mode of symlinks.
        # archive:
        #   none: {}

        s3:
          # Use the corresponding endpoint depending on your S3 provider:
          #   AWS: s3.amazonaws.com
          #   GCS: storage.googleapis.com
          #   Minio: my-minio-endpoint.default:9000
          endpoint: s3.amazonaws.com
          bucket: my-bucket
          # Specify the bucket region. Note that if you want Argo to figure out this automatically,
          # you can set additional statement policy that allows `s3:GetBucketLocation` action.
          # For details, check out: https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/#configuring-aws-s3
          region: us-west-2

          # NOTE: by default, output artifacts are automatically tarred and gzipped before saving.
          # As a best practice, .tgz or .tar.gz should be suffixed into the key name so the
          # resulting object has an accurate file extension and mime-type. If archive is set to
          # 'none', then preserve the appropriate file extension for the key name
          key: path/in/bucket/hello_world.txt.tgz

          # accessKeySecret and secretKeySecret are secret selectors. It references the k8s secret
          # named 'my-s3-credentials'. This secret is expected to have have the keys 'accessKey'
          # and 'secretKey', containing the base64 encoded credentials to the bucket.
          accessKeySecret:
            name: my-s3-credentials
            key: accessKey
          secretKeySecret:
            name: my-s3-credentials
            key: secretKey

Logs from the workflow controller

These had no useful information, they only showed the Workflow nodes failing with the error:

Error (exit code 64): failed to create new S3 client: operation error STS: 
AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 
failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 
failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 

If needed I can go back and recreate the failure to get these logs, but I don't think they are much use.



### Logs from in your workflow's wait container

```text
time="2024-12-19T04:16:37.773Z" level=info msg="Staging artifact: manifests"
time="2024-12-19T04:16:37.773Z" level=info msg="Staging /<REDACTED>/manifests from mirrored volume mount /mainctrfs/<REDACTED>/manifests"
time="2024-12-19T04:16:37.773Z" level=info msg="No compression strategy needed. Staging skipped"
time="2024-12-19T04:16:37.773Z" level=info msg="S3 Save path: /mainctrfs/<REDACTED>/manifests, key: <REDACTED>/executionName/etcd-test-3/<REDACTED>"
time="2024-12-19T04:16:37.779Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<REDACTED>:role/<REDACTED>"
time="2024-12-19T04:16:37.784Z" level=warning msg="Non-transient error: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region"
time="2024-12-19T04:16:37.784Z" level=info msg="Save artifact" artifactName=manifests duration=11.05296ms error="failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region" key=<REDACTED>/executionName/etcd-test-3/<REDACTED>
time="2024-12-19T04:16:37.784Z" level=info msg="Staging artifact: execution-plan"
time="2024-12-19T04:16:37.784Z" level=info msg="Staging /<REDACTED>/<REDACTED>/ExecutionPlan.gz from mirrored volume mount /<REDACTED>/<REDACTED>/ExecutionPlan.gz"
time="2024-12-19T04:16:37.784Z" level=info msg="No compression strategy needed. Staging skipped"
time="2024-12-19T04:16:37.785Z" level=info msg="S3 Save path: /<REDACTED>/<REDACTED>/ExecutionPlan.gz, key: <REDACTED>/executionName/etcd-test-3/<REDACTED>/ExecutionPlan.gz"
time="2024-12-19T04:16:37.787Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<REDACTED>:role/<REDACTED>"
time="2024-12-19T04:16:37.790Z" level=warning msg="Non-transient error: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region"
time="2024-12-19T04:16:37.790Z" level=info msg="Save artifact" artifactName=execution-plan duration=5.88166ms error="failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region" key=<REDACTED>/executionName/etcd-test-3/<REDACTED>/ExecutionPlan.gz
time="2024-12-19T04:16:37.791Z" level=info msg="Staging artifact: entities"
time="2024-12-19T04:16:37.791Z" level=info msg="Staging /<REDACTED>/<REDACTED>/Entities.gz from mirrored volume mount /<REDACTED>/<REDACTED>/Entities.gz"
time="2024-12-19T04:16:37.791Z" level=info msg="No compression strategy needed. Staging skipped"
time="2024-12-19T04:16:37.791Z" level=info msg="S3 Save path: /<REDACTED>/<REDACTED>/Entities.gz, key: <REDACTED>/executionName/etcd-test-3/<REDACTED>/Entities-<REDACTED>.gz"
time="2024-12-19T04:16:37.793Z" level=info msg="Creating minio client using assumed-role credentials" roleArn="arn:aws:iam::<REDACTED>:role/<REDACTED>"
time="2024-12-19T04:16:37.797Z" level=warning msg="Non-transient error: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region"
time="2024-12-19T04:16:37.797Z" level=info msg="Save artifact" artifactName=entities duration=6.212947ms error="failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region" key=<REDACTED>/etcd-test-3/<REDACTED>/Entities-<REDACTED>.gz
time="2024-12-19T04:16:37.797Z" level=error msg="executor error: failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; "
time="2024-12-19T04:16:37.816Z" level=info msg="Alloc=22995 TotalAlloc=29795 Sys=36501 NumGC=3 Goroutines=8"
Error: failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; 
Usage:
  argoexec wait [flags]

Flags:
  -h, --help   help for wait

Global Flags:
      --as string                      Username to impersonate for the operation
      --as-group stringArray           Group to impersonate for the operation, this flag can be repeated to specify multiple groups.
      --as-uid string                  UID to impersonate for the operation
      --certificate-authority string   Path to a cert file for the certificate authority
      --client-certificate string      Path to a client certificate file for TLS
      --client-key string              Path to a client key file for TLS
      --cluster string                 The name of the kubeconfig cluster to use
      --context string                 The name of the kubeconfig context to use
      --disable-compression            If true, opt-out of response compression for all requests to the server
      --gloglevel int                  Set the glog logging level
      --insecure-skip-tls-verify       If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure
      --kubeconfig string              Path to a kube config. Only required if out-of-cluster
      --log-format string              The formatter to use for logs. One of: text|json (default "text")
      --loglevel string                Set the logging level. One of: debug|info|warn|error (default "info")
  -n, --namespace string               If present, the namespace scope for this CLI request
      --password string                Password for basic authentication to the API server
      --proxy-url string               If provided, this URL will be used to connect via proxy
      --request-timeout string         The length of time to wait before giving up on a single server request. Non-zero values should contain a corresponding time unit (e.g. 1s, 2m, 3h). A value of zero means don't timeout requests. (default "0")
      --server string                  The address and port of the Kubernetes API server
      --tls-server-name string         If provided, this name will be used to validate server certificate. If this is not provided, hostname used to contact the server is used.
      --token string                   Bearer token for authentication to the API server
      --user string                    The name of the kubeconfig user to use
      --username string                Username for basic authentication to the API server

failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region; failed to create new S3 client: operation error STS: AssumeRole, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region;
@blkperl blkperl added type/regression Regression from previous behavior (a specific type of bug) solution/workaround There's a workaround, might not be great, but exists area/artifacts S3/GCP/OSS/Git/HDFS etc labels Dec 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts S3/GCP/OSS/Git/HDFS etc solution/workaround There's a workaround, might not be great, but exists type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
None yet
Development

No branches or pull requests

2 participants