-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Data Repository Associations for Lusture 2.12 or newer filesystems(e.g. PERSISTENT_2
deployment type)
#368
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: everpeace The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
PERSISTENT_2
deployment type filesystemsPERSISTENT_2
deployment type)
d7260f4
to
c268728
Compare
/retest pull-aws-fsx-csi-driver-e2e |
@everpeace: The
Use In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test pull-aws-fsx-csi-driver-e2e |
c268728
to
bf8d4ac
Compare
pkg/cloud/cloud.go
Outdated
@@ -62,6 +64,10 @@ var ( | |||
// disks are found with the same volume name. | |||
ErrMultiFileSystems = errors.New("Multiple filesystems with same ID") | |||
|
|||
// ErrMultiAssociations is an error that is returned when multiple | |||
// associations are found with the same volume name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this would be if there are multiple DRAs with the same association id, not multiple associations with the same volume.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. fixed in bba7484.
bf8d4ac
to
04efd39
Compare
96129a4
to
3120d26
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jacobwolfaws Thank you for the quick review. I addressed your feedbacks. PTAL 🙇
@@ -65,6 +65,7 @@ controller: | |||
- effect: NoExecute | |||
operator: Exists | |||
tolerationSeconds: 300 | |||
provisionerTimeout: 5m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we make the default provisioner timeout longer in helm chart? It is because it often takes more time to prepare an FSx filesystem when it has data repository associations.
Single FSx for Lusture fielsystem can have up to 8 data repository associations.
In my experience, it usually takes around 7-10 minutes to make single data repository associations available even for empty S3 bucket.
Moreover, setting up data repository associations on the specific filesystem looks like sequential.
So, I think 90 min = 10x8 min (data repository associations) + 5 min (FSx filesystem) + <buffer>
would be safe because the current CreateVolume
operation is synchronous operation and not be safe when timeout happened.
What do you think??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping the default timer the same + clearly documenting the need to change the timeout if using DRAs would be the correct move. This ensures consistent behavior for users who aren't using DRAs. Extending it is a one way door (because reducing the timeout would break compatibility for users who are using a large number of DRAs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extending it is a one way door (because reducing the timeout would break compatibility for users who are using a large number of DRAs)
It makes sense.
the default timer the same + clearly documenting the need to change the timeout if using DRAs would be the correct move
OK. let me add the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the information for users, it seems like users using DRAs will still be fine in most cases:
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file#csi-error-and-timeout-handling
The CreateVolume will timeout and other ones will be made with an exponential backoff. It's only in the case of a large number of DRAs where this will be an issue.
/test pull-aws-fsx-csi-driver-e2e |
063cd00
to
4d05480
Compare
…sing Data Repository Associasions
@@ -0,0 +1 @@ | |||
CONTROLLER_PROVISIONER_TIMEOUT=5m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the value of creating a separate file for this vs. putting it in the values.yaml:
https://github.com/kubernetes-sigs/aws-fsx-csi-driver/blob/master/charts/aws-fsx-csi-driver/values.yaml#L42-L67
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is for manifests only for kustomize. values.yaml
is dedicated to helm chart. I understand this driver supports both kustomize and helm.
In kustomize, injecting parameter in building manifest needs a bit hack. This env file is needed for kustomize users to change timeout value. I also updated install.md
as below:
https://github.com/everpeace/aws-fsx-csi-driver/blob/suppor-dra/docs/install.md#deploy-driver
# To set CSI controller's provisioner timeout,
# Please follow the instruction
$ cd $(mktemp -d)
$ kustomize init
$ kustomize edit add resource "github.com/kubernetes-sigs/aws-fsx-csi-driver/deploy/kubernetes/overlays/stable/?ref=release-1.1"
$ kustomize edit add configmap fsx-csi-controller --from-literal=CONTROLLER_PROVISIONER_TIMEOUT=30m --behavior=merge
$ kubectl apply -k .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid hacks when possible and this seems like an avoidable instance. If users want to configure their kustomize templates, they can download them, configure them, and deploy them freely. We should follow precedent in terms of implementation, which is to put it in the values.yaml.
@@ -65,6 +65,7 @@ controller: | |||
- effect: NoExecute | |||
operator: Exists | |||
tolerationSeconds: 300 | |||
provisionerTimeout: 5m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the information for users, it seems like users using DRAs will still be fine in most cases:
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file
https://github.com/kubernetes-csi/external-provisioner?tab=readme-ov-file#csi-error-and-timeout-handling
The CreateVolume will timeout and other ones will be made with an exponential backoff. It's only in the case of a large number of DRAs where this will be an issue.
@@ -0,0 +1 @@ | |||
CONTROLLER_PROVISIONER_TIMEOUT=5m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should avoid hacks when possible and this seems like an avoidable instance. If users want to configure their kustomize templates, they can download them, configure them, and deploy them freely. We should follow precedent in terms of implementation, which is to put it in the values.yaml.
// target file system values | ||
PollCheckTimeout = 10 * time.Minute | ||
PollCheckTimeout = 15 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If PollCheckTimeout < provisionerTimeout, the provisionerTimeout will always kill the CreateVolume call before it the PollCheckTimeout is hit. I don't think incrementing this should make a difference
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
This has been up for a while, sorry. Going to freeze this PR for now, seems like there are some open comments and design decisions to be made here |
/lifecycle frozen |
@jacobwolfaws: The In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Seems like I can't freeze a PR :( /remove-lifecycle rotten |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Is this a bug fix or adding new feature?
new feature
fixes #367
What is this PR about? / Why do we need it?
This PR supports Data Repository Association(API reference) for Lusture 2.12 or newer filesystems(e.g.
PERSISTENT_2
deployment type) like below:What testing is done?