Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k8s.io disaster recovery plan #70

Open
amy opened this issue Jul 10, 2019 · 38 comments
Open

k8s.io disaster recovery plan #70

amy opened this issue Jul 10, 2019 · 38 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.

Comments

@amy
Copy link

amy commented Jul 10, 2019

Broad issue to track what our disaster recovery plan is if k8s.io registry somehow gets deleted.

One suggestion was creating a backup registry that snapshots k8s.io registry.

@amy
Copy link
Author

amy commented Jul 10, 2019

There are 2 things:
1.) Does the lack of a disaster recovery plan (aka maintaining the status quo today) prohibit the image promoter from being released to the general public?
2.) Initial brainstorming on possible disaster recovery options.

@amy
Copy link
Author

amy commented Jul 10, 2019

cc/ @thockin @listx @spiffxp

@thockin
Copy link

thockin commented Jul 10, 2019

Does the lack of a disaster recovery plan (aka maintaining the status quo today) prohibit the image promoter from being released to the general public?

IMO it does. I don't want to be over-pedantic but if we don't force ourselves to do it, it won't get done. :(

Brainstorming: This doesn't have to be the most amazing, elegant, automatic thing in the world. It might simply be:

  • A daily job (running where? how do we know if it fails or stops running) that copies all images by SHA to another GCS which has a much smaller set of things that can access it and a strong retention policy. Also snapshots of the promoter YAMLs (the SHA to tag mappings).

  • A program which consumes the snapshot yaml files and promotes the backup images into a GCR, restoring tags to SHAs.

  • A monthly job that runs the restore into a test GCR, generates a log, and then erases it all.

If that is too onerous, what corners can we cut?

@cblecker
Copy link
Member

Isn't GCR just a GCS bucket fronted by a proxy/API? Could the backend bucket just be backed up/copied? Would the GCP storage transfer service be enough? Or a cron'd gsutil sync?

@listx
Copy link
Contributor

listx commented Jul 10, 2019

Not as simple as a dumb bucket-wise copy, but: GCR just stores digests of images. As long as there is a reference to it, it won't get deleted. So, we could copy everything into another GCR, but namespace it under a timestamped folder. E.g. gcr.io/backup/20190701/..., and it won't eat up a ton of storage because Docker already de-dupes things.

And we could also try turning on that lifecycle thing for the underlying GCS (bucket) layer: https://cloud.google.com/storage/docs/lifecycle.

@amy
Copy link
Author

amy commented Jul 15, 2019

+1 on listx's suggestion. Let's narrow down the solutions so that we can get started and unblock releasing the promotor to the rest of the community.

@javier-b-perez
Copy link

is backing up the GCS bucket good enough? or do we need to do it at GCR level? copy image by image, tags and so

@cblecker
Copy link
Member

@listx What happens if the bucket or project gets deleted?

@listx
Copy link
Contributor

listx commented Jul 15, 2019

@cblecker Can you clarify?

@amy
Copy link
Author

amy commented Jul 15, 2019

/assign @amy

Please continue the discussion. I'm following the thread & will write up a google doc of some options for next week's meeting.

@cblecker
Copy link
Member

@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project?

@listx
Copy link
Contributor

listx commented Jul 16, 2019

@listx I guess I'm not clear on your proposed multiple GCR with digests proposal. Isn't the GCR just a fronted underlying GCS bucket scoped to a project?

Yes. But AFAIK GCS alone does not auto-dedup data. A quick google search led me to https://cloud.google.com/solutions/partners/storreduce-cloud-deduplication which supports my assumption.

Ultimately we would be taking daily(?) snapshots of all the images in k8s.gcr.io. If deduplication is free (via another Docker Registry such as GCR), then we can even take hourly snapshots and it won't matter much.

@listx
Copy link
Contributor

listx commented Aug 16, 2019

Short of reaching consensus on the initial backup approach, let's try to identify some invariants.

(1) job duration < 24 hrs: I think we want the backups to happen at least daily.
(2) disk usage: because of (1), we really want to de-dupe data. This rules out GCS bucketwise copies (although, one could argue, we could have a rolling window of backed-up snapshots --- e.g. only the last 30 days).
(3) restoration: following the spirit of Tim's 2nd bullet point, there needs to be some process that understands how to restore from the backup to an "original" state. Using the prefixed-by-date GCR backup idea, this would be as simple as copying all images from (for example) gcr.io/some-backup-project-name/20190808/... -> {asia,eu,us}.gcr.io/k8s-artifacts-prod/... . There are many options here (it could involve some combination of the promoter's -snapshot flag along with gcrane (gcrane, unlike gcloud (which the promoter currently relies on), can copy images that don't even have a tag).
(4) a job that actually runs the restoration: this follows Tim's 3rd bullet point.

Do these points sound reasonable as a first stab at this problem? I think using the promoter's -snapshot flag to generate an easy-to-read YAML inventory of all images in the GCR-to-backup makes sense. These snapshot YAMLs would be stored in GCS (or if we're fancy, in Github). I think the backup "job" should run in Prow (and surely, Prow has some slack alert thing that we can enable for the backup job).

As for where this backup job logic should live --- I'm guessing github.com/kubernetes/k8s.io, or some other k8s repo (and not this promoter repo).

@listx
Copy link
Contributor

listx commented Aug 16, 2019

Looks like there is already a GCS disaster recovery script underway here: kubernetes/k8s.io#334. We should probably follow the same infrastructural patterns established there.

@justinsb
Copy link
Contributor

justinsb commented Aug 20, 2019

The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion)

However, for registries which naturally de-dup I agree with the suggestions of using a date suffix.

And nice find on gcrane @listx ! How about:

gcrane cp -r gcr.io/k8s-staging-cluster-api-aws gcr.io/backup-dest/k8s-staging-cluster-api-aws/$(date --rfc-3339=date)

Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access.

If we do want to protect against that, another option is to rsync the bucket underlying GCR, and then also export the manifests and upload them. This is relatively cheap, and we also can then have a GCS bucket with a retention policy to prevent overwriting.

ID=`date --rfc-3339=date`
gsutil rsync -r gs://artifacts.k8s-staging-cluster-api-aws.appspot.com/containers/images/ gs://backup/containers/images/
mkdir -p tags/${ID}/gcr.io/k8s-staging-cluster-api-aws/
gcrane ls -r gcr.io/k8s-staging-cluster-api-aws | grep -v @sha256 | xargs -I {} bash -c "gcrane manifest {} > tags/${ID}/{}.manifest"
gsutil rsync -r tags/ gs://backup/tags

(This one probably does need some work, because I cheated when creating the directories: it fails on nested images)

The downside is that it isn't trivial to restore from that, and that we're making some assumptions about the structure of GCR. But we could easily bring up a server that serves from this structure - whether that's a temporary one for DR, or because we want some mirrors that don't use GCR. If we're really sneaky, it's even possible to serve direct from GCS I believe.

@listx
Copy link
Contributor

listx commented Aug 26, 2019

The pattern that I'm proposing in #334 is a different script for copying everything, with a no-overwrite / no-delete policy (I implemented that in code, @thockin pointed out that we can probably just use retention policy). (Edit: different as in not reusing the same code that we use for promotion)

However, for registries which naturally de-dup I agree with the suggestions of using a date suffix.

And nice find on gcrane @listx ! How about:

gcrane cp -r gcr.io/k8s-staging-cluster-api-aws gcr.io/backup-dest/k8s-staging-cluster-api-aws/$(date --rfc-3339=date)

Of course, it'll take some time to translate that from bash to a programming language ;-) And while this solution does dedup, it doesn't protect against accidental/malicious tag deletion if someone gets access.

I think it makes sense to just start out with something simple like this. One thing to note here is that the backup GCR will have its own separate service account for write access to the backups. It doesn't buy us a ton of security but it's better than the status quo.

Are there any volunteers for this initial implementation using gcrane to do the copy? It would have to live in a prow job. Please comment!

EDIT: I'd like to clarify that I will take an initial stab at the implementation (you should see a PR this week); I just wanted to see if other people on this thread wanted to chip in. :)

@listx listx added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Aug 27, 2019
@listx
Copy link
Contributor

listx commented Aug 28, 2019

An additional thought: I think it makes sense for the backup GCR to additionally mirror the latest snapshot of the prod GCR. This way, we could just redirect the vanity domain k8s.gcr.io to point to the backup GCR in case the prod GCR gets hosed, so that we don't have to wait for the backfill process to finish (there would be very minimal downtime).

The one slightly ugly part is that now the backup GCR looks ilke this:

gcr.io/<backup-project-name>/foo-img
gcr.io/<backup-project-name>/bar-img
gcr.io/<backup-project-name>/...
gcr.io/<backup-project-name>/backups/<DATE>/...

where the backups folder would take up a name that the new prod GCR must not have (it's a sort of reserved name). But I think this is minor/negligible.

I suppose the missing piece here is that the backup GCR has to be made smart enough to only mirror good states (i.e., if an attacker re-tags all images, we don't want the backup mirror to do the same --- there would have to be some sort of delta heuristic for the backup process to detect and know when not to mirror false positive states of the original).

@listx
Copy link
Contributor

listx commented Aug 30, 2019

Are there any thoughts about using the promoter directly for performing backups? We should be able to do this once #118 is merged.

The backup process would be:

  1. Construct a "backup" Promoter manifest. We can use the -snapshot flag to record all reachable images in a repo. (This output is 99% of a regular Promoter manifest, minus the registries: field).
  2. Promote all images in the backup manifest with a rebased name, prefixed by date:
registries:
- name: gcr.io/k8s-cip-test-prod
  service-account: k8s-infra-gcr-promoter@k8s-cip-test-prod.iam.gserviceaccount.com
  src: true
# Same for all the regions for multi-regional backups.
- name: us.gcr.io/k8s-cip-test-prod/<DATE>
  service-account: k8s-infra-gcr-promoter@k8s-cip-test-prod.iam.gserviceaccount.com
  1. Save the backup manifest to a GCS bucket (or Github or somewhere else). Saving it in Github would be nice because of the easier discoverability and change history.
  2. Repeat the above steps daily.

I think steps 1 and 2 can be glued together with either a shell script or Go binary (we already have the framework for this sort of "glue" code in our e2e tests, so we can reuse the code there if we decide to use Go instead of bash).

I think this is 1/2 of Disaster Recovery. The other 1/2 would be the Restoration process that restores backed-up images to a test GCR. This is actually pretty similar to the other half:

  1. Promote all images in the backup (the backup will already have a list of snapshot YAMLs by date) to the target GCR (in this case, the test GCR).
  2. Take a snapshot of the test GCR and ensure that it matches with the snapshot we used for the promotion (this is the same approach we use in our e2e tests).
  3. Delete the test GCR.

I think the only missing piece is some easy way of making the promoter promote directly from a snapshot YAML, by allowing the user to supply the missing registries: field dynamically as CLI arguments or ENV vars or some other.

@jonjohnsonjr
Copy link
Contributor

gcrane cp -r should work for this

@listx
Copy link
Contributor

listx commented Sep 20, 2019

I am working on a doc to sum everything up + an initial implementation. Will share with this thread soon... stay tuned!

@listx
Copy link
Contributor

listx commented Sep 23, 2019

Here is a writeup of an initial approach/design: https://docs.google.com/document/d/1od5y-Z2xP9mVmg2Yztnv-GQ7D-orj9HsTmeVvNHkzzA/edit?usp=sharing

Mailing list link: https://groups.google.com/d/msg/kubernetes-wg-k8s-infra/cseCwgALwdk/iOYkaEYFCAAJ

You must be a member of the kubernetes-wg-k8s-infra Google group in order to access the document.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2019
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 21, 2020
@listx
Copy link
Contributor

listx commented Mar 9, 2020

/unassign @amy

@bartsmykla
Copy link

I'm gonna assign myself to it too cause I think it's important topic we should work on sooner or later.

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 8, 2020
@listx
Copy link
Contributor

listx commented Jun 9, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 9, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 7, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 7, 2020
@justaugustus justaugustus removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 16, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2021
@spiffxp
Copy link
Member

spiffxp commented Jan 15, 2021

/remove-lifecycle stale
/lifecycle frozen
/wg k8s-infra
/sig release
/assign @justaugustus
please assign to someone more appropriate in @kubernetes-sigs/release-engineering

@k8s-ci-robot k8s-ci-robot added wg/k8s-infra lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. sig/release Categorizes an issue or PR as relevant to SIG Release. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 15, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 30, 2021
@BenTheElder
Copy link
Member

This still needs doing I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. sig/release Categorizes an issue or PR as relevant to SIG Release.
Projects
None yet
Development

No branches or pull requests