Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a v2 snapshot when running etcdutl migrate command #19168

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented Jan 10, 2025

Refer to #17911 (comment)

This PR will make the etcdutl migrate command fully functional.

  • It creates a v2snapshot from the v3store.
    You will never see error below anymore when executing etcdutl migrate command,

    Error: cannot downgrade storage, WAL contains newer entries, as the target version (3.5.0) is lower than the version (3.6.0) detected from WAL logs
    

    After executing the migrate command for all members, you just need to directly replace the binary of each member, then all done for the offline downgrade. Of course, it's still recommended to follow/perform the online downgrade process, as it doesn't break the workload. cc @ivanvc @jmhbnz

  • It also adds a separate etcdutl v2snapshot create command
    It's just a manual last to resort solution for any potential issue. Usually we don't need it.

I need to add e2e test. I may also break down it into smaller PRs.

@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch 2 times, most recently from 250f708 to d8f3b56 Compare January 10, 2025 18:21
Copy link

codecov bot commented Jan 10, 2025

Codecov Report

Attention: Patch coverage is 68.33333% with 19 lines in your changes missing coverage. Please review.

Project coverage is 68.81%. Comparing base (32cfd45) to head (732e7ef).

Files with missing lines Patch % Lines
etcdutl/etcdutl/common.go 77.35% 6 Missing and 6 partials ⚠️
etcdutl/etcdutl/migrate_command.go 0.00% 7 Missing ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
etcdutl/etcdutl/migrate_command.go 0.00% <0.00%> (ø)
etcdutl/etcdutl/common.go 74.68% <77.35%> (+28.52%) ⬆️

... and 26 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19168      +/-   ##
==========================================
- Coverage   68.83%   68.81%   -0.03%     
==========================================
  Files         420      420              
  Lines       35678    35737      +59     
==========================================
+ Hits        24560    24592      +32     
- Misses       9688     9705      +17     
- Partials     1430     1440      +10     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 32cfd45...732e7ef. Read the comment docs.

@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from d8f3b56 to 033f4cf Compare January 10, 2025 19:36
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch 2 times, most recently from d71fc9f to f59e8b8 Compare January 11, 2025 16:15
@ahrtr
Copy link
Member Author

ahrtr commented Jan 11, 2025

This is a huge PR, let me breakdown it into small PRs to make the review easier.

@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch 4 times, most recently from 43a213e to acffdb5 Compare January 13, 2025 10:13
@etcd-io etcd-io deleted a comment from k8s-ci-robot Jan 13, 2025
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from acffdb5 to 174c8b4 Compare January 13, 2025 12:25
@serathius
Copy link
Member

So many codecov warnings, is there any way to hide them?

@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch 2 times, most recently from df49a51 to 099e356 Compare January 15, 2025 16:15
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from 099e356 to 1187204 Compare January 15, 2025 19:06
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from 1187204 to 1180f3a Compare January 15, 2025 19:33
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from 5695acd to e3fb899 Compare January 16, 2025 15:33
@ahrtr
Copy link
Member Author

ahrtr commented Jan 16, 2025

/retest

@ahrtr
Copy link
Member Author

ahrtr commented Jan 16, 2025

/test pull-etcd-robustness-arm64

@ahrtr
Copy link
Member Author

ahrtr commented Jan 20, 2025

Note that previously I was thinking creating a v2 snapshot file is just an optional step when executing etcdutl migarte.

But it would definitely be better to make it as a required step, otherwise we will always see the cannot downgrade storage, WAL contains newer entries error message when executing etcdutl migrate on a newly created cluster, due to the clusterVer with a new version (e.g 3.6, when migrating from 3.6 to 3.5) .

See also #13405 (comment)

cc @fuweid @siyuanfoundation @serathius PTAL

@serathius
Copy link
Member

What about invariant snapshot.Metadata.Index < db.consistentIndex?

// RecoverSnapshotBackend recovers the DB from a snapshot in case etcd crashes
// before updating the backend db after persisting raft snapshot to disk,
// violating the invariant snapshot.Metadata.Index < db.consistentIndex. In this
// case, replace the db with the snapshot db sent by the leader.

We cannot just snapshot without updating db. That's not easy thing to change.

For cannot downgrade storage, WAL contains newer entries, as the target version (3.5.0) is lower than the version (3.6.0) detected from WAL logs error we might consider not considering SetClusterVersion("3.6") as an entry that makes wal version equal to 3.6.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 21, 2025

What about invariant snapshot.Metadata.Index < db.consistentIndex?

// RecoverSnapshotBackend recovers the DB from a snapshot in case etcd crashes
// before updating the backend db after persisting raft snapshot to disk,
// violating the invariant snapshot.Metadata.Index < db.consistentIndex. In this
// case, replace the db with the snapshot db sent by the leader.

We cannot just snapshot without updating db. That's not easy thing to change.

Not sure I got your point. The etcdutl migrate is just an offline tool, and we are going to creating a v2snapshot file based on the v3store(bbolt db). Recovering the DB from a snapshot in case etcd crashes isn't etcdutl's responsiblity.

For cannot downgrade storage, WAL contains newer entries, as the target version (3.5.0) is lower than the version (3.6.0) detected from WAL logs error we might consider not considering SetClusterVersion("3.6") as an entry that makes wal version equal to 3.6.

I thought about it before, but the answer is NO. Reasons,

  • 3.5 doesn't have SetClusterVersion entries in WAL file. If we see such entries, it's definitely generated by 3.6 or above version.
    Example for 3.6,

    $ ./etcd-dump-logs ../../default.etcd/
    Snapshot:
    empty
    Start dumping log entries from snapshot.
    WAL metadata:
    nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=2 commitIndex=5 vote=8e9e05c52164694d
    WAL entries: 5
    lastIndex=5
    term	     index	type	data
       1	         1	conf	method=ConfChangeAddNode id=8e9e05c52164694d
       2	         2	norm	
       2	         3	norm	header:<ID:7587884260861681666 > cluster_member_attr_set:<member_ID:10276657743932975437 member_attributes:<name:"default" client_urls:"http://localhost:2379" > > 
       2	         4	norm	header:<ID:7587884260861681668 > cluster_version_set:<ver:"3.6.0" > 
       2	         5	norm	header:<ID:7587884260861681669 > put:<key:"k1" value:"v1" > 
    
    Entry types (Normal,ConfigChange) count is : 5
    

    Example for 3.5,

    $ ./etcd-dump-logs ../../default.etcd/
    Snapshot:
    empty
    Start dumping log entries from snapshot.
    WAL metadata:
    nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=2 commitIndex=5 vote=8e9e05c52164694d
    WAL entries: 5
    lastIndex=5
    term	     index	type	data
       1	         1	conf	method=ConfChangeAddNode id=8e9e05c52164694d
       2	         2	norm	
       2	         3	norm	method=PUT path="/0/members/8e9e05c52164694d/attributes" val="{\"name\":\"default\",\"clientURLs\":[\"http://localhost:2379\"]}"
       2	         4	norm	method=PUT path="/0/version" val="3.5.0"
       2	         5	norm	header:<ID:7587884260899009541 > put:<key:"k1" value:"v1" > 
    
    Entry types (Normal,ConfigChange) count is : 5
    
  • We can revisit this in 3.7, as it won't be needed anymore.

etcdutl/etcdutl/common.go Show resolved Hide resolved
etcdutl/etcdutl/common_test.go Show resolved Hide resolved
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from e3fb899 to 7589807 Compare January 22, 2025 09:49
if err := w.SaveSnapshot(walpb.Snapshot{Index: ci, Term: term, ConfState: &confState}); err != nil {
return err
}
if err := w.Save(raftpb.HardState{Term: term, Commit: ci, Vote: st.Vote}, nil); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about adding a HardState here, what's the purpose? We using CI read from db file, means the index here are for sure committed, the commit index might be might further than db which is only flushed once every 5 seconds. I'm worried that it might cause problems with by breaking monotonicity of commit index.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about adding a HardState here, what's the purpose?

To ensure the snapshot index is committed, otherwise the snapshot may be filtered by etcdserver on bootstrap. Of course, I agree that we need to ensure the commitIndex never decrease. Will update later.

We using CI read from db file, means the index here are for sure committed

It is NOT guaranteed, because etcd applies the entries and wal sync hardstate async. It's easy to verify,

Firstly, make change something like below,

$ git diff
diff --git a/server/storage/wal/wal.go b/server/storage/wal/wal.go
index f3d7bc5f4..9597a0ccd 100644
--- a/server/storage/wal/wal.go
+++ b/server/storage/wal/wal.go
@@ -961,6 +961,13 @@ func (w *WAL) Save(st raftpb.HardState, ents []raftpb.Entry) error {
                return nil
        }
 
+       if st.Commit > 5 {
+               w.lg.Info("########### Sleeping 10 seconds", zap.Uint64("Index", st.Commit))
+               time.Sleep(10 * time.Second)
+               w.lg.Info("########### Panicking after 10 seconds")
+               panic("non empty hard state")
+       }
+
        mustSync := raft.MustSync(st, w.state, len(ents))
 
        // TODO(xiangli): no more reference operator

Execute a couple of etcdctl put command after starting etcd, then etcdserve will panic. Then you will find that consistentIndex is greater than commitIndex.

$ ./etcd-dump-db iterate-bucket ../../default.etcd/member/snap/db meta --decode
key="term", value=2
key="storageVersion", value="3.6.0"
key="consistent_index", value=6
key="confState", value="{\"voters\":[10276657743932975437],\"auto_leave\":false}"

$ ./etcd-dump-logs ../../default.etcd/
Snapshot:
empty
Start dumping log entries from snapshot.
WAL metadata:
nodeID=8e9e05c52164694d clusterID=cdf818194e3a8c32 term=2 commitIndex=5 vote=8e9e05c52164694d
WAL entries: 6
lastIndex=6
term	     index	type	data
   1	         1	conf	method=ConfChangeAddNode id=8e9e05c52164694d
   2	         2	norm	
   2	         3	norm	header:<ID:7587884288152690690 > cluster_member_attr_set:<member_ID:10276657743932975437 member_attributes:<name:"default" client_urls:"http://localhost:2379" > > 
   2	         4	norm	header:<ID:7587884288152690692 > cluster_version_set:<ver:"3.6.0" > 
   2	         5	norm	header:<ID:7587884288152690694 > put:<key:"k1" value:"v1" > 
   2	         6	norm	header:<ID:7587884288152690695 > put:<key:"k2" value:"v2" > 

Entry types (Normal,ConfigChange) count is : 6

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good observation, you are right, there is nothing beneficial from fsyncing HardState to WAL. Still I don't think there is anything beneficial from commiting the Snapshot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably I did not explain it clearly.

Firstly the consistent_index may be greater than the commit Index (already confirmed based on my above comment).

Secondly, when etcdserver bootstraps, it only loads v2 snapshot with Index <= commit Index (see below),

// filter out any snaps that are newer than the committed hardstate
n := 0
for _, s := range snaps {
if s.Index <= state.Commit {
snaps[n] = s
n++
}
}

So we need to commit the snapshot index, of course only when it's greater than the existing commitIndex.

Copy link
Member

@serathius serathius Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I now understand the motivation about putting HardState, however I see another problem. If the reason to add generation of snapshot is to prevent etcd v3.5 from going back in WAL and reading SetClusterVersion("3.6"), then what about situations where SetClusterVersion has raft index after CI?

PS: Please don't resolve conversations on discussion that didn't get an answer that was accepted by both sides. It slows down review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then what about situations where SetClusterVersion has raft index after CI?

Theoretically it's possible, but in practice it's unlikely. Creating a v2 snapshot is just a best effort action.

PS: Please don't resolve conversations on discussion that didn't get an answer that was accepted by both sides. It slows down review

The motivation is that resolving the conversation gives reviewers an impression it's ready for further round of review, instead of still getting stuck on the existing comment.

Also based on previous experience, usually there is no response for a comment for a long time, sometimes no any followup comments at all (it's not aim at you, just a generic comment on review process). So I tend to resolve it if I think everything has resolved.

Improvement:

  • I will try to keep it for a little longer next time even if I think everything is resolved.
  • Please also feel free to unresolved it if you or anyone still have comment in a conversation.

@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from 7589807 to ae5d208 Compare January 22, 2025 16:50
@ahrtr
Copy link
Member Author

ahrtr commented Jan 23, 2025

cc @serathius

@ahrtr ahrtr closed this Jan 23, 2025
@ahrtr ahrtr deleted the etcdutl_snapshot_20250110 branch January 23, 2025 16:25
@ahrtr ahrtr restored the etcdutl_snapshot_20250110 branch January 23, 2025 16:31
@ahrtr ahrtr reopened this Jan 23, 2025
Also added test to cover the etcdutl migrate command

Signed-off-by: Benjamin Wang <benjamin.ahrtr@gmail.com>
@ahrtr ahrtr force-pushed the etcdutl_snapshot_20250110 branch from ae5d208 to 732e7ef Compare January 23, 2025 16:33
Copy link
Contributor

@ah8ad3 ah8ad3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this, it gave me a good insight.

@serathius
Copy link
Member

3.5 doesn't have SetClusterVersion entries in WAL file. If we see such entries, it's definitely generated by 3.6 or above version.

ClusterVersionSet was introduced in v3.5 and is annotated as such

message ClusterVersionSetRequest {
option (versionpb.etcd_version_msg) = "3.5";
string ver = 1;
}
.

The reason why v3.5 etcd still uses old method=PUT path="/0/version" val="3.5.0" is backward compatibility. The versioned WAL entries were introduced just in v3.6, so before that etcd maintainers added new proto across multiple releases to ensure compatibility. SetClusterVersion was added in v3.5, but not used to ensure it doesn't break v3.4 during upgrade. SetClusterVersion started being used in v3.6, but is v3.5 compatible because v3.5 understands it.

@serathius
Copy link
Member

cc @siyuanfoundation can you also take a look as you worked on downgrades.

@serathius
Copy link
Member

serathius commented Jan 23, 2025

We can revisit this in 3.7, as it won't be needed anymore.

I think if we change how SetClusterVersion is handled in WAL version detection than we would not need this PR at all. I think that would be a simpler solution. I think I might have made mistake by implementing it like this just to be safe. However, if there are no incompatible entries in the WAL there is no reason not allow downgrade without snapshot. Also as there are no new entries in v3.6 then we would be able to always downgrade to v3.5.

This reminds me that because the WAL compatibility of v3.5 and v3.6 I explicitly wanted to add a fake new entry to WAL that could be used to test downgrades and this compatibility checks. Think I just ended up using SetClusterVersion for that.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 23, 2025

I think if we change how SetClusterVersion is handled in WAL version detection than we would not need this PR at all.

It means that we need to partially rollback #13405, which is what I had been trying avoid doing in the last minute before we release 3.6.0.

Anyway, let me see what's the effort and what's the impact.

@ahrtr
Copy link
Member Author

ahrtr commented Jan 23, 2025

Just raised #19263, PTAL

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, fuweid, siyuanfoundation

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants