-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Resolve claim lock contention issue by separating transaction #1741
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @renjiezh)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt
line 148 at r2 (raw file):
prioritizedStages: List<StageT>, ): String? { /** Claim a specific task represented by the results of running the above sql. */
This now needs to be done in a broader loop.
- Query for unclaimed task in first txn
- Check task is still unclaimed in second transaction. If not, retry starting at (1).
- Claim in second transaction
src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt
line 151 at r2 (raw file):
suspend fun claimSpecificTask(result: UnclaimedTaskQueryResult<StageT>): Boolean = databaseClient.readWriteTransaction().execute { txn -> claim(
Where is the additional read/check in this second transaction? Without that, the original issue from # #1722 is present again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @SanjayVas)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt
line 151 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
Where is the additional read/check in this second transaction? Without that, the original issue from # #1722 is present again.
I found the claim() function already checks the updateTime. I thought it was added by you in the recent PR. But actually it is there for a long time. Now it is a mystery to me why multiple mills can claim the same Computation at the same time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @SanjayVas)
src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt
line 148 at r2 (raw file):
Previously, SanjayVas (Sanjay Vasandani) wrote…
This now needs to be done in a broader loop.
- Query for unclaimed task in first txn
- Check task is still unclaimed in second transaction. If not, retry starting at (1).
- Claim in second transaction
Discussed offline.
src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt
line 151 at r2 (raw file):
Previously, renjiezh wrote…
I found the claim() function already checks the updateTime. I thought it was added by you in the recent PR. But actually it is there for a long time. Now it is a mystery to me why multiple mills can claim the same Computation at the same time.
Discussed offline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @SanjayVas)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some additional context, this is mostly reverting #1726. Using a single transaction can result in a lock contention issue, as the unclaimed tasks query will lock all claimable Computations DB rows. The solution is to use a separate transaction to perform the write, but to re-read the single row and ensure that it's still claimable. The old version of this check had a flaw in that it incorrectly assumes that the JVM clock is monotonic. This PR introduces an additional check based on LockExpirationTime
, which is less susceptible to clock skew.
Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved (waiting on @renjiezh)
For some additional context, this is mostly reverting #1726. Using a single transaction can result in a lock contention issue, as the unclaimed tasks query will lock all claimable Computations DB rows. The solution is to use a separate transaction to perform the write, but to re-read the single row and ensure that it's still claimable. The old version of this check had a flaw in that it incorrectly assumes that the JVM clock is monotonic. This PR introduces an additional check based on LockExpirationTime, which is less susceptible to clock skew.
No description provided.