Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Resolve claim lock contention issue by separating transaction #1741

Merged
merged 6 commits into from
Aug 8, 2024

Conversation

renjiezh
Copy link
Contributor

@renjiezh renjiezh commented Aug 7, 2024

No description provided.

@wfa-reviewable
Copy link

This change is Reviewable

@renjiezh renjiezh changed the title Use only 2 gcp duchies. Debug v0.5.7-rc3 in dev Aug 7, 2024
@renjiezh renjiezh changed the title Debug v0.5.7-rc3 in dev Resolve claim lock contention issue by separating transaction Aug 8, 2024
@renjiezh renjiezh marked this pull request as ready for review August 8, 2024 04:21
@renjiezh renjiezh requested a review from SanjayVas August 8, 2024 04:21
Copy link
Member

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r2, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @renjiezh)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt line 148 at r2 (raw file):

    prioritizedStages: List<StageT>,
  ): String? {
    /** Claim a specific task represented by the results of running the above sql. */

This now needs to be done in a broader loop.

  1. Query for unclaimed task in first txn
  2. Check task is still unclaimed in second transaction. If not, retry starting at (1).
  3. Claim in second transaction

src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt line 151 at r2 (raw file):

    suspend fun claimSpecificTask(result: UnclaimedTaskQueryResult<StageT>): Boolean =
      databaseClient.readWriteTransaction().execute { txn ->
        claim(

Where is the additional read/check in this second transaction? Without that, the original issue from # #1722 is present again.

Copy link
Contributor Author

@renjiezh renjiezh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @SanjayVas)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt line 151 at r2 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

Where is the additional read/check in this second transaction? Without that, the original issue from # #1722 is present again.

I found the claim() function already checks the updateTime. I thought it was added by you in the recent PR. But actually it is there for a long time. Now it is a mystery to me why multiple mills can claim the same Computation at the same time.

Copy link
Contributor Author

@renjiezh renjiezh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 4 files reviewed, 2 unresolved discussions (waiting on @SanjayVas)


src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt line 148 at r2 (raw file):

Previously, SanjayVas (Sanjay Vasandani) wrote…

This now needs to be done in a broader loop.

  1. Query for unclaimed task in first txn
  2. Check task is still unclaimed in second transaction. If not, retry starting at (1).
  3. Claim in second transaction

Discussed offline.


src/main/kotlin/org/wfanet/measurement/duchy/deploy/gcloud/spanner/computation/GcpSpannerComputationsDatabaseTransactor.kt line 151 at r2 (raw file):

Previously, renjiezh wrote…

I found the claim() function already checks the updateTime. I thought it was added by you in the recent PR. But actually it is there for a long time. Now it is a mystery to me why multiple mills can claim the same Computation at the same time.

Discussed offline.

Copy link
Collaborator

@stevenwarejones stevenwarejones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 4 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @SanjayVas)

Copy link
Member

@SanjayVas SanjayVas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some additional context, this is mostly reverting #1726. Using a single transaction can result in a lock contention issue, as the unclaimed tasks query will lock all claimable Computations DB rows. The solution is to use a separate transaction to perform the write, but to re-read the single row and ensure that it's still claimable. The old version of this check had a flaw in that it incorrectly assumes that the JVM clock is monotonic. This PR introduces an additional check based on LockExpirationTime, which is less susceptible to clock skew.

Reviewed 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved (waiting on @renjiezh)

@renjiezh renjiezh enabled auto-merge (squash) August 8, 2024 19:29
@renjiezh renjiezh merged commit aabb17a into main Aug 8, 2024
4 checks passed
@renjiezh renjiezh deleted the renjiez-claim-task-debug branch August 8, 2024 19:43
ple13 pushed a commit that referenced this pull request Aug 16, 2024
For some additional context, this is mostly reverting #1726. Using a single transaction can result in a lock contention issue, as the unclaimed tasks query will lock all claimable Computations DB rows. The solution is to use a separate transaction to perform the write, but to re-read the single row and ensure that it's still claimable. The old version of this check had a flaw in that it incorrectly assumes that the JVM clock is monotonic. This PR introduces an additional check based on LockExpirationTime, which is less susceptible to clock skew.
@mikkokotila mikkokotila changed the title Resolve claim lock contention issue by separating transaction fix: Resolve claim lock contention issue by separating transaction Dec 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants