Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug(collection DOI update): dataset stuck on initialize after updating collection #7396

Open
Bento007 opened this issue Dec 6, 2024 · 0 comments
Labels
bug Someone made a missteak... P2 Priority 2 - Improvement with narrower impact, fix within a month

Comments

@Bento007
Copy link
Contributor

Bento007 commented Dec 6, 2024

Describe the bug

A curator updated the DOI on a private collection which started the metadata update for the datasets. This batch job does the update. A few datasets got stuck on initialize. The batch job showed this error

error: DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s

The error is related to ECR failing to create the docker container. This not something we have direct control over. However retry logic should be able restart the job without any intervention. There is retry logic in the terraform for this batch job, but looking at the failing batch job there isn't any retry logic
image

Work Around

If this is encountered, the batch job can be restarted manually by cloning the job for each stuck dataset.

Expected behavior

  • The job is retried on the above error.
  • a nice to have would be program to check for stuck or failed dataset_metadata_update batch jobs and set the status of the datasets to an appropriate error message.
    • There are no cloudwatch error message since this was and AWS transient error. So alert and metric will not tell us it failed
    • A step function could be used, but is likely overkill
    • Adjusting the retry logic in the batch job is the best solution as long as it can catch these transient AWS errors.

Environment

first discovered in 46431eb

Additional Context

@Bento007 Bento007 added bug Someone made a missteak... P2 Priority 2 - Improvement with narrower impact, fix within a month labels Dec 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Someone made a missteak... P2 Priority 2 - Improvement with narrower impact, fix within a month
Projects
None yet
Development

No branches or pull requests

1 participant