bug(collection DOI update): dataset stuck on initialize after updating collection #7396

Bento007 · 2024-12-06T17:47:10Z

Describe the bug

A curator updated the DOI on a private collection which started the metadata update for the datasets. This batch job does the update. A few datasets got stuck on initialize. The batch job showed this error

error: DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s

The error is related to ECR failing to create the docker container. This not something we have direct control over. However retry logic should be able restart the job without any intervention. There is retry logic in the terraform for this batch job, but looking at the failing batch job there isn't any retry logic

Work Around

If this is encountered, the batch job can be restarted manually by cloning the job for each stuck dataset.

Expected behavior

The job is retried on the above error.
a nice to have would be program to check for stuck or failed dataset_metadata_update batch jobs and set the status of the datasets to an appropriate error message.
- There are no cloudwatch error message since this was and AWS transient error. So alert and metric will not tell us it failed
- A step function could be used, but is likely overkill
- Adjusting the retry logic in the batch job is the best solution as long as it can catch these transient AWS errors.

Environment

first discovered in 46431eb

Additional Context

One of the failed batch jobs

The text was updated successfully, but these errors were encountered:

Bento007 added bug Someone made a missteak... P2 Priority 2 - Improvement with narrower impact, fix within a month labels Dec 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug(collection DOI update): dataset stuck on initialize after updating collection #7396

bug(collection DOI update): dataset stuck on initialize after updating collection #7396

Bento007 commented Dec 6, 2024 •

edited

Loading

bug(collection DOI update): dataset stuck on initialize after updating collection #7396

bug(collection DOI update): dataset stuck on initialize after updating collection #7396

Comments

Bento007 commented Dec 6, 2024 • edited Loading

Describe the bug

Work Around

Expected behavior

Environment

Additional Context

Bento007 commented Dec 6, 2024 •

edited

Loading