You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A curator updated the DOI on a private collection which started the metadata update for the datasets. This batch job does the update. A few datasets got stuck on initialize. The batch job showed this error
error: DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s
The error is related to ECR failing to create the docker container. This not something we have direct control over. However retry logic should be able restart the job without any intervention. There is retry logic in the terraform for this batch job, but looking at the failing batch job there isn't any retry logic
Work Around
If this is encountered, the batch job can be restarted manually by cloning the job for each stuck dataset.
Expected behavior
The job is retried on the above error.
a nice to have would be program to check for stuck or failed dataset_metadata_update batch jobs and set the status of the datasets to an appropriate error message.
There are no cloudwatch error message since this was and AWS transient error. So alert and metric will not tell us it failed
A step function could be used, but is likely overkill
Adjusting the retry logic in the batch job is the best solution as long as it can catch these transient AWS errors.
Describe the bug
A curator updated the DOI on a private collection which started the metadata update for the datasets. This batch job does the update. A few datasets got stuck on initialize. The batch job showed this error
The error is related to ECR failing to create the docker container. This not something we have direct control over. However retry logic should be able restart the job without any intervention. There is retry logic in the terraform for this batch job, but looking at the failing batch job there isn't any retry logic
Work Around
If this is encountered, the batch job can be restarted manually by cloning the job for each stuck dataset.
Expected behavior
Environment
first discovered in 46431eb
Additional Context
The text was updated successfully, but these errors were encountered: