Skip to content

Commit

Permalink
add 2tb gpu workers for translations (#48)
Browse files Browse the repository at this point in the history
We're seeing out of disk issues in https://firefox-ci-tc.services.mozilla.com/tasks/DQeRyr1_TjmXhC0Z-5KWWw/runs/1, which is suspected to be an OpusTrainer bug. In the short term, we're going to workaround this by adding 2tb workers. In the medium term we'll fix the root cause and remove these.

Disks are pretty cheap (and certainly a tiny percentage of our translations spend), so while this is not ideal, it's not hugely expensive to do.
  • Loading branch information
bhearsum authored Jul 31, 2024
1 parent dfac9f8 commit 5653aac
Showing 1 changed file with 35 additions and 0 deletions.
35 changes: 35 additions & 0 deletions worker-pools.yml
Original file line number Diff line number Diff line change
Expand Up @@ -1771,6 +1771,41 @@ pools:
guestAccelerators:
- acceleratorCount: 4
acceleratorType: nvidia-tesla-v100
- pool_id: '{pool-group}/b-linux-v100-gpu-4-2tb'
description: Worker for machine learning and other high GPU tasks
owner: release+tc-workers@mozilla.com
variants:
- pool-group: translations-1
email_on_error: true
provider_id:
by-chain-of-trust:
trusted: fxci-level3-gcp
default: fxci-level1-gcp
config:
worker-config:
genericWorker:
config:
# 2592000s is 30 days.
maxTaskRunTime: 2592000
enableInteractive: true
minCapacity: 0
# We use 4 GPUs per instance across 4 regions with a limit of 128
# per region at any given time. 4 regions * 4 GPUs = 512 total GPUs
# 512 GPUs / 4 per instance = 128 instances possibly running at once.
maxCapacity: 128
implementation: generic-worker/worker-runner-linux
regions: [us-central1, us-west1, us-east1, europe-west4]
image: monopacker-translations-worker
instance_types:
- minCpuPlatform: Intel Skylake
disks:
- <<: *persistent-disk
diskSizeGb: 2048
# 40 CPUs, 256GB RAM
machine_type: n1-custom-40-262144
guestAccelerators:
- acceleratorCount: 4
acceleratorType: nvidia-tesla-v100
- pool_id: 'translations-1/b-linux-aerickson-test'
description: Worker for testing new Translations images.
owner: aerickson@mozilla.com
Expand Down

0 comments on commit 5653aac

Please sign in to comment.