-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot submit a custom training job to a VAI Persistent Resource #6285
Comments
I think you need make few changes in training_clients.py to use the create client and create_custom_job needs to be updated as shown in create_custom_job example code. Also, get_custom_job needs to be updated as shown here. Let us know if this works for you. Thank you! |
@singhniraj08 Thank you for responding quickly. It's unlikely I would have stumbled upon the In
Note that existing TFX code did:
I changed this to use the additional
In
in
I could not see anywhere else that needed to be changed. |
@singhniraj08 hi, not sure why this issue was closed - I did not knowingly close it. Thx for re-opening! |
If you already changed I've found some configurations [1] that need to be done when using a persistent resource like
It seems that you already specified the persistent_resource_id, but I have no idea whether Could you please check this? |
@briron hi, thanks for the reply. yes, have added
which I think aligns with the VAI Persistent Resource cluster we provisioned:
|
@adriangay |
@briron No, we don't normally set it. You think this is required? I can try that 😸 |
@adriangay Let's try. I'll investigate more apart from that. |
@briron added:
to
"Resources are insufficient..." job failure, then resource was acquired after a retry and training started. So I'm assuming I got lucky on the retry and this is not running on the persistent cluster. I have no direct way of observing where execution occurred other than the labels for the log messages:
|
@briron The logging to the VAI pipeline console UI does not show any of the logging I see in Stackdriver logs. All I see in VAI UI is:
The messages re: insufficient resources and retry are not there. But retry may be happening on other successful jobs and I wouldn't see them regardless of where it ran? |
@briron i've uploaded my modified |
If you're using VertexClient, it looks right. The job seems to be submitted well, but I have no idea how VAI works on its side. |
@briron ok, thank you for investigating |
System information
TFX Version (you are using): Applies to all versions up to latest 1.14.0
Environment in which you plan to use the feature: Google Cloud
Are you willing to contribute it (Yes/No): Yes
Do you have a workaround or are completely blocked by this? : blocked, can't get workaround to work
Name of your Organization (Optional): Sky/NBCU
Describe the feature and the current behavior/state.
TFX
tfx.extensions.google_cloud_ai_platform.training_clients.py
usesgoogle.cloud.aiplatform_v1
. Update TFX to usegoogle.cloud.aiplatform_v1beta1
or later.Google Vertex AI has a new feature in preview - VAI Persistent Resource. This allows customers to reserve and use a cluster with GPU and appropriate CPU for model training. Using this feature is highly desirable due to ongoing, global, GPU resource shortage, causing very frequent 'stockout' errors ("resources insufficient in region") causing custom training pipeline jobs to fail, resulting in stale models. Creating the cluster works fine; submitting custom training jobs from TFX Trainer does not work.
The reason for this is that in order for the job to be submitted to a VAI Persistent Resource, a new field,
persistent_resource_id
must be added to theCustomJobSpec
provided on job submission. This was introduced at some point ingoogle.cloud.aiplatform_v1beta1.
and is defined here:https://cloud.google.com/python/docs/reference/aiplatform/latest/google.cloud.aiplatform_v1beta1.types.CustomJobSpec
It must be added to the TFX Trainer
ai_platform_training_args
args like this:This results in
ValueError: Protocol message CustomJobSpec has no "persistent_resource_id" field
on job submission:Because TFX uses
v1
API andv1
CustomJobSpec
.In an attempt to patch packages in a TFX container to workaround this, we reverse-engineered the code path and modified the TFX container we build to replace imports referencing
CustomJob,
andCustomJobSpec
in various places with:While this fixes the
ValueError
and job submission now succeeds, the job is not routed to the persistent resource cluster. We think that the issue is that TFXtraining_clients.py
is still 'calling' thegoogle.cloud.aiplatform_v1
API, so the Google service is just ignoring the extra field of thev1beta1
CustomJobSpec we are passing?We can see
gapic
is the API, and references togapic_version
being set, but don't really understand how that is selected or can be patched, if that is the issue now? If this is the case, we would appreciate some advice and guidance on what further patching on the TFX container would be required to enabletraining_clients.py
to 'call' thev1beta1
API.The text was updated successfully, but these errors were encountered: