-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Carify job fails in spark mode #127
Comments
Thank you for your question! For batch size in endpoint calls, Clarify has a system of figuring out the optimal batch size. If all fails, we will end up with a single instance per request. I am curious why you think your endpoint calls failed. I would like to understand your concern. Is it that your clarify jobs are failing, or that the job is slower than you expected? If failing, can you please share the error stack? If it is just the logs, you generally shouldn't have to worry about it. Also, You can specify the
|
When you enable Spark integration, could you also increase the shadow endpoint instance count (instance_count parameter of sagemaker.clarify.ModelConfig)? We recommend 1:1 ratio between processing instance count and endpoint instance count. This repo (amazon-sagemaker-clarify) is one of the core libraries used by SageMaker Clarify processing container, and SageMakerClarifyProcessor API is designed to launch the container which is Amazon proprietary. If you want to launch your own processing container, then the generic Processor API is a better choice. |
Thanks very much @keerthanvasist and @xgchena, I will follow-up on your posts soon to provide more details |
Thanks for that swift response (I wish it was like that in all AWS (sagemaker) repos). My clarify job worked when I used an instance count of 1 for both the processing job and the endpoint. I got various warnings in the log saying batch size was reducted:
The clarify job completed but was slow (~17h). Checking in cloudwatch, the model latency never exceeded one minute, was in the range of 250k microseconds. Increasing the number of workes in the endpoint did not speed up the calculation but reduced the CPU load on each worker (judging from Cloudwatch metrics). This made me believe that the requests are blocking and that the one worker sends to different end points, one at the time. Hence, the next natural step was to increase the number of workers in the processing job. According to the docs, this means that now Apache Spark is leveraged and it's recommended to use a 1:1 ratio for endpoint and processing instances (as @xgchena just pointed out above):
I did that (e.g. 2 workers on both ends), and some variations of it, but I always get errors there too (even after batch size is reduced to 1, while latency did not go down) and the job did not complete. I believe the relevant errors (from the below logs) are always of this form:
The endpoints had many errors of this form:
I checked cloudwatch and I see that the average model latency is much higher with spark enabled. The maximum latency quickly goes to 60s, which I believe is probably causing a timeout. Below I plot the average latency. I also noted that the warnings about maximal payload don't occur. This lead me to believe that I should choose a more powerful machine behind the endpoint to bring down the latency. I tried to replace my initial value Logs
Here's my python script (wont' run on your machine because of account and machine specific dependencies): Would be great if you could help me solve this.
Noted. But I think my use case is just the regular use case, so I prefer to understand why it fails instead of building my own solution with sagemaker processing.
Thanks, but without open-sourcing the container, I don't think it makes sense to use this. Also, I think my use case is the regular use case, so it should be fixed upstream instead of building my own container.
As I hope it becomes clear from the above description, it fails (the Sagemaker Job does not complete). |
Thanks Lorenz for the great description. We are looking at this with the team and get back to you shortly. Just to let you know you can use ANALYZER_USE_SPARK=1 in the environment to use the spark implementation even in single instance if you want. |
Thanks @larroy, appreciate it. I updated the title of to reflect the discussion. |
Cross-reference https://tiny.amazon.com/8lwa4yrv |
Thanks for this project. For my project, I'd need to configure some elements of the clarify processing and it would require respective Docker Files available for modification. More concretely, I am facing timeouts in the endpoint calls due to a very high max batch size/max payload and a slow model, but only when apache spark integration is used, i.e.
instance_count > 1
. In that case, the max payload is for some reason much higher than when spark integration is disabled, leading to longer response times for a batch. Choosing more or a bigger or more powerful instance in the endpoint does not solve the problem.Can you open-source the Dockerfiles? This would be very beneficial.
In addition,
sagemaker.clarify.SageMakerClarifyProcessor()
should accept an optionalimage_uri
argument so I can supply my custom image, but that I can also solve myself by forking the sagemaker sdk and create a PRThe text was updated successfully, but these errors were encountered: