-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Validate the embeddings size to catch silent embeddings batch failures #103
base: main
Are you sure you want to change the base?
Conversation
With a larger batch size for `add_documents` (e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error: ``` [values_dict[key][i] for key in values_dict] ~~~~~~~~~~~~~~~~^^^ IndexError: list index out of range ``` Add additional size validation and a suggestion on how to remedy.
/gcbrun |
@aperepel, thank you raising the pull request. Question: In your PR you are adding a validation check, which is required for sure. On a separate note: Have you tried diving it into batches and then generating the embeddings if it's failing for large batches? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me.
Yes, that's what we've been doing. On a larger batch (usually around 1000 items), the embeddings service returned ~20-25% loss. The 100 size batch works perfectly stable, however. We do use retry logic everywhere we run micro-batches, but if you were asking about pro-actively splitting it after a failure - that's probably too many LLM calls, as we must retry a whole split. |
Tests are failing due to code coverage. Can you add a tests for your changes? |
With a larger batch size for
add_documents
(e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:Add additional size validation and a suggestion on how to remedy.
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes #<issue_number_goes_here> 🦕