Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Validate the embeddings size to catch silent embeddings batch failures #103

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

aperepel
Copy link
Contributor

@aperepel aperepel commented Oct 9, 2024

With a larger batch size for add_documents (e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:

[values_dict[key][i] for key in values_dict]
     ~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range

Add additional size validation and a suggestion on how to remedy.

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

With a larger batch size for `add_documents` (e.g. 1000), the embeddings service may silently fail and return nothing for some entries. This lead to the more cryptic error:
```
[values_dict[key][i] for key in values_dict]
     ~~~~~~~~~~~~~~~~^^^
IndexError: list index out of range
```
Add additional size validation and a suggestion on how to remedy.
@aperepel aperepel requested review from a team as code owners October 9, 2024 20:12
@product-auto-label product-auto-label bot added the api: spanner Issues related to the googleapis/langchain-google-spanner-python API. label Oct 9, 2024
@averikitsch
Copy link
Collaborator

/gcbrun

@averikitsch averikitsch changed the title Validate the embeddings size to catch silent embeddings batch failures fix: Validate the embeddings size to catch silent embeddings batch failures Oct 10, 2024
@gauravpurohit06
Copy link
Contributor

@aperepel, thank you raising the pull request. Question: In your PR you are adding a validation check, which is required for sure. On a separate note: Have you tried diving it into batches and then generating the embeddings if it's failing for large batches?

Copy link
Contributor

@gauravpurohit06 gauravpurohit06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me.

@aperepel
Copy link
Contributor Author

@aperepel, thank you raising the pull request. Question: In your PR you are adding a validation check, which is required for sure. On a separate note: Have you tried diving it into batches and then generating the embeddings if it's failing for large batches?

Yes, that's what we've been doing. On a larger batch (usually around 1000 items), the embeddings service returned ~20-25% loss. The 100 size batch works perfectly stable, however.

We do use retry logic everywhere we run micro-batches, but if you were asking about pro-actively splitting it after a failure - that's probably too many LLM calls, as we must retry a whole split.

@averikitsch
Copy link
Collaborator

Tests are failing due to code coverage. Can you add a tests for your changes?

@kurtisvg
Copy link
Collaborator

kurtisvg commented Nov 1, 2024

@gauravpurohit06 @vishwarajanand FYI

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: spanner Issues related to the googleapis/langchain-google-spanner-python API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants