-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Metadata To Embeddings #75
Comments
Hi David! For the layer zip files, those are just a way to pre-package the layer as we have it defined here and then move the source into a region of your choice, ideally for network-isolated environments where we can't pull dependencies on the fly. As for what you'd like to do, it sounds like we may need to update the RAG API itself (and we welcome pull requests against the If you are willing to hack on LISA to add this, my first guess would be around this area: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py Specifically this function is what we call to generate the initial embeddings: https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151 and then similaritySearch is doing the embedding call for the prompt text https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107 We're using LangChain under the hood, and we've created a form of langchain-compatible openai binding for embeddings specifically over here: https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153 (ignore other things in the file, there are some unused clients that we need to clean up 😬 ) So if there's a solution you had in mind or could point us in a direction to help with, I think these would be the best starting points. I'm not sure if this answers your question or helps guide in a direction, so please let me know! |
This makes sense and is helpful. I imagine that folks won't want all s3
metadata translated to embeddings, do you think it would make sense to
check for a prefix a la if s3 object metadata is prefixed with _lisa_ (or
something) then it is translated to vector metadata. Figure its worth
asking before heading down the wrong path.
…On Mon, Sep 9, 2024, 4:21 PM Peter Muller ***@***.***> wrote:
Hi David! For the layer zip files, those are just a way to pre-package the
layer as we have it defined here and then move the source into a region of
your choice, ideally for network-isolated environments where we can't pull
dependencies on the fly.
As for what you'd like to do, it sounds like we may need to update the RAG
API itself (and we welcome pull requests against the develop branch 🎉 )
If you are willing to hack on LISA to add this, my first guess would be
around this area:
https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py
Specifically this function is what we call to generate the initial
embeddings:
https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L125-L151
and then similaritySearch is doing the embedding call for the prompt text
https://github.com/awslabs/LISA/blob/develop/lambda/repository/lambda_functions.py#L80-L107
We're using LangChain under the hood, and we've created a form of
langchain-compatible openai binding for embeddings specifically over here:
https://github.com/awslabs/LISA/blob/develop/lisa-sdk/lisapy/langchain.py#L102-L153
(ignore other things in the file, there are some unused clients that we
need to clean up 😬 )
So if there's a solution you had in mind or could point us in a direction
to help with, I think these would be the best starting points. I'm not sure
if this answers your question or helps guide in a direction, so please let
me know!
—
Reply to this email directly, view it on GitHub
<#75 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQBXN23WXX4DLGB3OCAGTWLZVX7K3AVCNFSM6AAAAABNZE7EV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZZGAYTENBRGQ>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I could see the possibility of adding some fields to the related APIs to add another map to the requests, such that those will contain the additional metadata. The metadata is attached at the Document level, so we could possibly make it as part of the API per document, or if we assume a list of files already in S3, then there's the possibility for us to edit the processing function to add more metadata than just the document location over here: https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146 So for your suggestion, would the LISA prefix be related to the metadata already on the S3 object? As in something along the lines of:
Is this the workflow you're thinking of? |
Yeah that was my first thought. Not sure if tying the vector metadata to S3
metadata is out of line with the goals of the project for some reason but
unless your averse I can put it in a PR.
…On Mon, Sep 9, 2024, 7:25 PM Peter Muller ***@***.***> wrote:
I could see the possibility of adding some fields to the related APIs to
add another map to the requests, such that those will contain the
additional metadata. The metadata is attached at the Document
<https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html>
level, so we could possibly make it as part of the API per document, or if
we assume a list of files already in S3, then there's the possibility for
us to edit the processing function to add more metadata than just the
document location over here:
https://github.com/awslabs/LISA/blob/develop/lambda/utilities/file_processing.py#L146
So for your suggestion, would the LISA prefix be related to the metadata
already on the S3 object? As in something along the lines of:
1. Upload file to S3 with Object metadata attached
2. Use LISA ingestion to consume / embed files
3. Per file, check if there's S3 metadata (optionally: and check if
the metadata is prefixed with a LISA-known prefix)
4. Add metadata to metadata dictionary that is processed along with
the Document object
5. Metadata is now returned with the document text for requested
vectors
Is this the workflow you're thinking of?
—
Reply to this email directly, view it on GitHub
<#75 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AQBXN24EK3OWTYLKZJ6S6UDZVYU6ZAVCNFSM6AAAAABNZE7EV6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMZZGMZDEMRTHA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
had to step away for the past couple days but kind of circling back to where I was originally when looking through this and trying to figure out a path to get the rag functionality I need. I understand that the layer zip files are included in the config optionally for network isolation. Could they not also serve to replace the RAG functionality if that is my end goal? Something I am thinking through is what I need out of RAG is pretty boutique and I am doubtful it will be useful to other LISA users (likely would include custom embedding generation logic that is specific to the shape of specific documents) so figure any contribution I make here would end up looking like: place custom functionality somewhere (likely in the form of a lambda) and use it to replace some part or all of the rag API and I am questioning if this already exists in plain site or if I am missing something. |
No worries at all! I've been thinking on this one for a little bit too and I think the main issue in our way is that our implementation of the RAG feature is fairly limited from the UI. Direct invocation via curl command or similar isn't really documented, but as I'm staring at it, I can see that it is possible to upload a custom list of keys to the rag store, so long as the exist in the LISA-provided document bucket (which is also something that we could edit to be user-provider). And with that, we could then provide additional metadata as part of the ingest_documents request. Several routes to go from here, but possible ones:
And to answer your question, yes the rag layer could be used that way, but then it's a lot harder for us to support that way or improve on our existing things. I would say even based on all of this, we would still welcome a pull request with your ideas in it, and we can work to find the best path forward on it. If the goal for now is to just make a utility outside of the Chat UI to ingest documents with metadata, I think that backwards compatible changes to the repository API would be fine (as long as it doesn't break the current functionality then I'm good 👍 ) Some points of interest for that:
just some ideas and totally not prescriptive by any means! |
Great ideas peter! I think these are great ways to get to the goal I expressed of adding metadata to vector embeddings. I think I may have convoluted the thread here with a second and related goal I have which I am having a harder time thinking through in terms of how to add in a way that could be useful to the broader LISA community that motivated this comment #75 (comment) Ill leave it hear in case you have thoughts but recognize it should be in another ticket and think I have the information I was seeking about metadata creation. Basically I would like to be able to use boutique embedding creation logic so that I could parse a document and include some a prior knowledge about its shape in the embedding creation process so that I can for instance inject a title and subheading for each chunk generated from a section in a policy document. Looking through the codebase I believe that would require replacing the routine here LISA/lambda/utilities/file_processing.py Line 59 in 0e824eb
|
As of v3.5.0, we now have a Document meta table that stores additional information about a document outside the vector store. This is primarily used for managing ingested documents, but this might be a good place to store your additional data fields. Unfortunately, it won't address chunk-level metadata as stated above. |
Id like to add s3 metadata to my embeddings during the embedding creation process and realized that I wasnt sure the best place to do that. I wasnt sure if forking the project and adding to the file processing would be ideal or if there was something I could do by defining a ragLambdaLayer as descibed here
LISA/example_config.yaml
Line 16 in 2c3b03b
The text was updated successfully, but these errors were encountered: