-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC for MM-RAG #49
base: main
Are you sure you want to change the base?
RFC for MM-RAG #49
Conversation
Signed-off-by: Tiep Le <tiep.le@intel.com>
Signed-off-by: Tiep Le <tiep.le@intel.com>
|
||
The proposed architecture involves the creation of two megaservices. | ||
- The first megaservice functions as the core pipeline, comprising four microservices: embedding, retriever, reranking, and LVLM. This megaservice exposes a MMRagBasedVisualQnAGateway, allowing users to query the system via the `/v1/mmrag_visual_qna` endpoint. | ||
- The second megaservice manages user data storage in VectorStore and is composed of a single microservice, embedding. This megaservice provides a MMRagDataIngestionGateway, enabling user access through the `/v1/mmrag_data_ingestion` endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be an enhanced microservice based on the existing data ingestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hshen14 Thanks for your comments. The simple answer is yes. We will enhance/reuse the existing data ingestion microservice. Current data ingestion microservice supports only TextDoc. In our proposal, we will enhance its interface to accept MultimodalDoc which can be either TextDoc, ImageDoc, ImageTextPairDoc, etc... If the input is of type TextDoc, we will divert the execution to current microservice/functions.
The proposed architecture involves the creation of two megaservices. | ||
- The first megaservice functions as the core pipeline, comprising four microservices: embedding, retriever, reranking, and LVLM. This megaservice exposes a MMRagBasedVisualQnAGateway, allowing users to query the system via the `/v1/mmrag_visual_qna` endpoint. | ||
- The second megaservice manages user data storage in VectorStore and is composed of a single microservice, embedding. This megaservice provides a MMRagDataIngestionGateway, enabling user access through the `/v1/mmrag_data_ingestion` endpoint. | ||
- The third megaservice functions as a helper to extract list of frame-transcript pairs from videos using audio-to-text models (e.g., BLIP2) for transcripting or LVLM model (e.g., LLAVA) for captioning. This megaservice is composed of 2 microservices: transcripting and LVLM. This megaservice provides a MMRagVideoprepGateway, enabling user access through the `/v1/mmrag_video_prep` endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mentioned either audio-to-text models for transcripting or LVLM model for captioning, while you also mentioned they need to be composed. If composing is not mandatory, does it make sense to make it as microservice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hshen14 Thanks for your comment. We were proposing each transcripting and LVLM being a microservice. The composition is not mandatory. We believe that when we ingest a video, it is better to include both frames' transcripts and frames' captions as metadata for inference (LVLM) after retrieval. However, in the megaservice we will have different options for user to choose whether they want transcript only, caption only or both. The composition here is optional. Hope this is clear to you.
#### 2.1 Embeddings | ||
- Interface `MultimodalEmbeddings` that extends the interface langchain_core.embeddings.Embeddings with an abstract method: | ||
```python | ||
embed_multimodal_document(self, doc: MultimodalDoc) -> List[float] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall RFC is good to me. just minor comment here. I think this interface is implementation specific one, right?. my point here is you don't need to mention that. you just need tell user what's the standard input and output just like you did in Data Classes section, that's enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RFC file naming convention follows this rule: yy-mm-dd-[OPEA Project Name]-[index]-title.md
For example, 24-04-29-GenAIExamples-001-Using_MicroService_to_implement_ChatQnA.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tileintel could you pls update this file name to be consistent with other PRs?
Can this be closed? |
Yes, will close it soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tileintel could you pls update this file name to be consistent with other PRs? then I can merge this one
We submit our RFC for Multimodal-RAG based Visual QnA.
@ftian1 @hshen14: Could you please help to provide feedbacks?