-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H-3614: Introduce chonky
PDF Embeddings
#5673
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JesusFileto!
I noticed that you have used curl
which means there is no asynchronous execution of requests possible. In particular for AI calls, asynchronous calls are preferred. In addition, curl
pulls in libcurl
(and OpenSSL), while reqwest
is written in Rust and can use rustls
.
As async backend we use tokio
. If you add the "macros"
and "rt-multi-thread"
to tokio
you can use #[tokio::main]
on fn main()
and #[tokio::test]
instead of #[test]
which makes the function async
.
As this effectively makes the code async
we can also use tokio
to read files (by using the fs
feature on tokio
and tokio::fs
instead of std::fs
)
chonky
PDF Embeddings
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #5673 +/- ##
=======================================
Coverage 21.72% 21.72%
=======================================
Files 566 566
Lines 19156 19156
Branches 2755 2755
=======================================
Hits 4162 4162
Misses 14942 14942
Partials 52 52
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work moving to async
🎉
I did a more in-depth review. I think it's a good time to improve the code quality a bit 🙂
.to_owned()) | ||
} | ||
|
||
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed to pass a vector, a slice is enough
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { | |
fn base64_json(image_data: &[u8]) -> Result<String, Report<ChonkyError>> { |
If you want to be even more generic you can use the same approach as above:
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { | |
fn base64_json(image_data: impl AsRef<[u8]>) -> Result<String, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, will add this in next PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second suggestion is a more flexible way without side effects (the calling site doesn't need to be updated).
However, you can change the caller to
- payload = base64_json(image_payload);
+ payload = base64_json(&image_payload);
} | ||
|
||
// Parses the response to extract the image embedding vector | ||
fn extract_embedding(response_text: &str) -> Result<Vec<f64>, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have a struct for embedding, such as TextEmbedding
and ImageEmbedding
we probably want to return an enum here
enum Embedding {
Text(TextEmbedding),
Image(ImageEmbedding),
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was implementing this but the ImageEmbedding
struct requires the image in the struct, which is not accessible when just requesting the embedding, since the calling function should have knowledge on the type of vector it is receiving, it may be easier to not implement this (to avoid having to handling if let
branches upstream where the compiler does not know what datatype we are expecting but we do)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a re-review and the changes look very good. I added a few more comments.
Also, let's make sure to consistently use the new Embedding
structs as they really contributes to the readability of the code.
let xmin: i32 = num_traits::cast(bbox.xmin).ok_or(ChonkyError::Pdfium)?; | ||
let ymin: i32 = num_traits::cast(bbox.ymin).ok_or(ChonkyError::Pdfium)?; | ||
let xmax: i32 = num_traits::cast(bbox.xmax).ok_or(ChonkyError::Pdfium)?; | ||
let ymax: i32 = num_traits::cast(bbox.ymax).ok_or(ChonkyError::Pdfium)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the exact behavior here we expect? Because num_traits::cast
simply discards the decimals which is the behavior of floor
. We should add to the comment that flooring it is the expected behavior.
libs/chonky/src/lib.rs
Outdated
pub mod embedding; | ||
|
||
pub use embedding::{hugging_face_api, multi_modal_embedding}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need both to be pub. Either is sufficient.
libs/chonky/src/lib.rs
Outdated
|
||
#[derive(Debug, Clone)] | ||
pub struct Embedding { | ||
_model_used: String, //model name reveals image or text embedding model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's often the case that the model is statically non
_model_used: String, //model name reveals image or text embedding model | |
_model_used: Cow<'static, str>, //model name reveals image or text embedding model |
Also, is there a specific reason why you hide this detail and not making this pub
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only reason the field is not pub
is since it is not used yet so clippy complains.
pub async fn embed_pdf_object_images( | ||
pdf_image_extract: Vec<Vec<DynamicImage>>, | ||
project_id: &str, | ||
) -> Result<Vec<Vec<Vec<f64>>>, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely see the improvement already. I'll keep this comment open for the Box<[f64]>
approach.
/// | ||
/// [`ChonkyError::VertexAPI`] when there are HTTP request errors | ||
pub async fn embed_screenshots( | ||
pdf_image_extract: Vec<DynamicImage>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The actual idiomatic version (but also quite hard to start with) Is using streams.
This way the function would become
pub fn embed_pdf_object_images(
pdf_image_extract: impl IntoIterator<Item = PageImageObjects>,
project_id: &str,
) -> impl Stream<Item = Result<PageImageObjectsEmbeddings, Report<ChonkyError>>> {
stream::iter(pdf_image_extract)
.then(|page_images| embed_screenshots(page_images, project_id))
.map(|embeddings| {
Ok(PageImageObjectsEmbeddings {
_embeddings: embeddings?,
})
})
}
Note, that this function is not async
, because it returns a Stream
(which is pretty much an async iterator).
I don't say we need to take this route because streams are not yet well matured in Rust (and a more built-in feature is on the way), but it avoids allocations.
pub async fn embed_tables( | ||
pdf_table_bounds: Vec<Vec<ExtractedTable>>, | ||
project_id: &str, | ||
) -> Result<Vec<Vec<Vec<f64>>>, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we still have this signature 😅
async fn get_binary_image_data( | ||
image_path: impl AsRef<Path> + core::marker::Send + core::marker::Sync, | ||
) -> Result<Vec<u8>, Report<ChonkyError>> { | ||
fs::read(image_path) | ||
.await | ||
.change_context(ChonkyError::ImageError) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This really doesn't more than reading a file. We should probably inline it. Also, the error ImageError
seems to be inaccurate as this is really a file operation. (Btw, tokio::fs::read
only wraps the reading operation in an another task, nothing else).
If we keep it, we should use Send
only and directly:
async fn get_binary_image_data( | |
image_path: impl AsRef<Path> + core::marker::Send + core::marker::Sync, | |
) -> Result<Vec<u8>, Report<ChonkyError>> { | |
fs::read(image_path) | |
.await | |
.change_context(ChonkyError::ImageError) | |
} | |
async fn get_binary_image_data( | |
image_path: impl AsRef<Path> + Send, | |
) -> Result<Vec<u8>, Report<ChonkyError>> { | |
fs::read(image_path) | |
.await | |
.change_context(ChonkyError::ImageError) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In trying to move the function inline, found that the explicit references to Send + Sync
bounds avoid compiler errors when using references to image_path
further downstream when retrying the request if the hugging face model was initially cold. For this reason it may be wise to not inline this function.
.to_owned()) | ||
} | ||
|
||
fn base64_json(image_data: Vec<u8>) -> Result<String, Report<ChonkyError>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The second suggestion is a more flexible way without side effects (the calling site doesn't need to be updated).
However, you can change the caller to
- payload = base64_json(image_payload);
+ payload = base64_json(&image_payload);
… `chonky` tests (#5687) Co-authored-by: JesusFileto <91223391+JesusFileto@users.noreply.github.com> Co-authored-by: Jesus Fileto <jesus_filetojr@icloud.com>
} | ||
|
||
/// A function that performs authentication with Google Vertex API and performs | ||
/// a curl request to obtain multimodal embeddings given an image path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// a curl request to obtain multimodal embeddings given an image path | |
/// a request to obtain multimodal embeddings given an image path |
Cargo.lock
Outdated
[[package]] | ||
name = "curl" | ||
version = "0.4.47" | ||
source = "registry+https://github.com/rust-lang/crates.io-index" | ||
checksum = "d9fb4d13a1be2b58f14d60adba57c9834b78c62fd86c3e76a148f732686e9265" | ||
dependencies = [ | ||
"curl-sys", | ||
"libc", | ||
"openssl-probe", | ||
"openssl-sys", | ||
"schannel", | ||
"socket2", | ||
"windows-sys 0.52.0", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like you need to update this file, it seems outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add json files to .gitattributes
as well?
Tagging @TimDiekmann in to get this PR over the line as we'd like to unblock merge. Thanks @JesusFileto for your work in getting it this far! |
# Conflicts: # .markdownlintignore # Cargo.lock # Cargo.toml
Benchmark results
|
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph |
representative_read_multiple_entities
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_property | depths: DT=255, PT=255, ET=255, E=255 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=0, E=0 | Flame Graph | |
entity_by_property | depths: DT=2, PT=2, ET=2, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=0, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=0, ET=2, E=2 | Flame Graph | |
entity_by_property | depths: DT=0, PT=2, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=255, PT=255, ET=255, E=255 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=0, E=0 | Flame Graph | |
link_by_source_by_property | depths: DT=2, PT=2, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=0, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=0, ET=2, E=2 | Flame Graph | |
link_by_source_by_property | depths: DT=0, PT=2, ET=2, E=2 | Flame Graph |
representative_read_entity_type
Function | Value | Mean | Flame graphs |
---|---|---|---|
get_entity_type_by_id | Account ID: d4e16033-c281-4cde-aa35-9085bf2e7579
|
Flame Graph |
scaling_read_entity_complete_one_depth
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 50 entities | Flame Graph | |
entity_by_id | 5 entities | Flame Graph | |
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 25 entities | Flame Graph |
scaling_read_entity_linkless
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 100 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 1000 entities | Flame Graph | |
entity_by_id | 10000 entities | Flame Graph |
scaling_read_entity_complete_zero_depth
Function | Value | Mean | Flame graphs |
---|---|---|---|
entity_by_id | 50 entities | Flame Graph | |
entity_by_id | 5 entities | Flame Graph | |
entity_by_id | 1 entities | Flame Graph | |
entity_by_id | 10 entities | Flame Graph | |
entity_by_id | 25 entities | Flame Graph |
🌟 What is the purpose of this PR?
This PR outlines the work for the embeddings that will be done in the PDF and preliminary structs with how they will be stored. The main goal is to split embeddings into multiple levels of textual information such as table embeddings. Future PRs will focus on implementing the XML metadata that will accompany these embeddings to be used for entity extraction.
🚫 Blocked by
🔍 What does this change?
Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
This PR:
📜 Does this require a change to the docs?
The changes in this PR:
🕸️ Does this require a change to the Turbo Graph?
The changes in this PR:
🐾 Next steps
🛡 What tests cover this?