Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Barely multimodal and useless for the most part in PDF extraction #1278

Closed
1 task done
sidoncloud opened this issue Oct 17, 2024 · 1 comment
Closed
1 task done
Assignees

Comments

@sidoncloud
Copy link

File Name

https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb

What happened?

Change the query in the PDF extraction part to "Retrieve the details of the sold items along with the amount paid for Bikbear.". Here i am trying to fetch the other details from the PDF besides the company name which is not a huge deal anyway. You will realize how terribly Gemini performs in extracting those details. Basically does nothing.

Relevant log output

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@koverholt
Copy link
Member

Thanks for the feedback on this sample notebook. As a reminder, we aim to keep a friendly, welcoming, and constructive community and environment here per our Code of Conduct.

Multimodal Function Calling is a different use case than sending a prompt with an image or PDF to Gemini and asking it to extract details or text from the document. If you need that functionality, a simpler multimodal request to the Gemini API will work great. You could also add Controlled Generation to get the contents of documents as a structured data object, without needing to use Gemini Function Calling at all.

Multimodal Function Calling aims to go further when you need to take action based on the results of a multimodal function call request. This involves defining a JSON schema for your function (a FunctionDeclaration), wrapping those FunctionDeclarations in a tool, then using Function Calling as you normally would to get predicted functions & parameters, call an external API or function, then return the results to Gemini.

So, in order to modify the PDF example in https://github.com/GoogleCloudPlatform/generative-ai/blob/main/gemini/function-calling/multimodal_function_calling.ipynb, you would need to modify the JSON schema to specify the exact data structure that you want to output, modify the files and/or prompts as needed, then handle the function name and parameters to make an external API or function call.

It's hard to say without seeing the full inputs and outputs that you used, but it might be the case that you didn't update the JSON schema in the FunctionDeclaration and are only seeing the company name in the output, as defined in the current FunctionDeclaration. In summary, consider using multimodal calls to Gemini API or Controlled Generation if you're looking to just extract details from documents. Or if you need those plus want to implement Function Calling on top of that, Multimodal Function Calling might be a good fit!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants