multimodal audio in vs. transcribing an audio attachment? #43

ghchinoy · 2025-01-10T05:47:08Z

Currently, it appears that audio is handled as a binary attachment, and is then transcribed. ref

ai/lib/src/views/llm_chat_view/llm_chat_view.dart

Line 230 in b1fbc7c

Future<void> _onTranslateStt(XFile file) async {

For multimodal models such as Gemini, audio as an input is natively supported.

The expectation is that instead of an audio attachment that is transcribed, the audio should be used as the input to the model directly rather than the transcription.

csells · 2025-01-11T04:48:18Z

thanks, @ghchinoy . that's a viable option and an excellent feature request. it's out of scope for the AI Toolkit as we work towards v1 but very much on the list for a future v2 version.

csells added the feature request label Jan 11, 2025

csells added this to the v2 milestone Jan 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multimodal audio in vs. transcribing an audio attachment? #43

multimodal audio in vs. transcribing an audio attachment? #43

ghchinoy commented Jan 10, 2025

csells commented Jan 11, 2025

multimodal audio in vs. transcribing an audio attachment? #43

multimodal audio in vs. transcribing an audio attachment? #43

Comments

ghchinoy commented Jan 10, 2025

csells commented Jan 11, 2025