Llama Vision Model Integration for Actions #2

SharanyaSD · 2024-12-19T11:45:17Z

Integrate the Llama Vision Model to interpret UI screenshots and generate browser actions.

Set up Llama Vision Model in the FastAPI backend to process UI screenshots.
Create an endpoint to accept images, predict coordinates, and return actions (e.g., click, scroll).
Write logic to determine coordinates and actions for Puppeteer based on Llama Vision predictions.
Test basic tasks like clicking buttons or scrolling on sample pages.

Provide feedback