This repository contains a comprehensive guide and set of tools for evaluating AI agents using Dria SDK. The notebook provided in this repository demonstrates how to generate an evaluation set for your AI agents and assess their performance using various tools and datasets.
Evaluating Retrieval-Augmented Generation (RAG) agents is crucial for ensuring their effectiveness and reliability across diverse datasets and scenarios. By testing these agents with detailed questions and varied personas, you can better understand their strengths and weaknesses. This process helps in refining the agents to perform optimally in real-world applications.
Moreover, evaluating different models with various RAG methodologies allows for a comprehensive comparison of their capabilities. It highlights the nuances in performance and adaptability, guiding the selection of the most suitable model for specific tasks. This evaluation is essential for advancing AI technologies and ensuring they meet the desired standards of accuracy and efficiency.
This notebook shows how to generate an evaluation set for your AI agents by using Dria. In the end, you can evaluate these agents with promptfoo and see the evaluation and assessment results.
To get started, clone this repository and install the necessary dependencies. We recommend using a Python virtual environment to manage dependencies.
Begin by installing the necessary dependencies. This can be done by running the provided code block in the notebook. It is recommended to use your local machine instead of Google Colab due to potential incompatibilities.
To run and use external applications in this notebook, you need to have API keys from various providers such as Firecrawl, Jina Reader, Upstash, Cohere, and OpenAI. Create an .env
file with the required API keys.
Utilize the command-line interface provided in the notebook to scrape content from web domains. You can choose to scrape an entire domain or a single URL.
The notebook demonstrates how to combine scraped content with personas to create a comprehensive dataset for evaluation.
Finally, use the combined data to evaluate your AI agents. By leveraging Dria, it produces synthetic QA pairs for each context-persona combination. These pairs simulate real-world scenarios, offering insights into the performance of different RAG configurations. The notebook also provides guidance on how to perform this evaluation with promptfoo.
The project requires several Python packages, including but not limited to:
- requests
- openai
- pandas
- nltk
- matplotlib
- firecrawl
- upstash_vector
- cohere
- python-dotenv
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License. See the LICENSE file for more details.