Eval Framework? #20
Unanswered
stobias123
asked this question in
Q&A
Replies: 1 comment
-
I have been building the GAIA and SWE-lite benchmark runners to evaluate the agents overall ability. As for individual prompts, they just been manually tweaked so far from observations. Evaluation and meta-prompting is an area I'm looking into, DSPy etc to get a feel of what exists and what would be most suitable for integrating/building. I have a few ideas I'd like to play with so building evaluation datasets will be an important part. In SWE-bench there is an oracle dataset which has the files that need to be edited, so that gives a dataset for evaluating functionality in the selectFilesToEdit.ts file |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Doing some tinkering and absolutely love it so far. Trying to learn more and I'm curious if you guys have any docs on how you evaluate new functions / agents.
Beta Was this translation helpful? Give feedback.
All reactions