Adding first version of AgentEval -- a framework for assessing task u…

…tility for LLM-powered applications (#681) * add agenteval-notebook for math problems and the blog post about it * update gitignore * updates to notebook * adding folder for the logs * adding math problems logs * adding folder for alfworld logs * added limitiation and future work to blog post * minor edits blog post * adding changes * reorg * modify the main notebook * modification of the main notebook * remove wrong notebook * uploading new notebook * update agenteval notebook * change the sample * Update agenteval_cq_math.ipynb * adding final changes to notebook * updated framework picture * Update index.mdx * Update index.md * Add files via upload * updates to notebool * revise the blog * revise the blog * update the agent img * revise the blog * revise the blog * Excluded model logs from the main branch, you can find them in agenteval branch * Fixed pre-commit formatting. * Update website/blog/2023-11-11-AgentEval/index.mdx Co-authored-by: Chi Wang <wang.chi@microsoft.com> * update gitignore * update index.mdx * update authors.yml by adding Negar and Julia * remove md file * remove md file * update gitignore * update authors file * pre-commit checks * pre-commit checks on authors.yml * pre-commit checks on authors.yml * update index.mdx * update authors.yml by adding Negar and Julia * updated the blog-post version 1 * updated the blog-post: TL;DR is ready * updated the blog-post: first part of introduction is ready * updated figures: typos on fig 1, changed terminology on the fig 2 * upadated the Framework part * fixed redering issues * upload zip file instead of single samples * update prealgebra.zip * update * upload * update z * update naming * update zip * update the agenteval notebook * update the notebook - removing unmercenary logs * updated fig 1 and references to it * updated fig 1 * incorporated PR comments * merged agenteval branch * final changes to the blog * updated taxonomy * update notebook * minor changes to the blog * Fixed formatting * Update the link in agenteval_cq_math.ipynb * update the blog and link in notebook * Update index.mdx * change folder name * Changes to be committed: modified: OAI_CONFIG_LIST_sample.txt * add sample OAI file * fix the url link to colab and typos * fix the url link to colab and typos * add authors * update profile pic * "update authors" * fixing the problem in test_groupchat.py * update the title lower case * reverting changes in setup.py * rerun pre-commit --------- Co-authored-by: Negar Arabzadeh <ngr.arabzadeh@gmail.com> Co-authored-by: Julia Kiseleva <jukisele@microsoft.com> Co-authored-by: afourney <adamfo@microsoft.com> Co-authored-by: Chi Wang <wang.chi@microsoft.com> Co-authored-by: Qingyun Wu <qingyun.wu@psu.edu>
microsoft · Nov 21, 2023 · 19c7da2 · 19c7da2
1 parent 07646d4
commit 19c7da2
Show file tree

Hide file tree

Showing 12 changed files with 1,151 additions and 77 deletions.
diff --git a/.gitignore b/.gitignore
@@ -167,3 +167,6 @@ wolfram.txt
 
 # DB on disk for TeachableAgent
 tmp/
+
+# Storage for the AgentEval output
+test/test_files/agenteval-in-out/out/
diff --git a/notebook/agenteval_cq_math.ipynb b/notebook/agenteval_cq_math.ipynb
diff --git a/test/test_files/agenteval-in-out/samples/sample_math_criteria.json b/test/test_files/agenteval-in-out/samples/sample_math_criteria.json
@@ -0,0 +1,26 @@
+{
+    "Problem Interpretation": {
+      "description": "Ability to correctly interpret the problem.",
+      "accepted_values": ["completely off", "slightly relevant", "relevant", "mostly accurate", "completely accurate"]
+    },
+    "Mathematical Methodology": {
+      "description": "Adequacy of the chosen mathematical or algorithmic methodology for the question",
+      "accepted_values": ["inappropriate", "barely adequate", "adequate", "mostly effective", "completely effective"]
+    },
+    "Calculation Correctness": {
+      "description": "Accuracy of calculations made and solutions given",
+      "accepted_values": ["completely incorrect", "mostly incorrect", "neither", "mostly correct", "completely correct"]
+    },
+    "Explanation Clarity": {
+      "description": "Clarity and comprehensibility of explanations, including language use and structure",
+      "accepted_values": ["not at all clear", "slightly clear", "moderately clear", "very clear", "completely clear"]
+    },
+    "Code Efficiency": {
+      "description": "Quality of code in terms of  efficiency and elegance",
+      "accepted_values": ["not at all efficient", "slightly efficient", "moderately efficient", "very efficient", "extremely efficient"]
+    },
+    "Code Correctness": {
+      "description": "Correctness of the provided code",
+      "accepted_values": ["completely incorrect", "mostly incorrect", "partly correct", "mostly correct", "completely correct"]
+    }
+  }
diff --git a/test/test_files/agenteval-in-out/samples/sample_math_evaluated_results.json b/test/test_files/agenteval-in-out/samples/sample_math_evaluated_results.json
@@ -0,0 +1,13 @@
+{
+    "sample_algebra.json": {
+      "actual_success": "true",
+      "estimated_performance": {
+        "Problem Interpretation": "completely accurate",
+        "Mathematical Methodology": "completely effective",
+        "Calculation Correctness": "completely correct",
+        "Explanation Clarity": "very clear",
+        "Code Efficiency": "very efficient",
+        "Code Correctness": "completely correct"
+      }
+  }
+}
diff --git a/test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt b/test/test_files/agenteval-in-out/samples/sample_math_response_failed.txt
@@ -0,0 +1,32 @@
+{
+  "problem": "Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.",
+  "level": "Level 5",
+  "type": "Algebra",
+  "solution": "We have \\begin{align*} (2x+10)(x+3)&<(3x+9)(x+8) \\quad \\Rightarrow\n\\\\ 2(x+5)(x+3)&<3(x+3)(x+8) \\quad \\Rightarrow\n\\\\ 2(x+5)(x+3)-3(x+3)(x+8)&<0 \\quad \\Rightarrow\n\\\\ (2x+10-(3x+24))(x+3)&<0 \\quad \\Rightarrow\n\\\\ (-x-14)(x+3)&<0 \\quad \\Rightarrow\n\\\\ (x+14)(x+3)&>0.\n\\end{align*} This inequality is satisfied if and only if $(x+14)$ and $(x+3)$ are either both positive or both negative.  Both factors are positive for $x>-3$ and both factors are negative for $x<-14$.  When $-14<x<-3$, one factor is positive and the other negative, so their product is negative.   Therefore, the range of $x$ that satisfies the inequality is $ \\boxed{(-\\infty, -14)\\cup(-3,\\infty)} $.",
+  "correct_ans": "(-\\infty, -14)\\cup(-3,\\infty)",
+  "problem_id": "0",
+  "response_with_ans": "I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\n\n```python\nfrom sympy import symbols, factor\r\nx = symbols('x')\r\nfactorized = factor(x**2 + 17*x + 42)\r\nfactorized\r\n```\n\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.",
+  "round": 1,
+  "messages": [
+    {
+      "content": "Find all $x$ that satisfy the inequality $(2x+10)(x+3)<(3x+9)(x+8)$. Express your answer in interval notation.",
+      "role": "user"
+    },
+    {
+      "content": "First, let's simplify and isolate the inequality to solve for $x$. For this, we need to expand the expressions on both sides and then move everything to one side.\n\nStep 1: Expand the expressions on both sides\n$(2x+10)(x+3) = 2x^2 + 6x + 10x + 30 = 2x^2 + 16x + 30$\n$(3x+9)(x+8) = 3x^2 + 24x + 9x + 72 = 3x^2 + 33x + 72$\n\nStep 2: Move everything to one side\n$0 < 3x^2 + 33x + 72 - (2x^2 + 16x + 30) = x^2 + 17x + 42$\n\nNow we need to find the factorized form of the quadratic inequality $x^2 + 17x + 42 > 0$ and then find intervals where the inequality holds true.\n\nLet's first find the factorized form of the quadratic:",
+      "role": "assistant"
+    },
+    {
+      "content": "",
+      "role": "user"
+    },
+    {
+      "content": "I apologize for the confusion in my previous message. Let's find the factorized form of the quadratic using Python code.\n\n```python\nfrom sympy import symbols, factor\r\nx = symbols('x')\r\nfactorized = factor(x**2 + 17*x + 42)\r\nfactorized\r\n```\n\nPlease run the Python code above to find the factorized form of the quadratic, and we can proceed to determine the intervals where the inequality holds true.",
+      "role": "assistant"
+    }
+  ],
+  "time": 24.91333508491516,
+  "check_result": "the reply doesn't contain an answer.",
+  "is_correct": false,
+  "trial": -1
+}
diff --git a/test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt b/test/test_files/agenteval-in-out/samples/sample_math_response_successful.txt
@@ -0,0 +1,24 @@
+{
+  "problem": "What is the sum of all the distinct positive two-digit factors of 144?",
+  "level": "Level 5",
+  "type": "Number Theory",
+  "solution": "Prime factorize $144=2^4\\cdot3^2$. The sum of the positive two-digit factors of 144 is $2^4+2\\cdot3^2+2^2\\cdot3+2^2\\cdot3^2+2^3\\cdot3+2^3\\cdot3^2+2^4\\cdot3=\\boxed{226}.$",
+  "correct_ans": "226",
+  "problem_id": "0",
+  "response_with_ans": "To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\n\nHere's a Python script to accomplish this:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n    if 144 % i == 0:\n        two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"The sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nPlease run this script to find the desired sum.",
+  "round": 0,
+  "messages": [
+    {
+      "content": "What is the sum of all the distinct positive two-digit factors of 144?",
+      "role": "user"
+    },
+    {
+      "content": "To find the sum of all the distinct positive two-digit factors of 144, we need to first find all these factors. We can do this by iterating through the numbers from 10 to 99 and checking if they are factors of 144. Then, we can sum these factors and print their sum.\n\nHere's a Python script to accomplish this:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n    if 144 % i == 0:\n        two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"The sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nPlease run this script to find the desired sum.",
+      "role": "assistant"
+    }
+  ],
+  "time": 11.140539407730103,
+  "check_result": "the answer is not directly provided in the reply, but i can extract it by running the given python code:\n\n```python\ntwo_digit_factors = []\n\nfor i in range(10, 100):\n    if 144 % i == 0:\n        two_digit_factors.append(i)\n\nsum_of_factors = sum(two_digit_factors)\nprint(\"the sum of all the distinct positive two-digit factors of 144 is:\", sum_of_factors)\n```\n\nafter running this code, the result is:\nthe sum of all the distinct positive two-digit factors of 144 is: 226\n\nthe answer is 226. the ground truth answer is 226. therefore,\n\nthe answer is correct.",
+  "is_correct": true,
+  "trial": -1
+}
diff --git a/test/test_files/agenteval-in-out/samples/sample_test_case.json b/test/test_files/agenteval-in-out/samples/sample_test_case.json
@@ -0,0 +1,24 @@
+{
+  "problem": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.",
+  "level": "Level 5",
+  "type": "Number Theory",
+  "solution": "Since $5 \\times 24 = 120 = 121 - 1$, it follows that $-5 \\times 24 \\equiv 1 \\pmod{121}$. Adding 121 to $-5$ to make it positive, we find $(-5 + 121) \\times 24 \\equiv 116 \\times 24 \\equiv 1 \\pmod{121}$, so it follows that the modular inverse of $24$ is $\\boxed{116}$ when taken modulo $121$.",
+  "correct_ans": "116",
+  "problem_id": "5",
+  "response_with_ans": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n    g, x, _ = extended_gcd(a, m)\n    if g != 1:\n        raise Exception(f\"{a} and {m} are not coprime.\")\n    return x % m\n\ndef extended_gcd(a, b):\n    if a == 0:\n        return b, 0, 1\n    else:\n        g, x, y = extended_gcd(b % a, a)\n        return g, y - (b // a) * x, x\n```\n\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nI will execute the above code to find the modular inverse of 24 modulo 11^2.",
+  "round": 0,
+  "messages": [
+    {
+      "content": "Find $24^{-1} \\pmod{11^2}$. That is, find the residue $b$ for which $24b \\equiv 1\\pmod{11^2}$.\n\nExpress your answer as an integer from $0$ to $11^2-1$, inclusive.",
+      "role": "user"
+    },
+    {
+      "content": "To find the modular inverse of 24 modulo 11^2, we can use the Extended Euclidean Algorithm. Here is a Python function to compute the modular inverse using this algorithm:\n\n```python\ndef mod_inverse(a, m):\n    g, x, _ = extended_gcd(a, m)\n    if g != 1:\n        raise Exception(f\"{a} and {m} are not coprime.\")\n    return x % m\n\ndef extended_gcd(a, b):\n    if a == 0:\n        return b, 0, 1\n    else:\n        g, x, y = extended_gcd(b % a, a)\n        return g, y - (b // a) * x, x\n```\n\nLet's use the above function to find the modular inverse of 24 modulo 11^2:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nI will execute the above code to find the modular inverse of 24 modulo 11^2.",
+      "role": "assistant"
+    }
+  ],
+  "time": 13.481226921081543,
+  "check_result": "let me verify the modular inverse of 24 modulo 11^2.\n\nafter executing the provided python code:\n\n```python\na = 24\nm = 11**2\nmod_inverse(a, m)\n```\n\nthe answer is found to be: 116.\n\nthe answer is correct.",
+  "is_correct": true,
+  "trial": -1
+}
diff --git a/website/blog/2023-11-20-AgentEval/img/agenteval-CQ.png b/website/blog/2023-11-20-AgentEval/img/agenteval-CQ.png
diff --git a/website/blog/2023-11-20-AgentEval/img/math-problems-plot.png b/website/blog/2023-11-20-AgentEval/img/math-problems-plot.png
diff --git a/website/blog/2023-11-20-AgentEval/img/tasks-taxonomy.png b/website/blog/2023-11-20-AgentEval/img/tasks-taxonomy.png