Demonstrating the `AgentEval` framework using the task of solving math problems as an example
AgentEval: a multi-agent system for assessing utility of LLM-powered applications
This notebook aims to demonstrate how AgentEval
works in an offline
scenario, where we use a math problem-solving task as an example.
AgentEval
consists of two key steps:
-
generate_criteria
: This is an LLM-based function that generates a list of criteria to help to evaluate a utility given task. -
quantify_criteria
: This function quantifies the performance of any sample task based on the criteria generated in thegenerate_criteria
step in the following way:
For more detailed explanations, please refer to the accompanying blog post
Requirements
AG2 requires Python>=3.9
. To run this notebook example, please install
pyautogen, Docker, and OpenAI:
Set your API Endpoint
- The
config_list_from_json
function loads a list of configurations from an environment variable or a json file. It first looks for an environment variable with a specified name. The value of the environment variable needs to be a valid json string. If that variable is not found, it looks for a json file with the same name. It filters the configs by filter_dict.
You can set the value of config_list in any way you prefer. Please refer to this User Guide for full code examples of the different methods.
Run the Critic
To run the critic, we need a couple of math problem examples. One of
them failed to solve the problem successfully, given in
agenteval-in-out/response_failed.txt
, and the other one was solved
successfully, i.e., agenteval-in-out/response_successful.txt
.
The Criteria
Now, we print the designed criteria for assessing math problems.
Note : You can also define and use your own criteria in order to feed into the quantifier.
The QuantifierAgent
Once we have the criteria, we need to quantify a new sample based on the
designed criteria and its accepted values. This will be done through
quantify_criteria
from agent_eval. Again, you can use your own defined
criteria in criteria_file
.
Running the quantifier on a single test case
Here, we run the quantifier on a single math problem test case,
sample_test_case.json
, for demonstration.
Run AgentEval
on the logs
In the example below, log_path points to the sample logs folder to run the quantifier. The current sample belongs to the prealgebra category which will be downloaded from here. In case you want to replicate the results described in the blog post, you can download all the logs for math problems using the following link.
Plotting the estimated performance
Here you can find an example of how to visualize the obtained result in the histogram form (similar to the one in the blog post).
The final plot would be saved in
../test/test_files/agenteval-in-out/estimated_performance.png