AgentEval: a multi-agent system for assessing utility of LLM-powered applications
This notebook aims to demonstrate how AgentEval
works in an offline
scenario, where we use a math problem-solving task as an example.
AgentEval
consists of two key steps:
generate_criteria
: This is an LLM-based function that generates a
list of criteria to help to evaluate a utility
given task.
quantify_criteria
: This function quantifies the performance of any
sample task based on the criteria generated in the
generate_criteria
step in the following way:
For more detailed explanations, please refer to the accompanying blog post
AG2 requires Python>=3.9
. To run this notebook example, please install
ag2, Docker, and OpenAI:
config_list_from_json
function loads a list of configurations from an environment variable
or a json file. It first looks for an environment variable with a
specified name. The value of the environment variable needs to be a
valid json string. If that variable is not found, it looks for a
json file with the same name. It filters the configs by filter_dict.You can set the value of config_list in any way you prefer. Please refer to this User Guide for full code examples of the different methods.
To run the critic, we need a couple of math problem examples. One of
them failed to solve the problem successfully, given in
agenteval-in-out/response_failed.txt
, and the other one was solved
successfully, i.e., agenteval-in-out/response_successful.txt
.
Now, we print the designed criteria for assessing math problems.
Note : You can also define and use your own criteria in order to feed into the quantifier.
QuantifierAgent
Once we have the criteria, we need to quantify a new sample based on the
designed criteria and its accepted values. This will be done through
quantify_criteria
from agent_eval. Again, you can use your own defined
criteria in criteria_file
.
Here, we run the quantifier on a single math problem test case,
sample_test_case.json
, for demonstration.
AgentEval
on the logsIn the example below, log_path points to the sample logs folder to run the quantifier. The current sample belongs to the prealgebra category which will be downloaded from here. In case you want to replicate the results described in the blog post, you can download all the logs for math problems using the following link.
Here you can find an example of how to visualize the obtained result in the histogram form (similar to the one in the blog post).
The final plot would be saved in
../test/test_files/agenteval-in-out/estimated_performance.png
AgentEval: a multi-agent system for assessing utility of LLM-powered applications
This notebook aims to demonstrate how AgentEval
works in an offline
scenario, where we use a math problem-solving task as an example.
AgentEval
consists of two key steps:
generate_criteria
: This is an LLM-based function that generates a
list of criteria to help to evaluate a utility
given task.
quantify_criteria
: This function quantifies the performance of any
sample task based on the criteria generated in the
generate_criteria
step in the following way:
For more detailed explanations, please refer to the accompanying blog post
AG2 requires Python>=3.9
. To run this notebook example, please install
ag2, Docker, and OpenAI:
config_list_from_json
function loads a list of configurations from an environment variable
or a json file. It first looks for an environment variable with a
specified name. The value of the environment variable needs to be a
valid json string. If that variable is not found, it looks for a
json file with the same name. It filters the configs by filter_dict.You can set the value of config_list in any way you prefer. Please refer to this User Guide for full code examples of the different methods.
To run the critic, we need a couple of math problem examples. One of
them failed to solve the problem successfully, given in
agenteval-in-out/response_failed.txt
, and the other one was solved
successfully, i.e., agenteval-in-out/response_successful.txt
.
Now, we print the designed criteria for assessing math problems.
Note : You can also define and use your own criteria in order to feed into the quantifier.
QuantifierAgent
Once we have the criteria, we need to quantify a new sample based on the
designed criteria and its accepted values. This will be done through
quantify_criteria
from agent_eval. Again, you can use your own defined
criteria in criteria_file
.
Here, we run the quantifier on a single math problem test case,
sample_test_case.json
, for demonstration.
AgentEval
on the logsIn the example below, log_path points to the sample logs folder to run the quantifier. The current sample belongs to the prealgebra category which will be downloaded from here. In case you want to replicate the results described in the blog post, you can download all the logs for math problems using the following link.
Here you can find an example of how to visualize the obtained result in the histogram form (similar to the one in the blog post).
The final plot would be saved in
../test/test_files/agenteval-in-out/estimated_performance.png