Requirements
AG2 requiresPython>=3.9
. To run this notebook example, please install
with the [blendsearch] option:
Note: For code corresponding to version <0.2, you can refer to the repository
autogen.ChatCompletion.tune
and to make a request with
the tuned config: autogen.ChatCompletion.create
. First, we import
autogen:
Set your API Endpoint
Theconfig_list_openai_aoai
function tries to create a list of Azure OpenAI endpoints and OpenAI
endpoints. It assumes the api keys and api bases are stored in the
corresponding environment variables or local txt files:
- OpenAI API key: os.environ[“OPENAI_API_KEY”] or
openai_api_key_file="key_openai.txt"
. - Azure OpenAI API key: os.environ[“AZURE_OPENAI_API_KEY”] or
aoai_api_key_file="key_aoai.txt"
. Multiple keys can be stored, one per line. - Azure OpenAI API base: os.environ[“AZURE_OPENAI_API_BASE”] or
aoai_api_base_file="base_aoai.txt"
. Multiple bases can be stored, one per line.
Load dataset
We load the competition_math dataset. The dataset contains 201 “Level 2” Algebra examples. We use a random sample of 20 examples for tuning the generation hyperparameters and the remaining for evaluation.Define Success Metric
Before we start tuning, we must define the success metric we want to optimize. For each math task, we use voting to select a response with the most common answers out of all the generated responses. We consider the task successfully solved if it has an equivalent answer to the canonical solution. Then we can optimize the mean success rate of a collection of tasks.Use the tuning data to find a good configuration
For (local) reproducibility and cost efficiency, we cache responses from OpenAI with a controllable seed.cache_path_root
from “.cache” to a different path in set_cache()
.
The cache for different seeds are stored separately.
Perform tuning
The tuning will take a while to finish, depending on the optimization budget. The tuning will be performed under the specified optimization budgets.inference_budget
is the benchmark’s target average inference budget per instance. For example, 0.004 means the target inference budget is 0.004 dollars, which translates to 2000 tokens (input + output combined) if the gpt-3.5-turbo model is used.optimization_budget
is the total budget allowed for tuning. For example, 1 means 1 dollar is allowed in total, which translates to 500K tokens for the gpt-3.5-turbo model.num_sumples
is the number of different hyperparameter configurations allowed to be tried. The tuning will stop after either num_samples trials are completed or optimization_budget dollars are spent, whichever happens first. -1 means no hard restriction in the number of trials and the actual number is decided byoptimization_budget
.
Output tuning results
After the tuning, we can print out the config and the result found by AG2, which uses flaml for tuning.Make a request with the tuned config
We can apply the tuned config on the request for an example task:Evaluate the success rate on the test data
You can useautogen.ChatCompletion.test
to evaluate the performance of
an entire dataset with the tuned config. The following code will take a
while (30 mins to 1 hour) to evaluate all the test data instances if
uncommented and run. It will cost roughly $3.