Mitigating Prompt hacking with JSON Mode in Autogen
Introduction
In this notebook, we’ll explore how to generate very precise agent responses using a combination of OpenAI JSON mode and the Agent Description.
As our example, we will implement prompt hacking protection by controlling how agents can respond; Filtering coercive requests to an agent that will always reject their requests. The strucutre of JSON mode both enables precise speaker selection and allows us to add a “coersiveness rating” to a request that the groupchat manager can use to filter out bad requests.
The group chat manager can perfrom some simple maths encoded into the agent descriptions on the rating values (made reliable by json mode) and direct requests deemed too coersive to the “suspicious agent”
Please find documentation about this feature in OpenAI here. More information about Agent Descriptions is located here
Benefits - This contribution provides a method to implement precise speaker transitions based on content of the input message. The example can prevent Prompt hacks that use coersive language.
Requirements
JSON mode is a feature of OpenAI API, however strong models (such as
Claude 3 Opus), can generate appropriate json as well. AutoGen requires
Python>=3.9
. To run this notebook example, please install:
Model Configuration
We need to set two different Configs for this to work. One for JSON mode One for Text mode. This is because the group chat manager requires text mode.
Defining the task
The task for our JSON example is to answer the question: “Are ducks more dangerous than you think?”
Configuring the Agents
To solve the task, we will create two different agents with diamentically opposed prompts. One will be friendly and the other suspicious. To ensure the correct agent is chosen, we will have an input filtering agent who categorises the user message. These categories are the input for the selection mechanism. naturally, they are in json.
Note the system message format. We tell the agent: * who they are * what their job is * what the output strucutre must be
For JSON mode to work, we must include the literal string “JSON”. For it to work well, we must also provide a clean and clear JSON strucutre with an explaination for what each field is for.
Friendly and Suspicious Agents
Now we set up the friendly and suspicious agents. Note that the system message has the same overall strucutre, however it is much less prescriptive. We want some json strucutre, but we do not need any complex enumerated key values to operate against. We can still use JSON to give useful strucutre. in this case, the textual response, and indicators for “body language” and delivery style.
Description
The interaction between JSON mode and Description can be used to control speaker transition.
The Description is read by the group chat manager to understand the circumstances in which they should call this agent. The agent itself is not exposed to this information. In this case, we can include some simple logic for the manager to assess against the JSON strcutured output from the IO_Agent.
The strucutred and dependable nature of the output with the friendliness and coercive_rating being intergers between 1 and 10, means that we can trust this interaction to control the speaker transition.
In essence, we have created a framework for using maths or formal logic to determine which speaker is chosen.
Friendly Agent
Suspicious Agent
Defining Allowed Speaker transitions
allowed transitions is a very useful way of controlling which agents can speak to one another. IN this example, there is very few open paths, because we want to ensure that the correct agent responds to the task.
Creating the Group Chat
Now, we’ll create an instance of the GroupChat class, ensuring that we have allowed_or_disallowed_speaker_transitions set to allowed_transitions and speaker_transitions_type set to “allowed” so the allowed transitions works properly. We also create the manager to coordinate the group chat. IMPORTANT NOTE: the group chat manager cannot use JSON mode. it must use text mode. For this reason it has a distinct llm_config
Finally, we pass the task into message initiating the chat.
Conclusion
By using JSON mode and carefully crafted agent descriptions, we can precisely control the flow of speaker transitions in a multi-agent conversation system built with the Autogen framework. This approach allows for more specific and specialized agents to be called in narrow contexts, enabling the creation of complex and flexible agent workflows.