TL;DR: Introduce Stateflow, a task-solving paradigm that conceptualizes complex task-solving processes backed by LLMs as state machines. Introduce how to use GroupChat to realize such an idea with a customized speaker selection function.

Introduction

It is a notable trend to use Large Language Models (LLMs) to tackle complex tasks, e.g., tasks that require a sequence of actions and dynamic interaction with tools and external environments. In this paper, we propose StateFlow, a novel LLM-based task-solving paradigm that conceptualizes complex task-solving processes as state machines. In StateFlow, we distinguish between “process grounding” (via state and state transitions) and “sub-task solving” (through actions within a state), enhancing control and interpretability of the task-solving procedure. A state represents the status of a running process. The transitions between states are controlled by heuristic rules or decisions made by the LLM, allowing for a dynamic and adaptive progression. Upon entering a state, a series of actions is executed, involving not only calling LLMs guided by different prompts, but also the utilization of external tools as needed.

StateFlow

Finite State machines (FSMs) are used as control systems to monitor practical applications, such as traffic light control. A defined state machine is a model of behavior that decides what to do based on current status. A state represents one situation that the FSM might be in. Drawing from this concept, we want to use FSMs to model the task-solving process of LLMs. When using LLMs to solve a task with multiple steps, each step of the task-solving process can be mapped to a state.

Let’s take an example of an SQL task (See the figure below). For this task, a desired procedure is:

  1. gather information about the tables and columns in the database,
  2. construct a query to retrieve the required information,
  3. finally verify the task is solved and end the process.

For each step, we create a corresponding state. Also, we define an error state to handle failures. In the figure, execution outcomes are indicated by red arrows for failures and green for successes. Transition to different states is based on specific rules. For example, at a successful “Submit” command, the model transits to the End state. When reaching a state, a sequence of output functions defined is executed (e.g., M_i -> E means to first call the model and then execute the SQL command).

Experiments

InterCode: We evaluate StateFlow on the SQL task and Bash task from the InterCode benchmark, with both GTP-3.5-Turbo and GPT-4-Turbo. We record different metrics for a comprehensive comparison. The ‘SR’ (success rate) measures the performance, ‘Turns’ represents the number of interactions with the environment, and ‘Error Rate’ represents the percentage of errors of the commands executed. We also record the cost of the LLM usage.

We compare with the following baselines: (1) ReAct: a few-shot prompting method that prompts the model to generate thoughts and actions. (2) Plan & Solve: A two-step prompting strategy to first ask the model to propose a plan and then execute it.

The results of the Bash task are presented below:

ALFWorld: We also experiment with the ALFWorld benchmark, a synthetic text-based game implemented in the TextWorld environments. We tested with GPT-3.5-Turbo and took an average of 3 attempts.

We evaluate with: (1) ReAct: We use the two-shot prompt from the ReAct. Note there is a specific prompt for each type of task. (2) ALFChat (2 agents): A two-agent system setting from AutoGen consisting of an assistant agent and an executor agent. ALFChat is based on ReAct, which modifies the ReAct prompt to follow a conversational manner. (3) ALFChat (3 agents): Based on the 2-agent system, it introduces a grounding agent to provide commonsense facts whenever the assistant outputs the same action three times in a row.

For both tasks, StateFlow achieves the best performance with the lowest cost. For more details, please refer to our paper.

Implement StateFlow With GroupChat

We illustrate how to build StateFlow with GroupChat. Previous blog FSM Group Chat introduces a new feature of GroupChat that allows us to input a transition graph to constrain agent transitions. It requires us to use natural language to describe the transition conditions of the FSM in the agent’s description parameter, and then use an LLM to take in the description and make decisions for the next agent. In this blog, we take advantage of a customized speaker selection function passed to the speaker_selection_method of the GroupChat object. This function allows us to customize the transition logic between agents and can be used together with the transition graph introduced in FSM Group Chat. The current StateFlow implementation also allows the user to override the transition graph. These transitions can be based on the current speaker and static checking of the context history (for example, checking if ‘Error’ is in the last message).

We present an example of how to build a state-oriented workflow using GroupChat. We define a custom speaker selection function to be passed into the speaker_selection_method parameter of the GroupChat. Here, the task is to retrieve research papers related to a given topic and create a markdown table for these papers.

We define the following agents:

  • Initializer: Start the workflow by sending a task.
  • Coder: Retrieve papers from the internet by writing code.
  • Executor: Execute the code.
  • Scientist: Read the papers and write a summary.
# Define the agents, the code is for illustration purposes and is not executable.
initializer = autogen.UserProxyAgent(
   name="Init"
)
coder = autogen.AssistantAgent(
   name="Coder",
   system_message="""You are the Coder. Write Python Code to retrieve papers from arxiv."""
)
executor = autogen.UserProxyAgent(
   name="Executor",
   system_message="Executor. Execute the code written by the Coder and report the result.",
)
scientist = autogen.AssistantAgent(
   name="Scientist",
   system_message="""You are the Scientist. Please categorize papers after seeing their abstracts printed and create a markdown table with Domain, Title, Authors, Summary and Link. Return 'TERMINATE' in the end.""",
)

In the Figure, we define a simple workflow for research with 4 states: Init, Retrieve, Research, and End. Within each state, we will call different agents to perform the tasks.

  • Init: We use the initializer to start the workflow.
  • Retrieve: We will first call the coder to write code and then call the executor to execute the code.
  • Research: We will call the scientist to read the papers and write a summary.
  • End: We will end the workflow.

Then we define a customized function to control the transition between states:

def state_transition(last_speaker, groupchat):
   messages = groupchat.messages

   if last_speaker is initializer:
       # init -> retrieve
       return coder
   elif last_speaker is coder:
       # retrieve: action 1 -> action 2
       return executor
   elif last_speaker is executor:
       if messages[-1]["content"] == "exitcode: 1":
           # retrieve --(execution failed)--> retrieve
           return coder
       else:
           # retrieve --(execution success)--> research
           return scientist
   elif last_speaker == "Scientist":
       # research -> end
       return None


groupchat = autogen.GroupChat(
   agents=[initializer, coder, executor, scientist],
   messages=[],
   max_round=20,
   speaker_selection_method=state_transition,
)

We recommend implementing the transition logic for each speaker in the customized function. In analogy to a state machine, a state transition function determines the next state based on the current state and input. Instead of returning an Agent class representing the next speaker, we can also return a string from ['auto', 'manual', 'random', 'round_robin'] to select a default method to use. For example, we can always default to the built-in auto method to employ an LLM-based group chat manager to select the next speaker. When returning None, the group chat will terminate. Note that some of the transitions, such as “initializer” -> “coder” can be defined with the transition graph.

For Further Reading