AI Agents #3: Collaboration in OpenAI Swarm
Multi-agent systems (MAS) have experienced waves of popularity across various domains over the years, driven by their ability to model and solve complex problems. Now, with the advent of large language models (LLMs), MAS have returned to the spotlight once again, but with a new twist — for the first time, one can communicate with such software modules in natural language. How does this change the way software modules collaborate? Let’s explore OpenAI’s recently open-sourced pedagogical codebase, Swarm, to find out.
Communication, as a manifestation of collaboration, can occur in a variety of different ways. In this post, let’s focus on the most basic scenario, where only two agents collaborate at any given time, using direct communication. We’ll further assume that one agent is a controller and the other is a worker. This is a popular setting, applicable to a wide range of applications, and it serves as the basis for the initial examples in OpenAI’s Swarm GitHub repository. But first, what is OpenAI Swarm?
An short introduction to OpenAI Swarm
Swarm is a new multi-agent ‘sample’ framework by OpenAI. It is a code repository with a few example applications for you to explore and learn how agents can work together to create experiences.
In Swarm, an agent is designed to solve a particular task. It is defined by a set of instructions (for a workflow, but in natural language) and a set of python functions that can be triggered by those instructions. In addition, Swarm runtime maintains a dictionary of context variables (with a session global scope) that every agent and function can get access (read and write). Here is an example for part of airline agents system:
def transfer_to_flight_modification():
"""Transfer control to flight_modification agent"""
return flight_modification
def triage_instructions(context_variables):
customer_context = context_variables.get("customer_context", None)
flight_context = context_variables.get("flight_context", None)
return f"""You are to triage a users request, and call a tool to transfer to the right intent.
Once you are ready to transfer to the right intent, call the tool to transfer to the right intent.
You dont need to know specifics, just the topic of the request.
When you need more information to triage the request to an agent, ask a direct question without explaining why you're asking it.
Do not share your thought process with the user! Do not make unreasonable assumptions on behalf of user.
The customer context is here: {customer_context}, and flight context is here: {flight_context}"""
triage = Agent(
name="Triage Agent",
instructions=triage_instructions,
functions=[
transfer_to_flight_modification,
transfer_to_lost_baggage],
)
...
flight_change = Agent(
name="Flight change traversal",
instructions=STARTER_PROMPT + FLIGHT_CHANGE_POLICY,
functions=[
escalate_to_agent,
change_flight,
valid_to_change_flight,
transfer_to_triage,
case_resolved,
],
)
These agents, along with user session, including the conversation history and context variables, are managed by the framework runtime (accessible via a client), which executes an endless loop that forwards user input to the currently active agent. When control is handed off to an agent, it responds to input based on the predefined instructions and returns a function at the end. If the execution of that function returns another agent, control of the user session is passed to the new agent. This allows agents to be composed statically into a network of ‘agents,’ enabling the solution of increasingly complex problems.
How do pick agent to collaborate?
One of the first questions about collaboration is: how does the controller determine which module or agent to call? Is it purely decided by the input (ultimately from the user), or does the business have some influence?
Businesses always want to differentiate their offerings, so they need control over how their software behaves. Comparing to function calling that translates user input into semantic representation of function invocation, Swarm agents can dispatch control to functionalities based not only on user input, but also the predefined instructions by businesses, which allows for incorporating business goals, policies, or user-specific preferences into the decision-making process. This is great for business to build useful and profitable applications.
How information is exchanged during integration?
Another important decision about collaboration is how information should be exchanged between agents to ensure the task is carried out effectively from the controller’s perspective? In natural language or structured semantic representation?
In the past, software modules, often written in different programming language using gRPC and RESTful, collaborated through function calls. The controller would send parameters to the worker module and get a structured result back. With instruction tuned LLMs, as in Swarm, however, the input to agents are natural languages. The output can be natural language message, or some function.
Sending input using natural language makes it possible to solve problems when we don’t know how to express them at the schema level, but there are some downsides. For example, different LLMs may interpret the text differently, potentially introducing inconsistencies in behavior or responses, which is problematic when consistent user experience is required.
If we decompose an agent into three components — dialogue understanding (input, or converting text to data), a workflow that completes the task, and response generation (output, or converting data to text) — and use an overall input and output layer, we can integrate workflows at the semantic level (i.e., using any formal programming language). Since the labels for agents and functions in the semantic space must be unique, they are defined based on careful analysis. Integration at this level provides a solid foundation for a consistent user experience and allows us to directly leverage a vast number of existing, well-tested traditional software modules without needing to wrap them first. This approach is especially useful when the problem is well-defined, or corresponding API schemas are available.
How dispatch is done?
Another important question is how should the dispatch be handled — statically at compile time or dynamically at runtime? Despite Swarm being introduced in late 2024, it shares the same design flaws as the original Dialogflow ES from Google: the procedural static dispatching of functionalities.
Compared to the dynamic dispatching offered by modern object-oriented languages, where the same event representing a user’s intent can be handled by different objects with different implementations in a decoupled manner, static dispatch requires defining flows for all combinations of user input and business logic. This approach makes it cumbersome and verbose for businesses to define user experiences, especially since they do not have control over what users might express at each turn. To provide enough flexibility for real enterprise use cases, Google had to come up with a completely different offering: Dialogflow CX. Therefore, this is expected to be a limiting factor for Swarm as well.
Parting words:
Building consumer products is different from developing tools that help businesses create user-facing experiences. For the former, the focus is on delivering what the general public desires, while for the latter, the goal is to empower businesses to define precise, repeatable, and unique behaviors. A hybrid AI approach, such as agents, enables businesses to leverage readily available traditional software modules while delivering consistent user experiences, making it a trending solution for later.
The Swarm represents a solid step forward in building hybrid neural-symbolic applications by focusing on injecting business logic to shape a consistent user experience. However, for well-defined problems, it might be useful to further decompose LLM capabilities into natural language understanding and declarative problem solving. This would allow us to treat hard-coded functions and soft LLM-based functionality (agents) uniformly. Yes, Swarm does adopt tried-and-true techniques, such as following an event-driven design, but as a framework for building conversational user experiences (in the form of agents), where is the good old model-view-controller ?