Advisor Tool
Pair a faster executor model with a higher-intelligence advisor model that provides strategic guidance mid-generation.
The advisor tool lets a fast, lower-cost executor model (Sonnet or Haiku) consult a high-intelligence advisor model (Opus 4.6) mid-generation. The advisor reads the full conversation and produces a plan or course correction — typically 400–700 text tokens — and the executor continues with the task.
This pattern is well-suited for long-horizon agentic workloads (coding agents, computer use, multi-step research) where most turns are mechanical but having an excellent plan is crucial. You get close to advisor-solo quality while the bulk of token generation happens at executor-model rates.
The advisor tool is in beta. Include anthropic-beta: advisor-tool-2026-03-01 in your requests — LiteLLM adds this automatically when it detects the advisor tool in your tools array.
Supported Providers​
| Provider | Chat Completions API | Messages API | Notes |
|---|---|---|---|
| Anthropic API | ✅ | ✅ | Native — runs server-side |
| OpenAI / Azure OpenAI | ✅ | ✅ | LiteLLM orchestration loop |
| Amazon Bedrock | ✅ | ✅ | LiteLLM orchestration loop |
| Google Vertex AI | ✅ | ✅ | LiteLLM orchestration loop |
| Groq / Mistral / others | ✅ | ✅ | LiteLLM orchestration loop |
How it works (LiteLLM native orchestration)​
For non-Anthropic providers, LiteLLM implements the advisor loop itself. The API you call is identical — LiteLLM handles everything transparently.
When a request arrives with an advisor_20260301 tool and a non-Anthropic provider, AdvisorOrchestrationHandler intercepts it. It translates the advisor tool into a regular function tool the provider understands, then runs an orchestration loop:
What LiteLLM does for you:
- Strips
advisor_20260301from the outgoing request — the provider only sees a standard function tool namedadvisor - When the executor calls it, intercepts before the result reaches you, runs the advisor sub-call, and injects the advice
- Strips any
advisor_tool_result/server_tool_useblocks from message history on re-send so non-Anthropic providers never see Anthropic-specific types - Wraps the final response in an SSE stream if you requested
stream=True - Enforces
max_usesas a hard cap —AdvisorMaxIterationsErroris raised if exceeded;max_uses=0disables the advisor entirely
Model Compatibility​
The executor and advisor models must form a valid pair. Currently the only supported advisor model is claude-opus-4-6.
| Executor | Advisor |
|---|---|
claude-haiku-4-5-20251001 | claude-opus-4-6 |
claude-sonnet-4-6 | claude-opus-4-6 |
claude-opus-4-6 | claude-opus-4-6 |
Chat Completions API​
SDK Usage​
Basic Example​
import litellm
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
print(response.choices[0].message.content)
With Optional Parameters​
import litellm
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a REST API with authentication in Python."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
"max_uses": 3, # cap advisor calls per request
"caching": {"type": "ephemeral", "ttl": "5m"}, # enable for 3+ calls per conversation
}
],
max_tokens=4096,
)
Streaming​
import litellm
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
The advisor sub-inference does not stream. The executor's stream pauses while the advisor runs, then the full advisor result arrives in a single event. Executor output resumes streaming afterward.
Multi-Turn Conversation​
import litellm
tools = [
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
]
messages = [
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
]
response = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)
# Append the full response (includes server_tool_use + advisor_tool_result blocks)
messages.append({"role": "assistant", "content": response.choices[0].message.content})
# Continue the conversation — keep the same tools array
messages.append({"role": "user", "content": "Now add a max-in-flight limit of 10."})
response2 = litellm.completion(
model="anthropic/claude-sonnet-4-6",
messages=messages,
tools=tools,
max_tokens=4096,
)
LiteLLM automatically strips advisor_tool_result blocks from message history when the advisor tool is not present in the current request. This prevents the Anthropic 400 error that would otherwise occur.
AI Gateway Usage​
Proxy Configuration​
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
Client Request via Proxy​
from openai import OpenAI
client = OpenAI(
api_key="your-litellm-proxy-key",
base_url="http://0.0.0.0:4000/v1"
)
response = client.chat.completions.create(
model="claude-sonnet",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter in Python."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
Messages API​
SDK Usage​
Basic Example​
import asyncio
import litellm
async def main():
response = await litellm.anthropic.messages.acreate(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
)
print(response)
asyncio.run(main())
Streaming​
import asyncio
import json
import litellm
async def main():
response = await litellm.anthropic.messages.acreate(
model="anthropic/claude-sonnet-4-6",
messages=[
{"role": "user", "content": "Implement a distributed rate limiter."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
max_tokens=4096,
stream=True,
)
async for chunk in response:
if isinstance(chunk, bytes):
for line in chunk.decode("utf-8").split("\n"):
if line.startswith("data: "):
try:
print(json.loads(line[6:]))
except json.JSONDecodeError:
pass
asyncio.run(main())
AI Gateway Usage​
Proxy Configuration​
model_list:
- model_name: claude-sonnet
litellm_params:
model: anthropic/claude-sonnet-4-6
api_key: os.environ/ANTHROPIC_API_KEY
Client Request via Proxy (Anthropic SDK)​
import anthropic
client = anthropic.Anthropic(
api_key="your-litellm-proxy-key",
base_url="http://0.0.0.0:4000"
)
response = client.beta.messages.create(
model="claude-sonnet",
max_tokens=4096,
betas=["advisor-tool-2026-03-01"],
messages=[
{"role": "user", "content": "Build a concurrent worker pool in Go with graceful shutdown."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
}
],
)
print(response)
Non-Anthropic Provider (LiteLLM orchestration loop)​
import asyncio
import litellm
async def main():
# executor: openai/gpt-4.1-mini | advisor: claude-opus-4-6
# LiteLLM runs the orchestration loop automatically
response = await litellm.anthropic.messages.acreate(
model="openai/gpt-4.1-mini",
messages=[
{"role": "user", "content": "Implement a Python LRU cache with O(1) get and put."}
],
tools=[
{
"type": "advisor_20260301",
"name": "advisor",
"model": "claude-opus-4-6",
"max_uses": 3,
}
],
max_tokens=1024,
custom_llm_provider="openai",
)
# Final response is clean — no advisor tool_use blocks
print(response["content"][0]["text"])
asyncio.run(main())
Response Structure​
A successful advisor call returns server_tool_use and advisor_tool_result blocks in the assistant content:
{
"role": "assistant",
"content": [
{
"type": "text",
"text": "Let me consult the advisor on this."
},
{
"type": "server_tool_use",
"id": "srvtoolu_abc123",
"name": "advisor",
"input": {}
},
{
"type": "advisor_tool_result",
"tool_use_id": "srvtoolu_abc123",
"content": {
"type": "advisor_result",
"text": "Use a channel-based coordination pattern. The tricky part is draining in-flight work during shutdown: close the input channel first, then wait on a WaitGroup..."
}
},
{
"type": "text",
"text": "Here's the implementation using a channel-based coordination pattern..."
}
]
}
Pass the full assistant content, including advisor blocks, back on subsequent turns. LiteLLM handles this automatically through provider_specific_fields.
Cost Control​
Advisor calls run as a separate sub-inference billed at the advisor model's rates. Usage is reported in usage.iterations[]:
{
"usage": {
"input_tokens": 412,
"output_tokens": 531,
"iterations": [
{
"type": "message",
"input_tokens": 412,
"output_tokens": 89
},
{
"type": "advisor_message",
"model": "claude-opus-4-6",
"input_tokens": 823,
"output_tokens": 1612
},
{
"type": "message",
"input_tokens": 1348,
"output_tokens": 442
}
]
}
}
Top-level usage reflects executor tokens only. Advisor tokens appear in iterations entries with type: "advisor_message" and are billed at Opus rates.
Tips:
- Enable
cachingon the tool definition only when you expect 3+ advisor calls per conversation; it costs more than it saves below that threshold. - Use
max_usesto cap advisor calls per request. Once reached, the executor continues without further advice. - For conversation-level caps, count advisor calls client-side. When you reach your limit, remove the advisor tool from
tools.
Recommended System Prompt​
For coding and agent tasks, Anthropic recommends prepending these blocks to your system prompt for consistent advisor timing and optimal cost/quality:
You have access to an `advisor` tool backed by a stronger reviewer model. It takes NO parameters — when you call advisor(), your entire conversation history is automatically forwarded. They see the task, every tool call you've made, every result you've seen.
Call advisor BEFORE substantive work — before writing, before committing to an interpretation, before building on an assumption. If the task requires orientation first (finding files, fetching a source, seeing what's there), do that, then call advisor. Orientation is not substantive work. Writing, editing, and declaring an answer are.
Also call advisor:
- When you believe the task is complete. BEFORE this call, make your deliverable durable: write the file, save the result, commit the change.
- When stuck — errors recurring, approach not converging, results that don't fit.
- When considering a change of approach.
On tasks longer than a few steps, call advisor at least once before committing to an approach and once before declaring done. On short reactive tasks where the next action is dictated by tool output you just read, you don't need to keep calling.
Give the advice serious weight. If you follow a step and it fails empirically, or you have primary-source evidence that contradicts a specific claim, adapt. A passing self-test is not evidence the advice is wrong.
If you've already retrieved data pointing one way and the advisor points another: don't silently switch. Surface the conflict in one more advisor call — "I found X, you suggest Y, which constraint breaks the tie?"
To reduce advisor output length by 35–45% without losing quality, add:
The advisor should respond in under 100 words and use enumerated steps, not explanations.