The OpenAI Assistants API is OpenAI’s framework for building stateful AI applications — ones that remember conversation history, search through uploaded files, and call your functions automatically. It sits a layer above the raw Chat Completions API and handles the infrastructure complexity that would otherwise land on your server.
What the Assistants API Actually Does
The Assistants API introduces four core primitives:
- Assistants: a configured AI persona with a model, instructions, and enabled tools
- Threads: persistent conversation histories stored server-side (no token limit headaches)
- Messages: individual user or assistant turns within a thread
- Runs: an execution of an assistant against a thread that produces a response
The major advantage over direct completions: you don’t manage conversation history yourself. OpenAI stores it, truncates intelligently when context limits approach, and keeps your application code stateless.
Built-in tools available in 2026:
- File Search (formerly Retrieval): semantic search over uploaded documents
- Code Interpreter: sandboxed Python execution with file I/O
- Function Calling: structured calls to your own APIs
Setup: Install the Python SDK
pip install openai
Set your API key as an environment variable:
export OPENAI_API_KEY="sk-proj-..."
from openai import OpenAI
client = OpenAI() # Reads OPENAI_API_KEY from environment
Creating an Assistant
assistant = client.beta.assistants.create(
name="Technical Support Bot",
instructions="""You are a technical support specialist.
Answer questions based on the provided documentation.
Be concise and include code examples where relevant.""",
model="gpt-4o",
tools=[
{"type": "file_search"},
{"type": "code_interpreter"}
]
)
print(f"Assistant ID: {assistant.id}")
# Save this ID — you can reuse it across sessions
Models available in 2026: gpt-4o, gpt-4o-mini, gpt-4-turbo, o3, o4-mini. Use gpt-4o-mini for cost-sensitive applications.
Uploading Files for Retrieval
File Search lets your assistant answer questions grounded in your documents — PDFs, text files, code, markdown, and more.
# Upload a file
with open("technical_docs.pdf", "rb") as f:
file = client.files.create(
file=f,
purpose="assistants"
)
# Create a vector store and attach the file
vector_store = client.beta.vector_stores.create(
name="Technical Docs",
file_ids=[file.id]
)
# Update the assistant to use this vector store
assistant = client.beta.assistants.update(
assistant.id,
tool_resources={
"file_search": {
"vector_store_ids": [vector_store.id]
}
}
)
OpenAI automatically chunks and embeds your files. Queries to the assistant will semantically search across all attached documents.
Creating Threads and Messages
Threads are free to create and persist indefinitely. Each user session should typically get its own thread:
# Create a new thread for this user session
thread = client.beta.threads.create()
# Add a user message
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="What does the rate limiting section say about retry strategies?"
)
You can attach files at the message level too, for one-off document questions:
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content="Summarize this report",
attachments=[{
"file_id": file.id,
"tools": [{"type": "file_search"}]
}]
)
Running and Polling
A Run triggers the assistant to process the thread and generate a response:
import time
# Create a run
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
# Poll until completion
while run.status in ["queued", "in_progress"]:
time.sleep(1)
run = client.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
# Retrieve the assistant's response
messages = client.beta.threads.messages.list(
thread_id=thread.id,
order="desc",
limit=1
)
print(messages.data[0].content[0].text.value)
Using the Streaming API (Recommended)
Polling adds latency. The streaming approach delivers tokens as they’re generated:
with client.beta.threads.runs.stream(
thread_id=thread.id,
assistant_id=assistant.id
) as stream:
for text in stream.text_deltas:
print(text, end="", flush=True)
Function Calling Integration
Register your custom functions as tools, and the assistant will call them when appropriate:
import json
# Define the assistant with a function tool
assistant = client.beta.assistants.create(
name="Weather Assistant",
model="gpt-4o",
tools=[{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}]
)
# Handle tool calls during a run
run = client.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
while run.status == "requires_action":
tool_calls = run.required_action.submit_tool_outputs.tool_calls
tool_outputs = []
for tc in tool_calls:
args = json.loads(tc.function.arguments)
# Call your actual function here
result = {"temperature": 22, "condition": "sunny"}
tool_outputs.append({
"tool_call_id": tc.id,
"output": json.dumps(result)
})
run = client.beta.threads.runs.submit_tool_outputs(
thread_id=thread.id,
run_id=run.id,
tool_outputs=tool_outputs
)
time.sleep(1)
run = client.beta.threads.runs.retrieve(thread_id=thread.id, run_id=run.id)
Cost Considerations
Assistants API pricing in 2026 includes:
| Cost Component | Details |
|---|---|
| Input/output tokens | Same as Chat Completions API rates |
| File Search | $0.10 per GB/day for vector storage |
| Code Interpreter | $0.03 per session |
| File storage | $0.20 per GB/day |
Cost-saving tips:
- Use
gpt-4o-minifor most tasks — it handles retrieval and function calling well - Delete threads and files when a session ends
- Set
max_prompt_tokensandmax_completion_tokenson runs to cap costs - Use
truncation_strategyto control how long threads are summarized
Assistants API vs. Chat Completions API
| Scenario | Use Assistants API | Use Chat Completions |
|---|---|---|
| Multi-turn conversations | Yes | Manage history yourself |
| Document Q&A | Yes (File Search) | Use embeddings + RAG |
| Code execution sandbox | Yes | No |
| Simple one-shot queries | Overkill | Yes |
| Streaming responses | Supported | Yes, simpler |
| Full control over context | No | Yes |
| Lower latency | No | Yes |
The Assistants API shines for user-facing chatbots where you want persistent history, document grounding, and tool use without building the scaffolding yourself. For batch processing, RAG pipelines you control, or latency-sensitive applications, direct completions remain the better choice.