Vision models accept images alongside text for description, classification, and visual Q&A.
gemma3 — Google's multimodal modelgemma4 — Latest with vision, tools, thinking, audio (cloud)qwen3-vl — Qwen vision-language model (cloud)llava — LLaVA serieskimi-k2.5 — Multimodal agentic (cloud)ollama run gemma3 ./image.png "What's in this image?"
Images must be base64-encoded in the REST API. SDKs accept file paths, URLs, or raw bytes.
# 1. Encode the image
IMG=$(base64 < photo.jpg | tr -d '\n')
2. Send to Ollama
curl http://localhost:11434/api/chat -d '{
"model": "gemma3",
"messages": [{
"role": "user",
"content": "What is in this image?",
"images": ["'"$IMG"'"]
}],
"stream": false
}'
curl http://localhost:11434/api/generate -d '{
"model": "gemma3",
"prompt": "Describe what you see",
"images": ["iVBORw0KGgoAAAANSUhEUg...base64..."],
"stream": false
}'
from ollama import chat
File path (SDK handles encoding)
response = chat(
model='gemma3',
messages=[{
'role': 'user',
'content': 'What is in this image? Be concise.',
'images': ['/path/to/image.jpg'],
}],
)
print(response.message.content)
Raw bytes
from pathlib import Path
img_bytes = Path('/path/to/image.jpg').read_bytes()
response = chat(
model='gemma3',
messages=[{
'role': 'user',
'content': 'Describe this image.',
'images': [img_bytes],
}],
)
Base64 string
import base64
img_b64 = base64.b64encode(Path('/path/to/image.jpg').read_bytes()).decode()
response = chat(
model='gemma3',
messages=[{
'role': 'user',
'content': 'What do you see?',
'images': [img_b64],
}],
)
import ollama from 'ollama'
const response = await ollama.chat({
model: 'gemma3',
messages: [{
role: 'user',
content: 'What is in this image?',
images: ['/absolute/path/to/image.jpg']
}],
stream: false,
})
console.log(response.message.content)
from openai import OpenAI
client = OpenAI(base_url='http://localhost:11434/v1/', api_key='ollama')
response = client.chat.completions.create(
model='gemma3',
messages=[{
'role': 'user',
'content': [
{'type': 'text', 'text': "What's in this image?"},
{'type': 'image_url', 'image_url': 'data:image/png;base64,iVBORw0KGgo...'},
],
}],
max_tokens=300,
)
print(response.choices[0].message.content)
See structured_outputs.md for combining vision with JSON schemas.
The existing OllamaAPI already supports images via kwargs:
import base64
from pathlib import Path
Encode image
img_b64 = base64.b64encode(Path("screenshot.png").read_bytes()).decode()
Via OllamaAPI
result = await ollama_api.generate_text(
prompt="Describe what you see in this dashboard screenshot",
model="gemma3", # Must be a vision model
images=[img_b64]
)
async def analyze_avatar(image_path: str) -> dict:
"""Use vision model to analyze generated avatar quality."""
img_b64 = base64.b64encode(Path(image_path).read_bytes()).decode()
from pydantic import BaseModel
class AvatarAnalysis(BaseModel):
style: str
quality_score: float
colors: list[str]
description: str
response = chat(
model='gemma3',
messages=[{
'role': 'user',
'content': 'Analyze this avatar image for quality and style.',
'images': [img_b64],
}],
format=AvatarAnalysis.model_json_schema(),
options={'temperature': 0}
)
return AvatarAnalysis.model_validate_json(response.message.content)