This document describes the updates made to the Ollama API integration in api/ollama_url.py to align with the official Ollama API documentation.
Previous: 10s total, 5s sock_read (too short for large models)
Updated: 120s total, 60s sock_read (sufficient for large model inference)
# Session timeout for inference requests
timeout=aiohttp.ClientTimeout(
total=120, # 120 second total timeout for large models
connect=10, # 10 second connection timeout
sock_read=60 # 60 second read timeout for inference
)
Impact: Prevents premature timeouts with larger models (e.g., 30B+ parameter models)
keep_alive ParameterPer Ollama API Docs: The keep_alive parameter controls how long the model stays loaded in memory after a request (default: 5m).
Implementation:
payload = {
"model": model,
"prompt": prompt,
"stream": False,
"keep_alive": kwargs.get("keep_alive", "5m"), # Keep model loaded (default 5m)
"options": {
"num_predict": max_tokens,
"temperature": temperature,
}
}
Benefits:
keep_alive parameterAdded Support For:
format: JSON mode or structured outputs (JSON schema)system: System message overridetemplate: Prompt template overrideraw: Bypass templating systemsuffix: Text after model response (for code completion)images: Base64-encoded images (for multimodal models)think: Enable thinking mode (for thinking models)Implementation:
# Add any additional top-level parameters from kwargs
for key in ["format", "system", "template", "raw", "suffix", "images", "think"]:
if key in kwargs:
payload[key] = kwargs[key]
Previous: Estimated tokens using word count
Updated: Uses actual token counts from Ollama API response
# Extract performance metrics if available (per Ollama API docs)
eval_count = data.get("eval_count", 0)
prompt_eval_count = data.get("prompt_eval_count", 0)
Use actual token counts if available, otherwise estimate
if eval_count > 0 or prompt_eval_count > 0:
total_tokens = eval_count + prompt_eval_count
self.metrics.total_tokens += total_tokens
else:
# Fallback estimation
estimated_tokens = len(prompt.split()) 1.3 + len(content.split()) 1.3
self.metrics.total_tokens += int(estimated_tokens)
Benefits:
Added Specific Timeout Error Handling:
except asyncio.TimeoutError as e:
logger.error(f"Ollama API timeout: {e}")
return json.dumps({"error": "TimeoutError", "message": "Request timed out after 120s..."})
except aiohttp.ServerTimeoutError as e:
logger.error(f"Ollama API server timeout: {e}")
return json.dumps({"error": "ServerTimeoutError", "message": "Server timeout reading response..."})
Benefits:
/api/generate EndpointStandard Completion:
result = await ollama_api.generate_text(
prompt="Why is the sky blue?",
model="llama3.2",
max_tokens=100,
temperature=0.7
)
With JSON Mode:
result = await ollama_api.generate_text(
prompt="Return a JSON object with age and availability",
model="llama3.1:8b",
format="json",
max_tokens=100
)
With Structured Outputs:
result = await ollama_api.generate_text(
prompt="Describe the weather",
model="llama3.1:8b",
format={
"type": "object",
"properties": {
"temperature": {"type": "integer"},
"condition": {"type": "string"}
}
}
)
With Code Completion (suffix):
result = await ollama_api.generate_text(
prompt="def compute_gcd(a, b):",
suffix=" return result",
model="codellama:code",
temperature=0
)
With Custom keep_alive:
result = await ollama_api.generate_text(
prompt="Analyze this data",
model="mistral-nemo:latest",
keep_alive="10m" # Keep model loaded for 10 minutes
)
/api/chat EndpointChat Completion:
result = await ollama_api.generate_text(
prompt="What is the weather?",
model="llama3.2",
use_chat=True,
messages=[
{"role": "user", "content": "What is the weather?"}
]
)
With Conversation History:
result = await ollama_api.generate_text(
prompt="How is that different?",
model="llama3.2",
use_chat=True,
messages=[
{"role": "user", "content": "Why is the sky blue?"},
{"role": "assistant", "content": "Due to Rayleigh scattering."},
{"role": "user", "content": "How is that different?"}
]
)
The API now tracks:
eval_count + prompt_eval_count from API responsetotal_duration, load_duration, eval_duration (in nanoseconds)eval_count / eval_duration * 10^9 tokens/secondkeep_alive for Repeated Requests: Set keep_alive="10m" or longer if making multiple requests to the same modelformat="json" and instruct the model in the prompt/api/chat with message history for better contextNo Breaking Changes: All existing code continues to work. New parameters are optional.
Recommended Updates:
keep_alive parameter for frequently-used models