ollama_api_integration.md · 6.6 KB

Ollama API Integration - Official Documentation Compliance

Overview

This document describes the updates made to the Ollama API integration in api/ollama_url.py to align with the official Ollama API documentation.

Changes Made

1. Extended Timeout Configuration

Previous: 10s total, 5s sock_read (too short for large models)

Updated: 120s total, 60s sock_read (sufficient for large model inference)

# Session timeout for inference requests
timeout=aiohttp.ClientTimeout(
    total=120,  # 120 second total timeout for large models
    connect=10,  # 10 second connection timeout
    sock_read=60  # 60 second read timeout for inference
)

Impact: Prevents premature timeouts with larger models (e.g., 30B+ parameter models)

2. Added keep_alive Parameter

Per Ollama API Docs: The keep_alive parameter controls how long the model stays loaded in memory after a request (default: 5m).

Implementation:

payload = {
    "model": model,
    "prompt": prompt,
    "stream": False,
    "keep_alive": kwargs.get("keep_alive", "5m"),  # Keep model loaded (default 5m)
    "options": {
        "num_predict": max_tokens,
        "temperature": temperature,
    }
}

Benefits:

3. Support for Additional API Parameters

Added Support For:

Implementation:

# Add any additional top-level parameters from kwargs
for key in ["format", "system", "template", "raw", "suffix", "images", "think"]:
    if key in kwargs:
        payload[key] = kwargs[key]

4. Improved Token Counting

Previous: Estimated tokens using word count

Updated: Uses actual token counts from Ollama API response

# Extract performance metrics if available (per Ollama API docs)
eval_count = data.get("eval_count", 0)
prompt_eval_count = data.get("prompt_eval_count", 0)

Use actual token counts if available, otherwise estimate

if eval_count > 0 or prompt_eval_count > 0: total_tokens = eval_count + prompt_eval_count self.metrics.total_tokens += total_tokens else: # Fallback estimation estimated_tokens = len(prompt.split()) 1.3 + len(content.split()) 1.3 self.metrics.total_tokens += int(estimated_tokens)

Benefits:

5. Enhanced Error Handling

Added Specific Timeout Error Handling:

except asyncio.TimeoutError as e:
    logger.error(f"Ollama API timeout: {e}")
    return json.dumps({"error": "TimeoutError", "message": "Request timed out after 120s..."})

except aiohttp.ServerTimeoutError as e: logger.error(f"Ollama API server timeout: {e}") return json.dumps({"error": "ServerTimeoutError", "message": "Server timeout reading response..."})

Benefits:

API Endpoint Usage

/api/generate Endpoint

Standard Completion:

result = await ollama_api.generate_text(
    prompt="Why is the sky blue?",
    model="llama3.2",
    max_tokens=100,
    temperature=0.7
)

With JSON Mode:

result = await ollama_api.generate_text(
    prompt="Return a JSON object with age and availability",
    model="llama3.1:8b",
    format="json",
    max_tokens=100
)

With Structured Outputs:

result = await ollama_api.generate_text(
    prompt="Describe the weather",
    model="llama3.1:8b",
    format={
        "type": "object",
        "properties": {
            "temperature": {"type": "integer"},
            "condition": {"type": "string"}
        }
    }
)

With Code Completion (suffix):

result = await ollama_api.generate_text(
    prompt="def compute_gcd(a, b):",
    suffix="    return result",
    model="codellama:code",
    temperature=0
)

With Custom keep_alive:

result = await ollama_api.generate_text(
    prompt="Analyze this data",
    model="mistral-nemo:latest",
    keep_alive="10m"  # Keep model loaded for 10 minutes
)

/api/chat Endpoint

Chat Completion:

result = await ollama_api.generate_text(
    prompt="What is the weather?",
    model="llama3.2",
    use_chat=True,
    messages=[
        {"role": "user", "content": "What is the weather?"}
    ]
)

With Conversation History:

result = await ollama_api.generate_text(
    prompt="How is that different?",
    model="llama3.2",
    use_chat=True,
    messages=[
        {"role": "user", "content": "Why is the sky blue?"},
        {"role": "assistant", "content": "Due to Rayleigh scattering."},
        {"role": "user", "content": "How is that different?"}
    ]
)

Performance Metrics

The API now tracks:

Best Practices

  1. Use keep_alive for Repeated Requests: Set keep_alive="10m" or longer if making multiple requests to the same model
  2. Handle Timeouts Gracefully: Large models may take 60+ seconds; ensure client code handles timeouts
  3. Use JSON Mode for Structured Data: Set format="json" and instruct the model in the prompt
  4. Monitor Token Usage: Use actual token counts from API responses for accurate tracking
  5. Use Chat Endpoint for Conversations: Use /api/chat with message history for better context

Migration Notes

No Breaking Changes: All existing code continues to work. New parameters are optional.

Recommended Updates:

References


Referenced in this document
API

All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference