ollama/INDEX.md · 30.6 KB

Ollama Complete Reference — Local Documentation for mindX

Self-contained reference for all Ollama capabilities.
No external docs needed — resilient offline operation.
Source: docs.ollama.com (fetched 2026-04-11) + mindX integration specifics.
>
Back to mindX Documentation Hub

Operational Standards

mindX operates from two inference pillars — both are operational standards, not fallbacks:

PillarSourceSpeedModel ScaleAvailabilityCost CPU inferencelocalhost:11434~8 tok/s0.6B–1.7BAlways (no network)Zero Cloud inferenceollama.com via OllamaCloudTool~65 tok/s3B–1T24/7/365 (free tier)Zero

CPU provides autonomy — mindX reasons even offline, even when every API key is exhausted. Cloud provides scale — 120B+ parameter models on NVIDIA GPUs, 8.2x faster than local CPU. Together they form the resilience guarantee: mindX never stops inferring.

The 5-step resolution chain in _resolve_inference_model() tries the best available source first and walks down to guarantee. CPU is the failsafe. Cloud is the guarantee. Both are always ready.


Quick Navigation

API Reference

EndpointLocalCloudDoc POST /api/generatelocalhost:11434ollama.comgenerate.md POST /api/chatlocalhost:11434ollama.comchat.md POST /api/embedlocalhost:11434ollama.comembeddings.md Model managementlocalhost:11434models.md GET /api/ps, /api/versionlocalhost:11434running.md

All endpoints documented with every parameter, response field, and curl/Python/JavaScript examples. See the Ollama OpenAPI spec for the authoritative schema.

Features

Each feature doc includes curl, Python SDK, JavaScript SDK, and mindX-specific code examples. All features work identically on both CPU and Cloud pillars.

  • Streaming — Real-time token-by-token output via /api/chat and /api/generate; extends OllamaAPI which currently uses stream=False
  • Thinking — Chain-of-thought reasoning with the think parameter; supported models include DeepSeek R1 (local) and GPT-OSS (cloud, levels: "low"/"medium"/"high")
  • Structured Outputs — JSON schema-constrained generation via the format parameter; works with Pydantic and Zod; used by BDI reasoning for structured state extraction
  • Vision — Image understanding with multimodal models; cloud models gemma4, kimi-k2.5 support vision
  • Embeddings — Vector embeddings for RAGE semantic search and pgvector storage; mindX uses mxbai-embed-large and nomic-embed-text
  • Tool Calling — Function calling / tool use; single, parallel, and agent loop patterns; bridges to mindX BaseTool via OllamaCloudTool
  • Web Search — Grounded generation via Ollama web search API; requires OLLAMA_API_KEY; available as OllamaCloudTool.execute(operation="web_search")
  • Cloud & Infrastructure

  • Ollama Cloud — Free/Pro/Max tiers, API keys, cloud models, local offload vs direct API; the cloud operational pillar
  • Cloud Model Search — Programmatic discovery via /api/tags and OllamaCloudModelDiscovery class; feeds into Modelfile schema and Chimaiera alignment; the -cloud suffix distinction
  • Cloud Rate LimitingCloudRateLimiter embedded in OllamaCloudTool with adaptive pacing (3s–30s); uses actual token counts from Ollama API; integrates with rate_limiter.py
  • OpenAI Compatibility — Drop-in replacement at /v1/chat/completions; works with OpenAI Python SDK and OpenAI JS SDK; base URL localhost:11434/v1/ for CPU pillar, ollama.com/v1/ for cloud pillar
  • SDKs

  • Python SDKollama library (PyPI); sync, async, cloud client, auto-parsed tool schemas; mindX uses aiohttp directly via OllamaAPI for maximum control
  • JavaScript SDKollama library (npm); browser, Node.js, cloud, abort; used by mindX frontend
  • Setup & Operations

  • Getting Started — Installation on Linux, macOS, Windows; first model pull; mindX quick setup for the CPU pillar
  • GPU SupportNVIDIA (CC 5.0+), AMD ROCm, Apple Metal, Vulkan (experimental); the 10.0.0.155 GPU server when online
  • Docker — CPU, NVIDIA, AMD, Vulkan containers; Docker Hub; Compose with mindX
  • Modelfile — Custom model creation; canonical schema for model collection, rating, and agent-model alignment toward Chimaiera
  • FAQ & Troubleshooting — Context window, keep_alive, Flash Attention, KV cache quantization, concurrency, VPS production notes
  • mindX Integration

  • OllamaCloudToolFirst-class BaseTool for the cloud pillar. Any agent can execute(operation="chat", model="deepseek-v3.2", message="..."). Dual access (local proxy + direct API), embedded CloudRateLimiter, 18dp precision metrics, conversation history, branch-ready. Registered in augmentic_tools_registry.json with access_control: [""]. Wired into _resolve_inference_model() as Step 5 (guarantee).
  • Architecture — Integration layer diagram; OllamaAPIOllamaChatManagerLLMFactoryInferenceDiscovery; 5-step resilience chain; cloud offload
  • ConfigurationMINDX_LLM__OLLAMA__BASE_URL (CPU pillar), OLLAMA_API_KEY (cloud pillar), models/ollama.yaml, BANKON vault, llm_factory_config.json
  • Precision Metricsllm/precision_metrics.py: 18-decimal-place scientific tracking; Decimal accumulation; actual counts only; separate cloud file at data/metrics/cloud_precision_metrics.json
  • Capability Examples — Working Python code for all 10 capabilities: streaming, thinking, structured outputs, vision, embeddings, tool calling, web search, cloud, model management, rate-limited cloud client
  • Test & Benchmarking

  • scripts/test_cloud_all_models.pyPrimary benchmark: single prompt to every model, precision metrics (18dp Decimal), actual eval_count/eval_duration from Ollama API; see Latest Benchmark and How Cloud Works Without an API Key
  • scripts/test_cloud_inference.py — Original multi-source benchmark (local + cloud + vLLM)
  • scripts/test_ollama_connection.py — Connection test using OllamaAPI
  • Existing mindX Ollama Docs (pre-2026-04-11)

  • docs/ollama_api_integration.md — Original API compliance notes (timeouts, keep_alive, token counting)
  • docs/ollama_integration.md — Custom client (OllamaAPI) vs official library (Python SDK)
  • docs/ollama_model_capability_tool.mdModel discovery and capability registration
  • docs/OLLAMA_VLLM_CLOUD_RESEARCH.mdCloud + vLLM research (2026-04-10); established the dual-pillar strategy
  • llm/RESILIENCE.md — Graded inference hierarchy: Primary → Secondary → Failsafe (CPU) → Guarantee (Cloud)

  • Latest Benchmark (2026-04-11)

    Prompt: "You are mindX. In one sentence, describe what you are." Script: test_cloud_all_models.py | Results: data/cloud_test_results.json

    ModelPillarevalprompttotaltok/swall_mstotal_ms gpt-oss:120b-cloudCloud678114865.521,2141,022 deepseek-r1:1.5bCPU7917968.0016,29416,291 deepseek-coder:latestCPU72831557.2922,56922,565

    Aggregate (all values ACTUAL from Ollama API, 18dp precision):

  • Total tokens: 399 (218 eval + 181 prompt) = 399000000000000000000 sub-tokens
  • Aggregate throughput: 11.03 tok/s (11.033658223708593293 at 18dp)
  • Cloud vs CPU speedup: 8.2x (65.52 vs ~7.6 tok/s) — 120B cloud GPU vs 1.5B local CPU
  • Cloud Timing Note

    Cloud-proxied models (gpt-oss:120b-cloud) return eval_duration_ns: 0 — the local offload proxy does not expose per-stage timing from the remote GPU. The total_duration_ns is used for tok/s calculation instead. CPU pillar models return all duration fields. See test_cloud_all_models.py line 114 for the fallback logic.

    36 Additional Cloud Models Available

    The cloud catalog lists 36 models at ollama.com/api/tags. To test them:

    # Option A: Pull with -cloud suffix for free-tier proxy (metadata only, no weights)
    ollama pull deepseek-v3.2-cloud
    python3 scripts/test_cloud_all_models.py --local

    Option B: Set API key for direct cloud access to all 36 models

    export OLLAMA_API_KEY=your_key python3 scripts/test_cloud_all_models.py

    How Cloud Works Without an API Key

    test_cloud_inference.py and OllamaCloudTool return cloud model responses without OLLAMA_API_KEY because of Ollama's local offload architecture:

    The -cloud Suffix

    Model names with -cloud appended (e.g., gpt-oss:120b-cloud) are metadata-only pulls that proxy inference to ollama.com. Without the suffix (e.g., gpt-oss:120b), ollama pull downloads the full model weights (gigabytes) for CPU pillar execution.

    The cloud catalog returns names without -cloud. Append it for free-tier local proxy:

    ollama pull gpt-oss:120b-cloud      # metadata only → inference proxied to cloud
    ollama pull deepseek-v3.2-cloud     # metadata only → inference proxied to cloud
    

    vs

    ollama pull gemma3:4b # downloads 3.3GB weights for local CPU execution

    The Mechanism

  • ollama pull gpt-oss:120b-cloud downloads metadata (not weights) to the local daemon
  • ollama run gpt-oss:120b-cloud sends the request to localhost:11434 like any local model
  • The local Ollama daemon detects the -cloud tag and transparently proxies to ollama.com
  • Authentication is handled by the daemon using credentials from ollama signin (stored at ~/.ollama/id_ed25519; see FAQ)
  • OllamaCloudTool calls localhost:11434/api/chatno Bearer token needed because the local daemon is the auth proxy
  • Agent → OllamaCloudTool.execute(operation="chat") → _try_local_proxy()
        → localhost:11434/api/chat (model-cloud) → local Ollama daemon
                                                      ↓ (transparent proxy)
                                                 ollama.com (auth via ed25519 key)
                                                      ↓
                                                 Cloud GPU inference
                                                      ↓
    Agent ← result (eval_count, tokens_per_sec, 18dp) ← ollama.com
    

    Three Access Paths

    PathURLAuthPullWhenTool Method Local offloadlocalhost:11434None (daemon)ollama pull model-cloudFree tier, no key_try_local_proxy() Direct APIollama.com/api/chatBearer $OLLAMA_API_KEYNone neededKey set_try_direct_cloud() Local executionlocalhost:11434Noneollama pull model (full weights)Always offlineOllamaAPI

    OllamaCloudTool in auto mode tries local proxy first, then direct cloud — matching the dual-pillar design.

    Why /api/tags Works Without Auth

    The model listing endpoint at https://ollama.com/api/tags is publicly accessible — it lists available cloud models for discovery. This is how test_cloud_all_models.py, OllamaCloudTool.list_models, and OllamaCloudModelDiscovery discover available models without authentication.

    Free Tier Limits

    LimitValueResetTracked By SessionLight usageEvery 5 hoursCloudRateLimiter in OllamaCloudTool WeeklyLight usageEvery 7 daysCloudQuotaTracker Concurrent cloud models1—Ollama server-side

    See Cloud Rate Limiting for the adaptive pacing strategy (3s–30s based on quota utilization) that maximizes throughput within these limits using actual token counts.


    Modelfile as Canonical Schema

    The Ollama Modelfile is mindX's canonical schema for model collection and rating across both pillars:

    InstructionMaps TomindX Component FROMBase architecture/weightsmodels/ollama.yaml models[].name PARAMETEROperational characteristicsmodels/ollama.yaml model_selection TEMPLATECommunication protocolGo template syntax SYSTEMCognitive identityAgent system prompts in BDIAgent CapabilitiesDynamic from /api/showOllamaCloudModelDiscovery

    This feeds into:

  • HierarchicalModelScorer — learned task_scores from precision metrics feedback
  • OllamaCloudModelDiscovery — dynamic capability detection across both CPU and cloud models
  • InferenceDiscovery — provider routing with cloud guarantee fallback
  • Agent-model alignment toward Chimaiera (the ROI moment when model composition outperforms single-model inference)
  • See Modelfile Reference for the full instruction set and Chimaiera alignment section.


    Precision Metrics

    Token tracking at 18 decimal places using Python Decimal. No floating-point drift. No estimation. Applied identically to both CPU and Cloud pillars.

    WhatBeforeAfterModule Token countsword_count
    1.3eval_count from Ollama APIprecision_metrics.py Timingfloat millisecondsint nanoseconds (Ollama native)OllamaResponseMetrics Accumulationfloat (compounding drift)Decimal (28-digit significand)PrecisionAccumulator Sub-token unitnone1 token = 10^18 sub-tokens (wei equivalent)SUBTOKEN_FACTOR Cloud tok/snot trackedeval_count / total_duration_ns (cloud proxy returns eval_duration: 0)OllamaCloudTool

    Local metrics: data/metrics/precision_metrics.json (via OllamaAPI) Cloud metrics: data/metrics/cloud_precision_metrics.json (via OllamaCloudTool)

    Full docs: Precision Metrics.


    Resilience Design

    The 5-step resolution chain in _resolve_inference_model() ensures mindX always has inference when any network path is available:

    Step 1: InferenceDiscovery → best provider (Gemini, Mistral, Groq, etc.)
              ↓ all keys exhausted or rate limited
    Step 2: OllamaChatManager → local model selection (HierarchicalModelScorer)
              ↓ connection stale or failed
    Step 3: Re-init OllamaChatManager → retry with fresh connection
              ↓ still failing
    Step 4: Direct HTTP → localhost:11434/api/tags (zero dependencies)
              ↓ local Ollama completely down
    Step 5: OllamaCloudTool → ollama.com GPU inference ← GUARANTEE (24/7/365)
              ↓ cloud also unreachable (network down)
         → None → fallback_decide() rule-based heuristics → 2-min backoff
    
    TierRoleProviderSpeedmindX Component PrimaryBest qualityGemini, MistralVariesLLMFactory SecondarySpeed/costGroq, TogetherFastLLMFactory FailsafeCPU pillarOllama local (localhost:11434)~8 tok/sOllamaChatManager GuaranteeCloud pillarOllama Cloud (ollama.com)~65 tok/sOllamaCloudTool Last resortNo inference——fallback_decide() rule-based

    Cloud is guarantee, not default. The _cloud_inference_active flag in mindXagent.py routes one chat through OllamaCloudTool, then resets so the next cycle tries local first. This preserves CPU pillar autonomy while ensuring the cloud pillar catches every gap.

    InferenceDiscovery.get_provider_for_task() routes tasks through the same hierarchy: preferred provider → ollama_localollama_cloud → any available → None.

    Implementation: _resolve_inference_model() (5 steps) → InferenceDiscovery (provider probing + cloud fallback) → OllamaCloudTool (cloud guarantee) → RESILIENCE.md (graded hierarchy docs) → chat_with_ollama() (cloud routing when active).


    mindX File Map

    Core Ollama Integration

    FileRoleDocPillar tools/cloud/ollama_cloud_tool.pyOllamaCloudTool — cloud inference for any agentThis pageCloud api/ollama/ollama_url.pyHTTP API client, rate limiter, precision metrics, failoverArchitectureCPU agents/core/ollama_chat_manager.pyConnection manager, model discovery, conversation historyArchitectureCPU agents/core/mindXagent.py5-step resolution chain, cloud routing, autonomous loopArchitectureBoth llm/ollama_handler.pyLLMFactory handler interfaceArchitectureCPU llm/llm_factory.pyMaster factory, provider selectionConfigurationBoth llm/rate_limiter.pyToken-bucket rate limitingCloud Rate LimitingBoth llm/precision_metrics.py18dp scientific token trackingPrecision MetricsBoth llm/inference_discovery.pyBoot-time probe, task routing, cloud guaranteeArchitectureBoth models/ollama.yamlModel registry, task scores, cloud configConfigurationBoth api/ollama/ollama_admin_routes.pyAdmin endpoints (status, test, models)FAQCPU agents/core/model_scorer.pyHierarchicalModelScorerModelfile SchemaBoth agents/core/inference_optimizer.pySliding-scale frequency optimizationArchitectureCPU agents/hostinger_vps_agent.pyVPS management: 3 MCP channels (SSH + Hostinger API + Backend)NAV.mdBoth

    Test Scripts

    FilePurposePillar scripts/test_cloud_all_models.pyPrimary: every model, precision metrics, 18dp DecimalBoth scripts/test_cloud_inference.pyOriginal: local + cloud + vLLM comparisonBoth scripts/test_ollama_connection.pyConnection test via OllamaAPICPU data/cloud_test_results.jsonLatest benchmark results (JSON, 18dp)Both

    External References

    ResourceURLRelevance Ollama Homepageollama.comBoth pillars Ollama Docsdocs.ollama.comAPI reference source Ollama API (OpenAPI)docs.ollama.com/openapi.yamlAPI docs source Ollama GitHubgithub.com/ollama/ollamaSetup Python SDKgithub.com/ollama/ollama-pythonSDK docs JavaScript SDKgithub.com/ollama/ollama-jsSDK docs Cloud Modelsollama.com/search?c=cloudCloud pillar catalog Thinking Modelsollama.com/search?c=thinkingThinking feature Vision Modelsollama.com/search?c=visionVision feature Tool Modelsollama.com/search?c=toolsTool Calling feature Model Libraryollama.com/libraryModelfile reference API Keysollama.com/settings/keysCloud auth Discorddiscord.gg/ollamaCommunity Docker Hubhub.docker.com/r/ollama/ollamaDocker setup OllamaFreeAPIgithub.com/mfoud444/ollamafreeapiCommunity gateway mindX Productionmindx.pythai.netLive CPU pillar mindX Thesisdocs/THESIS.mdDarwin-Godel Machine synthesis mindX Manifestodocs/MANIFESTO.mdChimaiera roadmap RAGEdocs/AGINT.mdEmbeddings architecture — RAGE wipes the floor with RAG Attributiondocs/ATTRIBUTION.mdOpen source stack: Ollama, vLLM, SwarmClaw, pgvector

    Version Info

  • Ollama docs: Fetched 2026-04-11 from docs.ollama.com
  • Operational standards: CPU (OllamaAPI + OllamaChatManager) + Cloud (OllamaCloudTool)
  • Resilience: 5-step chain in _resolve_inference_model() with cloud guarantee
  • Precision: 18dp Decimal via precision_metrics.py, actual counts from Ollama API
  • Production: mindx.pythai.net (4GB VPS, CPU pillar, dual-URL failover)
  • Benchmark: 2026-04-11 — 3 models, 399 tokens, cloud 8.2x faster than CPU
  • 28 files, ~6,000 lines — self-contained for resilient offline operation

  • Referenced in this document
    ATTRIBUTIONMANIFESTONAVOLLAMA_VLLM_CLOUD_RESEARCHTHESISagintollama_api_integrationollama_integrationollama_model_capability_tool

    All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference