mindXdashboard/docs/book/journal/api/dojo/inference/governance/origin

philosophymanifesto thesis origin whitepaper ataraxia roadmap press|archoverview orchestration codebase hierarchy core|agentsmindXagent ceo mastermind bdi evolution author all

govdaio civilization identity security|memorypgvector embed aglm memory|inferencevllm ollama mistral gemini|timeoracle

toolsindex tools a2a mcp shell|publishauthoragent book journal|deployproduction security monitoring|apireference swagger|learnusage guide hackathon

ollama/INDEX.md · 30.6 KB

Ollama Complete Reference — Local Documentation for mindX

Self-contained reference for all Ollama capabilities.

No external docs needed — resilient offline operation.

Source: docs.ollama.com (fetched 2026-04-11) + mindX integration specifics.

Back to mindX Documentation Hub

Operational Standards

mindX operates from two inference pillars — both are operational standards, not fallbacks:

PillarSourceSpeedModel ScaleAvailabilityCost CPU inferencelocalhost:11434~8 tok/s0.6B–1.7BAlways (no network)Zero Cloud inferenceollama.com via OllamaCloudTool~65 tok/s3B–1T24/7/365 (free tier)Zero

CPU provides autonomy — mindX reasons even offline, even when every API key is exhausted. Cloud provides scale — 120B+ parameter models on NVIDIA GPUs, 8.2x faster than local CPU. Together they form the resilience guarantee: mindX never stops inferring.

The 5-step resolution chain in _resolve_inference_model() tries the best available source first and walks down to guarantee. CPU is the failsafe. Cloud is the guarantee. Both are always ready.

Quick Navigation

API Reference

EndpointLocalCloudDoc POST /api/generatelocalhost:11434ollama.comgenerate.md POST /api/chatlocalhost:11434ollama.comchat.md POST /api/embedlocalhost:11434ollama.comembeddings.md Model managementlocalhost:11434—models.md GET /api/ps, /api/versionlocalhost:11434—running.md

All endpoints documented with every parameter, response field, and curl/Python/JavaScript examples. See the Ollama OpenAPI spec for the authoritative schema.

Features

Each feature doc includes curl, Python SDK, JavaScript SDK, and mindX-specific code examples. All features work identically on both CPU and Cloud pillars.

Streaming — Real-time token-by-token output via /api/chat and /api/generate; extends OllamaAPI which currently uses stream=False

Thinking — Chain-of-thought reasoning with the think parameter; supported models include DeepSeek R1 (local) and GPT-OSS (cloud, levels: "low"/"medium"/"high")

Structured Outputs — JSON schema-constrained generation via the format parameter; works with Pydantic and Zod; used by BDI reasoning for structured state extraction

Vision — Image understanding with multimodal models; cloud models gemma4, kimi-k2.5 support vision

Embeddings — Vector embeddings for RAGE semantic search and pgvector storage; mindX uses mxbai-embed-large and nomic-embed-text

Tool Calling — Function calling / tool use; single, parallel, and agent loop patterns; bridges to mindX BaseTool via OllamaCloudTool

Web Search — Grounded generation via Ollama web search API; requires OLLAMA_API_KEY; available as OllamaCloudTool.execute(operation="web_search")

Cloud & Infrastructure

Ollama Cloud — Free/Pro/Max tiers, API keys, cloud models, local offload vs direct API; the cloud operational pillar

Cloud Model Search — Programmatic discovery via /api/tags and OllamaCloudModelDiscovery class; feeds into Modelfile schema and Chimaiera alignment; the -cloud suffix distinction

Cloud Rate Limiting — CloudRateLimiter embedded in OllamaCloudTool with adaptive pacing (3s–30s); uses actual token counts from Ollama API; integrates with rate_limiter.py

OpenAI Compatibility — Drop-in replacement at /v1/chat/completions; works with OpenAI Python SDK and OpenAI JS SDK; base URL localhost:11434/v1/ for CPU pillar, ollama.com/v1/ for cloud pillar

SDKs

Python SDK — ollama library (PyPI); sync, async, cloud client, auto-parsed tool schemas; mindX uses aiohttp directly via OllamaAPI for maximum control

JavaScript SDK — ollama library (npm); browser, Node.js, cloud, abort; used by mindX frontend

Setup & Operations

Getting Started — Installation on Linux, macOS, Windows; first model pull; mindX quick setup for the CPU pillar

GPU Support — NVIDIA (CC 5.0+), AMD ROCm, Apple Metal, Vulkan (experimental); the 10.0.0.155 GPU server when online

Docker — CPU, NVIDIA, AMD, Vulkan containers; Docker Hub; Compose with mindX

Modelfile — Custom model creation; canonical schema for model collection, rating, and agent-model alignment toward Chimaiera

FAQ & Troubleshooting — Context window, keep_alive, Flash Attention, KV cache quantization, concurrency, VPS production notes

mindX Integration

OllamaCloudTool — First-class BaseTool for the cloud pillar. Any agent can execute(operation="chat", model="deepseek-v3.2", message="..."). Dual access (local proxy + direct API), embedded CloudRateLimiter, 18dp precision metrics, conversation history, branch-ready. Registered in augmentic_tools_registry.json with access_control: [""]. Wired into _resolve_inference_model() as Step 5 (guarantee).

Architecture — Integration layer diagram; OllamaAPI → OllamaChatManager → LLMFactory → InferenceDiscovery; 5-step resilience chain; cloud offload

Configuration — MINDX_LLM__OLLAMA__BASE_URL (CPU pillar), OLLAMA_API_KEY (cloud pillar), models/ollama.yaml, BANKON vault, llm_factory_config.json

Precision Metrics — llm/precision_metrics.py: 18-decimal-place scientific tracking; Decimal accumulation; actual counts only; separate cloud file at data/metrics/cloud_precision_metrics.json

Capability Examples — Working Python code for all 10 capabilities: streaming, thinking, structured outputs, vision, embeddings, tool calling, web search, cloud, model management, rate-limited cloud client

Test & Benchmarking

scripts/test_cloud_all_models.py — Primary benchmark: single prompt to every model, precision metrics (18dp Decimal), actual eval_count/eval_duration from Ollama API; see Latest Benchmark and How Cloud Works Without an API Key

scripts/test_cloud_inference.py — Original multi-source benchmark (local + cloud + vLLM)

scripts/test_ollama_connection.py — Connection test using OllamaAPI

Existing mindX Ollama Docs (pre-2026-04-11)

docs/ollama_api_integration.md — Original API compliance notes (timeouts, keep_alive, token counting)

docs/ollama_integration.md — Custom client (OllamaAPI) vs official library (Python SDK)

docs/ollama_model_capability_tool.md — Model discovery and capability registration

docs/OLLAMA_VLLM_CLOUD_RESEARCH.md — Cloud + vLLM research (2026-04-10); established the dual-pillar strategy

llm/RESILIENCE.md — Graded inference hierarchy: Primary → Secondary → Failsafe (CPU) → Guarantee (Cloud)

Latest Benchmark (2026-04-11)

Prompt: "You are mindX. In one sentence, describe what you are." Script: test_cloud_all_models.py | Results: data/cloud_test_results.json

ModelPillarevalprompttotaltok/swall_mstotal_ms gpt-oss:120b-cloudCloud678114865.521,2141,022 deepseek-r1:1.5bCPU7917968.0016,29416,291 deepseek-coder:latestCPU72831557.2922,56922,565
Aggregate (all values ACTUAL from Ollama API, 18dp precision):
Total tokens: 399 (218 eval + 181 prompt) = 399000000000000000000 sub-tokens

Aggregate throughput: 11.03 tok/s (11.033658223708593293 at 18dp)

Cloud vs CPU speedup: 8.2x (65.52 vs ~7.6 tok/s) — 120B cloud GPU vs 1.5B local CPU

Cloud Timing Note

Cloud-proxied models (gpt-oss:120b-cloud) return eval_duration_ns: 0 — the local offload proxy does not expose per-stage timing from the remote GPU. The total_duration_ns is used for tok/s calculation instead. CPU pillar models return all duration fields. See test_cloud_all_models.py line 114 for the fallback logic.

36 Additional Cloud Models Available

The cloud catalog lists 36 models at ollama.com/api/tags. To test them:

# Option A: Pull with -cloud suffix for free-tier proxy (metadata only, no weights) ollama pull deepseek-v3.2-cloud python3 scripts/test_cloud_all_models.py --local Option B: Set API key for direct cloud access to all 36 models export OLLAMA_API_KEY=your_key python3 scripts/test_cloud_all_models.py

How Cloud Works Without an API Key

test_cloud_inference.py and OllamaCloudTool return cloud model responses without OLLAMA_API_KEY because of Ollama's local offload architecture:

The -cloud Suffix

Model names with -cloud appended (e.g., gpt-oss:120b-cloud) are metadata-only pulls that proxy inference to ollama.com. Without the suffix (e.g., gpt-oss:120b), ollama pull downloads the full model weights (gigabytes) for CPU pillar execution.

The cloud catalog returns names without -cloud. Append it for free-tier local proxy:

ollama pull gpt-oss:120b-cloud # metadata only → inference proxied to cloud ollama pull deepseek-v3.2-cloud # metadata only → inference proxied to cloud vs ollama pull gemma3:4b # downloads 3.3GB weights for local CPU execution

The Mechanism

ollama pull gpt-oss:120b-cloud downloads metadata (not weights) to the local daemon

ollama run gpt-oss:120b-cloud sends the request to localhost:11434 like any local model

The local Ollama daemon detects the -cloud tag and transparently proxies to ollama.com

Authentication is handled by the daemon using credentials from ollama signin (stored at ~/.ollama/id_ed25519; see FAQ)

OllamaCloudTool calls localhost:11434/api/chat — no Bearer token needed because the local daemon is the auth proxy

Agent → OllamaCloudTool.execute(operation="chat") → _try_local_proxy() → localhost:11434/api/chat (model-cloud) → local Ollama daemon ↓ (transparent proxy) ollama.com (auth via ed25519 key) ↓ Cloud GPU inference ↓ Agent ← result (eval_count, tokens_per_sec, 18dp) ← ollama.com

Three Access Paths
PathURLAuthPullWhenTool Method Local offloadlocalhost:11434None (daemon)ollama pull model-cloudFree tier, no key_try_local_proxy() Direct APIollama.com/api/chatBearer $OLLAMA_API_KEYNone neededKey set_try_direct_cloud() Local executionlocalhost:11434Noneollama pull model (full weights)Always offlineOllamaAPI
OllamaCloudTool in auto mode tries local proxy first, then direct cloud — matching the dual-pillar design.

Why /api/tags Works Without Auth

The model listing endpoint at https://ollama.com/api/tags is publicly accessible — it lists available cloud models for discovery. This is how test_cloud_all_models.py, OllamaCloudTool.list_models, and OllamaCloudModelDiscovery discover available models without authentication.

Free Tier Limits
LimitValueResetTracked By SessionLight usageEvery 5 hoursCloudRateLimiter in OllamaCloudTool WeeklyLight usageEvery 7 daysCloudQuotaTracker Concurrent cloud models1—Ollama server-side
See Cloud Rate Limiting for the adaptive pacing strategy (3s–30s based on quota utilization) that maximizes throughput within these limits using actual token counts.

Modelfile as Canonical Schema

The Ollama Modelfile is mindX's canonical schema for model collection and rating across both pillars:
InstructionMaps TomindX Component FROMBase architecture/weightsmodels/ollama.yaml models[].name PARAMETEROperational characteristicsmodels/ollama.yaml model_selection TEMPLATECommunication protocolGo template syntax SYSTEMCognitive identityAgent system prompts in BDIAgent CapabilitiesDynamic from /api/showOllamaCloudModelDiscovery
This feeds into:
HierarchicalModelScorer — learned task_scores from precision metrics feedback

OllamaCloudModelDiscovery — dynamic capability detection across both CPU and cloud models

InferenceDiscovery — provider routing with cloud guarantee fallback

Agent-model alignment toward Chimaiera (the ROI moment when model composition outperforms single-model inference)

See Modelfile Reference for the full instruction set and Chimaiera alignment section.

Precision Metrics

Token tracking at 18 decimal places using Python Decimal. No floating-point drift. No estimation. Applied identically to both CPU and Cloud pillars.
WhatBeforeAfterModule Token countsword_count 1.3eval_count from Ollama API precision_metrics.py Timingfloat millisecondsint nanoseconds (Ollama native)OllamaResponseMetrics Accumulationfloat (compounding drift)Decimal (28-digit significand)PrecisionAccumulator Sub-token unitnone1 token = 10^18 sub-tokens (wei equivalent)SUBTOKEN_FACTOR Cloud tok/snot trackedeval_count / total_duration_ns (cloud proxy returns eval_duration: 0)OllamaCloudTool

Local metrics: data/metrics/precision_metrics.json (via OllamaAPI) Cloud metrics: data/metrics/cloud_precision_metrics.json (via OllamaCloudTool)

Full docs: Precision Metrics.

Resilience Design

The 5-step resolution chain in _resolve_inference_model() ensures mindX always has inference when any network path is available:

Step 1: InferenceDiscovery → best provider (Gemini, Mistral, Groq, etc.)
          ↓ all keys exhausted or rate limited
Step 2: OllamaChatManager → local model selection (HierarchicalModelScorer)
          ↓ connection stale or failed
Step 3: Re-init OllamaChatManager → retry with fresh connection
          ↓ still failing
Step 4: Direct HTTP → localhost:11434/api/tags (zero dependencies)
          ↓ local Ollama completely down
Step 5: OllamaCloudTool → ollama.com GPU inference ← GUARANTEE (24/7/365)
          ↓ cloud also unreachable (network down)
     → None → fallback_decide() rule-based heuristics → 2-min backoff

TierRoleProviderSpeedmindX Component PrimaryBest qualityGemini, MistralVariesLLMFactory SecondarySpeed/costGroq, TogetherFastLLMFactory FailsafeCPU pillarOllama local (localhost:11434)~8 tok/sOllamaChatManager GuaranteeCloud pillarOllama Cloud (ollama.com)~65 tok/sOllamaCloudTool Last resortNo inference——fallback_decide() rule-based

Cloud is guarantee, not default. The _cloud_inference_active flag in mindXagent.py routes one chat through OllamaCloudTool, then resets so the next cycle tries local first. This preserves CPU pillar autonomy while ensuring the cloud pillar catches every gap.

InferenceDiscovery.get_provider_for_task() routes tasks through the same hierarchy: preferred provider → ollama_local → ollama_cloud → any available → None.

Implementation: _resolve_inference_model() (5 steps) → InferenceDiscovery (provider probing + cloud fallback) → OllamaCloudTool (cloud guarantee) → RESILIENCE.md (graded hierarchy docs) → chat_with_ollama() (cloud routing when active).

mindX File Map

Core Ollama Integration

FileRoleDocPillar tools/cloud/ollama_cloud_tool.pyOllamaCloudTool — cloud inference for any agentThis pageCloud api/ollama/ollama_url.pyHTTP API client, rate limiter, precision metrics, failover ArchitectureCPU agents/core/ollama_chat_manager.pyConnection manager, model discovery, conversation historyArchitectureCPU agents/core/mindXagent.py5-step resolution chain, cloud routing, autonomous loopArchitectureBoth llm/ollama_handler.pyLLMFactory handler interfaceArchitectureCPU llm/llm_factory.pyMaster factory, provider selectionConfigurationBoth llm/rate_limiter.pyToken-bucket rate limitingCloud Rate LimitingBoth llm/precision_metrics.py18dp scientific token trackingPrecision MetricsBoth llm/inference_discovery.pyBoot-time probe, task routing, cloud guarantee ArchitectureBoth models/ollama.yamlModel registry, task scores, cloud config ConfigurationBoth api/ollama/ollama_admin_routes.pyAdmin endpoints (status, test, models)FAQCPU agents/core/model_scorer.pyHierarchicalModelScorerModelfile SchemaBoth agents/core/inference_optimizer.pySliding-scale frequency optimizationArchitectureCPU agents/hostinger_vps_agent.pyVPS management: 3 MCP channels (SSH + Hostinger API + Backend)NAV.mdBoth

Test Scripts

FilePurposePillar scripts/test_cloud_all_models.pyPrimary: every model, precision metrics, 18dp DecimalBoth scripts/test_cloud_inference.pyOriginal: local + cloud + vLLM comparisonBoth scripts/test_ollama_connection.pyConnection test via OllamaAPICPU data/cloud_test_results.jsonLatest benchmark results (JSON, 18dp)Both

External References

ResourceURLRelevance Ollama Homepageollama.com Both pillars Ollama Docsdocs.ollama.comAPI reference source Ollama API (OpenAPI)docs.ollama.com/openapi.yaml API docs source Ollama GitHubgithub.com/ollama/ollama Setup Python SDKgithub.com/ollama/ollama-python SDK docs JavaScript SDKgithub.com/ollama/ollama-js SDK docs Cloud Modelsollama.com/search?c=cloud Cloud pillar catalog Thinking Modelsollama.com/search?c=thinking Thinking feature Vision Modelsollama.com/search?c=vision Vision feature Tool Modelsollama.com/search?c=tools Tool Calling feature Model Libraryollama.com/library Modelfile reference API Keysollama.com/settings/keys Cloud auth Discorddiscord.gg/ollamaCommunity Docker Hubhub.docker.com/r/ollama/ollama Docker setup OllamaFreeAPIgithub.com/mfoud444/ollamafreeapiCommunity gateway mindX Productionmindx.pythai.netLive CPU pillar mindX Thesisdocs/THESIS.mdDarwin-Godel Machine synthesis mindX Manifestodocs/MANIFESTO.mdChimaiera roadmap RAGEdocs/AGINT.mdEmbeddings architecture — RAGE wipes the floor with RAG Attributiondocs/ATTRIBUTION.mdOpen source stack: Ollama, vLLM, SwarmClaw, pgvector

Version Info

Ollama docs: Fetched 2026-04-11 from docs.ollama.com

Operational standards: CPU (OllamaAPI + OllamaChatManager) + Cloud (OllamaCloudTool)

Resilience: 5-step chain in _resolve_inference_model() with cloud guarantee

Precision: 18dp Decimal via precision_metrics.py, actual counts from Ollama API

Production: mindx.pythai.net (4GB VPS, CPU pillar, dual-URL failover)

Benchmark: 2026-04-11 — 3 models, 399 tokens, cloud 8.2x faster than CPU

28 files, ~6,000 lines — self-contained for resilient offline operation

Referenced in this document

ATTRIBUTION MANIFESTO NAV OLLAMA_VLLM_CLOUD_RESEARCH THESIS agint ollama_api_integration ollama_integration ollama_model_capability_tool

All Documents Document Index The Book of mindX Improvement Journal API Reference

Ollama Complete Reference — Local Documentation for mindX

Operational Standards

Quick Navigation

API Reference

Features

Cloud & Infrastructure

SDKs

Setup & Operations

mindX Integration

Test & Benchmarking

Existing mindX Ollama Docs (pre-2026-04-11)

Latest Benchmark (2026-04-11)

Cloud Timing Note

36 Additional Cloud Models Available

Option B: Set API key for direct cloud access to all 36 models

How Cloud Works Without an API Key

The `-cloud` Suffix

vs

The Mechanism

Three Access Paths

Why `/api/tags` Works Without Auth

Free Tier Limits

Modelfile as Canonical Schema

Precision Metrics

Resilience Design

mindX File Map

Core Ollama Integration

Test Scripts

External References

Version Info

Ollama Complete Reference — Local Documentation for mindX

Operational Standards

Quick Navigation

API Reference

Features

Cloud & Infrastructure

SDKs

Setup & Operations

mindX Integration

Test & Benchmarking

Existing mindX Ollama Docs (pre-2026-04-11)

Latest Benchmark (2026-04-11)

Cloud Timing Note

36 Additional Cloud Models Available

Option B: Set API key for direct cloud access to all 36 models

How Cloud Works Without an API Key

The -cloud Suffix

vs

The Mechanism

Three Access Paths

Why /api/tags Works Without Auth

Free Tier Limits

Modelfile as Canonical Schema

Precision Metrics

Resilience Design

mindX File Map

Core Ollama Integration

Test Scripts

External References

Version Info

The `-cloud` Suffix

Why `/api/tags` Works Without Auth