Self-contained reference for all Ollama capabilities.
No external docs needed — resilient offline operation.
Source: docs.ollama.com (fetched 2026-04-11) + mindX integration specifics.>
Back to mindX Documentation Hub
mindX operates from two inference pillars — both are operational standards, not fallbacks:
localhost:11434ollama.com via OllamaCloudToolCPU provides autonomy — mindX reasons even offline, even when every API key is exhausted. Cloud provides scale — 120B+ parameter models on NVIDIA GPUs, 8.2x faster than local CPU. Together they form the resilience guarantee: mindX never stops inferring.
The 5-step resolution chain in _resolve_inference_model() tries the best available source first and walks down to guarantee. CPU is the failsafe. Cloud is the guarantee. Both are always ready.
POST /api/generatelocalhost:11434ollama.comPOST /api/chatlocalhost:11434ollama.comPOST /api/embedlocalhost:11434ollama.comlocalhost:11434GET /api/ps, /api/versionlocalhost:11434All endpoints documented with every parameter, response field, and curl/Python/JavaScript examples. See the Ollama OpenAPI spec for the authoritative schema.
Each feature doc includes curl, Python SDK, JavaScript SDK, and mindX-specific code examples. All features work identically on both CPU and Cloud pillars.
OllamaAPI which currently uses stream=Falsethink parameter; supported models include DeepSeek R1 (local) and GPT-OSS (cloud, levels: "low"/"medium"/"high")format parameter; works with Pydantic and Zod; used by BDI reasoning for structured state extractionmxbai-embed-large and nomic-embed-textBaseTool via OllamaCloudToolOllamaCloudTool.execute(operation="web_search")/api/tags and OllamaCloudModelDiscovery class; feeds into Modelfile schema and Chimaiera alignment; the -cloud suffix distinctionCloudRateLimiter embedded in OllamaCloudTool with adaptive pacing (3s–30s); uses actual token counts from Ollama API; integrates with rate_limiter.py/v1/chat/completions; works with OpenAI Python SDK and OpenAI JS SDK; base URL localhost:11434/v1/ for CPU pillar, ollama.com/v1/ for cloud pillarollama library (PyPI); sync, async, cloud client, auto-parsed tool schemas; mindX uses aiohttp directly via OllamaAPI for maximum controlollama library (npm); browser, Node.js, cloud, abort; used by mindX frontendBaseTool for the cloud pillar. Any agent can execute(operation="chat", model="deepseek-v3.2", message="..."). Dual access (local proxy + direct API), embedded CloudRateLimiter, 18dp precision metrics, conversation history, branch-ready. Registered in augmentic_tools_registry.json with access_control: [""]. Wired into _resolve_inference_model() as Step 5 (guarantee).OllamaAPI → OllamaChatManager → LLMFactory → InferenceDiscovery; 5-step resilience chain; cloud offloadMINDX_LLM__OLLAMA__BASE_URL (CPU pillar), OLLAMA_API_KEY (cloud pillar), models/ollama.yaml, BANKON vault, llm_factory_config.jsonllm/precision_metrics.py: 18-decimal-place scientific tracking; Decimal accumulation; actual counts only; separate cloud file at data/metrics/cloud_precision_metrics.jsonscripts/test_cloud_all_models.py — Primary benchmark: single prompt to every model, precision metrics (18dp Decimal), actual eval_count/eval_duration from Ollama API; see Latest Benchmark and How Cloud Works Without an API Keyscripts/test_cloud_inference.py — Original multi-source benchmark (local + cloud + vLLM)scripts/test_ollama_connection.py — Connection test using OllamaAPIdocs/ollama_api_integration.md — Original API compliance notes (timeouts, keep_alive, token counting)docs/ollama_integration.md — Custom client (OllamaAPI) vs official library (Python SDK)docs/ollama_model_capability_tool.md — Model discovery and capability registrationdocs/OLLAMA_VLLM_CLOUD_RESEARCH.md — Cloud + vLLM research (2026-04-10); established the dual-pillar strategyllm/RESILIENCE.md — Graded inference hierarchy: Primary → Secondary → Failsafe (CPU) → Guarantee (Cloud)Prompt: "You are mindX. In one sentence, describe what you are."
Script: test_cloud_all_models.py | Results: data/cloud_test_results.json
gpt-oss:120b-clouddeepseek-r1:1.5bdeepseek-coder:latestAggregate (all values ACTUAL from Ollama API, 18dp precision):
399000000000000000000 sub-tokens11.033658223708593293 at 18dp)Cloud-proxied models (gpt-oss:120b-cloud) return eval_duration_ns: 0 — the local offload proxy does not expose per-stage timing from the remote GPU. The total_duration_ns is used for tok/s calculation instead. CPU pillar models return all duration fields. See test_cloud_all_models.py line 114 for the fallback logic.
The cloud catalog lists 36 models at ollama.com/api/tags. To test them:
# Option A: Pull with -cloud suffix for free-tier proxy (metadata only, no weights)
ollama pull deepseek-v3.2-cloud
python3 scripts/test_cloud_all_models.py --local
Option B: Set API key for direct cloud access to all 36 models
export OLLAMA_API_KEY=your_key
python3 scripts/test_cloud_all_models.py
test_cloud_inference.py and OllamaCloudTool return cloud model responses without OLLAMA_API_KEY because of Ollama's local offload architecture:
-cloud SuffixModel names with -cloud appended (e.g., gpt-oss:120b-cloud) are metadata-only pulls that proxy inference to ollama.com. Without the suffix (e.g., gpt-oss:120b), ollama pull downloads the full model weights (gigabytes) for CPU pillar execution.
The cloud catalog returns names without -cloud. Append it for free-tier local proxy:
ollama pull gpt-oss:120b-cloud # metadata only → inference proxied to cloud
ollama pull deepseek-v3.2-cloud # metadata only → inference proxied to cloud
vs
ollama pull gemma3:4b # downloads 3.3GB weights for local CPU execution
ollama pull gpt-oss:120b-cloud downloads metadata (not weights) to the local daemonollama run gpt-oss:120b-cloud sends the request to localhost:11434 like any local model-cloud tag and transparently proxies to ollama.comollama signin (stored at ~/.ollama/id_ed25519; see FAQ)OllamaCloudTool calls localhost:11434/api/chat — no Bearer token needed because the local daemon is the auth proxyAgent → OllamaCloudTool.execute(operation="chat") → _try_local_proxy()
→ localhost:11434/api/chat (model-cloud) → local Ollama daemon
↓ (transparent proxy)
ollama.com (auth via ed25519 key)
↓
Cloud GPU inference
↓
Agent ← result (eval_count, tokens_per_sec, 18dp) ← ollama.com
localhost:11434ollama pull model-cloud_try_local_proxy()ollama.com/api/chatBearer $OLLAMA_API_KEY_try_direct_cloud()localhost:11434ollama pull model (full weights)OllamaAPIOllamaCloudTool in auto mode tries local proxy first, then direct cloud — matching the dual-pillar design.
/api/tags Works Without AuthThe model listing endpoint at https://ollama.com/api/tags is publicly accessible — it lists available cloud models for discovery. This is how test_cloud_all_models.py, OllamaCloudTool.list_models, and OllamaCloudModelDiscovery discover available models without authentication.
CloudRateLimiter in OllamaCloudToolCloudQuotaTrackerSee Cloud Rate Limiting for the adaptive pacing strategy (3s–30s based on quota utilization) that maximizes throughput within these limits using actual token counts.
The Ollama Modelfile is mindX's canonical schema for model collection and rating across both pillars:
FROMmodels/ollama.yaml models[].namePARAMETERmodels/ollama.yaml model_selectionTEMPLATESYSTEMBDIAgent/api/showOllamaCloudModelDiscoveryThis feeds into:
HierarchicalModelScorer — learned task_scores from precision metrics feedbackOllamaCloudModelDiscovery — dynamic capability detection across both CPU and cloud modelsInferenceDiscovery — provider routing with cloud guarantee fallbackSee Modelfile Reference for the full instruction set and Chimaiera alignment section.
Token tracking at 18 decimal places using Python Decimal. No floating-point drift. No estimation. Applied identically to both CPU and Cloud pillars.
word_count 1.3eval_count from Ollama APIprecision_metrics.pyfloat millisecondsint nanoseconds (Ollama native)OllamaResponseMetricsfloat (compounding drift)Decimal (28-digit significand)PrecisionAccumulatorSUBTOKEN_FACTOReval_count / total_duration_ns (cloud proxy returns eval_duration: 0)OllamaCloudToolLocal metrics: data/metrics/precision_metrics.json (via OllamaAPI)
Cloud metrics: data/metrics/cloud_precision_metrics.json (via OllamaCloudTool)
Full docs: Precision Metrics.
The 5-step resolution chain in _resolve_inference_model() ensures mindX always has inference when any network path is available:
Step 1: InferenceDiscovery → best provider (Gemini, Mistral, Groq, etc.)
↓ all keys exhausted or rate limited
Step 2: OllamaChatManager → local model selection (HierarchicalModelScorer)
↓ connection stale or failed
Step 3: Re-init OllamaChatManager → retry with fresh connection
↓ still failing
Step 4: Direct HTTP → localhost:11434/api/tags (zero dependencies)
↓ local Ollama completely down
Step 5: OllamaCloudTool → ollama.com GPU inference ← GUARANTEE (24/7/365)
↓ cloud also unreachable (network down)
→ None → fallback_decide() rule-based heuristics → 2-min backoff
LLMFactoryLLMFactorylocalhost:11434)OllamaChatManagerollama.com)OllamaCloudToolfallback_decide() rule-basedCloud is guarantee, not default. The _cloud_inference_active flag in mindXagent.py routes one chat through OllamaCloudTool, then resets so the next cycle tries local first. This preserves CPU pillar autonomy while ensuring the cloud pillar catches every gap.
InferenceDiscovery.get_provider_for_task() routes tasks through the same hierarchy: preferred provider → ollama_local → ollama_cloud → any available → None.
Implementation: _resolve_inference_model() (5 steps) → InferenceDiscovery (provider probing + cloud fallback) → OllamaCloudTool (cloud guarantee) → RESILIENCE.md (graded hierarchy docs) → chat_with_ollama() (cloud routing when active).
tools/cloud/ollama_cloud_tool.pyapi/ollama/ollama_url.pyagents/core/ollama_chat_manager.pyagents/core/mindXagent.pyllm/ollama_handler.pyLLMFactory handler interfacellm/llm_factory.pyllm/rate_limiter.pyllm/precision_metrics.pyllm/inference_discovery.pymodels/ollama.yamlapi/ollama/ollama_admin_routes.pyagents/core/model_scorer.pyHierarchicalModelScoreragents/core/inference_optimizer.pyagents/hostinger_vps_agent.pyscripts/test_cloud_all_models.pyDecimalscripts/test_cloud_inference.pyscripts/test_ollama_connection.pyOllamaAPIdata/cloud_test_results.jsondocs/THESIS.mddocs/MANIFESTO.mddocs/AGINT.mddocs/ATTRIBUTION.mdOllamaAPI + OllamaChatManager) + Cloud (OllamaCloudTool)_resolve_inference_model() with cloud guaranteeDecimal via precision_metrics.py, actual counts from Ollama API