mindXdashboard/docs/book/journal/api/dojo/inference/governance/origin

philosophymanifesto thesis origin whitepaper ataraxia roadmap press|archoverview orchestration codebase hierarchy core|agentsmindXagent ceo mastermind bdi evolution author all

govdaio civilization identity security|memorypgvector embed aglm memory|inferencevllm ollama mistral gemini|timeoracle

toolsindex tools a2a mcp shell|publishauthoragent book journal|deployproduction security monitoring|apireference swagger|learnusage guide hackathon

ollama/mindx/architecture.md · 9.8 KB

mindX Ollama Architecture

How mindX uses Ollama — from production deployment at mindx.pythai.net to local development.

Integration Layer

┌─────────────────────────────────────────────────────────────┐
│                   mindX Agent Layer                          │
│  MindXAgent · BlueprintAgent · AuthorAgent · CEOAgent       │
│                                                              │
│  ┌────────────────────┐  ┌──────────────────────────────┐   │
│  │ OllamaChatManager  │  │ InferenceDiscovery           │   │
│  │ (agents/core/)     │  │ (llm/inference_discovery.py) │   │
│  │                    │  │                               │   │
│  │ • Model discovery  │  │ • Probes all sources at boot │   │
│  │ • Best model select│  │ • Validates before each cycle│   │
│  │ • Chat history     │  │ • Feeds HierarchicalScorer   │   │
│  │ • Auto-retry       │  │                               │   │
│  └────────┬───────────┘  └──────────────┬───────────────┘   │
│           │                              │                    │
│  ┌────────┴──────────────────────────────┴───────────────┐   │
│  │              OllamaAPI (api/ollama/ollama_url.py)      │   │
│  │                                                        │   │
│  │  • /api/generate and /api/chat endpoints               │   │
│  │  • Token-bucket rate limiter (1000 RPM local)          │   │
│  │  • Dual-URL failover (primary → fallback)              │   │
│  │  • 120s timeout, keep_alive, format, think support     │   │
│  │  • Actual token counting from API response             │   │
│  └───────────┬────────────────────────┬──────────────────┘   │
│              │                        │                       │
│  ┌───────────┴───────┐  ┌────────────┴──────────────────┐   │
│  │ OllamaHandler     │  │ LLMFactory                    │   │
│  │ (llm/ollama_      │  │ (llm/llm_factory.py)          │   │
│  │  handler.py)      │  │                                │   │
│  │                   │  │ • Provider preference order    │   │
│  │ • LLMHandlerIface │  │ • DualLayerRateLimiter        │   │
│  │ • /api/generate   │  │ • Handler caching              │   │
│  │ • Returns None on │  │ • Ollama = last resort fallback│   │
│  │   failure (→ next) │  │ • Default: phi3:mini          │   │
│  └───────────────────┘  └───────────────────────────────┘   │
└─────────────────────────────────────────────────────────────┘
                    │                    │
        ┌───────────┘                    └──────────┐
        ▼                                           ▼
┌───────────────────┐                    ┌──────────────────┐
│ Primary: GPU      │                    │ Cloud: ollama.com│
│ 10.0.0.155:18080  │                    │ (OLLAMA_API_KEY) │
│ (when available)  │                    │ Free/Pro/Max tier│
└───────────────────┘                    └──────────────────┘
        │ (unreachable?)
        ▼
┌───────────────────┐
│ Fallback: CPU     │
│ localhost:11434   │
│ (always available)│
└───────────────────┘

File Map

FileRole api/ollama/ollama_url.pyHTTP API client, rate limiter, metrics, failover agents/core/ollama_chat_manager.pyConnection manager, model discovery, conversation history llm/ollama_handler.pyLLMFactory handler interface implementation llm/llm_factory.pyMaster factory, provider selection, Ollama as fallback llm/rate_limiter.pyToken-bucket rate limiting with metrics llm/inference_discovery.pyBoot-time probe of all inference sources models/ollama.yamlModel registry with task scores api/ollama/ollama_admin_routes.pyAdmin endpoints (status, test, generate, models) api/ollama/ollama_model_capability_tool.pyDynamic capability detection

Configuration Cascade

ENV: MINDX_LLM__OLLAMA__BASE_URL
  → explicit base_url parameter
    → models/ollama.yaml base_url
      → data/config/*.json settings
        → localhost:11434 (default)

Model Selection Hierarchy

HierarchicalModelScorer — learned from feedback (success rate, latency, token throughput)

Task keyword matching — chat→mistral/llama, reasoning→nemo/deepseek, coding→codegemma

First available — whatever's loaded

Resilience Chain (from llm/RESILIENCE.md)

_resolve_inference_model() — 5-step chain:

Step 1: InferenceDiscovery → best provider (Gemini, Mistral, Groq, etc.) Step 2: OllamaChatManager → local model selection Step 3: Re-init OllamaChatManager → retry with fresh connection Step 4: Direct HTTP → localhost:11434/api/tags (zero dependencies) Step 5: OllamaCloudTool → ollama.com GPU inference ← GUARANTEE (24/7/365) → None → fallback_decide() → 2-min backoff

TierRoleProviderWhen PrimaryBest qualityGemini, MistralFirst choice SecondarySpeed/costGroq, TogetherLatency or cost FailsafeLocal fallbackOllama CPU (localhost:11434)When cloud APIs fail GuaranteeCloud fallbackOllamaCloudTool (ollama.com)When local is also down — 24/7/365

mindX never has an inference gap when ollama.com is reachable. Cloud is the guarantee, not the default — the _cloud_inference_active flag in mindXagent.py resets after one use so the next cycle tries local first.

Cloud Offload (via `-cloud` suffix)

Cloud models accessed through the local daemon use the -cloud tag suffix. This is a metadata-only pull — inference is proxied to ollama.com GPU servers. See How Cloud Works Without an API Key and the latest benchmark.

ollama pull gpt-oss:120b-cloud    → metadata only, inference on cloud GPU (65 tok/s)
ollama pull deepseek-r1:1.5b      → full weights, inference on local CPU (8 tok/s)

Test script: scripts/test_cloud_all_models.py

Production Deployment Notes (mindx.pythai.net)

VPS: 4GB RAM, No GPU, Hostinger

Only 1 model loaded at a time

qwen3:1.7b as autonomous default (~2GB RAM)

qwen3:0.6b for lightweight tasks (~1GB)

Embedding models: mxbai-embed-large (0.7GB), nomic-embed-text (0.3GB)

keep_alive: 5m — free memory between cycles

Autonomous cycle: 300s interval with inference pre-check

Known Issues (from audit 2026-04-10)

OllamaHandler ignores rate limiting — uses direct aiohttp, not rate_limiter

No streaming in OllamaAPI — stream: False hardcoded

MastermindAgent.autonomous_loop_task never created — declared but not wired

blueprint_agent crashes on None LLM response — no null check before json.loads

MemoryAgent missing get_memories_by_agent — RAGE route fallback fails

What's Working Well

Dual-URL failover is production-proven

Token counting from API response (not estimation)

Model discovery with 24h refresh

Conversation history persistence to JSON

Admin routes for diagnostics

HierarchicalModelScorer feedback loop

Cloud Integration (Implemented)

OllamaCloudTool — cloud inference as a first-class BaseTool

Wired into _resolve_inference_model() as Step 5 (guarantee)

Rate limited at 10 RPM via embedded CloudRateLimiter

18dp precision metrics at data/metrics/cloud_precision_metrics.json

VPS Deployment (HostingerVPSAgent)

agents/hostinger_vps_agent.py manages the production VPS through three MCP channels:

ChannelTransportAuthCapabilities SSHroot@168.231.126.58~/.ssh/id_rsadeploy, health, restart, logs, models, disk Hostinger APIHTTPSHOSTINGER_API_KEYrestart (no SSH), metrics, backups, VPS info mindX BackendHTTPSNone (public)health, diagnostics, inference, dojo, activity

full_health_check() queries all three in parallel. register_mcp_context() publishes tool definitions for agent discovery. See .agent definition.

Referenced in this document

index

All Documents Document Index The Book of mindX Improvement Journal API Reference