mindXdashboard/docs/book/journal/api/dojo/inference/governance/origin

philosophymanifesto thesis origin whitepaper ataraxia roadmap press|archoverview orchestration codebase hierarchy core|agentsmindXagent ceo mastermind bdi evolution author all

govdaio civilization identity security|memorypgvector embed aglm memory|inferencevllm ollama mistral gemini|timeoracle

toolsindex tools a2a mcp shell|publishauthoragent book journal|deployproduction security monitoring|apireference swagger|learnusage guide hackathon

OLLAMA_VLLM_CLOUD_RESEARCH.md · 6.1 KB

Ollama Cloud & vLLM Research — 2026-04-10

Summary

Ollama now has a free cloud tier. vLLM is not viable on this VPS. The optimal strategy for mindx.pythai.net (4GB RAM, no GPU) is Ollama local for lightweight tasks + Ollama cloud free tier for heavy reasoning.

1. Ollama Cloud API

Ollama is no longer strictly local-only. A cloud inference service launched with free and paid tiers.

Free Tier

Light usage with session limits (reset every 5 hours) and weekly limits (reset every 7 days)

1 cloud model at a time

Cloud models run on NVIDIA GPU hardware with native weights (not quantized)

Paid Tiers

Pro ($20/mo): 50x more cloud usage, 3 concurrent cloud models

Max ($100/mo): 5x more than Pro, 10 concurrent models

API Endpoints

Native: https://ollama.com/api/chat

OpenAI-compatible: https://ollama.com/v1/chat/completions

Authentication: OLLAMA_API_KEY via bearer token

Cloud-Enabled Models

Available at https://ollama.com/search?c=cloud:

qwen3.5, qwen3-coder-next, qwen3-vl

deepseek-v3.2, gemma4, glm-5

nemotron-3-super, devstral-small-2

ministral-3, kimi-k2.5

Many more — full list at the search URL

Third-Party Free Option

OllamaFreeAPI — community-run public gateway to managed Ollama servers with 50+ models, no API key required.

2. Local Models for Constrained Hardware (4GB RAM, No GPU)

Models that fit in 4GB RAM with CPU-only inference:

ModelParamsDisk (Q4)RAMStrength qwen2.5-coder:0.5b0.5B~400MB~1GBCoding, completion qwen2.5-coder:1.5b1.5B~1.0GB~2GBBest small coder deepseek-r1:1.5b1.5B~1.0GB~2GBReasoning, chain-of-thought qwen3.5:0.8b0.8B~600MB~1GBGeneral + reasoning (newest) qwen3:0.6b0.6B~500MB~1GBGeneral (already installed) qwen3:1.7b1.7B~1.4GB~2GBGeneral (already installed, current autonomous model) smollm2:1.7b1.7B~1.0GB~2GBGeneral purpose smollm2:360m360M~250MB~500MBUltra-light, basic tasks lfm2.5-thinking:1.2b1.2B~800MB~1.5GBReasoning (hybrid arch)

Best Picks for mindX

Coding tasks: qwen2.5-coder:1.5b

Reasoning/improvement: deepseek-r1:1.5b (already installed)

General/current: qwen3:1.7b (already installed, current autonomous model)

Ultra-light embedding: qwen3:0.6b (already installed)

Heavy tasks: Ollama cloud free tier → large models remotely

Currently Installed on VPS

qwen3.5:2b (2.7GB) — newest, may be tight on RAM

qwen3:1.7b (1.4GB) — current autonomous model

mxbai-embed-large:latest (0.7GB) — embeddings

nomic-embed-text:latest (0.3GB) — embeddings

qwen3:0.6b (0.5GB) — lightweight

3. vLLM on CPU

Verdict: Not Viable for This VPS

vLLM is designed for high-throughput GPU serving with PagedAttention. On CPU:

AspectOllama (llama.cpp)vLLM CPU performance~80 tok/s~55 tok/s RAM efficiencyExcellent (GGUF Q4)Poor (FP16 default) 4GB RAM viableYes (0.5-1.5B models)No Setup complexitySimpleComplex GPU performanceGoodExcellent (3-20x faster)

vLLM requires significantly more RAM than llama.cpp — it uses FP16/BF16 weights by default with no native GGUF quantization on CPU. For a 4GB VPS, vLLM cannot even load its runtime plus a model.

When vLLM Makes Sense

GPU servers with 24GB+ VRAM

Multi-user concurrent serving

Production throughput optimization

When the 10.0.0.155 GPU server comes back online

Free vLLM Cloud

AMD Developer Cloud: Free GPU credits to run vLLM with open-source models

No general free hosted vLLM API exists

4. Recommended Strategy for mindX

Tier 1: Local (always available)

qwen3:1.7b via Ollama localhost:11434 for autonomous improvement cycles

qwen3:0.6b for lightweight tasks (heartbeat, quick classification)

Embedding models (mxbai-embed-large, nomic-embed-text) for RAGE/pgvector

Tier 2: Ollama Cloud (free, rate-limited)

Use https://ollama.com as a secondary inference source

Route heavy reasoning tasks to cloud models (deepseek-v3.2, qwen3-coder-next)

Configure in InferenceDiscovery as an additional provider

Tier 3: GPU Server (when available)

10.0.0.155:18080 for larger models when the GPU server is online

Automatic failover already implemented in ollama_url.py

Tier 4: vLLM (future)

Deploy on the GPU server when it returns

Not for the VPS — Ollama is strictly better on CPU

Multi-Model Agent Assignment

Different agents should use different models suited to their tasks:

mindXagent (autonomous loop): qwen3:1.7b (general reasoning)

BlueprintAgent (evolution planning): Ollama cloud → qwen3-coder-next (coding focus)

AuthorAgent (chapter writing): qwen3:1.7b or cloud → qwen3.5 (prose quality)

Heartbeat/health: qwen3:0.6b (ultra-fast, minimal resource)

Self-improvement evaluation: deepseek-r1:1.5b (chain-of-thought reasoning)

5. Implementation Notes

Current State (2026-04-10)

Ollama localhost:11434 is the primary inference source (env var MINDX_LLM__OLLAMA__BASE_URL)

OllamaHandler and OllamaAPI both respect the env var override

5 models available locally

Autonomous loop running with qwen3:1.7b, executing real improvements

To Add Ollama Cloud

Store in BANKON vault: python manage_credentials.py store ollama_cloud_api_key "KEY"

Add as inference source in InferenceDiscovery

Configure ResourceGovernor to route heavy tasks to cloud, light tasks to local

Resource Management

Run only 1 local model at a time (RAM constraint)

ResourceGovernor should be in balanced or minimal mode

Monitor memory via HealthAuditorTool — if >80%, downshift to qwen3:0.6b

Research conducted 2026-04-10. Sources: ollama.com/pricing, ollama.com/blog/cloud-models, developers.redhat.com (benchmarks), github.com/mfoud444/ollamafreeapi.

All Documents Document Index The Book of mindX Improvement Journal API Reference