OLLAMA_VLLM_CLOUD_RESEARCH.md · 6.1 KB

Ollama Cloud & vLLM Research — 2026-04-10

Summary

Ollama now has a free cloud tier. vLLM is not viable on this VPS. The optimal strategy for mindx.pythai.net (4GB RAM, no GPU) is Ollama local for lightweight tasks + Ollama cloud free tier for heavy reasoning.


1. Ollama Cloud API

Ollama is no longer strictly local-only. A cloud inference service launched with free and paid tiers.

Free Tier

  • Light usage with session limits (reset every 5 hours) and weekly limits (reset every 7 days)
  • 1 cloud model at a time
  • Cloud models run on NVIDIA GPU hardware with native weights (not quantized)
  • Paid Tiers

  • Pro ($20/mo): 50x more cloud usage, 3 concurrent cloud models
  • Max ($100/mo): 5x more than Pro, 10 concurrent models
  • API Endpoints

  • Native: https://ollama.com/api/chat
  • OpenAI-compatible: https://ollama.com/v1/chat/completions
  • Authentication: OLLAMA_API_KEY via bearer token
  • Cloud-Enabled Models

    Available at https://ollama.com/search?c=cloud:
  • qwen3.5, qwen3-coder-next, qwen3-vl
  • deepseek-v3.2, gemma4, glm-5
  • nemotron-3-super, devstral-small-2
  • ministral-3, kimi-k2.5
  • Many more — full list at the search URL
  • Third-Party Free Option

    OllamaFreeAPI — community-run public gateway to managed Ollama servers with 50+ models, no API key required.


    2. Local Models for Constrained Hardware (4GB RAM, No GPU)

    Models that fit in 4GB RAM with CPU-only inference:

    ModelParamsDisk (Q4)RAMStrength qwen2.5-coder:0.5b0.5B~400MB~1GBCoding, completion qwen2.5-coder:1.5b1.5B~1.0GB~2GBBest small coder deepseek-r1:1.5b1.5B~1.0GB~2GBReasoning, chain-of-thought qwen3.5:0.8b0.8B~600MB~1GBGeneral + reasoning (newest) qwen3:0.6b0.6B~500MB~1GBGeneral (already installed) qwen3:1.7b1.7B~1.4GB~2GBGeneral (already installed, current autonomous model) smollm2:1.7b1.7B~1.0GB~2GBGeneral purpose smollm2:360m360M~250MB~500MBUltra-light, basic tasks lfm2.5-thinking:1.2b1.2B~800MB~1.5GBReasoning (hybrid arch)

    Best Picks for mindX

  • Coding tasks: qwen2.5-coder:1.5b
  • Reasoning/improvement: deepseek-r1:1.5b (already installed)
  • General/current: qwen3:1.7b (already installed, current autonomous model)
  • Ultra-light embedding: qwen3:0.6b (already installed)
  • Heavy tasks: Ollama cloud free tier → large models remotely
  • Currently Installed on VPS

  • qwen3.5:2b (2.7GB) — newest, may be tight on RAM
  • qwen3:1.7b (1.4GB) — current autonomous model
  • mxbai-embed-large:latest (0.7GB) — embeddings
  • nomic-embed-text:latest (0.3GB) — embeddings
  • qwen3:0.6b (0.5GB) — lightweight

  • 3. vLLM on CPU

    Verdict: Not Viable for This VPS

    vLLM is designed for high-throughput GPU serving with PagedAttention. On CPU:

    AspectOllama (llama.cpp)vLLM CPU performance~80 tok/s~55 tok/s RAM efficiencyExcellent (GGUF Q4)Poor (FP16 default) 4GB RAM viableYes (0.5-1.5B models)No Setup complexitySimpleComplex GPU performanceGoodExcellent (3-20x faster)

    vLLM requires significantly more RAM than llama.cpp — it uses FP16/BF16 weights by default with no native GGUF quantization on CPU. For a 4GB VPS, vLLM cannot even load its runtime plus a model.

    When vLLM Makes Sense

  • GPU servers with 24GB+ VRAM
  • Multi-user concurrent serving
  • Production throughput optimization
  • When the 10.0.0.155 GPU server comes back online
  • Free vLLM Cloud

  • AMD Developer Cloud: Free GPU credits to run vLLM with open-source models
  • No general free hosted vLLM API exists

  • 4. Recommended Strategy for mindX

    Tier 1: Local (always available)

  • qwen3:1.7b via Ollama localhost:11434 for autonomous improvement cycles
  • qwen3:0.6b for lightweight tasks (heartbeat, quick classification)
  • Embedding models (mxbai-embed-large, nomic-embed-text) for RAGE/pgvector
  • Tier 2: Ollama Cloud (free, rate-limited)

  • Register for Ollama cloud free tier
  • Use https://ollama.com as a secondary inference source
  • Route heavy reasoning tasks to cloud models (deepseek-v3.2, qwen3-coder-next)
  • Configure in InferenceDiscovery as an additional provider
  • Tier 3: GPU Server (when available)

  • 10.0.0.155:18080 for larger models when the GPU server is online
  • Automatic failover already implemented in ollama_url.py
  • Tier 4: vLLM (future)

  • Deploy on the GPU server when it returns
  • Not for the VPS — Ollama is strictly better on CPU
  • Multi-Model Agent Assignment

    Different agents should use different models suited to their tasks:
  • mindXagent (autonomous loop): qwen3:1.7b (general reasoning)
  • BlueprintAgent (evolution planning): Ollama cloud → qwen3-coder-next (coding focus)
  • AuthorAgent (chapter writing): qwen3:1.7b or cloud → qwen3.5 (prose quality)
  • Heartbeat/health: qwen3:0.6b (ultra-fast, minimal resource)
  • Self-improvement evaluation: deepseek-r1:1.5b (chain-of-thought reasoning)

  • 5. Implementation Notes

    Current State (2026-04-10)

  • Ollama localhost:11434 is the primary inference source (env var MINDX_LLM__OLLAMA__BASE_URL)
  • OllamaHandler and OllamaAPI both respect the env var override
  • 5 models available locally
  • Autonomous loop running with qwen3:1.7b, executing real improvements
  • To Add Ollama Cloud

  • Sign up at ollama.com, get API key
  • Store in BANKON vault: python manage_credentials.py store ollama_cloud_api_key "KEY"
  • Add as inference source in InferenceDiscovery
  • Configure ResourceGovernor to route heavy tasks to cloud, light tasks to local
  • Resource Management

  • Run only 1 local model at a time (RAM constraint)
  • ResourceGovernor should be in balanced or minimal mode
  • Monitor memory via HealthAuditorTool — if >80%, downshift to qwen3:0.6b

  • Research conducted 2026-04-10. Sources: ollama.com/pricing, ollama.com/blog/cloud-models, developers.redhat.com (benchmarks), github.com/mfoud444/ollamafreeapi.


    All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference