OLLAMA_VLLM_CLOUD_RESEARCH.md · 6.1 KB

Ollama Cloud & vLLM Research — 2026-04-10

Summary

Ollama now has a free cloud tier. vLLM is not viable on this VPS. The optimal strategy for mindx.pythai.net (4GB RAM, no GPU) is Ollama local for lightweight tasks + Ollama cloud free tier for heavy reasoning.


1. Ollama Cloud API

Ollama is no longer strictly local-only. A cloud inference service launched with free and paid tiers.

Free Tier

Paid Tiers

API Endpoints

Cloud-Enabled Models

Available at https://ollama.com/search?c=cloud:

Third-Party Free Option

OllamaFreeAPI — community-run public gateway to managed Ollama servers with 50+ models, no API key required.


2. Local Models for Constrained Hardware (4GB RAM, No GPU)

Models that fit in 4GB RAM with CPU-only inference:

ModelParamsDisk (Q4)RAMStrength
qwen2.5-coder:0.5b0.5B~400MB~1GBCoding, completion
qwen2.5-coder:1.5b1.5B~1.0GB~2GBBest small coder
deepseek-r1:1.5b1.5B~1.0GB~2GBReasoning, chain-of-thought
qwen3.5:0.8b0.8B~600MB~1GBGeneral + reasoning (newest)
qwen3:0.6b0.6B~500MB~1GBGeneral (already installed)
qwen3:1.7b1.7B~1.4GB~2GBGeneral (already installed, current autonomous model)
smollm2:1.7b1.7B~1.0GB~2GBGeneral purpose
smollm2:360m360M~250MB~500MBUltra-light, basic tasks
lfm2.5-thinking:1.2b1.2B~800MB~1.5GBReasoning (hybrid arch)

Best Picks for mindX

Currently Installed on VPS

  1. qwen3.5:2b (2.7GB) — newest, may be tight on RAM
  2. qwen3:1.7b (1.4GB) — current autonomous model
  3. mxbai-embed-large:latest (0.7GB) — embeddings
  4. nomic-embed-text:latest (0.3GB) — embeddings
  5. qwen3:0.6b (0.5GB) — lightweight

3. vLLM on CPU

Verdict: Not Viable for This VPS

vLLM is designed for high-throughput GPU serving with PagedAttention. On CPU:

AspectOllama (llama.cpp)vLLM
CPU performance~80 tok/s~55 tok/s
RAM efficiencyExcellent (GGUF Q4)Poor (FP16 default)
4GB RAM viableYes (0.5-1.5B models)No
Setup complexitySimpleComplex
GPU performanceGoodExcellent (3-20x faster)

vLLM requires significantly more RAM than llama.cpp — it uses FP16/BF16 weights by default with no native GGUF quantization on CPU. For a 4GB VPS, vLLM cannot even load its runtime plus a model.

When vLLM Makes Sense

Free vLLM Cloud


4. Recommended Strategy for mindX

Tier 1: Local (always available)

Tier 2: Ollama Cloud (free, rate-limited)

Tier 3: GPU Server (when available)

Tier 4: vLLM (future)

Multi-Model Agent Assignment

Different agents should use different models suited to their tasks:

5. Implementation Notes

Current State (2026-04-10)

To Add Ollama Cloud

  1. Sign up at ollama.com, get API key
  2. Store in BANKON vault: python manage_credentials.py store ollama_cloud_api_key "KEY"
  3. Add as inference source in InferenceDiscovery
  4. Configure ResourceGovernor to route heavy tasks to cloud, light tasks to local

Resource Management


Research conducted 2026-04-10. Sources: ollama.com/pricing, ollama.com/blog/cloud-models, developers.redhat.com (benchmarks), github.com/mfoud444/ollamafreeapi.


All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference