Ollama now has a free cloud tier. vLLM is not viable on this VPS. The optimal strategy for mindx.pythai.net (4GB RAM, no GPU) is Ollama local for lightweight tasks + Ollama cloud free tier for heavy reasoning.
Ollama is no longer strictly local-only. A cloud inference service launched with free and paid tiers.
https://ollama.com/api/chathttps://ollama.com/v1/chat/completionsOLLAMA_API_KEY via bearer tokenhttps://ollama.com/search?c=cloud:
Models that fit in 4GB RAM with CPU-only inference:
| Model | Params | Disk (Q4) | RAM | Strength |
|---|---|---|---|---|
qwen2.5-coder:0.5b | 0.5B | ~400MB | ~1GB | Coding, completion |
qwen2.5-coder:1.5b | 1.5B | ~1.0GB | ~2GB | Best small coder |
deepseek-r1:1.5b | 1.5B | ~1.0GB | ~2GB | Reasoning, chain-of-thought |
qwen3.5:0.8b | 0.8B | ~600MB | ~1GB | General + reasoning (newest) |
qwen3:0.6b | 0.6B | ~500MB | ~1GB | General (already installed) |
qwen3:1.7b | 1.7B | ~1.4GB | ~2GB | General (already installed, current autonomous model) |
smollm2:1.7b | 1.7B | ~1.0GB | ~2GB | General purpose |
smollm2:360m | 360M | ~250MB | ~500MB | Ultra-light, basic tasks |
lfm2.5-thinking:1.2b | 1.2B | ~800MB | ~1.5GB | Reasoning (hybrid arch) |
qwen2.5-coder:1.5bdeepseek-r1:1.5b (already installed)qwen3:1.7b (already installed, current autonomous model)qwen3:0.6b (already installed)qwen3.5:2b (2.7GB) — newest, may be tight on RAMqwen3:1.7b (1.4GB) — current autonomous modelmxbai-embed-large:latest (0.7GB) — embeddingsnomic-embed-text:latest (0.3GB) — embeddingsqwen3:0.6b (0.5GB) — lightweightvLLM is designed for high-throughput GPU serving with PagedAttention. On CPU:
| Aspect | Ollama (llama.cpp) | vLLM |
|---|---|---|
| CPU performance | ~80 tok/s | ~55 tok/s |
| RAM efficiency | Excellent (GGUF Q4) | Poor (FP16 default) |
| 4GB RAM viable | Yes (0.5-1.5B models) | No |
| Setup complexity | Simple | Complex |
| GPU performance | Good | Excellent (3-20x faster) |
vLLM requires significantly more RAM than llama.cpp — it uses FP16/BF16 weights by default with no native GGUF quantization on CPU. For a 4GB VPS, vLLM cannot even load its runtime plus a model.
qwen3:1.7b via Ollama localhost:11434 for autonomous improvement cyclesqwen3:0.6b for lightweight tasks (heartbeat, quick classification)mxbai-embed-large, nomic-embed-text) for RAGE/pgvectorhttps://ollama.com as a secondary inference sourceInferenceDiscovery as an additional provider10.0.0.155:18080 for larger models when the GPU server is onlineollama_url.pyqwen3:1.7b (general reasoning)qwen3-coder-next (coding focus)qwen3:1.7b or cloud → qwen3.5 (prose quality)qwen3:0.6b (ultra-fast, minimal resource)deepseek-r1:1.5b (chain-of-thought reasoning)MINDX_LLM__OLLAMA__BASE_URL)OllamaHandler and OllamaAPI both respect the env var overrideqwen3:1.7b, executing real improvementspython manage_credentials.py store ollama_cloud_api_key "KEY"InferenceDiscoveryResourceGovernor to route heavy tasks to cloud, light tasks to localResourceGovernor should be in balanced or minimal modeHealthAuditorTool — if >80%, downshift to qwen3:0.6bResearch conducted 2026-04-10. Sources: ollama.com/pricing, ollama.com/blog/cloud-models, developers.redhat.com (benchmarks), github.com/mfoud444/ollamafreeapi.