Ollama Cloud & vLLM Research — 2026-04-10

Summary

Ollama now has a free cloud tier. vLLM is not viable on this VPS. The optimal strategy for mindx.pythai.net (4GB RAM, no GPU) is Ollama local for lightweight tasks + Ollama cloud free tier for heavy reasoning.

1. Ollama Cloud API

Ollama is no longer strictly local-only. A cloud inference service launched with free and paid tiers.

Free Tier

Light usage with session limits (reset every 5 hours) and weekly limits (reset every 7 days)
1 cloud model at a time
Cloud models run on NVIDIA GPU hardware with native weights (not quantized)

Paid Tiers

Pro ($20/mo): 50x more cloud usage, 3 concurrent cloud models
Max ($100/mo): 5x more than Pro, 10 concurrent models

API Endpoints

Native: https://ollama.com/api/chat
OpenAI-compatible: https://ollama.com/v1/chat/completions
Authentication: OLLAMA_API_KEY via bearer token

Cloud-Enabled Models

Available at https://ollama.com/search?c=cloud:

qwen3.5, qwen3-coder-next, qwen3-vl
deepseek-v3.2, gemma4, glm-5
nemotron-3-super, devstral-small-2
ministral-3, kimi-k2.5
Many more — full list at the search URL

Third-Party Free Option

OllamaFreeAPI — community-run public gateway to managed Ollama servers with 50+ models, no API key required.

2. Local Models for Constrained Hardware (4GB RAM, No GPU)

Models that fit in 4GB RAM with CPU-only inference:

Model	Params	Disk (Q4)	RAM	Strength
`qwen2.5-coder:0.5b`	0.5B	~400MB	~1GB	Coding, completion
`qwen2.5-coder:1.5b`	1.5B	~1.0GB	~2GB	Best small coder
`deepseek-r1:1.5b`	1.5B	~1.0GB	~2GB	Reasoning, chain-of-thought
`qwen3.5:0.8b`	0.8B	~600MB	~1GB	General + reasoning (newest)
`qwen3:0.6b`	0.6B	~500MB	~1GB	General (already installed)
`qwen3:1.7b`	1.7B	~1.4GB	~2GB	General (already installed, current autonomous model)
`smollm2:1.7b`	1.7B	~1.0GB	~2GB	General purpose
`smollm2:360m`	360M	~250MB	~500MB	Ultra-light, basic tasks
`lfm2.5-thinking:1.2b`	1.2B	~800MB	~1.5GB	Reasoning (hybrid arch)

Best Picks for mindX

Coding tasks: qwen2.5-coder:1.5b
Reasoning/improvement: deepseek-r1:1.5b (already installed)
General/current: qwen3:1.7b (already installed, current autonomous model)
Ultra-light embedding: qwen3:0.6b (already installed)
Heavy tasks: Ollama cloud free tier → large models remotely

Currently Installed on VPS

qwen3.5:2b (2.7GB) — newest, may be tight on RAM
qwen3:1.7b (1.4GB) — current autonomous model
mxbai-embed-large:latest (0.7GB) — embeddings
nomic-embed-text:latest (0.3GB) — embeddings
qwen3:0.6b (0.5GB) — lightweight

3. vLLM on CPU

Verdict: Not Viable for This VPS

vLLM is designed for high-throughput GPU serving with PagedAttention. On CPU:

Aspect	Ollama (llama.cpp)	vLLM
CPU performance	~80 tok/s	~55 tok/s
RAM efficiency	Excellent (GGUF Q4)	Poor (FP16 default)
4GB RAM viable	Yes (0.5-1.5B models)	No
Setup complexity	Simple	Complex
GPU performance	Good	Excellent (3-20x faster)

vLLM requires significantly more RAM than llama.cpp — it uses FP16/BF16 weights by default with no native GGUF quantization on CPU. For a 4GB VPS, vLLM cannot even load its runtime plus a model.

When vLLM Makes Sense

GPU servers with 24GB+ VRAM
Multi-user concurrent serving
Production throughput optimization
When the 10.0.0.155 GPU server comes back online

Free vLLM Cloud

AMD Developer Cloud: Free GPU credits to run vLLM with open-source models
No general free hosted vLLM API exists

4. Recommended Strategy for mindX

Tier 1: Local (always available)

qwen3:1.7b via Ollama localhost:11434 for autonomous improvement cycles
qwen3:0.6b for lightweight tasks (heartbeat, quick classification)
Embedding models (mxbai-embed-large, nomic-embed-text) for RAGE/pgvector

Tier 2: Ollama Cloud (free, rate-limited)

Register for Ollama cloud free tier
Use https://ollama.com as a secondary inference source
Route heavy reasoning tasks to cloud models (deepseek-v3.2, qwen3-coder-next)
Configure in InferenceDiscovery as an additional provider

Tier 3: GPU Server (when available)

10.0.0.155:18080 for larger models when the GPU server is online
Automatic failover already implemented in ollama_url.py

Tier 4: vLLM (future)

Deploy on the GPU server when it returns
Not for the VPS — Ollama is strictly better on CPU

Multi-Model Agent Assignment

Different agents should use different models suited to their tasks:

mindXagent (autonomous loop): qwen3:1.7b (general reasoning)
BlueprintAgent (evolution planning): Ollama cloud → qwen3-coder-next (coding focus)
AuthorAgent (chapter writing): qwen3:1.7b or cloud → qwen3.5 (prose quality)
Heartbeat/health: qwen3:0.6b (ultra-fast, minimal resource)
Self-improvement evaluation: deepseek-r1:1.5b (chain-of-thought reasoning)

5. Implementation Notes

Current State (2026-04-10)

Ollama localhost:11434 is the primary inference source (env var MINDX_LLM__OLLAMA__BASE_URL)
OllamaHandler and OllamaAPI both respect the env var override
5 models available locally
Autonomous loop running with qwen3:1.7b, executing real improvements

To Add Ollama Cloud

Sign up at ollama.com, get API key
Store in BANKON vault: python manage_credentials.py store ollama_cloud_api_key "KEY"
Add as inference source in InferenceDiscovery
Configure ResourceGovernor to route heavy tasks to cloud, light tasks to local

Resource Management

Run only 1 local model at a time (RAM constraint)
ResourceGovernor should be in balanced or minimal mode
Monitor memory via HealthAuditorTool — if >80%, downshift to qwen3:0.6b

Research conducted 2026-04-10. Sources: ollama.com/pricing, ollama.com/blog/cloud-models, developers.redhat.com (benchmarks), github.com/mfoud444/ollamafreeapi.

All Documents Document Index The Book of mindX Improvement Journal API Reference