Ollama Cloud

Run larger models without powerful hardware. Cloud models offload to Ollama's cloud service with the same local tool compatibility.

Account Setup

# Create account and sign in
ollama signin
Generate API key at: https://ollama.com/settings/keys
export OLLAMA_API_KEY=your_api_key

Tiers (ollama.com/pricing)

TierPriceCloud UsageConcurrent Models Free$0Light (session + weekly limits)1 Pro$20/mo50x free3 Max$100/mo5x Pro10

Free tier limits reset: session limits every 5 hours, weekly limits every 7 days.

Critical constraint — Free tier = 1 concurrent model. This means:

The boardroom MUST use a single cloud model for all soldiers (no model switching between queries)

If soldier A uses deepseek-v3.2-cloud and soldier B requests qwen3-coder-next-cloud, the second request must wait for the first model to unload

Recommended: Use one strong general model (e.g. gpt-oss:120b-cloud at 65 tok/s) for all boardroom cloud queries

Local models are unlimited — mix freely on the VPS

API Endpoints

EndpointURLAuth Local (offloaded)http://localhost:11434/api/chatNone (auto-offload) Cloud directhttps://ollama.com/api/chatBearer token OpenAI-compatiblehttps://ollama.com/v1/chat/completionsBearer token List cloud modelshttps://ollama.com/api/tagsNone (public catalog)

Usage Methods

Method 1: Local Ollama with Cloud Offload (Recommended)

Cloud models are pulled locally but inference runs in the cloud. Seamless — same API as local models.

Critical: Append -cloud to the model name when pulling. Without it, ollama pull downloads full weights (gigabytes) for local execution. With -cloud, only metadata is pulled and inference is proxied to ollama.com GPU servers.

# Pull cloud model (metadata only — inference proxied to cloud GPU)
ollama pull gpt-oss:120b-cloud
ollama pull deepseek-v3.2-cloud
NOT this — downloads 3.3GB weights for local execution:
ollama pull gemma3:4b
Run (automatically offloads to cloud)
ollama run gpt-oss:120b-cloud

The cloud catalog lists names without -cloud. Append it yourself:

# Catalog returns: gpt-oss:120b, deepseek-v3.2, qwen3-coder-next, etc.
Pull as:         gpt-oss:120b-cloud, deepseek-v3.2-cloud, qwen3-coder-next-cloud

Requires ollama signin first (stores ed25519 key at ~/.ollama/id_ed25519).

Benchmark (test_cloud_all_models.py, 2026-04-11): gpt-oss:120b-cloud at 65.52 tok/s vs local CPU deepseek-r1:1.5b at 8.00 tok/s — 8.2x speedup on cloud GPU with a 120B model.

from ollama import Client
client = Client()
for part in client.chat('gpt-oss:120b-cloud', messages=[
    {'role': 'user', 'content': 'Why is the sky blue?'}
], stream=True):
    print(part.message.content, end='', flush=True)

Method 2: Direct Cloud API

Use https://ollama.com as a remote Ollama host:

import os
from ollama import Client
client = Client(
    host='https://ollama.com',
    headers={'Authorization': 'Bearer ' + os.environ.get('OLLAMA_API_KEY')}
)
for part in client.chat('gpt-oss:120b', messages=[
    {'role': 'user', 'content': 'Why is the sky blue?'}
], stream=True):
    print(part.message.content, end='', flush=True)

import { Ollama } from 'ollama'
const ollama = new Ollama({
    host: 'https://ollama.com',
    headers: { Authorization: 'Bearer ' + process.env.OLLAMA_API_KEY },
})
const response = await ollama.chat({
    model: 'gpt-oss:120b',
    messages: [{ role: 'user', content: 'Explain quantum computing' }],
    stream: true,
})
for await (const part of response) {
    process.stdout.write(part.message.content)
}

curl https://ollama.com/api/chat \
  -H "Authorization: Bearer $OLLAMA_API_KEY" \
  -d '{
    "model": "gpt-oss:120b",
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "stream": false
  }'

List Available Cloud Models

curl https://ollama.com/api/tags \
  -H "Authorization: Bearer $OLLAMA_API_KEY"

Cloud Models (as of 2026-04-11)

ModelParams (Active)TagsBest For glm-5.1—tools, thinkingAgentic engineering, coding gemma426b, 31bvision, tools, thinking, audioFrontier multimodal minimax-m2.7—tools, thinkingCoding, productivity qwen3.50.8b-122bvision, tools, thinkingGeneral multimodal qwen3-coder-next—toolsAgentic coding ministral-33b, 8b, 14bvision, toolsEdge deployment devstral-small-224bvision, toolsCode exploration nemotron-3-super120b (12b active)tools, thinkingEfficient MoE qwen3-next80btools, thinkingEfficient reasoning glm-5744b (40b active)tools, thinkingComplex engineering kimi-k2.5—vision, tools, thinkingMultimodal agentic rnj-18btoolsCode + STEM nemotron-3-nano4b, 30btools, thinkingEfficient agentic deepseek-v3.2—tools, thinkingEfficient reasoning cogito-2.1671b—General (MIT) gemini-3-flash-preview—vision, tools, thinkingSpeed + intelligence

Full list: ollama.com/search?c=cloud

Local-Only Mode

Disable cloud features entirely:

// ~/.ollama/server.json
{"disable_ollama_cloud": true}

Or: OLLAMA_NO_CLOUD=1

mindX Cloud Strategy

See cloud/rate_limiting.md for maximizing the free tier.

Tier Assignment for mindX Agents

AgentModel SourceWhy mindXagent (autonomous)Local qwen3:1.7bAlways available, no quota BlueprintAgentCloud qwen3-coder-nextCoding focus, needs power AuthorAgentCloud qwen3.5:27bProse quality Heartbeat/healthLocal qwen3:0.6bUltra-fast, no quota burn Self-improvement evalLocal deepseek-r1:1.5bChain-of-thought local Heavy reasoningCloud deepseek-v3.2Best reasoning, cloud only

All Documents Document Index The Book of mindX Improvement Journal API Reference