ollama/cloud/cloud.md · 6.2 KB

Ollama Cloud

Run larger models without powerful hardware. Cloud models offload to Ollama's cloud service with the same local tool compatibility.

Account Setup

# Create account and sign in
ollama signin

Generate API key at: https://ollama.com/settings/keys

export OLLAMA_API_KEY=your_api_key

Tiers (ollama.com/pricing)

TierPriceCloud UsageConcurrent Models Free$0Light (session + weekly limits)1 Pro$20/mo50x free3 Max$100/mo5x Pro10

Free tier limits reset: session limits every 5 hours, weekly limits every 7 days.

Critical constraint — Free tier = 1 concurrent model. This means:

  • The boardroom MUST use a single cloud model for all soldiers (no model switching between queries)
  • If soldier A uses deepseek-v3.2-cloud and soldier B requests qwen3-coder-next-cloud, the second request must wait for the first model to unload
  • Recommended: Use one strong general model (e.g. gpt-oss:120b-cloud at 65 tok/s) for all boardroom cloud queries
  • Local models are unlimited — mix freely on the VPS
  • API Endpoints

    EndpointURLAuth Local (offloaded)http://localhost:11434/api/chatNone (auto-offload) Cloud directhttps://ollama.com/api/chatBearer token OpenAI-compatiblehttps://ollama.com/v1/chat/completionsBearer token List cloud modelshttps://ollama.com/api/tagsNone (public catalog)

    Usage Methods

    Method 1: Local Ollama with Cloud Offload (Recommended)

    Cloud models are pulled locally but inference runs in the cloud. Seamless — same API as local models.

    Critical: Append -cloud to the model name when pulling. Without it, ollama pull downloads full weights (gigabytes) for local execution. With -cloud, only metadata is pulled and inference is proxied to ollama.com GPU servers.

    # Pull cloud model (metadata only — inference proxied to cloud GPU)
    ollama pull gpt-oss:120b-cloud
    ollama pull deepseek-v3.2-cloud

    NOT this — downloads 3.3GB weights for local execution:

    ollama pull gemma3:4b

    Run (automatically offloads to cloud)

    ollama run gpt-oss:120b-cloud

    The cloud catalog lists names without -cloud. Append it yourself:

    # Catalog returns: gpt-oss:120b, deepseek-v3.2, qwen3-coder-next, etc.
    

    Pull as: gpt-oss:120b-cloud, deepseek-v3.2-cloud, qwen3-coder-next-cloud

    Requires ollama signin first (stores ed25519 key at ~/.ollama/id_ed25519).

    Benchmark (test_cloud_all_models.py, 2026-04-11): gpt-oss:120b-cloud at 65.52 tok/s vs local CPU deepseek-r1:1.5b at 8.00 tok/s — 8.2x speedup on cloud GPU with a 120B model.

    from ollama import Client

    client = Client() for part in client.chat('gpt-oss:120b-cloud', messages=[ {'role': 'user', 'content': 'Why is the sky blue?'} ], stream=True): print(part.message.content, end='', flush=True)

    Method 2: Direct Cloud API

    Use https://ollama.com as a remote Ollama host:

    import os
    from ollama import Client

    client = Client( host='https://ollama.com', headers={'Authorization': 'Bearer ' + os.environ.get('OLLAMA_API_KEY')} )

    for part in client.chat('gpt-oss:120b', messages=[ {'role': 'user', 'content': 'Why is the sky blue?'} ], stream=True): print(part.message.content, end='', flush=True)

    import { Ollama } from 'ollama'

    const ollama = new Ollama({ host: 'https://ollama.com', headers: { Authorization: 'Bearer ' + process.env.OLLAMA_API_KEY }, })

    const response = await ollama.chat({ model: 'gpt-oss:120b', messages: [{ role: 'user', content: 'Explain quantum computing' }], stream: true, }) for await (const part of response) { process.stdout.write(part.message.content) }

    curl https://ollama.com/api/chat \
      -H "Authorization: Bearer $OLLAMA_API_KEY" \
      -d '{
        "model": "gpt-oss:120b",
        "messages": [{"role": "user", "content": "Why is the sky blue?"}],
        "stream": false
      }'
    

    List Available Cloud Models

    curl https://ollama.com/api/tags \
      -H "Authorization: Bearer $OLLAMA_API_KEY"
    

    Cloud Models (as of 2026-04-11)

    ModelParams (Active)TagsBest For glm-5.1—tools, thinkingAgentic engineering, coding gemma426b, 31bvision, tools, thinking, audioFrontier multimodal minimax-m2.7—tools, thinkingCoding, productivity qwen3.50.8b-122bvision, tools, thinkingGeneral multimodal qwen3-coder-next—toolsAgentic coding ministral-33b, 8b, 14bvision, toolsEdge deployment devstral-small-224bvision, toolsCode exploration nemotron-3-super120b (12b active)tools, thinkingEfficient MoE qwen3-next80btools, thinkingEfficient reasoning glm-5744b (40b active)tools, thinkingComplex engineering kimi-k2.5—vision, tools, thinkingMultimodal agentic rnj-18btoolsCode + STEM nemotron-3-nano4b, 30btools, thinkingEfficient agentic deepseek-v3.2—tools, thinkingEfficient reasoning cogito-2.1671b—General (MIT) gemini-3-flash-preview—vision, tools, thinkingSpeed + intelligence

    Full list: ollama.com/search?c=cloud

    Local-Only Mode

    Disable cloud features entirely:

    // ~/.ollama/server.json
    {"disable_ollama_cloud": true}
    

    Or: OLLAMA_NO_CLOUD=1

    mindX Cloud Strategy

    See cloud/rate_limiting.md for maximizing the free tier.

    Tier Assignment for mindX Agents

    AgentModel SourceWhy mindXagent (autonomous)Local qwen3:1.7bAlways available, no quota BlueprintAgentCloud qwen3-coder-nextCoding focus, needs power AuthorAgentCloud qwen3.5:27bProse quality Heartbeat/healthLocal qwen3:0.6bUltra-fast, no quota burn Self-improvement evalLocal deepseek-r1:1.5bChain-of-thought local Heavy reasoningCloud deepseek-v3.2Best reasoning, cloud only
    All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference