Run larger models without powerful hardware. Cloud models offload to Ollama's cloud service with the same local tool compatibility.
# Create account and sign in
ollama signin
Generate API key at: https://ollama.com/settings/keys
export OLLAMA_API_KEY=your_api_key
Free tier limits reset: session limits every 5 hours, weekly limits every 7 days.
Critical constraint — Free tier = 1 concurrent model. This means:
deepseek-v3.2-cloud and soldier B requests qwen3-coder-next-cloud, the second request must wait for the first model to unloadgpt-oss:120b-cloud at 65 tok/s) for all boardroom cloud querieshttp://localhost:11434/api/chathttps://ollama.com/api/chathttps://ollama.com/v1/chat/completionshttps://ollama.com/api/tagsCloud models are pulled locally but inference runs in the cloud. Seamless — same API as local models.
Critical: Append -cloud to the model name when pulling. Without it, ollama pull downloads full weights (gigabytes) for local execution. With -cloud, only metadata is pulled and inference is proxied to ollama.com GPU servers.
# Pull cloud model (metadata only — inference proxied to cloud GPU)
ollama pull gpt-oss:120b-cloud
ollama pull deepseek-v3.2-cloud
NOT this — downloads 3.3GB weights for local execution:
ollama pull gemma3:4b
Run (automatically offloads to cloud)
ollama run gpt-oss:120b-cloud
The cloud catalog lists names without -cloud. Append it yourself:
# Catalog returns: gpt-oss:120b, deepseek-v3.2, qwen3-coder-next, etc.
Pull as: gpt-oss:120b-cloud, deepseek-v3.2-cloud, qwen3-coder-next-cloud
Requires ollama signin first (stores ed25519 key at ~/.ollama/id_ed25519).
Benchmark (test_cloud_all_models.py, 2026-04-11): gpt-oss:120b-cloud at 65.52 tok/s vs local CPU deepseek-r1:1.5b at 8.00 tok/s — 8.2x speedup on cloud GPU with a 120B model.
from ollama import Client
client = Client()
for part in client.chat('gpt-oss:120b-cloud', messages=[
{'role': 'user', 'content': 'Why is the sky blue?'}
], stream=True):
print(part.message.content, end='', flush=True)
Use https://ollama.com as a remote Ollama host:
import os
from ollama import Client
client = Client(
host='https://ollama.com',
headers={'Authorization': 'Bearer ' + os.environ.get('OLLAMA_API_KEY')}
)
for part in client.chat('gpt-oss:120b', messages=[
{'role': 'user', 'content': 'Why is the sky blue?'}
], stream=True):
print(part.message.content, end='', flush=True)
import { Ollama } from 'ollama'
const ollama = new Ollama({
host: 'https://ollama.com',
headers: { Authorization: 'Bearer ' + process.env.OLLAMA_API_KEY },
})
const response = await ollama.chat({
model: 'gpt-oss:120b',
messages: [{ role: 'user', content: 'Explain quantum computing' }],
stream: true,
})
for await (const part of response) {
process.stdout.write(part.message.content)
}
curl https://ollama.com/api/chat \
-H "Authorization: Bearer $OLLAMA_API_KEY" \
-d '{
"model": "gpt-oss:120b",
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"stream": false
}'
curl https://ollama.com/api/tags \
-H "Authorization: Bearer $OLLAMA_API_KEY"
Full list: ollama.com/search?c=cloud
Disable cloud features entirely:
// ~/.ollama/server.json
{"disable_ollama_cloud": true}
Or: OLLAMA_NO_CLOUD=1
See cloud/rate_limiting.md for maximizing the free tier.
qwen3:1.7bqwen3-coder-nextqwen3.5:27bqwen3:0.6bdeepseek-r1:1.5bdeepseek-v3.2