Default is 4096 tokens. Override per request or globally:
# Global (env var)
OLLAMA_CONTEXT_LENGTH=8192 ollama serve
Per request (API)
curl http://localhost:11434/api/generate -d '{
"model": "qwen3:1.7b",
"prompt": "...",
"options": {"num_ctx": 8192}
}'
Interactive session
/set parameter num_ctx 4096
Default: 5 minutes after last request. Control via:
# Keep forever
curl http://localhost:11434/api/generate -d '{"model": "qwen3:1.7b", "keep_alive": -1}'
Unload immediately
curl http://localhost:11434/api/generate -d '{"model": "qwen3:1.7b", "keep_alive": 0}'
ollama stop qwen3:1.7b
Global default
OLLAMA_KEEP_ALIVE=10m ollama serve
Send empty request to load model into memory:
curl http://localhost:11434/api/generate -d '{"model": "qwen3:1.7b"}'
curl http://localhost:11434/api/chat -d '{"model": "qwen3:1.7b"}'
ollama run qwen3:1.7b ""
ollama ps
NAME SIZE PROCESSOR UNTIL
qwen3:1.7b 1.4 GB 100% CPU 4 minutes from now
OLLAMA_HOST=0.0.0.0:11434 ollama serve
OLLAMA_ORIGINS=chrome-extension://,http://localhost:3000 ollama serve
~/.ollama/models/usr/share/ollama/.ollama/modelsC:\Users\%username%\.ollama\modelsOverride: OLLAMA_MODELS=/custom/path
// ~/.ollama/server.json
{"disable_ollama_cloud": true}
Or: OLLAMA_NO_CLOUD=1
Set via systemctl edit ollama.service on Linux:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_MODELS=/data/models"
Environment="OLLAMA_KEEP_ALIVE=10m"
Environment="OLLAMA_CONTEXT_LENGTH=8192"
Environment="OLLAMA_FLASH_ATTENTION=1"
Environment="OLLAMA_KV_CACHE_TYPE=q8_0"
Environment="OLLAMA_MAX_LOADED_MODELS=3"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_MAX_QUEUE=512"
Then: systemctl daemon-reload && systemctl restart ollama
OLLAMA_MAX_LOADED_MODELSOLLAMA_NUM_PARALLELOLLAMA_MAX_QUEUERAM scales by: OLLAMA_NUM_PARALLEL * OLLAMA_CONTEXT_LENGTH
Reduces memory with large contexts:
OLLAMA_FLASH_ATTENTION=1 ollama serve
Requires Flash Attention enabled:
f16 (default)q8_0q4_0OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve
# NVIDIA — select specific GPUs
CUDA_VISIBLE_DEVICES=0,1 ollama serve
AMD
ROCR_VISIBLE_DEVICES=0 ollama serve
Force CPU only
CUDA_VISIBLE_DEVICES=-1 ollama serve
# Linux (systemd)
journalctl -u ollama --no-pager --follow
macOS
cat ~/.ollama/logs/server.log
Docker
docker logs <container-name>
sudo rmmod nvidia_uvm && sudo modprobe nvidia_uvm
If GPU discovery times out (30s), upgrade to ROCm v7:
# Install amdgpu-install from AMD docs, then:
sudo amdgpu-install
sudo reboot
OLLAMA_LLM_LIBRARY="cpu_avx2" ollama serve
Options: cpu, cpu_avx, cpu_avx2, cuda_v11, rocm_v5, rocm_v6
Increase queue: OLLAMA_MAX_QUEUE=1024
# Linux
curl -fsSL https://ollama.com/install.sh | sh
Specific version
curl -fsSL https://ollama.com/install.sh | OLLAMA_VERSION=0.5.7 sh
OLLAMA_MAX_LOADED_MODELS=1qwen3:1.7b as default — fits in ~2GB RAMOLLAMA_KV_CACHE_TYPE=q8_0 to halve context memoryResourceGovernor — downshift to qwen3:0.6b at >80% RAMOLLAMA_KEEP_ALIVE=5m — free memory between cycleskeep_alive: 0 trick unloads immediately after each request for RAM-critical periodsPrimary: MINDX_LLM__OLLAMA__BASE_URL (10.0.0.155:18080 when GPU available)
Fallback: localhost:11434 (always available on VPS)
Timeout: 120s total, 10s connect, 60s sock_read
# Simple ping
curl http://localhost:11434/api/tags
mindX admin route
curl http://localhost:8000/api/admin/ollama/status