Shows models currently loaded in memory with their resource usage.
curl http://localhost:11434/api/ps
{
"models": [
{
"name": "qwen3:1.7b",
"model": "qwen3:1.7b",
"size": 2800000000,
"digest": "sha256:a2af6cc3eb7f...",
"details": {
"format": "gguf",
"family": "qwen3",
"families": ["qwen3"],
"parameter_size": "1.7B",
"quantization_level": "Q4_K_M"
},
"expires_at": "2025-10-17T16:47:07Z",
"size_vram": 2500000000,
"context_length": 4096
}
]
}
namesizedigestdetailsexpires_atsize_vramcontext_lengthMonitor which models are loaded to prevent OOM on 4GB VPS:
import aiohttp
async def get_running_models(base_url="http://localhost:11434"):
async with aiohttp.ClientSession() as session:
async with session.get(f"{base_url}/api/ps") as resp:
data = await resp.json()
return data.get("models", [])
Check if we need to unload before loading a new model
running = await get_running_models()
total_mem = sum(m["size"] for m in running)
if total_mem > 3_000_000_000: # 3GB threshold on 4GB VPS
# Unload least recently used
for model in running:
await unload_model(model["name"])
curl http://localhost:11434/api/version
Response:
{"version": "0.12.6"}
# List running models
ollama ps
Stop/unload a model
ollama stop qwen3:1.7b
ollama ps:NAME ID SIZE PROCESSOR UNTIL
qwen3:1.7b abc123def456 1.4 GB 100% CPU 4 minutes from now
The PROCESSOR column shows:
100% GPU — entirely on GPU100% CPU — entirely in system memory48%/52% CPU/GPU — split across both