vLLM is the primary high-performance inference engine in mindX. It provides PagedAttention-optimized serving for both embeddings and language model inference, with Ollama as the always-on fallback.
Request (embedding / chat / completion)
↓
InferenceDiscovery (auto-scores available providers)
↓ best_provider()
┌─────────────────────────────────────────┐
│ vLLM (primary) port 8001 │
│ PagedAttention, continuous batching │
│ OpenAI-compatible API │
│ Model: mxbai-embed-large (embeddings) │
│ qwen3:0.6b / 1.7b (chat) │
├─────────────────────────────────────────┤
│ Ollama (fallback) port 11434 │
│ CPU-native, always-on │
│ qwen3:0.6b, qwen3:1.7b, mxbai-embed │
├─────────────────────────────────────────┤
│ Cloud (escalation) Gemini, Groq │
│ For complex tasks beyond local models │
└─────────────────────────────────────────┘
All semantic search in mindX flows through vLLM embeddings first:
POST /v1/embeddingsdoc_embeddings tablemxbai-embed-large (1024-dimensional vectors)/api/embeddings on port 11434 if vLLM is unavailablememory_pgvector.generate_embedding(text)
↓ try vLLM first
POST http://localhost:8001/v1/embeddings
↓ fallback to Ollama
POST http://localhost:11434/api/embeddings
↓ returns
1024-dim float vector → pgvectorscale
InferenceDiscovery automatically probes and scores all available inference sources:
/health endpoint (preferred) with fallback to /v1/modelsreliability × speed_factor × recencyget_best_provider()The VLLMAgent (singleton) manages the vLLM server lifecycle:
data/vllm_status.jsonLLMFactory creates vLLM handlers when the provider is available:
# Provider resolution order
handler = LLMFactory.create_llm_handler(
provider="vllm", # or auto-detected
model="qwen3:0.6b",
)
GET /vllm/statusPOST /vllm/build-cpuPOST /vllm/serve?model=MODELPOST /vllm/stopGET /vllm/health# vLLM OpenAI-compatible endpoints
VLLM_BASE_URL=http://localhost:8001
VLLM_CHAT_ENDPOINT=/v1/chat/completions
VLLM_EMBED_ENDPOINT=/v1/embeddings
VLLM_MODELS_ENDPOINT=/v1/models
python -m vllm.entrypoints.openai.api_server \
--model mxbai-embed-large \
--port 8001 \
--dtype float16
/v1/)/api/)/v1/embeddings (fast)/api/embeddings (reliable)agents/vllm_agent.pyllm/vllm_handler.pyllm/inference_discovery.pyagents/memory_pgvector.pyllm/llm_factory.pyconfig/providers/vllm.envmodels/vllm.yamlscripts/start_vllm_embed.sh