VLLM_INTEGRATION.md · 5.6 KB

vLLM Integration — mindX Inference Engine

vLLM is the primary high-performance inference engine in mindX. It provides PagedAttention-optimized serving for both embeddings and language model inference, with Ollama as the always-on fallback.

Architecture

Request (embedding / chat / completion)
    ↓
InferenceDiscovery (auto-scores available providers)
    ↓ best_provider()
┌─────────────────────────────────────────┐
│  vLLM (primary)          port 8001      │
│  PagedAttention, continuous batching    │
│  OpenAI-compatible API                  │
│  Model: mxbai-embed-large (embeddings)  │
│         qwen3:0.6b / 1.7b (chat)       │
├─────────────────────────────────────────┤
│  Ollama (fallback)       port 11434     │
│  CPU-native, always-on                  │
│  qwen3:0.6b, qwen3:1.7b, mxbai-embed  │
├─────────────────────────────────────────┤
│  Cloud (escalation)      Gemini, Groq   │
│  For complex tasks beyond local models  │
└─────────────────────────────────────────┘

How vLLM Is Used

1. Embedding Engine (Primary)

All semantic search in mindX flows through vLLM embeddings first:

memory_pgvector.generate_embedding(text)
    ↓ try vLLM first
    POST http://localhost:8001/v1/embeddings
    ↓ fallback to Ollama
    POST http://localhost:11434/api/embeddings
    ↓ returns
    1024-dim float vector → pgvectorscale

2. Inference Discovery

InferenceDiscovery automatically probes and scores all available inference sources:

3. VLLMAgent — Lifecycle Management

The VLLMAgent (singleton) manages the vLLM server lifecycle:

4. LLM Factory Integration

LLMFactory creates vLLM handlers when the provider is available:

# Provider resolution order
handler = LLMFactory.create_llm_handler(
    provider="vllm",     # or auto-detected
    model="qwen3:0.6b",
)

API Endpoints

EndpointMethodPurpose
GET /vllm/statusGETFull efficiency report: version, backend, serving model, hardware, recommendations
POST /vllm/build-cpuPOSTBuild vLLM from source for CPU-only VPS (10-30 min)
POST /vllm/serve?model=MODELPOSTStart serving a model on port 8001
POST /vllm/stopPOSTStop serving
GET /vllm/healthGETServer health check

Configuration

config/providers/vllm.env

# vLLM OpenAI-compatible endpoints
VLLM_BASE_URL=http://localhost:8001
VLLM_CHAT_ENDPOINT=/v1/chat/completions
VLLM_EMBED_ENDPOINT=/v1/embeddings
VLLM_MODELS_ENDPOINT=/v1/models

models/vllm.yaml

Recommended models for the 2-core VPS:

scripts/start_vllm_embed.sh

python -m vllm.entrypoints.openai.api_server \
  --model mxbai-embed-large \
  --port 8001 \
  --dtype float16

vLLM vs Ollama

FeaturevLLMOllama
RolePrimary (performance)Fallback (reliability)
ArchitecturePagedAttention, continuous batchingCPU-native, simple
APIOpenAI-compatible (/v1/)Ollama API (/api/)
Embeddings/v1/embeddings (fast)/api/embeddings (reliable)
Port800111434
GPUOptimized (tensor parallelism)CPU-only on VPS
Always-onOn-demand24/7

Current Deployment (mindx.pythai.net)

Key Files

FilePurpose
agents/vllm_agent.pyLifecycle management (build, serve, stop, monitor)
llm/vllm_handler.pyOpenAI API client for vLLM
llm/inference_discovery.pyAuto-discovery, health probing, provider scoring
agents/memory_pgvector.pyEmbedding engine (vLLM primary, Ollama fallback)
llm/llm_factory.pyProvider creation and caching
config/providers/vllm.envConfiguration
models/vllm.yamlModel recommendations
scripts/start_vllm_embed.shStartup script

All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference