EMBEDDING_SYSTEM.md · 5.3 KB

RAGE Embed — Semantic Search over Documentation and Memory

Overview

RAGE (Retrieval Augmented Generative Engine) embed is the embedding layer of mindX. It bridges LLM inference with pgvector database storage, enabling semantic search over all documentation and agent memories.

mindX embeds all documentation (194 files) and agent memories into pgvector using mxbai-embed-large (1024 dimensions). RAGE facilitates the interaction between LLM and pgvectorscale, providing the semantic retrieval layer for RAG (Retrieval Augmented Generation) queries.

Architecture

Question → Embed (mxbai-embed-large) → pgvector cosine similarity → Top-K chunks → qwen3:0.6b answer

Embedding Pipeline

  • Model: mxbai-embed-large via Ollama (1024-dimensional vectors)
  • Production (VPS): Ollama /api/embeddings on port 11434 — CPU-native, always running
  • GPU path: vLLM /v1/embeddings on port 8001 — auto-activates when GPU hardware available
  • Storage: PostgreSQL pgvector — doc_embeddings and memories.embedding columns
  • Indexing: IVFFlat cosine similarity index for fast nearest-neighbor search
  • Tables

    -- Document chunks with embeddings
    doc_embeddings (
        doc_name VARCHAR(256),
        chunk_idx INTEGER,
        text_content TEXT,
        embedding vector(1024),
        UNIQUE(doc_name, chunk_idx)
    )

    -- Memory embeddings (column on existing memories table) memories.embedding vector(1024)

    Chunking Strategy

    Documents are split into ~500-word chunks. Each chunk is embedded independently. A 10KB doc typically produces 3-5 chunks. This ensures that search results return specific, relevant passages rather than entire documents.

    API Endpoints

    RAGE Embed Endpoints

    EndpointMethodPurpose /api/rage/embed?query=...GETSemantic search over docs + memories /api/rage/embed/statsGETEmbedding counts + action efficiency /actions/exportGETExport all actions as JSON /actions/export/csvGETExport actions as CSV download /actions/efficiencyGETAction pipeline efficiency metrics /diagnostics/exportGETFull diagnostics snapshot download

    Chat Endpoints

    POST /chat/docs — RAG Q&A

    Ask a question about mindX documentation. Returns an answer generated from relevant doc chunks.

    curl -X POST "https://mindx.pythai.net/chat/docs?question=What+is+the+BDI+agent"
    

    Response:

    {
      "question": "What is the BDI agent",
      "answer": "The BDI (Belief-Desire-Intention) agent is the core reasoning engine...",
      "sources": [
        {"doc": "AGINT", "similarity": 0.6226},
        {"doc": "AGENTS", "similarity": 0.6207}
      ]
    }
    

    GET /chat/docs/stats — Embedding Statistics

    curl https://mindx.pythai.net/chat/docs/stats
    

    Response:

    {"docs": 95, "memories": 558}
    

    Embedding Models

    ModelDimensionsSpeedUse mxbai-embed-large1024~100ms/queryPrimary — best quality nomic-embed-text768~80ms/queryAvailable as alternative qwen3:0.6bN/A~3-6s/queryChat/generation (not embedding)

    Batch Embedding

    # Embed all docs and memories
    python scripts/embed_docs.py
    

    This walks docs/*.md, chunks each, embeds via mxbai-embed-large, and stores in pgvector. Also embeds all memories without embeddings. Takes ~5 minutes on CPU.

    Auto-Embedding

    New memories are auto-embedded on save via MemoryAgent.save_timestamped_memory(). New docs are picked up every 6 hours by the periodic re-embedding task.

    File Paths

  • Embedding engine: agents/memory_pgvector.py (generate_embedding, embed_and_store_doc, semantic_search_docs)
  • Batch script: scripts/embed_docs.py
  • vLLM handler: llm/vllm_handler.py (generate_embeddings method)
  • vLLM startup: scripts/start_vllm_embed.sh
  • RAG endpoint: mindx_backend_service/main_service.py (/chat/docs)
  • vLLMAgent

    agents/vllm_agent.py manages the vLLM lifecycle for mindX:

    EndpointPurpose GET /vllm/statusInstallation status, hardware, recommendations POST /vllm/build-cpuBuild vLLM from source for CPU (AVX2) POST /vllm/serveStart serving a model on port 8001 POST /vllm/stopStop serving GET /vllm/healthServer health check

    Current VPS Status

  • vLLM 0.19.0 installed, backend=ready
  • AMD EPYC 7543P (2 vCPUs, AVX2 supported)
  • 7.8GB RAM — sufficient for mxbai-embed-large
  • Ollama handles chat (qwen3:0.6b) and embeddings (mxbai-embed-large) on CPU
  • vLLM can serve embeddings when started (POST /vllm/serve)
  • Efficiency Strategy

  • Embeddings: vLLM on port 8001 (when serving) → Ollama fallback on port 11434
  • Chat/Generation: Ollama qwen3:0.6b (always available, CPU-native)
  • Cloud LLM: Gemini, Groq, etc. for complex reasoning
  • Multi-stream: Parallel queries across providers for critical decisions
  • Configuration

    # Environment variables
    VLLM_EMBED_URL=http://localhost:8001  # vLLM embedding server
    VLLM_PORT=8001                        # vLLM serving port
    EMBED_MODEL=mxbai-embed-large         # Default embedding model

    Ollama models (pull if not present)

    ollama pull mxbai-embed-large ollama pull nomic-embed-text ollama pull qwen3:0.6b

    All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference