monitoring_rate_control.md · 5.6 KB

Monitoring and Rate Control (Both Directions)

Whether mindX is ingesting (receiving data from clients), providing inference (calling Ollama/LLMs), or services (orchestration, memory, tools), monitoring and rate control are essential in both directions. This document defines actual network and data metrics in scientific form (SI or standard units) and where they apply.


1. Both directions

DirectionRoleMonitoringRate control InboundClients → mindX (ingestion, agent/call, ollama/ingest)Request latency, payload size, throughputPer-client or global req/s, req/min OutboundmindX → Ollama / LLM providers (inference)Latency, tokens, throughput, errorsRPM, RPH, token bucket (see llm/rate_limiter.py)

Both directions must be measured and, where configured, limited so that ingestion, inference, and services stay within capacity and quotas.


2. Scientific network and data metrics

All metrics use explicit units. Prefer SI or widely used standards.

2.1 Time

SymbolUnitDescription \(t\)s (second)Wall-clock time \(T_{\mathrm{lat}}\)s or msLatency (request start → response end) \(T_{\mathrm{wait}}\)msWait time in rate limiter before sending

  • Latency: Report in seconds (s) or milliseconds (ms). APIs: average_latency_ms, latency_ms, total_duration (ns → convert to s or ms).
  • Throughput (time-based): Requests per second req/s or per minute req/min (RPM).
  • 2.2 Data volume

    SymbolUnitDescription \(B_{\mathrm{in}}\)byteRequest body size (payload in) \(B_{\mathrm{out}}\)byteResponse body size (payload out) \(N_{\mathrm{tok}}\)1 (dimensionless)Token count (input + output)

  • Payload sizes: Report in bytes (B). Request/response body length in bytes.
  • Tokens: Count as integers; report tokens or tokens/s for throughput.
  • 2.3 Rate (throughput)

    QuantityUnitDescription Request rate (inbound)req/s, req/minIncoming API requests per unit time Request rate (outbound)req/min (RPM), req/h (RPH)Outgoing calls to Ollama/LLM per unit time Token ratetokens/s, tokens/min (TPM)Tokens consumed or generated per unit time Data ratebyte/s (B/s), kB/sPayload bytes per unit time

    2.4 Counts and ratios

    QuantityUnitDescription Total requests1Cumulative count Success / failure1Counts or ratio (dimensionless) Rate limit hits1Count of requests delayed or blocked by limiter Utilization0–1 or %e.g. token bucket utilization, queue depth / max

    3. Where metrics are collected

    3.1 Inbound (clients → mindX)

  • Middleware: mindx_backend_service/inbound_metrics.pyInboundMetricsMiddleware records per-request latency \(T_{\mathrm{lat}}\) (ms), request body size \(B_{\mathrm{in}}\) (bytes), response body size \(B_{\mathrm{out}}\) (bytes). Optional inbound rate limit (req/min) returns 429 when exceeded.
  • Aggregates: get_metrics(window_s) returns total_requests, total_latency_ms, average_latency_ms, total_request_bytes, total_response_bytes, requests_per_minute (in window), rate_limit_rejects, latency_p50_ms, latency_p90_ms, latency_p99_ms.
  • API: GET /api/monitoring/inbound — returns inbound_metrics (scientific units) and inbound_rate_limit (requests_per_minute, window_s). Enable limit via set_inbound_rate_limit(requests_per_minute, window_s).
  • 3.2 Outbound (mindX → Ollama / LLMs)

  • Per provider (e.g. api/ollama/ollama_url.py): total_requests, successful_requests, failed_requests, rate_limit_hits, total_tokens, average_latency_ms, rate_limits.rpm, rate_limits.tpm.
  • Rate limiter (llm/rate_limiter.py): wait_time_ms, wait_time_p50/p90/p99, token_utilization, requests_per_minute, requests_per_hour; get_metrics() returns these.
  • Units: Latency in ms, tokens as count, rate as req/min and req/h.
  • 3.3 Services (internal)

  • PerformanceMonitor (agents/monitoring/performance_monitor.py): total_calls, successful_calls, failed_calls, total_latency_ms, latencies_ms, total_prompt_tokens, total_completion_tokens, total_cost.
  • Use the same units: ms for latency, tokens for counts, USD or equivalent for cost where applicable.

  • 4. API and config

  • Rate limiter API: llm/rate_limiter.pyRateLimiter.get_metrics(), DualLayerRateLimiter.get_metrics(), HourlyRateLimiter.get_metrics(); api/llm_routes.py — rate limit status and update endpoints.
  • Provider YAML: models/*.yamlrate_limits (rpm, rph) and optional quota (total_calls, period_days) for even distribution.
  • Factory config: data/config/llm_factory_config.jsonrate_limit_profiles (rpm, rph, strict, very_strict, etc.).

  • 5. Summary

  • Ingestion, inference, and services: Monitor and apply rate control in both directions (inbound and outbound).
  • Scientific metrics: Use s or ms for time, bytes for payload size, req/s or req/min for request rate, tokens and tokens/s or TPM for token throughput, and dimensionless counts/ratios where appropriate.
  • Actual metrics: Exposed via get_metrics() on limiters and API clients, PerformanceMonitor, and optional inbound middleware; persist or export as needed for dashboards and alerts.

  • All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference