MindXAgent now includes a sliding scale optimization system that automatically finds the optimal inference frequency for Ollama models based on data-driven analysis.
agents/core/inference_optimizer.py)Every inference request is recorded with:
Metrics are collected in time windows (default: 5 minutes):
Each frequency is scored based on:
The optimizer uses a sliding scale approach:
InferenceOptimizer(
min_frequency=1.0, # Minimum: 1 request/minute
max_frequency=120.0, # Maximum: 120 requests/minute
initial_frequency=10.0, # Start: 10 requests/minute
window_duration=300, # 5 minutes per window
optimization_interval=600 # Optimize every 10 minutes
)
# In ollama_chat_manager.py initialization
self.inference_optimizer = InferenceOptimizer(
config=self.config,
metrics_file=Path("data/custom_metrics.json"),
min_frequency=5.0, # Custom minimum
max_frequency=60.0, # Custom maximum
initial_frequency=20.0, # Custom starting point
window_duration=600, # 10 minutes per window
optimization_interval=1200 # Optimize every 20 minutes
)
The optimizer runs automatically when OllamaChatManager is initialized:
# Already integrated in mindXagent
mindx_agent = await MindXAgent.get_instance()
Optimization is active automatically
# Get current optimal frequency
frequency = mindx_agent.get_optimal_inference_frequency()
print(f"Optimal frequency: {frequency} rpm")
Get optimization metrics
metrics = mindx_agent.get_inference_optimization_metrics()
print(f"Total requests: {metrics['total_requests']}")
print(f"Success rate: {metrics['recent_success_rate']100:.1f}%")
print(f"Avg latency: {metrics['recent_avg_latency_ms']:.0f}ms")
if mindx_agent.ollama_chat_manager:
optimizer = mindx_agent.ollama_chat_manager.inference_optimizer
if optimizer:
# Get current frequency
freq = optimizer.get_current_frequency()
# Get metrics summary
summary = optimizer.get_metrics_summary()
# Manually trigger optimization
optimal = await optimizer.optimize_frequency()
data/inference_optimizer_metrics.jsonmetrics_file parameter{
"metrics": [
{
"timestamp": 1705536000.0,
"model": "mistral-nemo:latest",
"input_tokens": 50,
"output_tokens": 100,
"latency_ms": 1500.0,
"success": true
}
],
"frequency_windows": [
{
"frequency": 10.0,
"start_time": 1705536000.0,
"end_time": 1705536300.0,
"total_requests": 50,
"successful_requests": 48,
"failed_requests": 2,
"avg_latency_ms": 1200.0,
"total_input_tokens": 2500,
"total_output_tokens": 5000,
"throughput_tokens_per_sec": 25.0,
"error_rate": 0.04
}
],
"current_frequency": 12.5,
"optimal_frequency": 12.5,
"last_updated": "2026-01-17T22:00:00"
}
score = (throughput 0.4) +
(error_rate -0.3) +
(latency_penalty -0.2) +
(success_bonus 0.1)
Where:
throughput = tokens per seconderror_rate = failed requests / total requestslatency_penalty = average latency in secondssuccess_bonus = successful requests / total requests new_frequency = current + min(10, (max - current) 0.1)
new_frequency = current - min(10, (current - min) 0.2)
new_frequency = current # No change
The optimization system is automatically started when:
1. MindXAgent._async_init()
↓
OllamaChatManager.initialize()
↓
InferenceOptimizer.start_optimization_loop()
↓
Metrics collection begins
↓
Optimization runs every 10 minutes
# Get current status
metrics = mindx_agent.get_inference_optimization_metrics()
Check optimization status
if metrics.get("status") == "no_data":
print("Collecting initial data...")
else:
print(f"Frequency: {metrics['current_frequency']} rpm")
print(f"Requests: {metrics['total_requests']}")
print(f"Success: {metrics['recent_success_rate']100:.1f}%")
# Access optimizer directly
optimizer = mindx_agent.ollama_chat_manager.inference_optimizer
Get all windows
windows = optimizer.frequency_windows
Analyze trends
for window in windows[-10:]: # Last 10 windows
print(f"{window.frequency} rpm: {window.error_rate*100:.1f}% errors, {window.throughput_tokens_per_sec:.1f} tok/s")
mindx_agent.ollama_chat_manager.inference_optimizerdata/inference_optimizer_metrics.json