The SystemHealthTool provides comprehensive system monitoring and health management capabilities for mindX. It monitors system resources (CPU, memory, disk, network, temperatures) and performs basic remediation actions to maintain system health.
File: tools/system_health_tool.py
Class: SystemHealthTool
Version: 1.0.0
Status: ✅ Active
class SystemHealthTool(BaseTool):
- tasks: Dict[str, Callable] - Task dispatcher
- config: Config - Configuration object
- _send_email_alert(): Email notification helper
monitor_cpuParameters: None (uses config thresholds)
Returns:
{
"status": "OK" | "ALERT",
"cpu_usage": float,
"threshold": float,
"message": str # Only if ALERT
}
monitor_memory_diskParameters: None
Returns:
{
"status": "OK" | "ALERT",
"memory_usage": float,
"disk_usage": float,
"message": str # Only if ALERT
}
monitor_networkParameters: None
Returns:
{
"status": "OK" | "ALERT",
"sent_kbs": float,
"recv_kbs": float,
"message": str # Only if ALERT
}
monitor_temperaturesParameters: None
Returns: Temperature data (implementation may vary)
get_top_cpu_processesParameters: None
Returns:
{
"status": "SUCCESS" | "ERROR",
"processes": List[str],
"message": str # Only if ERROR
}
clean_log_directoryParameters:
directory (str, optional): Log directory path (default: "/var/log/aion")Returns:
{
"status": "SUCCESS" | "ERROR",
"removed_count": int,
"errors": List[str],
"message": str
}
update_man_dbParameters: None
Returns:
{
"status": "SUCCESS" | "ERROR" | "SKIPPED",
"executed": bool,
"message": str
}
kill_stale_processes (self_healing)Parameters:
max_runtime_hours (int, optional): Max runtime in hours (default: 1)process_name (str, optional): Process name filter (default: "python")Returns:
{
"status": "SUCCESS",
"killed_count": int,
"details": List[Dict[str, Any]]
}
from tools.system_health_tool import SystemHealthTool
from utils.config import Config
config = Config()
tool = SystemHealthTool(config=config)
Monitor CPU
result = await tool.execute(task="monitor_cpu")
if result["status"] == "ALERT":
print(f"CPU Alert: {result['message']}")
Monitor memory and disk
result = await tool.execute(task="monitor_memory_disk")
# Kill stale processes
result = await tool.execute(
task="kill_stale_processes",
max_runtime_hours=2,
process_name="python"
)
print(f"Killed {result['killed_count']} stale processes")
# Clean log directory
result = await tool.execute(
task="clean_log_directory",
directory="/var/log/mindx"
)
# Alert thresholds
tools.system_health.cpu_alert_threshold: 90 # CPU % threshold
tools.system_health.mem_alert_threshold: 90 # Memory % threshold
tools.system_health.disk_alert_threshold: 80 # Disk % threshold
tools.system_health.network_alert_threshold: 1000 # KB/s threshold
Email alerts
tools.system_health.email_alerts: false # Enable email alerts
tools.system_health.email_recipient: "admin@example.com"
Self-healing
tools.system_health.cpu_permit_man_update: 50 # CPU % for man update
Optional email notifications for critical issues:
Configure via:
tools.system_health.email_alerts: true
tools.system_health.email_recipient: "admin@example.com"
Automatic remediation:
# In agent plan
plan = [
{
"action": "monitor_system_health",
"task": "monitor_cpu"
},
{
"action": "self_heal",
"task": "kill_stale_processes",
"max_runtime_hours": 2
}
]
The SystemHealthTool can be used with:
# Check all system resources
cpu_result = await tool.execute(task="monitor_cpu")
mem_result = await tool.execute(task="monitor_memory_disk")
net_result = await tool.execute(task="monitor_network")
Get top processes if CPU is high
if cpu_result["status"] == "ALERT":
processes = await tool.execute(task="get_top_cpu_processes")
# Clean logs if disk is high
disk_result = await tool.execute(task="monitor_memory_disk")
if disk_result["status"] == "ALERT" and disk_result["disk_usage"] > 85:
cleanup = await tool.execute(
task="clean_log_directory",
directory="/var/log/mindx"
)
psutil: System and process utilitiessmtplib: Email notifications (synchronous)subprocess: Process managementcore.bdi_agent.BaseTool: Base tool classAll tasks return structured error responses:
{
"status": "ERROR",
"message": "Error description"
}