base_gen_agent_backup.md · 24.7 KB

BaseGenAgent (base_gen_agent.py) - Configurable Codebase Documenter

optimized in markdown for LLM ingestion
mindX/tools/base_gen_agent.py
data/config/basegen_config.json

⚠️ IMPORTANT: Size Optimization Notice

For code auditing use cases, use OptimizedAuditGenAgent instead. The BaseGenAgent can generate extremely large files (up to 327MB) when run on directories containing memory data, making it impractical for LLM processing. See Optimization Assessment below.

⚠️ IMPORTANT: Size Optimization Notice

For code auditing use cases, use OptimizedAuditGenAgent instead. The BaseGenAgent can generate extremely large files (up to 327MB) when run on directories containing memory data, making it impractical for LLM processing. See Optimization Assessment below.

Introduction

The BaseGenAgent is a utility agent within the MindX toolkit (Augmentic Project). Its primary function is to automatically generate a comprehensive Markdown document that provides a snapshot of a given codebase directory. This documentation includes a visual directory tree of included files and the complete content of those files, with appropriate language tagging for Markdown syntax highlighting.

A key feature of this agent is its configurability. It intelligently filters files based on: .gitignore rules found within the target codebase. User-defined include/exclude glob patterns passed via CLI or programmatically. A central, modifiable JSON configuration file (basegen_config.json) that specifies hardcoded file/pattern exclusions and language mappings for syntax highlighting.

This makes BaseGenAgent a valuable tool for MindX itself (e.g., for the SelfImprovementAgent or CoordinatorAgent to understand code they are about to modify) or for developers needing a quick, shareable overview of a project or component.

CLI Testing Results ✅

The BaseGenAgent CLI has been validated and works correctly:

# CLI Help
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py --help

Test Results (June 2025)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md

✅ Generated: 39KB (5 files analyzed)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

✅ Generated: 211KB (21 files analyzed)

CLI Requirements:

Optimization Assessment

🚨 Critical Size Issues Identified

Giant File Problem:

Size Comparison Results

DirectoryBaseGenAgent OutputFiles AnalyzedIssue
./tools211KB21 files✅ Normal
./utils39KB5 files✅ Normal
./mindX327MB8000+ files❌ Includes memory data

Solution: OptimizedAuditGenAgent

For code auditing scenarios, use the specialized OptimizedAuditGenAgent:

from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent
from agents.memory_agent import MemoryAgent

For code auditing (recommended)

audit_agent = OptimizedAuditGenAgent(MemoryAgent(), max_files_per_chunk=50) success, result = audit_agent.generate_audit_documentation('./core')

Results: 99.93% size reduction, chunked output, audit-focused

OptimizedAuditGenAgent Benefits:

When to Use Each Agent

Use CaseRecommended AgentReason
Code auditingOptimizedAuditGenAgentPrevents giant files, audit-focused
General documentationBaseGenAgentComplete documentation including .md files
Memory analysisSpecialized tool neededNeither agent suitable for memory dumps

Explanation

Core Functionality

Configuration Loading (_load_agent_config): The agent's behavior is controlled by a JSON configuration file, typically PROJECT_ROOT/data/config/basegen_config.json. A custom path can also be provided during instantiation. If the external config file is not found, it falls back to internal DEFAULT_CONFIG_DATA. Configuration Merging: Values from an external basegen_config.json are merged with the internal defaults. For lists like HARD_CODED_EXCLUDES, entries are combined and deduplicated. For dictionaries like LANGUAGE_MAPPING and the base_gen_agent settings block, external values update or override internal defaults. Key Configurable Sections: - HARD_CODED_EXCLUDES: A list of glob patterns for common binary files, lock files, IDE metadata, temporary files, and version control directories (e.g., .png, node_modules/, .git/) that are generally excluded from code documentation. - LANGUAGE_MAPPING: A dictionary mapping file extensions (e.g., .py, .rs) to language tags recognized by Markdown for syntax highlighting (e.g., python, rust). - base_gen_agent: A sub-dictionary for settings specific to this agent: - max_file_size_kb_for_inclusion: (Default: 1024KB) Files larger than this will have their content omitted with a warning. - default_output_filename: Default name for the output Markdown if not specified by the caller.

File Discovery and Filtering Logic (generate_documentation, _should_include_file): The agent recursively scans the target root_path_str directory. .gitignore Processing (_load_gitignore_specs): If use_gitignore is true (default), it finds all .gitignore files within the root_path_str, aggregates their patterns, and compiles them into a pathspec.PathSpec object. This spec is used to efficiently exclude any files or directories ignored by Git. The .git/ directory itself is always implicitly ignored. Filtering Precedence for _should_include_file: If a file matches the gitignore_spec (and use_gitignore is true), it's excluded. The file is then checked against the combined exclude patterns (CLI/programmatic user_exclude_patterns + HARD_CODED_EXCLUDES from config). If it matches any, it's excluded. If include_patterns are provided (CLI/programmatic), the file must match at least one of these to be considered further. If it doesn't match any, it's excluded. If none of the above exclusion rules apply, the file is included.

Directory Tree Generation (_build_tree_dict, _format_tree_lines): A list of included files (as relative paths from the root_path_str) is used. _build_tree_dict: Constructs a nested dictionary representing the directory hierarchy of these included files. _format_tree_lines: Recursively traverses this dictionary to create an indented, human-readable string representation of the tree structure, suitable for Markdown text code blocks. Directories are marked with a trailing /.

Markdown Document Generation (generate_documentation): This is the main public method. It orchestrates scanning, filtering, and tree generation. It then iterates through the list of included files: For each file, a Markdown section is created with its POSIX-style relative path as a heading (e.g., ### \src/module/file.py\`). The file's content is read (UTF-8, replacing errors). If a file exceeds max_file_size_kb_for_inclusion, its content is omitted, and a warning is included in the Markdown. _guess_language() determines the Markdown language tag for syntax highlighting based on the file extension and the LANGUAGE_MAPPING from the configuration. The file content is embedded within a fenced code block (e.g., \\\python ... \\\). The complete Markdown content (tree + file contents) is written to the specified output file. The default output path is PROJECT_ROOT/data/generated_docs/_codebase_snapshot.md. Return Value: Returns a dictionary summarizing the operation: {"status": "SUCCESS"|"ERROR", "message": str, "output_file": str|None, "files_included": int}. This structured return makes it suitable for programmatic use by other MindX agents.

Agent Structure

Technical Details

- .gitignore: pathspec library (requires pip install pathspec). - Include/Exclude Globs: Python's fnmatch module.

Usage

As a Standalone CLI Tool

The agent can be executed directly:

<pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py <input_dir> [options] </code></pre>

⚠️ Warning: Avoid running on directories containing data/memory/ or large datasets as it can generate unusably large files (327MB+).

Key Arguments:

Example: <pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core \ -o ./docs/core_docs.md \ --include "*.py" \ --exclude "/test" \ --exclude "data/memory/" # Prevent giant files </code></pre>

The CLI will print a JSON object summarizing the result.

Programmatic Usage by Other MindX Agents

The CoordinatorAgent or StrategicEvolutionAgent can instantiate and call BaseGenAgent to get a structured understanding of a component they intend to analyze or modify

<pre><code class="lang-python">from agents.base_gen_agent import BaseGenAgent from agents.memory_agent import MemoryAgent

Initialize with MemoryAgent (required)

memory_agent = MemoryAgent() base_gen = BaseGenAgent(memory_agent=memory_agent)

target_module_path = "./utils" output_doc_path = "./docs/utils_snapshot.md"

result = base_gen.generate_markdown_summary( root_path_str=target_module_path, output_file_str=output_doc_path, include_patterns=[".py"], user_exclude_patterns=["test", "data/memory/"] # Prevent giant files )

if result["status"] == "SUCCESS": print(f"BaseGenAgent created doc: {result['output_file']}") # Markdown content can now be read and fed to an LLM for analysis markdown_content = Path(result['output_file']).read_text() else: print(f"BaseGenAgent failed: {result['message']}") </code></pre>

Dual-Agent Integration for Auditing

For comprehensive auditing workflows, use both agents strategically:

<pre><code class="lang-python">from agents.base_gen_agent import BaseGenAgent from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent from agents.memory_agent import MemoryAgent

class AuditAndImproveTool: def __init__(self, memory_agent): # Keep both for different use cases self.base_gen_agent = BaseGenAgent(memory_agent) # General docs self.audit_gen_agent = OptimizedAuditGenAgent(memory_agent) # Code auditing def execute(self, target_path: str, audit_mode: bool = True): if audit_mode: # Use optimized agent for code auditing (99.93% smaller) return self.audit_gen_agent.generate_audit_documentation(target_path) else: # Use original for general documentation return self.base_gen_agent.generate_markdown_summary(target_path) </code></pre>

Configuration

The BaseGenAgent relies on a JSON configuration file. The primary default location for this file is /data/config/basegen_config.json. This path can be overridden during instantiation or via the --config-file CLI argument.

Key configurable sections:

HARD_CODED_EXCLUDES: A list of glob patterns for files and directories to always exclude (e.g., .pyc, __pycache__/, .git/). LANGUAGE_MAPPING: A dictionary mapping file extensions to language identifiers for Markdown code blocks (e.g., ".py": "python"). base_gen_agent_settings: An object containing settings specific to this agent: max_file_size_kb_for_inclusion: Maximum size (in KB) for a file's content to be included directly. Larger files will have their content omitted with a note. (Default: 1024KB, Config: 2048KB) default_output_filename_stem: The base name for the output Markdown file if not specified. (Default: "codebase_snapshot")

Example basegen_config.json snippet (relevant parts):

<pre><code class="lang-json">{ "HARD_CODED_EXCLUDES": [ ".pyc", "__pycache__/", ".git/", "data/memory/", ".log" ], "LANGUAGE_MAPPING": { ".py": "python", ".js": "javascript", ".md": "markdown" }, "base_gen_agent_settings": { "max_file_size_kb_for_inclusion": 1024, "default_output_filename_stem": "codebase_snapshot" } } </code></pre>

Current Production Configuration

The current basegen_config.json includes:

Recommendation: Add "data/memory/" and "data/logs/" to HARD_CODED_EXCLUDES to prevent giant files.

Command-Line Interface (CLI) Usage

Validated CLI Functionality ✅

The BaseGenAgent CLI has been tested and confirmed working:

<pre><code class="lang-bash"># Prerequisites export PYTHONPATH=/path/to/mindX

Basic usage (tested June 2025)

python3 tools/base_gen_agent.py <input_dir> [options]

Successful test examples:

python3 tools/base_gen_agent.py ./utils --include "
.py" -o demo.md

✅ Output: 39KB, 5 files analyzed

python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

✅ Output: 211KB, 21 files analyzed

</code></pre>

Arguments:

Positional:

Options:

Special Configuration Operations:

List Operations for HARD_CODED_EXCLUDES: <pre><code class="lang-bash"># Append items --update-config '{"HARD_CODED_EXCLUDES": [{"_LIST_OP_":"APPEND_UNIQUE"}, "data/memory/", ".log"]}'

Remove items

--update-config '{"HARD_CODED_EXCLUDES": [{"_LIST_OP_":"REMOVE"}, "
.tmp"]}'

Replace entirely

--update-config '{"HARD_CODED_EXCLUDES": [".py", ".js"]}' </code></pre>

Deep merge for dictionaries: <pre><code class="lang-bash">--update-config '{"LANGUAGE_MAPPING": {".foo": "bar", "_MERGE_DEEP_": true}}' </code></pre>

CLI Output:

The CLI prints a JSON object indicating the status: <pre><code class="lang-json">{ "status": "SUCCESS", "message": "Markdown documentation generated successfully: /path/to/output.md", "output_file": "/path/to/output.md", "files_included": 21 } </code></pre>

Exit Codes:

Example CLI Calls:

<pre><code class="lang-bash"># Basic documentation generation PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core -o core_docs.md

Python files only with memory exclusion (recommended)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core \ --include ".py" \ --exclude "data/memory/" "/__pycache__/" \ -o core_python_docs.md

Custom configuration

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./project \ --config-file ./configs/audit_config.json \ -o project_audit.md

Update configuration first, then generate

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py \ --update-config '{"base_gen_agent_settings.max_file_size_kb_for_inclusion": 500}' </code></pre>

Best Practices

✅ Recommended Usage Patterns

  1. For Code Auditing: Use OptimizedAuditGenAgent instead
  2. For General Documentation: Use BaseGenAgent with memory exclusions
  3. Always exclude memory data: Add --exclude "data/memory/" "data/logs/"
  4. Use include patterns: Specify --include ".py" ".js" for code-only docs
  5. Test on small directories first: Verify output size before large directories

❌ Avoid These Patterns

<pre><code class="lang-bash"># ❌ DON'T: Run on entire mindX without exclusions (creates 327MB files) python3 tools/base_gen_agent.py ./mindX -o huge_file.md

✅ DO: Use exclusions or OptimizedAuditGenAgent

python3 tools/base_gen_agent.py ./mindX --exclude "data/memory/
" -o manageable_file.md </code></pre>

Integration with MastermindAgent

The MastermindAgent can instantiate and use BaseGenAgent internally via its BDI action ANALYZE_CODEBASE_FOR_STRATEGY. However, for audit use cases, it should prefer the OptimizedAuditGenAgent to avoid giant file generation.

Recommended Integration Pattern: <pre><code class="lang-python">class MastermindAgent: def __init__(self): self.base_gen_agent = BaseGenAgent(self.memory_agent) self.audit_gen_agent = OptimizedAuditGenAgent(self.memory_agent) def analyze_codebase_for_strategy(self, target_path: str, audit_mode: bool = True): if audit_mode: return self.audit_gen_agent.generate_audit_documentation(target_path) else: return self.base_gen_agent.generate_markdown_summary( target_path, user_exclude_patterns=["data/memory/", "data/logs/"] ) </code></pre>

This makes BaseGenAgent a crucial internal tool for Mastermind's self-understanding and its ability to strategically plan the evolution of the mindX system and its toolset.


⚠️ OPTIMIZATION ASSESSMENT - CRITICAL FINDINGS

Giant File Problem Discovered

Issue: BaseGenAgent generates unusably large files when run on directories containing memory data.

Test Results:

Size Comparison

DirectoryOutput SizeFilesStatus
./utils39KB5 files✅ Normal
./tools211KB21 files✅ Normal
./mindX327MB8000+ files❌ Too large

CLI Testing Results ✅

Validated working CLI commands: <pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md

✅ Success: 39KB output

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

✅ Success: 211KB output

</code></pre>

Solution: Use OptimizedAuditGenAgent

For code auditing, use the optimized version: <pre><code class="lang-python">from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent

99.93% size reduction (327MB → 213KB)

audit_agent = OptimizedAuditGenAgent(memory_agent, max_files_per_chunk=50) success, result = audit_agent.generate_audit_documentation('./core') </code></pre>

Critical Recommendations

  1. For auditing: Use OptimizedAuditGenAgent (99.93% smaller)
  2. For general docs: Add memory exclusions to BaseGenAgent
  3. Always exclude: "data/memory/", "data/logs/"
  4. Test first: Run on small directories before large ones

When to Use Each Tool

Use CaseToolReason
Code auditingOptimizedAuditGenAgentPrevents giant files
General documentationBaseGenAgent + exclusionsComplete docs
Memory analysisSpecialized toolsNeither suitable

Related Tools

Optimization Assessment and CLI Testing Results

⚠️ Critical Size Issues Identified

Giant File Problem Discovered:

Size Comparison Results

DirectoryBaseGenAgent OutputFiles AnalyzedStatus
./tools211KB21 files✅ Normal
./utils39KB5 files✅ Normal
./mindX327MB8000+ files❌ Includes memory data

CLI Testing Results ✅

The BaseGenAgent CLI has been validated and works correctly:

<pre><code class="lang-bash"># CLI Help PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py --help

Test Results (June 2025)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md

✅ Generated: 39KB (5 files analyzed)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

✅ Generated: 211KB (21 files analyzed)

</code></pre>

CLI Requirements:

Solution: OptimizedAuditGenAgent

For code auditing scenarios, use the specialized OptimizedAuditGenAgent:

<pre><code class="lang-python">from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent from agents.memory_agent import MemoryAgent

For code auditing (recommended)

audit_agent = OptimizedAuditGenAgent(MemoryAgent(), max_files_per_chunk=50) success, result = audit_agent.generate_audit_documentation('./core')

Results: 99.93% size reduction, chunked output, audit-focused

</code></pre>

OptimizedAuditGenAgent Benefits:

When to Use Each Agent

Use CaseRecommended AgentReason
Code auditingOptimizedAuditGenAgentPrevents giant files, audit-focused
General documentationBaseGenAgentComplete documentation including .md files
Memory analysisSpecialized tool neededNeither agent suitable for memory dumps

Best Practices

✅ Recommended Usage Patterns:

  1. For Code Auditing: Use OptimizedAuditGenAgent instead
  2. For General Documentation: Use BaseGenAgent with memory exclusions
  3. Always exclude memory data: Add --exclude "data/memory/" "data/logs/"
  4. Use include patterns: Specify --include ".py" ".js" for code-only docs
  5. Test on small directories first: Verify output size before large directories

❌ Avoid These Patterns: <pre><code class="lang-bash"># ❌ DON'T: Run on entire mindX without exclusions (creates 327MB files) python3 tools/base_gen_agent.py ./mindX -o huge_file.md

✅ DO: Use exclusions or OptimizedAuditGenAgent

python3 tools/base_gen_agent.py ./mindX --exclude "data/memory/" -o manageable_file.md </code></pre>

Configuration Recommendations

Add to basegen_config.json` HARD_CODED_EXCLUDES:

{
  "HARD_CODED_EXCLUDES": [
    "data/memory/",
    "data/logs/", 
    ".log",
    "/*.mem.json"
  ]
}

This prevents the giant file problem while maintaining full BaseGenAgent functionality for appropriate use cases.


All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference