base_gen_agent_backup.md · 24.7 KB

BaseGenAgent (base_gen_agent.py) - Configurable Codebase Documenter

optimized in markdown for LLM ingestion
mindX/tools/base_gen_agent.py
data/config/basegen_config.json

⚠️ IMPORTANT: Size Optimization Notice

For code auditing use cases, use OptimizedAuditGenAgent instead. The BaseGenAgent can generate extremely large files (up to 327MB) when run on directories containing memory data, making it impractical for LLM processing. See Optimization Assessment below.

⚠️ IMPORTANT: Size Optimization Notice

For code auditing use cases, use OptimizedAuditGenAgent instead. The BaseGenAgent can generate extremely large files (up to 327MB) when run on directories containing memory data, making it impractical for LLM processing. See Optimization Assessment below.

Introduction

The BaseGenAgent is a utility agent within the MindX toolkit (Augmentic Project). Its primary function is to automatically generate a comprehensive Markdown document that provides a snapshot of a given codebase directory. This documentation includes a visual directory tree of included files and the complete content of those files, with appropriate language tagging for Markdown syntax highlighting.

A key feature of this agent is its configurability. It intelligently filters files based on: .gitignore rules found within the target codebase. User-defined include/exclude glob patterns passed via CLI or programmatically. A central, modifiable JSON configuration file (basegen_config.json) that specifies hardcoded file/pattern exclusions and language mappings for syntax highlighting.

This makes BaseGenAgent a valuable tool for MindX itself (e.g., for the SelfImprovementAgent or CoordinatorAgent to understand code they are about to modify) or for developers needing a quick, shareable overview of a project or component.

CLI Testing Results ✅

The BaseGenAgent CLI has been validated and works correctly:

# CLI Help
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py --help

Test Results (June 2025)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md

✅ Generated: 39KB (5 files analyzed)

PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

✅ Generated: 211KB (21 files analyzed)

CLI Requirements:

  • Set PYTHONPATH to mindX root directory
  • Uses existing MemoryAgent and basegen_config.json
  • Supports all documented CLI options
  • Optimization Assessment

    🚨 Critical Size Issues Identified

    Giant File Problem:

  • BaseGenAgent on mindX directory: 327MB, 8.6M lines
  • Root cause: Includes data/memory/ directory (506MB of JSON memory files)
  • Size multiplier: 1,591x larger than normal directories
  • Impact: Unusable for LLM processing, hours to generate
  • Size Comparison Results

    DirectoryBaseGenAgent OutputFiles AnalyzedIssue ./tools211KB21 files✅ Normal ./utils39KB5 files✅ Normal ./mindX327MB8000+ files❌ Includes memory data

    Solution: OptimizedAuditGenAgent

    For code auditing scenarios, use the specialized OptimizedAuditGenAgent:

    from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent
    from agents.memory_agent import MemoryAgent

    For code auditing (recommended)

    audit_agent = OptimizedAuditGenAgent(MemoryAgent(), max_files_per_chunk=50) success, result = audit_agent.generate_audit_documentation('./core')

    Results: 99.93% size reduction, chunked output, audit-focused

    OptimizedAuditGenAgent Benefits:

  • 99.93% size reduction (327MB → 213KB)
  • Smart filtering: Excludes memory/log data automatically
  • Chunking: Creates manageable pieces (50 files max per chunk)
  • Audit focus: Only includes code files (.py, .js, .json, configs)
  • LLM compatible: Always fits in context windows
  • When to Use Each Agent

    Use CaseRecommended AgentReason Code auditingOptimizedAuditGenAgentPrevents giant files, audit-focused General documentationBaseGenAgentComplete documentation including .md files Memory analysisSpecialized tool neededNeither agent suitable for memory dumps

    Explanation

    Core Functionality

    Configuration Loading (_load_agent_config): The agent's behavior is controlled by a JSON configuration file, typically PROJECT_ROOT/data/config/basegen_config.json. A custom path can also be provided during instantiation. If the external config file is not found, it falls back to internal DEFAULT_CONFIG_DATA. Configuration Merging: Values from an external basegen_config.json are merged with the internal defaults. For lists like HARD_CODED_EXCLUDES, entries are combined and deduplicated. For dictionaries like LANGUAGE_MAPPING and the base_gen_agent settings block, external values update or override internal defaults. Key Configurable Sections: - HARD_CODED_EXCLUDES: A list of glob patterns for common binary files, lock files, IDE metadata, temporary files, and version control directories (e.g., .png, node_modules/, .git/) that are generally excluded from code documentation. - LANGUAGE_MAPPING: A dictionary mapping file extensions (e.g., .py, .rs) to language tags recognized by Markdown for syntax highlighting (e.g., python, rust). - base_gen_agent: A sub-dictionary for settings specific to this agent: - max_file_size_kb_for_inclusion: (Default: 1024KB) Files larger than this will have their content omitted with a warning. - default_output_filename: Default name for the output Markdown if not specified by the caller.

    File Discovery and Filtering Logic (generate_documentation, _should_include_file): The agent recursively scans the target root_path_str directory. .gitignore Processing (_load_gitignore_specs): If use_gitignore is true (default), it finds all .gitignore files within the root_path_str, aggregates their patterns, and compiles them into a pathspec.PathSpec object. This spec is used to efficiently exclude any files or directories ignored by Git. The .git/ directory itself is always implicitly ignored. Filtering Precedence for _should_include_file: If a file matches the gitignore_spec (and use_gitignore is true), it's excluded. The file is then checked against the combined exclude patterns (CLI/programmatic user_exclude_patterns + HARD_CODED_EXCLUDES from config). If it matches any, it's excluded. If include_patterns are provided (CLI/programmatic), the file must match at least one of these to be considered further. If it doesn't match any, it's excluded. If none of the above exclusion rules apply, the file is included.

    Directory Tree Generation (_build_tree_dict, _format_tree_lines): A list of included files (as relative paths from the root_path_str) is used. _build_tree_dict: Constructs a nested dictionary representing the directory hierarchy of these included files. _format_tree_lines: Recursively traverses this dictionary to create an indented, human-readable string representation of the tree structure, suitable for Markdown text code blocks. Directories are marked with a trailing /.

    Markdown Document Generation (generate_documentation): This is the main public method. It orchestrates scanning, filtering, and tree generation. It then iterates through the list of included files: For each file, a Markdown section is created with its POSIX-style relative path as a heading (e.g., ### \src/module/file.py\`). The file's content is read (UTF-8, replacing errors). If a file exceeds max_file_size_kb_for_inclusion, its content is omitted, and a warning is included in the Markdown. _guess_language() determines the Markdown language tag for syntax highlighting based on the file extension and the LANGUAGE_MAPPING from the configuration. The file content is embedded within a fenced code block (e.g., \\\python ... \\\). The complete Markdown content (tree + file contents) is written to the specified output file. The default output path is PROJECT_ROOT/data/generated_docs/_codebase_snapshot.md. Return Value: Returns a dictionary summarizing the operation: {"status": "SUCCESS"|"ERROR", "message": str, "output_file": str|None, "files_included": int}. This structured return makes it suitable for programmatic use by other MindX agents.

    Agent Structure

  • The core logic is encapsulated within the BaseGenAgent class.
  • Helper methods are private (prefixed with _).
  • Configuration is loaded during instantiation.
  • Technical Details

  • Path Handling: Uses pathlib.Path for robust, cross-platform path operations. Paths are generally resolved to absolute paths for internal consistency.
  • Pattern Matching:
  • -
    .gitignore: pathspec library (requires pip install pathspec). - Include/Exclude Globs: Python's fnmatch module.
  • Configuration: Loads from an external basegen_config.json file, falling back to internal defaults. This external file is a target for potential self-modification by MindX.
  • Error Handling: Includes try-except blocks for file I/O, JSON parsing, and directory scanning. Errors are logged, and the main method returns an error status.
  • Standalone CLI: A main_cli() function using argparse allows direct command-line execution of the agent, outputting its status dictionary as JSON.
  • Usage

    As a Standalone CLI Tool

    The agent can be executed directly:

    <pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py <input_dir> [options] </code></pre>

    ⚠️ Warning: Avoid running on directories containing data/memory/ or large datasets as it can generate unusably large files (327MB+).

    Key Arguments:

  • input_dir: Path to the codebase root.
  • -o, --output : Output Markdown file. Defaults to PROJECT_ROOT/data/generated_docs/<input_dir_name>_codebase_snapshot.md.
  • --include : Glob(s) for files to include.
  • --exclude : Glob(s) for files to exclude (these are additional to config excludes).
  • --no-gitignore: Ignore .gitignore files.
  • --config-file : Path to a custom agent configuration JSON file.
  • Example: <pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core \ -o ./docs/core_docs.md \ --include "*.py" \ --exclude "**/test" \ --exclude "data/memory/" # Prevent giant files </code></pre>

    The CLI will print a JSON object summarizing the result.

    Programmatic Usage by Other MindX Agents

    The CoordinatorAgent or StrategicEvolutionAgent can instantiate and call BaseGenAgent to get a structured understanding of a component they intend to analyze or modify

    <pre><code class="lang-python">from agents.base_gen_agent import BaseGenAgent from agents.memory_agent import MemoryAgent

    Initialize with MemoryAgent (required)

    memory_agent = MemoryAgent() base_gen = BaseGenAgent(memory_agent=memory_agent)

    target_module_path = "./utils" output_doc_path = "./docs/utils_snapshot.md"

    result = base_gen.generate_markdown_summary( root_path_str=target_module_path, output_file_str=output_doc_path, include_patterns=[".py"], user_exclude_patterns=["test", "data/memory/"] # Prevent giant files )

    if result["status"] == "SUCCESS": print(f"BaseGenAgent created doc: {result['output_file']}") # Markdown content can now be read and fed to an LLM for analysis markdown_content = Path(result['output_file']).read_text() else: print(f"BaseGenAgent failed: {result['message']}") </code></pre>

    Dual-Agent Integration for Auditing

    For comprehensive auditing workflows, use both agents strategically:

    <pre><code class="lang-python">from agents.base_gen_agent import BaseGenAgent from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent from agents.memory_agent import MemoryAgent

    class AuditAndImproveTool: def __init__(self, memory_agent): # Keep both for different use cases self.base_gen_agent = BaseGenAgent(memory_agent) # General docs self.audit_gen_agent = OptimizedAuditGenAgent(memory_agent) # Code auditing def execute(self, target_path: str, audit_mode: bool = True): if audit_mode: # Use optimized agent for code auditing (99.93% smaller) return self.audit_gen_agent.generate_audit_documentation(target_path) else: # Use original for general documentation return self.base_gen_agent.generate_markdown_summary(target_path) </code></pre>

    Configuration

    The BaseGenAgent relies on a JSON configuration file. The primary default location for this file is /data/config/basegen_config.json. This path can be overridden during instantiation or via the --config-file CLI argument.

    Key configurable sections:

    HARD_CODED_EXCLUDES: A list of glob patterns for files and directories to always exclude (e.g., .pyc, __pycache__/, .git/). LANGUAGE_MAPPING: A dictionary mapping file extensions to language identifiers for Markdown code blocks (e.g., ".py": "python"). base_gen_agent_settings: An object containing settings specific to this agent: max_file_size_kb_for_inclusion: Maximum size (in KB) for a file's content to be included directly. Larger files will have their content omitted with a note. (Default: 1024KB, Config: 2048KB) default_output_filename_stem: The base name for the output Markdown file if not specified. (Default: "codebase_snapshot")

    Example basegen_config.json snippet (relevant parts):

    <pre><code class="lang-json">{ "HARD_CODED_EXCLUDES": [ ".pyc", "__pycache__/", ".git/", "data/memory/", ".log" ], "LANGUAGE_MAPPING": { ".py": "python", ".js": "javascript", ".md": "markdown" }, "base_gen_agent_settings": { "max_file_size_kb_for_inclusion": 1024, "default_output_filename_stem": "codebase_snapshot" } } </code></pre>

    Current Production Configuration

    The current basegen_config.json includes:

  • max_file_size_kb_for_inclusion: 2048KB (allows large files)
  • HARD_CODED_EXCLUDES: Comprehensive list but missing memory data exclusions
  • LANGUAGE_MAPPING: Extensive language support (70+ languages)
  • Recommendation: Add "data/memory/" and "data/logs/" to HARD_CODED_EXCLUDES to prevent giant files.

    Command-Line Interface (CLI) Usage

    Validated CLI Functionality ✅

    The BaseGenAgent CLI has been tested and confirmed working:

    <pre><code class="lang-bash"># Prerequisites export PYTHONPATH=/path/to/mindX

    Basic usage (tested June 2025)

    python3 tools/base_gen_agent.py <input_dir> [options]

    Successful test examples:

    python3 tools/base_gen_agent.py ./utils --include "
    .py" -o demo.md

    ✅ Output: 39KB, 5 files analyzed

    python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

    ✅ Output: 211KB, 21 files analyzed

    </code></pre>

    Arguments:

    Positional:

  • input_dir: Path to the codebase root directory to document.
  • Options:

  • -o OUTPUT, --output OUTPUT: Output Markdown file path. Defaults to a generated name in a configured output directory.
  • --include PATT [PATT ...]: Glob pattern(s) for files to explicitly include (e.g., .py src/**/.js).
  • --exclude PATT [PATT ...]: Glob pattern(s) to explicitly exclude (e.g., /temp/ .DS_Store).
  • --no-gitignore: Disable applying .gitignore file exclusions.
  • --config-file FILE_PATH: Path to a custom agent JSON configuration file.
  • --update-config JSON_STRING: JSON string of settings to update in the agent's configuration.
  • Special Configuration Operations:

    List Operations for HARD_CODED_EXCLUDES: <pre><code class="lang-bash"># Append items --update-config '{"HARD_CODED_EXCLUDES": [{"_LIST_OP_":"APPEND_UNIQUE"}, "data/memory/", ".log"]}'

    Remove items

    --update-config '{"HARD_CODED_EXCLUDES": [{"_LIST_OP_":"REMOVE"}, "
    .tmp"]}'

    Replace entirely

    --update-config '{"HARD_CODED_EXCLUDES": [".py", ".js"]}' </code></pre>

    Deep merge for dictionaries: <pre><code class="lang-bash">--update-config '{"LANGUAGE_MAPPING": {".foo": "bar", "_MERGE_DEEP_": true}}' </code></pre>

    CLI Output:

    The CLI prints a JSON object indicating the status: <pre><code class="lang-json">{ "status": "SUCCESS", "message": "Markdown documentation generated successfully: /path/to/output.md", "output_file": "/path/to/output.md", "files_included": 21 } </code></pre>

    Exit Codes:

  • 0: Success
  • 1: Operational failure
  • 2: Fatal error
  • Example CLI Calls:

    <pre><code class="lang-bash"># Basic documentation generation PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core -o core_docs.md

    Python files only with memory exclusion (recommended)

    PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core \ --include ".py" \ --exclude "data/memory/" "**/__pycache__/" \ -o core_python_docs.md

    Custom configuration

    PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./project \ --config-file ./configs/audit_config.json \ -o project_audit.md

    Update configuration first, then generate

    PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py \ --update-config '{"base_gen_agent_settings.max_file_size_kb_for_inclusion": 500}' </code></pre>

    Best Practices

    ✅ Recommended Usage Patterns

  • For Code Auditing: Use OptimizedAuditGenAgent instead
  • For General Documentation: Use BaseGenAgent with memory exclusions
  • Always exclude memory data: Add --exclude "data/memory/" "data/logs/"
  • Use include patterns: Specify --include ".py" ".js" for code-only docs
  • Test on small directories first: Verify output size before large directories
  • ❌ Avoid These Patterns

    <pre><code class="lang-bash"># ❌ DON'T: Run on entire mindX without exclusions (creates 327MB files) python3 tools/base_gen_agent.py ./mindX -o huge_file.md

    ✅ DO: Use exclusions or OptimizedAuditGenAgent

    python3 tools/base_gen_agent.py ./mindX --exclude "data/memory/
    " -o manageable_file.md </code></pre>

    Integration with MastermindAgent

    The MastermindAgent can instantiate and use BaseGenAgent internally via its BDI action ANALYZE_CODEBASE_FOR_STRATEGY. However, for audit use cases, it should prefer the OptimizedAuditGenAgent to avoid giant file generation.

    Recommended Integration Pattern: <pre><code class="lang-python">class MastermindAgent: def __init__(self): self.base_gen_agent = BaseGenAgent(self.memory_agent) self.audit_gen_agent = OptimizedAuditGenAgent(self.memory_agent) def analyze_codebase_for_strategy(self, target_path: str, audit_mode: bool = True): if audit_mode: return self.audit_gen_agent.generate_audit_documentation(target_path) else: return self.base_gen_agent.generate_markdown_summary( target_path, user_exclude_patterns=["data/memory/", "data/logs/"] ) </code></pre>

    This makes BaseGenAgent a crucial internal tool for Mastermind's self-understanding and its ability to strategically plan the evolution of the mindX system and its toolset.


    ⚠️ OPTIMIZATION ASSESSMENT - CRITICAL FINDINGS

    Giant File Problem Discovered

    Issue: BaseGenAgent generates unusably large files when run on directories containing memory data.

    Test Results:

  • mindX directory: 327MB output (8.6M lines) - UNUSABLE
  • Root cause: Includes data/memory/ (506MB of JSON files)
  • Normal directories: 39KB-211KB (perfectly usable)
  • Size Comparison

    DirectoryOutput SizeFilesStatus
    ./utils39KB5 files✅ Normal ./tools211KB21 files✅ Normal ./mindX327MB8000+ files❌ Too large

    CLI Testing Results ✅

    Validated working CLI commands: <pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md

    ✅ Success: 39KB output

    PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

    ✅ Success: 211KB output

    </code></pre>

    Solution: Use OptimizedAuditGenAgent

    For code auditing, use the optimized version: <pre><code class="lang-python">from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent

    99.93% size reduction (327MB → 213KB)

    audit_agent = OptimizedAuditGenAgent(memory_agent, max_files_per_chunk=50) success, result = audit_agent.generate_audit_documentation('./core') </code></pre>

    Critical Recommendations

  • For auditing: Use OptimizedAuditGenAgent (99.93% smaller)
  • For general docs: Add memory exclusions to BaseGenAgent
  • Always exclude: "data/memory/", "data/logs/"
  • Test first: Run on small directories before large ones
  • When to Use Each Tool

    Use CaseToolReason Code auditing
    OptimizedAuditGenAgentPrevents giant files General documentationBaseGenAgent + exclusionsComplete docs Memory analysisSpecialized toolsNeither suitable

    Related Tools

  • OptimizedAuditGenAgent: Specialized for code auditing, 99.93% smaller output
  • AuditAndImproveTool: Uses both agents strategically based on use case
  • Memory analysis tools: Needed for analyzing memory data patterns (separate tooling required)
  • Optimization Assessment and CLI Testing Results

    ⚠️ Critical Size Issues Identified

    Giant File Problem Discovered:

  • BaseGenAgent on mindX directory: 327MB output, 8.6M lines
  • Root cause: Includes data/memory/ directory (506MB of JSON memory files)
  • Size multiplier: 1,591x larger than normal directories
  • Impact: Unusable for LLM processing, hours to generate
  • Size Comparison Results

    DirectoryBaseGenAgent OutputFiles AnalyzedStatus
    ./tools211KB21 files✅ Normal ./utils39KB5 files✅ Normal ./mindX327MB8000+ files❌ Includes memory data

    CLI Testing Results ✅

    The BaseGenAgent CLI has been validated and works correctly:

    <pre><code class="lang-bash"># CLI Help PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py --help

    Test Results (June 2025)

    PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md

    ✅ Generated: 39KB (5 files analyzed)

    PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md

    ✅ Generated: 211KB (21 files analyzed)

    </code></pre>

    CLI Requirements:

  • Set PYTHONPATH to mindX root directory
  • Uses existing MemoryAgent and basegen_config.json
  • Supports all documented CLI options
  • Solution: OptimizedAuditGenAgent

    For code auditing scenarios, use the specialized OptimizedAuditGenAgent:

    <pre><code class="lang-python">from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent from agents.memory_agent import MemoryAgent

    For code auditing (recommended)

    audit_agent = OptimizedAuditGenAgent(MemoryAgent(), max_files_per_chunk=50) success, result = audit_agent.generate_audit_documentation('./core')

    Results: 99.93% size reduction, chunked output, audit-focused

    </code></pre>

    OptimizedAuditGenAgent Benefits:

  • 99.93% size reduction (327MB → 213KB)
  • Smart filtering: Excludes memory/log data automatically
  • Chunking: Creates manageable pieces (50 files max per chunk)
  • Audit focus: Only includes code files (.py, .js, .json, configs)
  • LLM compatible: Always fits in context windows
  • When to Use Each Agent

    Use CaseRecommended AgentReason Code auditing
    OptimizedAuditGenAgentPrevents giant files, audit-focused General documentationBaseGenAgentComplete documentation including .md files Memory analysisSpecialized tool neededNeither agent suitable for memory dumps

    Best Practices

    ✅ Recommended Usage Patterns:

  • For Code Auditing: Use OptimizedAuditGenAgent instead
  • For General Documentation: Use BaseGenAgent with memory exclusions
  • Always exclude memory data: Add --exclude "data/memory/" "data/logs/"
  • Use include patterns: Specify --include ".py" ".js" for code-only docs
  • Test on small directories first: Verify output size before large directories
  • ❌ Avoid These Patterns: <pre><code class="lang-bash"># ❌ DON'T: Run on entire mindX without exclusions (creates 327MB files) python3 tools/base_gen_agent.py ./mindX -o huge_file.md

    ✅ DO: Use exclusions or OptimizedAuditGenAgent

    python3 tools/base_gen_agent.py ./mindX --exclude "data/memory/" -o manageable_file.md </code></pre>

    Configuration Recommendations

    Add to basegen_config.json` HARD_CODED_EXCLUDES:

    {
      "HARD_CODED_EXCLUDES": [
        "data/memory/",
        "data/logs/", 
        ".log",
        "**/*.mem.json"
      ]
    }
    

    This prevents the giant file problem while maintaining full BaseGenAgent functionality for appropriate use cases.


    All DocumentsDocument IndexThe Book of mindXImprovement JournalAPI Reference