base_gen_agent.py) - Configurable Codebase DocumenterFor code auditing use cases, use OptimizedAuditGenAgent instead. The BaseGenAgent can generate extremely large files (up to 327MB) when run on directories containing memory data, making it impractical for LLM processing. See Optimization Assessment below.
For code auditing use cases, use OptimizedAuditGenAgent instead. The BaseGenAgent can generate extremely large files (up to 327MB) when run on directories containing memory data, making it impractical for LLM processing. See Optimization Assessment below.
The BaseGenAgent is a utility agent within the MindX toolkit (Augmentic Project). Its primary function is to automatically generate a comprehensive Markdown document that provides a snapshot of a given codebase directory. This documentation includes a visual directory tree of included files and the complete content of those files, with appropriate language tagging for Markdown syntax highlighting.
A key feature of this agent is its configurability. It intelligently filters files based on:
.gitignore rules found within the target codebase.
User-defined include/exclude glob patterns passed via CLI or programmatically.
A central, modifiable JSON configuration file (basegen_config.json) that specifies hardcoded file/pattern exclusions and language mappings for syntax highlighting.
This makes BaseGenAgent a valuable tool for MindX itself (e.g., for the SelfImprovementAgent or CoordinatorAgent to understand code they are about to modify) or for developers needing a quick, shareable overview of a project or component.
The BaseGenAgent CLI has been validated and works correctly:
# CLI Help
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py --help
Test Results (June 2025)
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md
✅ Generated: 39KB (5 files analyzed)
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include "
.py" -o demo.md
✅ Generated: 211KB (21 files analyzed)
CLI Requirements:
PYTHONPATH to mindX root directoryMemoryAgent and basegen_config.jsonGiant File Problem:
data/memory/ directory (506MB of JSON memory files)./tools./utils./mindXFor code auditing scenarios, use the specialized OptimizedAuditGenAgent:
from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent
from agents.memory_agent import MemoryAgent
For code auditing (recommended)
audit_agent = OptimizedAuditGenAgent(MemoryAgent(), max_files_per_chunk=50)
success, result = audit_agent.generate_audit_documentation('./core')
Results: 99.93% size reduction, chunked output, audit-focused
OptimizedAuditGenAgent Benefits:
OptimizedAuditGenAgentBaseGenAgent Configuration Loading (_load_agent_config):
The agent's behavior is controlled by a JSON configuration file, typically PROJECT_ROOT/data/config/basegen_config.json. A custom path can also be provided during instantiation.
If the external config file is not found, it falls back to internal DEFAULT_CONFIG_DATA.
Configuration Merging: Values from an external basegen_config.json are merged with the internal defaults. For lists like HARD_CODED_EXCLUDES, entries are combined and deduplicated. For dictionaries like LANGUAGE_MAPPING and the base_gen_agent settings block, external values update or override internal defaults.
Key Configurable Sections:
- HARD_CODED_EXCLUDES: A list of glob patterns for common binary files, lock files, IDE metadata, temporary files, and version control directories (e.g., .png, node_modules/, .git/) that are generally excluded from code documentation.
- LANGUAGE_MAPPING: A dictionary mapping file extensions (e.g., .py, .rs) to language tags recognized by Markdown for syntax highlighting (e.g., python, rust).
- base_gen_agent: A sub-dictionary for settings specific to this agent:
- max_file_size_kb_for_inclusion: (Default: 1024KB) Files larger than this will have their content omitted with a warning.
- default_output_filename: Default name for the output Markdown if not specified by the caller.
File Discovery and Filtering Logic (generate_documentation, _should_include_file):
The agent recursively scans the target root_path_str directory.
.gitignore Processing (_load_gitignore_specs): If use_gitignore is true (default), it finds all .gitignore files within the root_path_str, aggregates their patterns, and compiles them into a pathspec.PathSpec object. This spec is used to efficiently exclude any files or directories ignored by Git. The .git/ directory itself is always implicitly ignored.
Filtering Precedence for _should_include_file:
If a file matches the gitignore_spec (and use_gitignore is true), it's excluded.
The file is then checked against the combined exclude patterns (CLI/programmatic user_exclude_patterns + HARD_CODED_EXCLUDES from config). If it matches any, it's excluded.
If include_patterns are provided (CLI/programmatic), the file must match at least one of these to be considered further. If it doesn't match any, it's excluded.
If none of the above exclusion rules apply, the file is included.
Directory Tree Generation (_build_tree_dict, _format_tree_lines):
A list of included files (as relative paths from the root_path_str) is used.
_build_tree_dict: Constructs a nested dictionary representing the directory hierarchy of these included files.
_format_tree_lines: Recursively traverses this dictionary to create an indented, human-readable string representation of the tree structure, suitable for Markdown text code blocks. Directories are marked with a trailing /.
Markdown Document Generation (generate_documentation):
This is the main public method.
It orchestrates scanning, filtering, and tree generation.
It then iterates through the list of included files:
For each file, a Markdown section is created with its POSIX-style relative path as a heading (e.g., ### \src/module/file.py\`).
The file's content is read (UTF-8, replacing errors). If a file exceeds max_file_size_kb_for_inclusion, its content is omitted, and a warning is included in the Markdown.
_guess_language() determines the Markdown language tag for syntax highlighting based on the file extension and the LANGUAGE_MAPPING from the configuration.
The file content is embedded within a fenced code block (e.g., \\\python ... \\\).
The complete Markdown content (tree + file contents) is written to the specified output file. The default output path is PROJECT_ROOT/data/generated_docs/.
Return Value: Returns a dictionary summarizing the operation: {"status": "SUCCESS"|"ERROR", "message": str, "output_file": str|None, "files_included": int}. This structured return makes it suitable for programmatic use by other MindX agents.
class.). for robust, cross-platform path operations. Paths are generally resolved to absolute paths for internal consistency.: pathspec library (requires pip install pathspec).
- Include/Exclude Globs: Python's fnmatch module.
Configuration: Loads from an external basegen_config.json file, falling back to internal defaults. This external file is a target for potential self-modification by MindX.
Error Handling: Includes try-except blocks for file I/O, JSON parsing, and directory scanning. Errors are logged, and the main method returns an error status.
Standalone CLI: A main_cli() function using argparse allows direct command-line execution of the agent, outputting its status dictionary as JSON.
Usage
As a Standalone CLI Tool
The agent can be executed directly:
<pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py <input_dir> [options]
</code></pre>
⚠️ Warning: Avoid running on directories containing
data/memory/ or large datasets as it can generate unusably large files (327MB+).
Key Arguments:
input_dir: Path to the codebase root.
-o, --output : Output Markdown file. Defaults to PROJECT_ROOT/data/generated_docs/<input_dir_name>_codebase_snapshot.md.
--include : Glob(s) for files to include.
--exclude : Glob(s) for files to exclude (these are additional to config excludes).
--no-gitignore: Ignore .gitignore files.
--config-file : Path to a custom agent configuration JSON file.
Example:
<pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core \
-o ./docs/core_docs.md \
--include "*.py" \
--exclude "**/test" \
--exclude "data/memory/" # Prevent giant files
</code></pre>
The CLI will print a JSON object summarizing the result.
Programmatic Usage by Other MindX Agents
The CoordinatorAgent or StrategicEvolutionAgent can instantiate and call BaseGenAgent to get a structured understanding of a component they intend to analyze or modify
<pre><code class="lang-python">from agents.base_gen_agent import BaseGenAgent
from agents.memory_agent import MemoryAgent
Initialize with MemoryAgent (required)
memory_agent = MemoryAgent()
base_gen = BaseGenAgent(memory_agent=memory_agent)
target_module_path = "./utils"
output_doc_path = "./docs/utils_snapshot.md"
result = base_gen.generate_markdown_summary(
root_path_str=target_module_path,
output_file_str=output_doc_path,
include_patterns=[".py"],
user_exclude_patterns=["test", "data/memory/"] # Prevent giant files
)
if result["status"] == "SUCCESS":
print(f"BaseGenAgent created doc: {result['output_file']}")
# Markdown content can now be read and fed to an LLM for analysis
markdown_content = Path(result['output_file']).read_text()
else:
print(f"BaseGenAgent failed: {result['message']}")
</code></pre>
Dual-Agent Integration for Auditing
For comprehensive auditing workflows, use both agents strategically:
<pre><code class="lang-python">from agents.base_gen_agent import BaseGenAgent
from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent
from agents.memory_agent import MemoryAgent
class AuditAndImproveTool:
def __init__(self, memory_agent):
# Keep both for different use cases
self.base_gen_agent = BaseGenAgent(memory_agent) # General docs
self.audit_gen_agent = OptimizedAuditGenAgent(memory_agent) # Code auditing
def execute(self, target_path: str, audit_mode: bool = True):
if audit_mode:
# Use optimized agent for code auditing (99.93% smaller)
return self.audit_gen_agent.generate_audit_documentation(target_path)
else:
# Use original for general documentation
return self.base_gen_agent.generate_markdown_summary(target_path)
</code></pre>
Configuration
The
BaseGenAgent relies on a JSON configuration file. The primary default location for this file is . This path can be overridden during instantiation or via the --config-file CLI argument.
Key configurable sections:
HARD_CODED_EXCLUDES: A list of glob patterns for files and directories to always exclude (e.g., .pyc, __pycache__/, .git/).
LANGUAGE_MAPPING: A dictionary mapping file extensions to language identifiers for Markdown code blocks (e.g., ".py": "python").
base_gen_agent_settings: An object containing settings specific to this agent:
max_file_size_kb_for_inclusion: Maximum size (in KB) for a file's content to be included directly. Larger files will have their content omitted with a note. (Default: 1024KB, Config: 2048KB)
default_output_filename_stem: The base name for the output Markdown file if not specified. (Default: "codebase_snapshot")
Example
basegen_config.json snippet (relevant parts):
<pre><code class="lang-json">{
"HARD_CODED_EXCLUDES": [
".pyc",
"__pycache__/",
".git/",
"data/memory/",
".log"
],
"LANGUAGE_MAPPING": {
".py": "python",
".js": "javascript",
".md": "markdown"
},
"base_gen_agent_settings": {
"max_file_size_kb_for_inclusion": 1024,
"default_output_filename_stem": "codebase_snapshot"
}
}
</code></pre>
Current Production Configuration
The current
basegen_config.json includes:
max_file_size_kb_for_inclusion: 2048KB (allows large files)
HARD_CODED_EXCLUDES: Comprehensive list but missing memory data exclusions
LANGUAGE_MAPPING: Extensive language support (70+ languages)
Recommendation: Add
"data/memory/" and "data/logs/" to HARD_CODED_EXCLUDES to prevent giant files.
Command-Line Interface (CLI) Usage
Validated CLI Functionality ✅
The BaseGenAgent CLI has been tested and confirmed working:
<pre><code class="lang-bash"># Prerequisites
export PYTHONPATH=/path/to/mindX
Basic usage (tested June 2025)
python3 tools/base_gen_agent.py <input_dir> [options]
Successful test examples:
python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md
python3 tools/base_gen_agent.py ./tools --include ".py" -o demo.md
✅ Output: 211KB, 21 files analyzed
</code></pre>
Positional:
: Path to the codebase root directory to document.Options:
: Output Markdown file path. Defaults to a generated name in a configured output directory.: Glob pattern(s) for files to explicitly include (e.g., .py src/**/.js).: Glob pattern(s) to explicitly exclude (e.g., /temp/ .DS_Store).: Disable applying .gitignore file exclusions.: Path to a custom agent JSON configuration file.: JSON string of settings to update in the agent's configuration.List Operations for HARD_CODED_EXCLUDES:
<pre><code class="lang-bash"># Append items
--update-config '{"HARD_CODED_EXCLUDES": [{"_LIST_OP_":"APPEND_UNIQUE"}, "data/memory/", ".log"]}'
Deep merge for dictionaries: <pre><code class="lang-bash">--update-config '{"LANGUAGE_MAPPING": {".foo": "bar", "_MERGE_DEEP_": true}}' </code></pre>
The CLI prints a JSON object indicating the status: <pre><code class="lang-json">{ "status": "SUCCESS", "message": "Markdown documentation generated successfully: /path/to/output.md", "output_file": "/path/to/output.md", "files_included": 21 } </code></pre>
Exit Codes:
: Success: Operational failure : Fatal error<pre><code class="lang-bash"># Basic documentation generation PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./core -o core_docs.md
instead
Use include patterns: Specify --include ".py" ".js" for code-only docs
Test on small directories first: Verify output size before large directories
❌ Avoid These Patterns
<pre><code class="lang-bash"># ❌ DON'T: Run on entire mindX without exclusions (creates 327MB files)
python3 tools/base_gen_agent.py ./mindX -o huge_file.md
✅ DO: Use exclusions or OptimizedAuditGenAgent
python3 tools/base_gen_agent.py ./mindX --exclude "data/memory/" -o manageable_file.md
</code></pre>
The MastermindAgent can instantiate and use BaseGenAgent internally via its BDI action ANALYZE_CODEBASE_FOR_STRATEGY. However, for audit use cases, it should prefer the OptimizedAuditGenAgent to avoid giant file generation.
Recommended Integration Pattern: <pre><code class="lang-python">class MastermindAgent: def __init__(self): self.base_gen_agent = BaseGenAgent(self.memory_agent) self.audit_gen_agent = OptimizedAuditGenAgent(self.memory_agent) def analyze_codebase_for_strategy(self, target_path: str, audit_mode: bool = True): if audit_mode: return self.audit_gen_agent.generate_audit_documentation(target_path) else: return self.base_gen_agent.generate_markdown_summary( target_path, user_exclude_patterns=["data/memory/", "data/logs/"] ) </code></pre>
This makes BaseGenAgent a crucial internal tool for Mastermind's self-understanding and its ability to strategically plan the evolution of the mindX system and its toolset.
Issue: BaseGenAgent generates unusably large files when run on directories containing memory data.
Test Results:
(506MB of JSON files)Validated working CLI commands:
<pre><code class="lang-bash">PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./utils --include ".py" -o demo.md
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include "✅ Success: 39KB output
For code auditing, use the optimized version: <pre><code class="lang-python">from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent
(99.93% smaller), "data/logs/" + exclusions: Specialized for code auditing, 99.93% smaller output: Uses both agents strategically based on use caseGiant File Problem Discovered:
directory (506MB of JSON memory files)The BaseGenAgent CLI has been validated and works correctly:
<pre><code class="lang-bash"># CLI Help PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py --help
PYTHONPATH=/path/to/mindX python3 tools/base_gen_agent.py ./tools --include "
.py" -o demo.mdCLI Requirements:
to mindX root directory and basegen_config.jsonFor code auditing scenarios, use the specialized OptimizedAuditGenAgent:
<pre><code class="lang-python">from tools.optimized_audit_gen_agent import OptimizedAuditGenAgent from agents.memory_agent import MemoryAgent
OptimizedAuditGenAgent Benefits:
✅ Recommended Usage Patterns:
instead for code-only docs❌ Avoid These Patterns: <pre><code class="lang-bash"># ❌ DON'T: Run on entire mindX without exclusions (creates 327MB files) python3 tools/base_gen_agent.py ./mindX -o huge_file.md
Add to basegen_config.json` HARD_CODED_EXCLUDES:
{
"HARD_CODED_EXCLUDES": [
"data/memory/",
"data/logs/",
".log",
"**/*.mem.json"
]
}
This prevents the giant file problem while maintaining full BaseGenAgent functionality for appropriate use cases.