Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?\n\n
Paper: 2602.11988\n
Authors: Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev\nVechev
Published: February 2026\n\n##2026
Summary
This paper evaluates whether repository-level context files (like AGENTS.md)) actually help coding agents perform better on real-world software engineering tasks.\n\n##
Key Findings\n\n###Findings
Performance Impact\n-Impact
- LLM-generated context files: Decrease task success rates by ~3% on
average\n-average - Developer-provided context files: Marginally improve performance by ~4% on
average\n-average - No context files: Baseline
performance\n\n###performance
Cost Impact\n-Impact
- Context files increase inference costs by over 20% on
average\n-average - More steps required to complete tasks (2.45-3.92 additional steps)
\n\n###
Behavioral Changes\n-Changes
- More testing and exploration: Agents run more tests, search more files, read more
files\n-files - Instruction following: Agents generally follow instructions in context
files\n-files - Redundant documentation: Context files are often redundant with existing
documentation\n-documentation - No effective overviews: Context files don't provide useful repository
overviews\n\n##overviews
AGENTBENCH
- 138 unique instances from 12
repositories\n-repositories - Real GitHub issues (bug-fixing and feature addition tasks)
\n- - Developer-written context
files\n-files - Python software engineering
tasks\n\nAGENTBENCHtasks
AGENTBENCH complements SWE-BENCH LITE (which uses popular repositories without context files).\n\n##
Experimental Setup\n\n###Setup
Coding Agents Evaluated\n-Evaluated
- CLAUDE CODE with SONNET-4.
5\n-5 - CODEX with GPT-5.2 and GPT-5.1
MINI\n-MINI - QWEN CODE with QWEN3-30B-
CODER\n\n###CODER
Datasets
- SWE-BENCH LITE: 300 tasks from 11 popular Python repositories (no context files)
\n- - AGENTBENCH: 138 tasks from 12 repositories with developer-provided context
files\n\n###files
Settings Evaluated\n1.Evaluated
- NONE: No context
files\n2.files - LLM: LLM-generated context files (using agent-developer recommendations)
\n3. - HUM: Developer-provided context
files\n\n##files
Key Insights\n\n###Insights
1. Context Files Make Tasks Harder\nInstructionsHarder
Instructions in context files increase reasoning tokens by 14-22%, suggesting tasks become more complex.\n\n###
2. Context Files Are Redundant\nWhenRedundant
When documentation files are removed from repositories, LLM-generated context files actually improve performance by 2.7% on average.\n\n###
3. Stronger Models Don't Generate Better Context Files\nUsingFiles
Using GPT-5.2 to generate context files improves SWE-BENCH LITE performance by 2% but degrades AGENTBENCH performance by 3%.\n\n###
4. Context Files Encourage Exploration\nAgentsExploration
Agents use more repository-specific tools (e.g., uv,, repo_tool)) and run more tests when context files are present.\n\n##
Recommendations
- Omit LLM-generated context files for now, contrary to agent-developer
recommendations\n2.recommendations - Include only minimal requirements in context files (e.g., specific tooling to use)
\n3. - Human-written context files should describe only essential
information\n4.information - Future work: Improve automatic generation of concise, task-relevant
guidance\n\n##guidance
Limitations
- Evaluation focused heavily on Python (a language well-represented in training data)
\n- - Context files are a recent development (August 2025)
\n- - Popular repositories used in benchmarks may not be representative of most
codebases\n\n##codebases
Related Work\n\n-Work
- SWE-BENCH: Repository-level coding agent
evaluation\n-evaluation - AGENTBENCH: New benchmark for context file
evaluation\n-evaluation - Context files: AGENTS.md, CLAUDE.md, README files for
agents\n\n##agents
Conclusion
Context files have only a marginal effect on coding agent behavior. While they encourage broader exploration and instruction following, they don't provide effective repository overviews and often make tasks harder. The authors recommend omitting LLM-generated context files and including only minimal requirements in human-written ones.\n\n---\n