Skip to main content
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?
Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?\n\nPaper: 2602.11988\nAuthors: Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin Vechev\nPublished: February 2026\n\n## Summary\n\nThis paper evaluates whether repository-level context files (like ) actually help coding agents perform better on real-world software engineering tasks.\n\n## Key Findings\n\n### Performance Impact\n- LLM-generated context files: Decrease task success rates by ~3% on average\n- Developer-provided context files: Marginally improve performance by ~4% on average\n- No context files: Baseline performance\n\n### Cost Impact\n- Context files increase inference costs by over 20% on average\n- More steps required to complete tasks (2.45-3.92 additional steps)\n\n### Behavioral Changes\n- More testing and exploration: Agents run more tests, search more files, read more files\n- Instruction following: Agents generally follow instructions in context files\n- Redundant documentation: Context files are often redundant with existing documentation\n- No effective overviews: Context files don't provide useful repository overviews\n\n## AGENTBENCH\n\nThe authors created a new benchmark called AGENTBENCH consisting of:\n- 138 unique instances from 12 repositories\n- Real GitHub issues (bug-fixing and feature addition tasks)\n- Developer-written context files\n- Python software engineering tasks\n\nAGENTBENCH complements SWE-BENCH LITE (which uses popular repositories without context files).\n\n## Experimental Setup\n\n### Coding Agents Evaluated\n- CLAUDE CODE with SONNET-4.5\n- CODEX with GPT-5.2 and GPT-5.1 MINI\n- QWEN CODE with QWEN3-30B-CODER\n\n### Datasets\n- SWE-BENCH LITE: 300 tasks from 11 popular Python repositories (no context files)\n- AGENTBENCH: 138 tasks from 12 repositories with developer-provided context files\n\n### Settings Evaluated\n1. NONE: No context files\n2. LLM: LLM-generated context files (using agent-developer recommendations)\n3. HUM: Developer-provided context files\n\n## Key Insights\n\n### 1. Context Files Make Tasks Harder\nInstructions in context files increase reasoning tokens by 14-22%, suggesting tasks become more complex.\n\n### 2. Context Files Are Redundant\nWhen documentation files are removed from repositories, LLM-generated context files actually improve performance by 2.7% on average.\n\n### 3. Stronger Models Don't Generate Better Context Files\nUsing GPT-5.2 to generate context files improves SWE-BENCH LITE performance by 2% but degrades AGENTBENCH performance by 3%.\n\n### 4. Context Files Encourage Exploration\nAgents use more repository-specific tools (e.g., , ) and run more tests when context files are present.\n\n## Recommendations\n\n1. Omit LLM-generated context files for now, contrary to agent-developer recommendations\n2. Include only minimal requirements in context files (e.g., specific tooling to use)\n3. Human-written context files should describe only essential information\n4. Future work: Improve automatic generation of concise, task-relevant guidance\n\n## Limitations\n\n- Evaluation focused heavily on Python (a language well-represented in training data)\n- Context files are a recent development (August 2025)\n- Popular repositories used in benchmarks may not be representative of most codebases\n\n## Related Work\n\n- SWE-BENCH: Repository-level coding agent evaluation\n- AGENTBENCH: New benchmark for context file evaluation\n- Context files: AGENTS.md, CLAUDE.md, README files for agents\n\n## Conclusion\n\nContext files have only a marginal effect on coding agent behavior. While they encourage broader exploration and instruction following, they don't provide effective repository overviews and often make tasks harder. The authors recommend omitting LLM-generated context files and including only minimal requirements in human-written ones.\n\n---\nTags: #agents #context-files #evaluation #SWE-bench #LLM-agents