Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?

Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?\n\n

Paper: 2602.11988\n Authors: Thibaud Gloaguen, Niels Mündler, Mark Müller, Veselin Raychev, Martin ~~Vechev\n~~Vechev Published: February ~~2026\n\n##~~2026

~~Summary\n\nThis~~

Summary

This paper evaluates whether repository-level context files (like ~~AGENTS.md)~~) actually help coding agents perform better on real-world software engineering tasks.~~\n\n##~~

Key Findings\n\n###Findings

Performance Impact\n-Impact

LLM-generated context files: Decrease task success rates by ~3% on ~~average\n-~~average
Developer-provided context files: Marginally improve performance by ~4% on ~~average\n-~~average
No context files: Baseline ~~performance\n\n###~~performance

Cost Impact\n-Impact

Context files increase inference costs by over 20% on ~~average\n-~~average
More steps required to complete tasks (2.45-3.92 additional steps)~~\n\n###~~

Behavioral Changes\n-Changes

More testing and exploration: Agents run more tests, search more files, read more ~~files\n-~~files
Instruction following: Agents generally follow instructions in context ~~files\n-~~files
Redundant documentation: Context files are often redundant with existing ~~documentation\n-~~documentation
No effective overviews: Context files don't provide useful repository ~~overviews\n\n##~~overviews

~~AGENTBENCH\n\nThe~~

AGENTBENCH

The authors created a new benchmark called AGENTBENCH consisting of:~~\n-~~

138 unique instances from 12 ~~repositories\n-~~repositories
Real GitHub issues (bug-fixing and feature addition tasks)~~\n-~~
Developer-written context ~~files\n-~~files
Python software engineering ~~tasks\n\nAGENTBENCH~~tasks

AGENTBENCH complements SWE-BENCH LITE (which uses popular repositories without context files).~~\n\n##~~

Experimental Setup\n\n###Setup

Coding Agents Evaluated\n-Evaluated

CLAUDE CODE with SONNET-4.~~5\n-~~5
CODEX with GPT-5.2 and GPT-5.1 ~~MINI\n-~~MINI
QWEN CODE with QWEN3-30B-~~CODER\n\n###~~CODER

~~Datasets\n-~~

Datasets

SWE-BENCH LITE: 300 tasks from 11 popular Python repositories (no context files)~~\n-~~
AGENTBENCH: 138 tasks from 12 repositories with developer-provided context ~~files\n\n###~~files

Settings Evaluated\n1.Evaluated

NONE: No context ~~files\n2.~~files
LLM: LLM-generated context files (using agent-developer recommendations)~~\n3.~~
HUM: Developer-provided context ~~files\n\n##~~files

Key Insights\n\n###Insights

1. Context Files Make Tasks Harder\nInstructionsHarder

Instructions in context files increase reasoning tokens by 14-22%, suggesting tasks become more complex.~~\n\n###~~

2. Context Files Are Redundant\nWhenRedundant

When documentation files are removed from repositories, LLM-generated context files actually improve performance by 2.7% on average.~~\n\n###~~

3. Stronger Models Don't Generate Better Context Files\nUsingFiles

Using GPT-5.2 to generate context files improves SWE-BENCH LITE performance by 2% but degrades AGENTBENCH performance by 3%.~~\n\n###~~

4. Context Files Encourage Exploration\nAgentsExploration

Agents use more repository-specific tools (e.g., ~~uv,~~, ~~repo_tool)~~) and run more tests when context files are present.~~\n\n##~~

~~Recommendations\n\n1.~~

Recommendations

Omit LLM-generated context files for now, contrary to agent-developer ~~recommendations\n2.~~recommendations
Include only minimal requirements in context files (e.g., specific tooling to use)~~\n3.~~
Human-written context files should describe only essential ~~information\n4.~~information
Future work: Improve automatic generation of concise, task-relevant ~~guidance\n\n##~~guidance

~~Limitations\n\n-~~

Limitations

Evaluation focused heavily on Python (a language well-represented in training data)~~\n-~~
Context files are a recent development (August 2025)~~\n-~~
Popular repositories used in benchmarks may not be representative of most ~~codebases\n\n##~~codebases

SWE-BENCH: Repository-level coding agent ~~evaluation\n-~~evaluation
AGENTBENCH: New benchmark for context file ~~evaluation\n-~~evaluation
Context files: AGENTS.md, CLAUDE.md, README files for ~~agents\n\n##~~agents

~~Conclusion\n\nContext~~

Conclusion

Context files have only a marginal effect on coding agent behavior. While they encourage broader exploration and instruction following, they don't provide effective repository overviews and often make tasks harder. The authors recommend omitting LLM-generated context files and including only minimal requirements in human-written ones.~~\n\n---\n~~

Tags: #agents #context-files #evaluation #SWE-bench #LLM-agents