Heat score
1Topic analysis
Show HN: Mdarena – Benchmark your Claude.md against your own PRs
Benchmark your CLAUDE.md against your own PRs. Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase. mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it. Parses .github/workflows/*.yml , package.json , pyproject.toml , Cargo.toml , and go.mod . When tests aren't available, falls back to diff overlap scoring. Pass a directory to benchmark a full CLAUDE.md tree: Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree. We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge. The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time. Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True , Claude Code runs with --dangerously-skip-permissions ). Sandboxes are isolated temp directories under /tmp but processes run as your user. Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions.
Sources
1Platforms
1Relations
5- First seen
- Apr 6, 2026, 7:35 AM
- Last updated
- Apr 6, 2026, 12:00 PM
Why this topic matters
Show HN: Mdarena – Benchmark your Claude.md against your own PRs is currently shaped by signals from 1 source platforms. This page organizes AI analysis summaries, 1 timeline events, and 5 relationship edges so search engines and AI systems can understand the topic's factual basis and propagation arc.
Keywords
5 tagsSource evidence
1 evidence itemsShow HN: Mdarena – Benchmark your Claude.md against your own PRs
News · 1Timeline
Show HN: Mdarena – Benchmark your Claude.md against your own PRs
Apr 6, 2026, 7:35 AM