Back to graph

Topic analysis

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Benchmark your CLAUDE.md against your own PRs. Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase. mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it. Parses .github/workflows/*.yml , package.json , pyproject.toml , Cargo.toml , and go.mod . When tests aren't available, falls back to diff overlap scoring. Pass a directory to benchmark a full CLAUDE.md tree: Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree. We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge. The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time. Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True , Claude Code runs with --dangerously-skip-permissions ). Sandboxes are isolated temp directories under /tmp but processes run as your user. Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions.

Heat score

1

Sources

1

Platforms

1

Relations

5
First seen
Apr 6, 2026, 7:35 AM
Last updated
Apr 6, 2026, 12:00 PM

Why this topic matters

Show HN: Mdarena – Benchmark your Claude.md against your own PRs is currently shaped by signals from 1 source platforms. This page organizes AI analysis summaries, 1 timeline events, and 5 relationship edges so search engines and AI systems can understand the topic's factual basis and propagation arc.

News

Keywords

5 tags
youragainstownfilesare

Source evidence

1 evidence items

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

News · 1
Apr 6, 2026, 7:35 AMOpen original source

Timeline

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Apr 6, 2026, 7:35 AM

Related topics

Show HN: Hippo, biologically inspired memory for AI agents

biologicallyinspiredmemoryagentssecretgoodisnrememberingmore
Relation score 0.80Open topic

The cult of vibe coding is insane

cultvibecodinginsanehadleaksourcecodepeoplehave
Relation score 0.70Open topic

Show HN: Modo – I built an open-source alternative to Kiro, Cursor, and Windsurf

builtopensourcealternativeplanscodeswhatadds
Relation score 0.80Open topic

The cult of vibe coding is insane

cultvibecodinginsanehadleaksourcecodepeoplehave
Relation score 0.70Open topic

Show HN: I built a tiny LLM to demystify how language models work

builttinydemystifyhowlanguagemodelsworkparametertalkslike
Relation score 0.60Open topic