Question 1

什么是“Show HN: Mdarena – Benchmark your Claude.md against your own PRs”？

Accepted Answer

Show HN: Mdarena – Benchmark your Claude.md against your own PRs 是 Link News 基于事实数据库聚合的新闻话题，当前摘要为：Benchmark your CLAUDE.md against your own PRs.

Most CLAUDE.md files are written blindly. Research shows they often reduce agent success rates and cost 20%+ more tokens. mdarena lets you measure whether yours helps or hurts, on tasks from your actual codebase.

mdarena can run your repo's actual tests to grade agent patches, the same way SWE-bench does it.

Parses .github/workflows/*.yml , package.json , pyproject.toml , Cargo.toml , and go.mod . When tests aren't available, falls back to diff overlap scoring.

Pass a directory to benchmark a full CLAUDE.md tree:

Each directory mirrors your repo structure. Baseline strips ALL CLAUDE.md and AGENTS.md files from the entire tree.

We ran mdarena against a large production monorepo: 20 merged PRs, Claude Opus 4.6, three conditions (bare baseline, existing CLAUDE.md, hand-written alternative). Patches graded against real test suites. Not string matching, not LLM-as-judge.

The winning CLAUDE.md wasn't the longest or most detailed. It was the one that put the right context in front of the agent at the right time.

Only benchmark repositories you trust. mdarena executes code from the repos it benchmarks (test commands run via shell=True , Claude Code runs with --dangerously-skip-permissions ). Sandboxes are isolated temp directories under /tmp but processes run as your user.

Benchmark integrity: Because tasks come from historical PRs, the gold patch is in the repo's git history. Claude 4 Sonnet exploited this against SWE-bench by walking future commits via tags. mdarena prevents this with history-free checkouts: git archive exports a snapshot at base_commit into a fresh single-commit repo. Future commits don't exist in the object database at all. See tests/test_isolated_checkout.py for the integrity assertions.

Question 2

这个话题覆盖了哪些来源？

Accepted Answer

这个话题当前覆盖 1 个来源平台，并持续汇总相关新闻、搜索与社交讨论信号。

Question 3

这个话题有哪些可追溯证据？

Accepted Answer

当前页面展示 1 条来源证据、1 个时间线节点，并保留原始出处链接便于核验。

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Why this topic matters

Keywords

Source evidence

Show HN: Mdarena – Benchmark your Claude.md against your own PRs

Timeline

Related topics

Show HN: Hippo, biologically inspired memory for AI agents

The cult of vibe coding is insane

Show HN: Modo – I built an open-source alternative to Kiro, Cursor, and Windsurf

The cult of vibe coding is insane

Show HN: I built a tiny LLM to demystify how language models work