对照 Anthropic 评估框架,回顾 Hermes Agent 自身迭代中的 Good Case & Bad Case
分析周期:2026.04–05 · 核心数据来源:ka-sales evals CHECKPOINT.md + 8 份评估报告 + 模型对比实验
Anthropic 文章定义了 agent eval 的核心概念。下面对照 Hermes 的实践做法:
judge.py 负责结构化 regex 检查(表格存在性、字段完整性、工具调用计数),judge_llm.py 负责语义质量(策略价值、事实准确、Challenger 叙事力度)。两者互补,分别覆盖客观和主观维度。
⇄ Anthropic: "Agent evaluations typically combine three types of graders: code-based, model-based, and human. Choose the right graders for the job."
每个 case 同时运行 with-skill 和 without-skill,计算 Δ(auto) 和 Δ(full)。research-profile avg Δ=+33.4,first-contact avg Δ=+26.8,competitive-displacement avg Δ=+38.0。数据说话,不是感觉判断。
⇄ Anthropic: "Capability evals ask 'What can this agent do well?' — they should start at a low pass rate, giving teams a hill to climb."
ka-research-profile 端到端评估中,先采集 QCC 原始数据一次(4工具串行),再分别喂给多个模型 API 做生成对比。这种设计将"工具调用"和"文本生成"解耦,只测生成阶段 TTFT/Total/质量,比完整 agent loop 更干净、更可复现。
⇄ Anthropic: "We're evaluating the harness and the model working together." — 分离 harness 和 model 的评估是高级实践。
重大变更后用 delegate_task 启动独立 subagent 做全量审查。已验证 2 次有效,subagent 多次发现主对话遗漏的问题(包括 skill 交叉引用错误、硬编码残留)。这种"外部审视"机制弥补了单个 agent 的盲区。
⇄ Anthropic: "Read the transcripts! You won't know if your graders are working well unless you read transcripts from many trials." — subagent review 是自动化的 transcript reading。
OP-Q04 (单商机不触发多客户运营)、DIB-Q04 (不要硬生成日报)、DW-Q04 (模糊指令先澄清)、RP-Q06 (仅组织架构底稿)。judge.py 为每个边界 case 增加了豁免逻辑,避免 rubric 误伤正确行为。
⇄ Anthropic: "Don't unnecessarily punish creativity — grade what the agent produced, not the path it took." 以及 "Build balanced problem sets: test both where behavior should and shouldn't occur."
284行 CHECKPOINT.md 记录:里程碑状态表(✅/🔄/⏳/⚠️)、关键产物路径、恢复命令、已知问题与对策。跨会话中断后可无缝恢复,避免凭记忆续跑。
⇄ Anthropic: "An eval suite is a living artifact that needs ongoing attention and clear ownership to remain useful."
同一 prompt × 同一模型 (deepseek-v4-pro) × reasoning_effort=none/low 对比。发现 none 比 low 慢 3.1x (343s vs 111s) 且产出断崖下降。直接推翻了"reasoning_effort 越高越好"的直觉假设。
⇄ Anthropic: "Evals also shape how quickly you can adopt new models. Teams with evals can quickly determine model strengths and upgrade in days."
flash 42s vs pro 111s (2.6x),flash 路由正确 (Fast格式),pro 路由漂移 (Deep格式)。pro 可靠性崩溃:首次 111s 成功,二次 562s+ 零输出 (50% 成功率)。用数据终结了"pro 总是更好"的假设。
⇄ Anthropic: "When more powerful models come out, teams without evals face weeks of testing while competitors with evals can quickly determine the model's strengths."
P0 边界降级优化后,定向复算 4 个代表 case:DIB-Q04 12→48 (+36)、RP-Q06 30→42 (+12)、DW-Q04 35→40 (+5)、OP-Q04 40→50 (+10)。同入口复算,确保修复后不破坏已有通过用例。
⇄ Anthropic: "Regression evals protect against backsliding. As teams hill-climb on capability evals, also run regression evals."
judge.py 执行时读取 ~/.hermes/skills/ka-sales/intel 作为 SSOT (竞品档案、客户画像)。确保评分依据与实际 skill 运行时使用的事实一致,而非依赖评估器自身的知识。
⇄ Anthropic: "Each trial should be isolated by starting from a clean environment. Unnecessary shared state can cause correlated failures." — 反转:我们恰恰利用了共享 SSOT 来确保一致性。
ka-sales 的 12 个 skill 是先开发、后补 evals 的。meddpicc-deal 最早,但 eval 框架 (runner.py + judge.py) 在 2026-04-29 才成型。这导致早期迭代完全依赖"感觉+手动测试",无法量化优化效果。
⇄ Anthropic: "Start early. Evals get harder to build the longer you wait. Early on, product requirements naturally translate into test cases. Wait too long and you're reverse-engineering success criteria from a live system."
first-contact、competitive-displacement、intel-watch 的 LLM judge 均出现过 delegate_task 404 错误。最终被迫改为本机 claude -p 串行兜底,且需要处理 structured_output 包装层。评估基础设施的可靠性不足。
⇄ Anthropic: "Build a robust eval harness with a stable environment. Each trial should be isolated. Infrastructure flakiness shouldn't cause correlated failures."
同一 eval 运行中,test0 profile 和 default profile 的 gateway 竞态,导致 T2 两次失败后才在第三个环境成功。这种基础设施层面的不确定性直接污染了 E2E 时间测量(T2 耗时估算 280-350s,而非精确值)。
⇄ Anthropic: "Unnecessary shared state between runs can cause correlated failures due to infrastructure flakiness rather than agent performance."
所有 LLM judge 的 rubric 是人工撰写的,但从未系统性地用人类专家评分来校准 LLM 评分。DIB-Q04 的 LLM 给了 46/50 但 auto 只有 12/50——这种巨大落差没有经过人工仲裁,我们默认接受了 LLM 的判断。
⇄ Anthropic: "LLM-as-judge graders should be closely calibrated with human experts. Use human graders judiciously for additional validation."
P0 优化的 OP-Q04 初次复算出现了"3客户/31项动作"的异常明细,说明自动评分在高置信度给出 40/50 时,实际输出内容可能是荒谬的。直到人工排查才发现是客户名重复计数 bug。系统性地缺少"读完 transcript 再下结论"的环节。
⇄ Anthropic: "Read the transcripts! You won't know if your graders are working well unless you read the transcripts and grades from many trials. Failures should seem fair."
所有 eval case 只跑 1 次。但 model 的非确定性意味着同一 task 的 pass@1 可能远低于真实能力。特别是 pro 模型 50% 零输出率,单次运行要么给满分要么给零分,完全无法反映真实可靠性。
⇄ Anthropic: "Because model outputs vary between runs, we run multiple trials to produce more consistent results. pass@k and pass^k help capture this nuance."
Anthropic 框架要求 6 层:Automated Evals + Production Monitoring + A/B Testing + User Feedback + Manual Transcript Review + Systematic Human Studies。Hermes 目前只有自动化 Eval 一层。生产环境中用户实际使用 skill 的效果、满意度、成功率完全不可知。
⇄ Anthropic: "Like the Swiss Cheese Model, no single evaluation layer catches every issue. The most effective teams combine multiple methods."
first-contact 完整均分 89.0、negotiation-pricing 92.0、operations-rhythm 88.8。当分数 >85 后,剩余 gap 更多来自 rubric 设计本身(自动评分项的可操作性)而非 agent 真实能力差异。需要更难的 task 或更细粒度的 grader。
⇄ Anthropic: "Eval saturation occurs when an agent passes all solvable tasks. As evals approach saturation, large capability improvements appear as small increases in scores."
P0 优化给 4 个 skill 各加了边界豁免逻辑,但方式是为每个特定的 Q0X case 手写 regex 豁免。DW/RP/DIB 的边界豁免仍主要靠 regex,未抽象为统一 helper。这违反了 Anthropic 的建议:"checking that agents followed very specific steps results in overly brittle tests."
⇄ Anthropic: "Design graders thoughtfully. Grade what the agent produced, not the path it took. Build in partial credit."
所有新 skill 的开发流程是:写 SKILL.md → 手动测试 → 上线 → 后面补 eval。而不是 Anthropic 推荐的:先写 eval task 定义预期行为 → 再开发 skill → eval 驱动迭代。这导致很多 skill 的边界行为是在补 eval 时才发现的。
⇄ Anthropic: "Practice eval-driven development: build evals to define planned capabilities before agents can fulfill them, then iterate until the agent performs well."
对照 Anthropic 的"0→1 路线图",Hermes 的 eval 实践大致处于 Step 1-5 之间,卡在 Step 6-8:
| Anthropic Step | Hermes 状态 | 差距 |
|---|---|---|
| 0. Start early | ❌ 起步晚 | Skill 开发在 eval 之前 2-3 周 |
| 1. Start with manual checks | ✅ 已做到 | suite.json 的 case 来自真实使用场景 |
| 2. Unambiguous tasks | ⚠️ 部分 | 边界 case 仍有歧义,但已在修复 |
| 3. Balanced problem sets | ✅ 优秀 | 每个 suite 4 case,含边界+常规 |
| 4. Robust harness | ❌ 不稳定 | delegate_task 404、gateway 竞态、claude -p 兜底 |
| 5. Thoughtful graders | ⚠️ 方向对但碎片化 | Auto+LLM 双层是正确的,但边界豁免靠 regex 堆积 |
| 6. Read transcripts | ❌ 系统化不足 | 只在出 bug 时才读,没有例行抽查 |
| 7. Monitor saturation | ⚠️ 已发现问题 | 多个 skill 均分 85+,但未主动设计更难 task |
| 8. Long-term maintenance | ⚠️ CHECKPOINT.md 是好开始 | 但缺少团队协作机制和 eval 负责人 |
生成于 2026-05-27 · 基于 Hermes Agent ka-sales 评估体系 (~/.hermes/skills/ka-sales/evals/)
Anthropic 原文 © 2026 Anthropic PBC