⚡ March 11, 2026¶
Generated: 2026-03-11 21:08 UTC
Total Duration: 42m 5s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 13 | 3 | 0 | 16 | 🟡 81% (13/16) |
| deepseek-v3.2-chat | 14 | 2 | 0 | 16 | 🟡 88% (14/16) |
| gemini-3.1-pro-preview | 15 | 1 | 0 | 16 | 🟡 94% (15/16) |
| gpt-5.3-codex | 4 | 12 | 0 | 16 | 🟡 25% (4/16) |
| gpt-5.4 | 13 | 3 | 0 | 16 | 🟡 81% (13/16) |
| haiku-4.5 | 13 | 3 | 0 | 16 | 🟡 81% (13/16) |
| opus-4.6 | 14 | 2 | 0 | 16 | 🟡 88% (14/16) |
| qwen-next-80B-instruct | 9 | 7 | 0 | 16 | 🟡 56% (9/16) |
| qwen-next-80B-thinking | 5 | 11 | 0 | 16 | 🟡 31% (5/16) |
| sonnet-4.6 | 16 | 0 | 0 | 16 | 🟢 100% (16/16) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 16 | $0.02 | $0.00 | $0.05 | $0.32 |
| deepseek-v3.2-chat | 16 | $0.01 | $0.00 | $0.03 | $0.21 |
| gemini-3.1-pro-preview | 16 | $0.12 | $0.05 | $0.25 | $1.84 |
| gpt-5.3-codex | 16 | $0.02 | $0.00 | $0.06 | $0.36 |
| gpt-5.4 | 16 | $0.13 | $0.02 | $0.25 | $2.13 |
| haiku-4.5 | 16 | $0.06 | $0.02 | $0.11 | $0.98 |
| opus-4.6 | 16 | $0.31 | $0.12 | $0.54 | $4.95 |
| qwen-next-80B-instruct | 16 | $0.04 | $0.00 | $0.07 | $0.65 |
| qwen-next-80B-thinking | 16 | $0.03 | $0.00 | $0.09 | $0.47 |
| sonnet-4.6 | 16 | $0.18 | $0.07 | $0.27 | $2.91 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 164.1 | 17.0 | 308.1 | 187.7 | 308.1 |
| deepseek-v3.2-chat | 94.7 | 33.1 | 214.5 | 87.2 | 214.5 |
| gemini-3.1-pro-preview | 35.2 | 18.2 | 63.1 | 37.3 | 63.1 |
| gpt-5.3-codex | 13.7 | 5.2 | 30.3 | 12.7 | 30.3 |
| gpt-5.4 | 37.5 | 7.3 | 55.5 | 40.9 | 55.5 |
| haiku-4.5 | 31.1 | 5.4 | 60.6 | 30.2 | 60.6 |
| opus-4.6 | 49.5 | 9.1 | 86.5 | 47.6 | 86.5 |
| qwen-next-80B-instruct | 34.1 | 5.7 | 54.2 | 35.2 | 54.2 |
| qwen-next-80B-thinking | 53.7 | 8.9 | 139.0 | 41.3 | 139.0 |
| sonnet-4.6 | 40.8 | 3.8 | 60.2 | 46.0 | 60.2 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 | Warnings |
|---|---|---|---|---|---|---|---|---|---|---|---|
| benchmark | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 83% (⅚) | 🟡 33% (2/6) | 🟡 50% (3/6) | 🟡 50% (3/6) | 🟡 67% (4/6) | 🟡 17% (⅙) | 🟡 17% (⅙) | 🟢 100% (6/6) | |
| context_window | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | |
| counting | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | |
| datetime | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟡 33% (⅓) | 🟢 100% (3/3) | 🟡 33% (⅓) | 🟢 100% (3/3) | 🟡 33% (⅓) | 🟡 33% (⅓) | 🟢 100% (3/3) | |
| easy | 🟡 86% (6/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | 🟡 29% (2/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | 🟡 86% (6/7) | 🟡 43% (3/7) | 🟢 100% (7/7) | |
| grafana-dashboard | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| hard | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | |
| kubernetes | 🟡 89% (8/9) | 🟢 100% (9/9) | 🟢 100% (9/9) | 🟡 22% (2/9) | 🟡 89% (8/9) | 🟢 100% (9/9) | 🟢 100% (9/9) | 🟡 67% (6/9) | 🟡 33% (3/9) | 🟢 100% (9/9) | |
| logs | 🟡 80% (⅘) | 🟡 80% (⅘) | 🟡 80% (⅘) | 🟡 20% (⅕) | 🟡 60% (⅗) | 🟡 40% (⅖) | 🟡 80% (⅘) | 🟡 20% (⅕) | 🔴 0% (0/5) | 🟢 100% (5/5) | |
| loki | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🔴 0% (0/2) | 🟢 100% (2/2) | |
| medium | 🟡 83% (⅚) | 🟡 83% (⅚) | 🟡 83% (⅚) | 🟡 17% (⅙) | 🟡 67% (4/6) | 🟡 67% (4/6) | 🟡 83% (⅚) | 🟡 17% (⅙) | 🔴 0% (0/6) | 🟢 100% (6/6) | |
| metrics | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| network | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| one-test | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| port-forward | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟡 33% (⅓) | 🟢 100% (3/3) | |
| question-answer | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| regression | 🟡 80% (8/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 20% (2/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 80% (8/10) | 🟡 40% (4/10) | 🟢 100% (10/10) | |
| runbooks | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| Overall | 🟡 81% (13/16) | 🟡 88% (14/16) | 🟡 94% (15/16) | 🟡 25% (4/16) | 🟡 81% (13/16) | 🟡 81% (13/16) | 🟡 88% (14/16) | 🟡 56% (9/16) | 🟡 31% (5/16) | 🟢 100% (16/16) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
Detailed Raw Results¶
| Eval ID | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 |
|---|---|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (1/1) / ⏱️ 115.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 90.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 25.9s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 5.2s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 37.1s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 30.2s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 44.1s / 💰 $0.29 | 🟢 100% (1/1) / ⏱️ 54.2s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 74.5s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 43.7s / 💰 $0.18 |
| 100a_loki_historical_logs 🔗 | 🟢 100% (1/1) / ⏱️ 308.1s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 145.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 30.6s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 20.9s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 51.2s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 36.4s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 86.5s / 💰 $0.54 | 🔴 0% (0/1) / ⏱️ 34.6s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 84.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 52.4s / 💰 $0.21 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟢 100% (1/1) / ⏱️ 234.7s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 214.5s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 38.1s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 23.9s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 46.7s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 40.6s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 79.9s / 💰 $0.37 | 🟢 100% (1/1) / ⏱️ 35.2s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 25.7s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 50.1s / 💰 $0.19 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/1) / ⏱️ 209.1s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 106.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 63.1s / 💰 $0.25 | 🔴 0% (0/1) / ⏱️ 7.9s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 40.9s / 💰 $0.13 | 🔴 0% (0/1) / ⏱️ 36.4s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 53.7s / 💰 $0.34 | 🔴 0% (0/1) / ⏱️ 53.3s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 135.7s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 46.0s / 💰 $0.27 |
| 111_pod_names_contain_service 🔗 | 🔴 0% (0/1) / ⏱️ 17.0s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 87.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 23.9s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 13.3s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 41.8s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 29.0s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 43.2s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 39.2s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 14.0s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 44.3s / 💰 $0.18 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 100% (1/1) / ⏱️ 206.8s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 82.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 24.4s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 24.6s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 37.0s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 23.2s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 36.6s / 💰 $0.25 | 🔴 0% (0/1) / ⏱️ 28.8s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 41.3s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 28.9s / 💰 $0.14 |
| 12_job_crashing 🔗 | 🟢 100% (1/1) / ⏱️ 187.7s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 138.1s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 40.5s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 5.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 45.0s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 36.8s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 47.6s / 💰 $0.31 | 🟢 100% (1/1) / ⏱️ 32.4s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 27.1s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 41.3s / 💰 $0.17 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (1/1) / ⏱️ 151.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 73.1s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 47.2s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 5.4s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 55.5s / 💰 $0.25 | 🟢 100% (1/1) / ⏱️ 40.3s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 48.9s / 💰 $0.32 | 🔴 0% (0/1) / ⏱️ 46.0s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 16.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 58.6s / 💰 $0.26 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (1/1) / ⏱️ 72.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 50.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 18.3s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 30.3s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 17.2s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 27.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 22.6s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 13.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 77.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 19.2s / 💰 $0.12 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 100% (1/1) / ⏱️ 113.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 74.1s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 41.6s / 💰 $0.10 | 🔴 0% (0/1) / ⏱️ 22.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 29.9s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 27.8s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 40.3s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 30.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 87.4s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 30.7s / 💰 $0.15 |
| 24_misconfigured_pvc 🔗 | 🟢 100% (1/1) / ⏱️ 187.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 76.3s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 37.3s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 7.2s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 27.9s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 27.0s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 44.9s / 💰 $0.31 | 🟢 100% (1/1) / ⏱️ 43.0s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 8.9s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 46.6s / 💰 $0.20 |
| 43_current_datetime_from_prompt 🔗 | 🔴 0% (0/1) / ⏱️ 32.4s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 33.1s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 18.2s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 5.4s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 7.3s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 5.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 9.1s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 5.7s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 11.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 3.8s / 💰 $0.07 |
| 61_exact_match_counting 🔗 | 🟢 100% (1/1) / ⏱️ 78.7s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 48.0s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 23.1s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 7.7s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 24.1s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 22.0s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 29.6s / 💰 $0.19 | 🟢 100% (1/1) / ⏱️ 13.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 77.4s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 23.5s / 💰 $0.10 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 243.4s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 109.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 34.2s / 💰 $0.10 | 🔴 0% (0/1) / ⏱️ 20.2s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 37.9s / 💰 $0.17 | 🔴 0% (0/1) / ⏱️ 21.4s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 68.3s / 💰 $0.37 | 🔴 0% (0/1) / ⏱️ 36.8s / 💰 $0.05 | 🔴 0% (0/1) / ⏱️ 24.4s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 50.9s / 💰 $0.19 |
| 73b_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 272.7s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 77.1s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 57.9s / 💰 $0.21 | 🔴 0% (0/1) / ⏱️ 6.4s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 49.1s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 34.0s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 56.5s / 💰 $0.32 | 🔴 0% (0/1) / ⏱️ 34.7s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 13.3s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 53.5s / 💰 $0.21 |
| 96_no_matching_runbook 🔗 | 🟢 100% (1/1) / ⏱️ 195.6s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 108.2s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 39.5s / 💰 $0.17 | 🟢 100% (1/1) / ⏱️ 12.7s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 51.3s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 60.6s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 80.1s / 💰 $0.47 | 🔴 0% (0/1) / ⏱️ 43.9s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 139.0s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 60.2s / 💰 $0.27 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-22972733375.