⚡ March 15, 2026¶
Generated: 2026-03-15 04:11 UTC
Total Duration: 58m 37s
Iterations: 1
Judge (classifier) model: gpt-4.1
Fast Benchmark
Markers: regression or benchmark
Schedule: Weekly (Sunday 2 AM UTC)
Purpose: Quick regression tests to catch breaking changes
HolmesGPT is continuously evaluated against real-world Kubernetes and cloud troubleshooting scenarios.
If you find scenarios that HolmesGPT does not perform well on, please consider adding them as evals to the benchmark.
Model Accuracy Comparison¶
| Model | Pass | Fail | Skip/Error | Total | Success Rate |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 14 | 2 | 0 | 16 | 🟡 88% (14/16) |
| deepseek-v3.2-chat | 13 | 3 | 0 | 16 | 🟡 81% (13/16) |
| gemini-3.1-pro-preview | 14 | 2 | 0 | 16 | 🟡 88% (14/16) |
| gpt-5.3-codex | 6 | 10 | 0 | 16 | 🟡 38% (6/16) |
| gpt-5.4 | 13 | 3 | 0 | 16 | 🟡 81% (13/16) |
| haiku-4.5 | 13 | 3 | 0 | 16 | 🟡 81% (13/16) |
| opus-4.6 | 16 | 0 | 0 | 16 | 🟢 100% (16/16) |
| qwen-next-80B-instruct | 12 | 4 | 0 | 16 | 🟡 75% (12/16) |
| qwen-next-80B-thinking | 7 | 9 | 0 | 16 | 🟡 44% (7/16) |
| sonnet-4.6 | 16 | 0 | 0 | 16 | 🟢 100% (16/16) |
Model Cost Comparison¶
| Model | Tests | Avg Cost | Min Cost | Max Cost | Total Cost |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 16 | $0.02 | $0.00 | $0.03 | $0.26 |
| deepseek-v3.2-chat | 16 | $0.02 | $0.00 | $0.04 | $0.25 |
| gemini-3.1-pro-preview | 16 | $0.12 | $0.04 | $0.39 | $1.99 |
| gpt-5.3-codex | 16 | $0.03 | $0.00 | $0.08 | $0.41 |
| gpt-5.4 | 16 | $0.13 | $0.02 | $0.30 | $2.06 |
| haiku-4.5 | 16 | $0.06 | $0.02 | $0.12 | $0.94 |
| opus-4.6 | 16 | $0.32 | $0.12 | $0.51 | $5.15 |
| qwen-next-80B-instruct | 16 | $0.04 | $0.00 | $0.09 | $0.59 |
| qwen-next-80B-thinking | 16 | $0.03 | $0.00 | $0.09 | $0.49 |
| sonnet-4.6 | 16 | $0.18 | $0.07 | $0.31 | $2.89 |
Model Latency Comparison¶
| Model | Avg (s) | Min (s) | Max (s) | P50 (s) | P95 (s) |
|---|---|---|---|---|---|
| deepseek-r1-reasoner | 304.2 | 12.5 | 548.2 | 297.5 | 548.2 |
| deepseek-v3.2-chat | 188.1 | 13.9 | 377.0 | 207.9 | 377.0 |
| gemini-3.1-pro-preview | 38.0 | 11.8 | 121.1 | 28.1 | 121.1 |
| gpt-5.3-codex | 16.2 | 4.4 | 61.0 | 13.1 | 61.0 |
| gpt-5.4 | 47.9 | 15.7 | 183.8 | 46.3 | 183.8 |
| haiku-4.5 | 28.9 | 6.3 | 51.7 | 26.9 | 51.7 |
| opus-4.6 | 42.0 | 4.5 | 78.9 | 45.1 | 78.9 |
| qwen-next-80B-instruct | 32.2 | 3.2 | 62.3 | 32.2 | 62.3 |
| qwen-next-80B-thinking | 48.1 | 6.1 | 116.2 | 48.6 | 116.2 |
| sonnet-4.6 | 35.2 | 5.8 | 53.8 | 37.6 | 53.8 |
Performance by Tag¶
Success rate by test category and model:
| Tag | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 | Warnings |
|---|---|---|---|---|---|---|---|---|---|---|---|
| benchmark | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟡 67% (4/6) | 🟡 50% (3/6) | 🟡 67% (4/6) | 🟡 50% (3/6) | 🟢 100% (6/6) | 🟡 50% (3/6) | 🟡 17% (⅙) | 🟢 100% (6/6) | |
| context_window | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | 🔴 0% (0/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | |
| counting | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | |
| datetime | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟡 33% (⅓) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟢 100% (3/3) | |
| easy | 🟡 86% (6/7) | 🟡 86% (6/7) | 🟢 100% (7/7) | 🟡 29% (2/7) | 🟡 86% (6/7) | 🟢 100% (7/7) | 🟢 100% (7/7) | 🟡 86% (6/7) | 🟡 71% (5/7) | 🟢 100% (7/7) | |
| grafana-dashboard | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| hard | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | 🟡 50% (½) | 🔴 0% (0/2) | 🟢 100% (2/2) | |
| kubernetes | 🟡 89% (8/9) | 🟢 100% (9/9) | 🟢 100% (9/9) | 🟡 33% (3/9) | 🟡 78% (7/9) | 🟢 100% (9/9) | 🟢 100% (9/9) | 🟡 78% (7/9) | 🟡 33% (3/9) | 🟢 100% (9/9) | |
| logs | 🟡 80% (⅘) | 🟡 80% (⅘) | 🟡 80% (⅘) | 🟡 40% (⅖) | 🟡 60% (⅗) | 🟡 40% (⅖) | 🟢 100% (5/5) | 🟡 60% (⅗) | 🟡 20% (⅕) | 🟢 100% (5/5) | |
| loki | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🟡 50% (½) | 🟢 100% (2/2) | 🟢 100% (2/2) | 🟡 50% (½) | 🔴 0% (0/2) | 🟢 100% (2/2) | |
| medium | 🟢 100% (6/6) | 🟡 83% (⅚) | 🟡 83% (⅚) | 🟡 33% (2/6) | 🟡 83% (⅚) | 🟡 67% (4/6) | 🟢 100% (6/6) | 🟡 67% (4/6) | 🟡 17% (⅙) | 🟢 100% (6/6) | |
| metrics | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| network | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| one-test | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | |
| port-forward | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🟡 67% (⅔) | 🟢 100% (3/3) | 🟢 100% (3/3) | 🟡 67% (⅔) | 🔴 0% (0/3) | 🟢 100% (3/3) | |
| question-answer | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| regression | 🟡 90% (9/10) | 🟡 90% (9/10) | 🟢 100% (10/10) | 🟡 30% (3/10) | 🟡 90% (9/10) | 🟢 100% (10/10) | 🟢 100% (10/10) | 🟡 90% (9/10) | 🟡 60% (6/10) | 🟢 100% (10/10) | |
| runbooks | 🟢 100% (1/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🟢 100% (1/1) | 🔴 0% (0/1) | 🔴 0% (0/1) | 🟢 100% (1/1) | |
| Overall | 🟡 88% (14/16) | 🟡 81% (13/16) | 🟡 88% (14/16) | 🟡 38% (6/16) | 🟡 81% (13/16) | 🟡 81% (13/16) | 🟢 100% (16/16) | 🟡 75% (12/16) | 🟡 44% (7/16) | 🟢 100% (16/16) |
Raw Results¶
Status of all evaluations across models. Color coding:
- 🟢 Passing 100% (stable)
- 🟡 Passing 1-99%
- 🔴 Passing 0% (failing)
- 🔧 Mock data failure (missing or invalid test data)
- ⚠️ Setup failure (environment/infrastructure issue)
- ⏱️ Timeout or rate limit error
- ⏭️ Test skipped (e.g., known issue or precondition not met)
Detailed Raw Results¶
| Eval ID | deepseek-r1-reasoner | deepseek-v3.2-chat | gemini-3.1-pro-preview | gpt-5.3-codex | gpt-5.4 | haiku-4.5 | opus-4.6 | qwen-next-80B-instruct | qwen-next-80B-thinking | sonnet-4.6 |
|---|---|---|---|---|---|---|---|---|---|---|
| 09_crashpod 🔗 | 🟢 100% (1/1) / ⏱️ 285.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 193.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 23.9s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 4.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 26.0s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 26.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 35.5s / 💰 $0.28 | 🟢 100% (1/1) / ⏱️ 27.3s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 48.6s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.18 |
| 100a_loki_historical_logs 🔗 | 🟢 100% (1/1) / ⏱️ 522.2s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 250.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 39.5s / 💰 $0.16 | 🔴 0% (0/1) / ⏱️ 26.9s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 183.8s / 💰 $0.30 | 🟢 100% (1/1) / ⏱️ 51.7s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 78.9s / 💰 $0.43 | 🔴 0% (0/1) / ⏱️ 55.8s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 19.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 45.9s / 💰 $0.20 |
| 101_loki_historical_logs_pod_deleted 🔗 | 🟢 100% (1/1) / ⏱️ 401.1s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 158.3s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 23.3s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 24.1s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 62.3s / 💰 $0.10 | 🟢 100% (1/1) / ⏱️ 26.9s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 63.7s / 💰 $0.37 | 🟢 100% (1/1) / ⏱️ 40.0s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 20.5s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 51.7s / 💰 $0.22 |
| 108_logs_nearby_lines 🔗 | 🔴 0% (0/1) / ⏱️ 548.2s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 377.0s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 121.1s / 💰 $0.39 | 🔴 0% (0/1) / ⏱️ 6.1s / 💰 $0.00 | 🔴 0% (0/1) / ⏱️ 46.9s / 💰 $0.18 | 🔴 0% (0/1) / ⏱️ 37.5s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 45.1s / 💰 $0.32 | 🔴 0% (0/1) / ⏱️ 56.0s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 116.2s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 49.6s / 💰 $0.31 |
| 111_pod_names_contain_service 🔗 | 🟢 100% (1/1) / ⏱️ 349.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 213.6s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 28.1s / 💰 $0.08 | 🔴 0% (0/1) / ⏱️ 4.7s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 27.5s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 28.9s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 39.6s / 💰 $0.28 | 🟢 100% (1/1) / ⏱️ 32.6s / 💰 $0.04 | 🔴 0% (0/1) / ⏱️ 7.0s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 34.3s / 💰 $0.16 |
| 112_find_pvcs_by_uuid 🔗 | 🟢 100% (1/1) / ⏱️ 294.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 183.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 20.3s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 14.6s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 18.5s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 31.8s / 💰 $0.06 | 🟢 100% (1/1) / ⏱️ 24.0s / 💰 $0.20 | 🟢 100% (1/1) / ⏱️ 14.8s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 20.0s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 27.9s / 💰 $0.15 |
| 12_job_crashing 🔗 | 🟢 100% (1/1) / ⏱️ 412.3s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 249.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 37.6s / 💰 $0.12 | 🔴 0% (0/1) / ⏱️ 6.8s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 48.0s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 23.3s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 37.9s / 💰 $0.27 | 🟢 100% (1/1) / ⏱️ 32.2s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 95.9s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 33.7s / 💰 $0.16 |
| 176_network_policy_blocking_traffic_no_runbooks 🔗 | 🟢 100% (1/1) / ⏱️ 267.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 239.7s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 26.8s / 💰 $0.09 | 🔴 0% (0/1) / ⏱️ 4.4s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 46.3s / 💰 $0.11 | 🟢 100% (1/1) / ⏱️ 33.9s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 55.2s / 💰 $0.39 | 🔴 0% (0/1) / ⏱️ 44.4s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 71.7s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 46.4s / 💰 $0.22 |
| 179_grafana_big_dashboard_query 🔗 | 🟢 100% (1/1) / ⏱️ 173.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 111.2s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 16.7s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 19.2s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 21.2s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 23.6s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 20.7s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 12.9s / 💰 $0.01 | 🔴 0% (0/1) / ⏱️ 52.7s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 15.7s / 💰 $0.11 |
| 227_count_configmaps_per_namespace[0] 🔗 | 🟢 100% (1/1) / ⏱️ 255.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 107.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 31.1s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 29.6s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 20.8s / 💰 $0.21 | 🟢 100% (1/1) / ⏱️ 20.7s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 33.4s / 💰 $0.26 | 🟢 100% (1/1) / ⏱️ 26.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 63.0s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 26.1s / 💰 $0.14 |
| 24_misconfigured_pvc 🔗 | 🔴 0% (0/1) / ⏱️ 28.2s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 166.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 36.5s / 💰 $0.11 | 🔴 0% (0/1) / ⏱️ 4.5s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 21.2s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 29.8s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 53.6s / 💰 $0.35 | 🟢 100% (1/1) / ⏱️ 62.3s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 6.1s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 40.6s / 💰 $0.20 |
| 43_current_datetime_from_prompt 🔗 | 🟢 100% (1/1) / ⏱️ 12.5s / 💰 $0.00 | 🔴 0% (0/1) / ⏱️ 13.9s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 11.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 4.6s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 15.7s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 6.3s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 4.5s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 3.2s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 12.6s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 5.8s / 💰 $0.07 |
| 61_exact_match_counting 🔗 | 🟢 100% (1/1) / ⏱️ 97.8s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 92.9s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 14.4s / 💰 $0.07 | 🔴 0% (0/1) / ⏱️ 4.6s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 18.6s / 💰 $0.12 | 🟢 100% (1/1) / ⏱️ 22.8s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 28.4s / 💰 $0.18 | 🟢 100% (1/1) / ⏱️ 10.7s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 48.4s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 18.4s / 💰 $0.10 |
| 73a_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 297.5s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 207.9s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 28.0s / 💰 $0.13 | 🟢 100% (1/1) / ⏱️ 30.0s / 💰 $0.08 | 🟢 100% (1/1) / ⏱️ 67.4s / 💰 $0.19 | 🔴 0% (0/1) / ⏱️ 23.9s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 48.3s / 💰 $0.50 | 🟢 100% (1/1) / ⏱️ 17.0s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 74.6s / 💰 $0.05 | 🟢 100% (1/1) / ⏱️ 35.2s / 💰 $0.18 |
| 73b_time_window_anomaly 🔗 | 🟢 100% (1/1) / ⏱️ 432.9s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 227.9s / 💰 $0.02 | 🟢 100% (1/1) / ⏱️ 31.1s / 💰 $0.13 | 🔴 0% (0/1) / ⏱️ 61.0s / 💰 $0.01 | 🟢 100% (1/1) / ⏱️ 60.7s / 💰 $0.10 | 🔴 0% (0/1) / ⏱️ 23.1s / 💰 $0.04 | 🟢 100% (1/1) / ⏱️ 48.1s / 💰 $0.51 | 🟢 100% (1/1) / ⏱️ 25.2s / 💰 $0.03 | 🔴 0% (0/1) / ⏱️ 11.7s / 💰 $0.00 | 🟢 100% (1/1) / ⏱️ 40.3s / 💰 $0.20 |
| 96_no_matching_runbook 🔗 | 🟢 100% (1/1) / ⏱️ 489.4s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 216.6s / 💰 $0.02 | 🔴 0% (0/1) / ⏱️ 117.4s / 💰 $0.28 | 🟢 100% (1/1) / ⏱️ 13.1s / 💰 $0.03 | 🟢 100% (1/1) / ⏱️ 81.7s / 💰 $0.14 | 🟢 100% (1/1) / ⏱️ 51.3s / 💰 $0.09 | 🟢 100% (1/1) / ⏱️ 56.0s / 💰 $0.46 | 🔴 0% (0/1) / ⏱️ 55.0s / 💰 $0.06 | 🔴 0% (0/1) / ⏱️ 100.4s / 💰 $0.07 | 🟢 100% (1/1) / ⏱️ 53.8s / 💰 $0.30 |
Results are automatically generated and updated weekly. View full traces and detailed analysis in Braintrust experiment: ci-benchmark-23102181491.