Accuracy Degradation
How accuracy falls as reasoning chains grow longer (k=1–5). Dashed lines are offline reference scores from notebook 04.
Reference lines are from offline evaluation on the same 250 held-out examples. Live results may differ slightly due to inference temperature and token streaming.
Overall Results
Accuracy and latency for each evaluated model.
| Model | Overall | k=1 | k=2 | k=3 | k=4 | k=5 | vs Baseline | p50 Latency | Avg TPS |
|---|