SpatialFT · AIPI 590.03

Live Benchmark

StepGame evaluation — 250 held-out examples, k = 1–5 reasoning steps

⏳ No live benchmark results yet. Trigger the Spatial Benchmark workflow in GitHub Actions to populate this page.

Accuracy Degradation

How accuracy falls as reasoning chains grow longer (k=1–5). Dashed lines are offline reference scores from notebook 04.

Reference lines are from offline evaluation on the same 250 held-out examples. Live results may differ slightly due to inference temperature and token streaming.

Overall Results

Accuracy and latency for each evaluated model.

Model	Overall	k=1	k=2	k=3	k=4	k=5	vs Baseline	p50 Latency	Avg TPS

spatialft/spatialft.github.io · lm-arena