HELIX v1.2 — Validated on IBM Fusion · March 2026

Benchmark Results

HELIX was validated against IBM Granite 4.0 Hybrid Small running on AMD EPYC 9254 at IBM Fusion. Every number is a measurement on identical hardware — the only variable is HELIX.

These benchmarks show what HELIX does to your model — not the model's own performance. The deltas are what you gain.

+~10pp
HumanEval pass@1
93.90% with HELIX vs ~83–84% baseline
154/164 problems · 0 runtime errors
+1.82pp
GSM8K Accuracy
92.42% with HELIX vs 90.60% baseline
1,219/1,319 problems · full benchmark
2.1×
Throughput Gain
14.4 tok/s with HELIX vs 6.8 tok/s baseline
Same pod · no hardware changes

What this means: If you're running Granite 4.0 Small today without HELIX, adding HELIX gives you these exact gains — same model, same hardware.

HELIX vs Baseline — Full Comparison

MetricNo HELIX+ HELIXDelta
GSM8K Accuracy90.60%92.42%+1.82pp
HumanEval pass@1~83–84%93.90%~+10pp
MMLU STEM71.50%18 domains
Throughput6.8 tok/s14.4 tok/s2.1× faster
Active params/token9B (MoE)1.125B~87.5% reduction
Completion errorsPresent0Zero all runs
TelemetryNone100%Per-token UTS

Hardware: AMD EPYC 9254 shared pod, IBM Fusion HPC. Multiple concurrent workloads — conservative, production-realistic. Validated March 2026.

Test Conditions

Hardware

  • AMD EPYC 9254 shared production pod
  • IBM Fusion HPC infrastructure
  • Multiple concurrent workloads active
  • Production-realistic, not isolated benchmark conditions

Model Under Test

  • IBM Granite 4.0 Hybrid Small (32B total, 9B MoE active)
  • HELIX slice: 1.125B active parameters per token
  • 263,786 slice rerank operations · zero failures
  • 5-hour continuous benchmark run

Why Accuracy Improves

Accuracy improves because irrelevant parameter activations inject noise into generation. HELIX eliminates that noise — it does not change the model, only which parameters execute. Fewer irrelevant activations = cleaner signal = better output. This is "addition by subtraction."

~90%
Parameters excluded per token
Not destroyed — filtered at execution
263,786
Slice rerank operations
Zero failures across 5-hour run
100%
Telemetry coverage
Per-token UTS on every token
Live Benchmark Recording — GPU vs CPU vs CPU + HELIX
GPU (L40S)
Reference baseline
IBM Fusion · same cluster
CPU Full Model
AMD EPYC 9254 · 9B active
6.8 tok/s · 90.60% GSM8K
CPU + HELIX Slice ✓
1.125B active · ~90% reduction
14.4 tok/s · 92.42% GSM8K

Note: Video shows HELIX with full per-token UTS telemetry logging enabled. Production deployments without logging run materially faster. The 2.1× figure is the logged configuration — unlogged is higher.