StepFun today released Step 3.7 Flash, a multimodal Mixture-of-Experts model targeting agentic use cases. It adds native vision input and improved tool-use reliability over Step 3.5 Flash.
What is Step 3.7 Flash?
Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language model. It pairs a 196B-parameter language backbone with a 1.8B-parameter vision encoder (ViT) for native image understanding.
The model activates approximately 11B parameters per token during inference. In MoE architectures, only a subset of “expert” sub-networks fires per forward pass — not the full network. This keeps inference compute closer to an 11B dense model while maintaining a 198B total parameter budget.
Key specs:
| Spec | Value |
|---|---|
| Total parameters | 198B (196B language + 1.8B ViT) |
| Active parameters per token | ~11B |
| Context window | 256k tokens |
| Throughput | Up to 400 tokens/sec |
| Reasoning levels | Low, medium, high |
| License | Apache 2.0 |
Architecture Notes
The vision encoder runs as a separate 1.8B ViT module. It injects image representations into the language backbone’s context. Step 3.5 Flash had no multimodal support; this is a new addition in 3.7.
Three selectable reasoning depths — low, medium, and high — let developers trade latency for reasoning depth. Low is faster and cheaper; high applies more computation per response.
Agentic Coding Performance
On SWE-Bench Pro, Step 3.7 Flash scores 56.26%, up from Step 3.5 Flash’s 51.3% — a gain of roughly 5 percentage points. On Terminal-Bench 2.1, it scores 59.55%, up from 53.37%.
On SWE-MTLG (a multi-task long-generation coding benchmark), it scores 72.42%.
Cross-harness consistency on StepFun’s internal Step-SWE-Bench:
| Scaffold | Step 3.7 Flash | Step 3.5 Flash |
|---|---|---|
| Hermes Agent | 67.5% | 60.0% |
| OpenClaw | 67.0% | 47.0% |
| KiloCode | 67.5% | 59.0% |
| RooCode | 64.5% | 43.0% |
| Claude Code | 71.5% | 73.0% |
| OpenCode | 64.5% | 57.0% |
Step 3.5 Flash ranged from 43% to 73% across harnesses. Step 3.7 Flash ranges from 64.5% to 71.5%. In production, coding agents often run inside heterogeneous scaffolds — each with its own prompting conventions and tool schemas. Narrower per-harness variance means more predictable behavior across different setups.
Advisor Mode
Step 3.7 Flash supports Advisor Mode, StepFun’s implementation of the advisor strategy described by Anthropic. The model runs the agentic loop end-to-end — calling tools, reading results, iterating — and escalates to a larger advisor model only at specific inflection points, such as planning or recovering from repeated failures. Most of the run stays at executor cost.
With Advisor Mode enabled on SWE-Bench Verified, StepFun reports Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance at roughly one-ninth the per-task cost ($0.19 vs. $1.76 per task). These are StepFun’s internal figures.
Multimodal Capabilities
Step 3.7 Flash supports two visual tool pathways:
Visual Search Tool — For recognition tasks where the model’s parametric knowledge is insufficient (long-tail entities, recently emerged concepts), it invokes a visual search tool to retrieve and verify. On SimpleVQA (with Search), it scores 79.16%, comparable to GPT 5.5 (79.11%) and above Kimi K2.6 (78.24%) and GLM 5V Turbo (78.20%).
Python Tool — For fine-grained visual tasks (high-resolution images, visual probing, bounding-box analysis), it uses a code interface to crop, zoom, and draw pixels or bounding boxes. On V (a self-tested score with Python), it scores 95.29%. On HR-Bench 4K and HR-Bench 8K, it scores 89.13% and 86.34% respectively.
StepFun notes an observed behavior during testing: the model combined visual tools with non-visual tools without being explicitly trained to do so. For example, after generating frontend code, it used the GUI to render and inspect the result before iterating. StepFun describes this as emergent compositional tool use.
On Android Daily (long-horizon phone UI task completion), Step 3.7 Flash scores 61.87%, ahead of Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash (63.21%) leads this benchmark.
Search and Research Benchmarks
StepFun focused this model’s search design on planning, evidence filtering, and synthesis — integrating search as part of the reasoning loop rather than a separate add-on.
| Benchmark | Step 3.7 Flash | Notable comparison |
|---|---|---|
| HLE with Tools (acc) | 47.20% | DeepSeek V4 Flash: 45.10% |
| BrowseComp (acc) | 75.82% | Claude Opus 4.7: 79.30% |
| DeepSearchQA (F1) | 92.82% | Kimi K2.6: 92.50% |
| ResearchRubrics (score) | 71.68% | GPT 5.5: 61.50% |
Note: The HLE with Tools score of 47.20% compares to Step 3.5 Flash’s text-only score of 35.68%. Step 3.5 Flash did not support tool-augmented evaluation on HLE.
General Agent Benchmarks
| Benchmark | Step 3.7 Flash | Description |
|---|---|---|
| Toolathlon | 49.51% | Multi-tool coordination |
| ClawEval-1.1 | 67.07% | Daily autonomous task execution in realistic environments |
| GDPval (44 occupations) | 45.8% | General professional task execution |
| Tau2-bench Telecom | >98% | Across different reasoning difficulty tiers |
On ClawEval-1.1, Step 3.7 Flash (67.07%) leads DeepSeek V4 Flash (57.80%) and DeepSeek V4 Pro (59.80%) among the compared models.
Long-Context Performance
On AA-LCR (a long-context retrieval benchmark, avg@16/acc), Step 3.7 Flash scores 63.94%. This is comparable to DeepSeek V4 Flash (63.70%) and DeepSeek V4 Pro (66.30%).
Pricing
| Token Type | Price |
|---|---|
| Input (cache miss) | $0.20 / M tokens |
| Input (cache hit) | $0.04 / M tokens |
| Output | $1.15 / M tokens |
Marktechpost’s Visual Explainer
Slide 1 of 8 — Overview
What Is Step 3.7 Flash?
Step 3.7 Flash is a sparse Mixture-of-Experts (MoE) vision-language model from StepFun. It combines a 196B-parameter language backbone with a 1.8B-parameter Vision Transformer (ViT) encoder for native image understanding.
In a MoE model, only a subset of “expert” sub-networks activates per token — not the full network. This keeps inference compute close to an 11B dense model while maintaining 198B total parameters.
Context Window
256k tokens
Reasoning Levels
Low / Med / High
Slide 2 of 8 — Architecture
Architecture Notes
The 1.8B ViT encoder runs as a separate module and injects image representations into the language backbone’s context. Step 3.5 Flash was text-only; native multimodal support is new in 3.7.
Three selectable reasoning depths let developers balance speed and cost:
- Low — Fastest, cheapest. Suitable for simple completions.
- Medium — Balanced cost and reasoning depth.
- High — More compute per response. Best for complex agent tasks.
MoE routing means you pay for ~11B active params at inference, not 198B. This is the core efficiency trade-off in Flash-tier models.
Slide 3 of 8 — Agentic Coding
Agentic Coding Performance
Step 3.7 Flash scores 56.26% on SWE-Bench Pro (up from 51.3% in 3.5 Flash) and 59.55% on Terminal-Bench 2.1 (up from 53.37%). On SWE-MTLG it scores 72.42%.
Per-harness scores on StepFun’s internal Step-SWE-Bench:
| Scaffold | 3.7 Flash | 3.5 Flash |
|---|---|---|
| Hermes Agent | 67.5% | 60.0% |
| OpenClaw | 67.0% | 47.0% |
| KiloCode | 67.5% | 59.0% |
| RooCode | 64.5% | 43.0% |
| Claude Code | 71.5% | 73.0% |
| OpenCode | 64.5% | 57.0% |
3.5 Flash ranged 43–73% across harnesses. 3.7 Flash narrows that to 64.5–71.5% — more predictable across heterogeneous scaffolds.
Slide 4 of 8 — Advisor Mode
Advisor Mode
Step 3.7 Flash supports Advisor Mode, StepFun’s implementation of the advisor strategy described by Anthropic. The model runs the full agentic loop — calling tools, reading results, iterating — and escalates to a larger advisor model only at specific inflection points.
- Escalates during planning or recovery from repeated failures
- Most of the run stays at executor (Flash) cost
- Large advisor model is consulted sparingly
SWE-Bench Verified results with Advisor Mode (StepFun internal figures):
Step 3.7 Flash + Advisor
76.3% score
Claude Opus 4.6
78.7% score
Claude Opus 4.6 cost
$1.76
Slide 5 of 8 — Multimodal
Multimodal Capabilities
Step 3.7 Flash supports two visual tool pathways:
- Visual Search Tool — Invoked for long-tail entity recognition or recently emerged concepts where parametric knowledge is insufficient. SimpleVQA (Search): 79.16%
- Python Tool — Code interface for cropping, zooming, pixel/bounding-box operations on high-resolution images. V* (Python): 95.29% | HR-Bench 4K: 89.13% | HR-Bench 8K: 86.34%
Android Daily (long-horizon phone UI tasks): Step 3.7 Flash scores 61.87%, ahead of Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash leads at 63.21%.
StepFun reports emergent compositional tool use during testing — the model combined visual and non-visual tools without explicit training to do so.
Slide 6 of 8 — Search & Research
Search and Research Benchmarks
Search is integrated into the model’s reasoning loop rather than treated as an external add-on. StepFun focused training on search planning, evidence filtering, and synthesis.
| Benchmark | 3.7 Flash | Comparison |
|---|---|---|
| HLE w. Tools (acc) | 47.20% | DeepSeek V4 Flash: 45.10% |
| BrowseComp (acc) | 75.82% | Claude Opus 4.7: 79.30% |
| DeepSearchQA (F1) | 92.82% | Kimi K2.6: 92.50% |
| ResearchRubrics | 71.68% | GPT 5.5: 61.50% |
HLE comparison: Step 3.5 Flash scored 35.68% text-only. Step 3.7 Flash scores 47.20% with tool access — these are not apples-to-apples.
Slide 7 of 8 — Deployment
Pricing, Deployment & Ecosystem
| Token Type | Price |
|---|---|
| Input (cache miss) | $0.20 / M tokens |
| Input (cache hit) | $0.04 / M tokens |
| Output | $1.15 / M tokens |
Available on:
StepFun Platform OpenRouter NVIDIA NIM DeepInfra (soon) Fireworks AI (soon) Modal (soon)
Inference backends: vLLM, SGLang, Hugging Face Transformers (requires v5.0+), llama.cpp
Quantization formats: BF16, FP8, NVFP4, GGUF
Local minimum: 120 GB unified memory/VRAM
Slide 8 of 8 — Key Takeaways
Key Takeaways
- 198B sparse MoE model with ~11B active params per token and a 256k context window
- Native multimodal support (images, GUIs, documents) — Step 3.5 Flash was text-only
- Advisor Mode scores 76.3% on SWE-Bench Verified at $0.19/task vs. Claude Opus 4.6 at $1.76
- Cross-harness coding variance narrowed from 43–73% (3.5) to 64.5–71.5% (3.7)
- Released Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face
Compatible harnesses:
Claude Code KiloCode Hermes Agent OpenClaw
Key Takeaways
- Step 3.7 Flash is a 198B sparse MoE model with 11B active params and a 256k context window.
- Native multimodal support (images, GUIs, documents) is new — Step 3.5 Flash was text-only.
- Advisor Mode reaches 97% of Claude Opus 4.6’s SWE-Bench Verified performance at $0.19 per task vs. $1.76.
- Cross-harness coding variance narrowed from a 43–73% range (3.5 Flash) to 64.5–71.5% (3.7 Flash).
- Released under Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face.
Where (Inferences) to Run Step 3.7 Flash
Where to Run It
Step 3.7 Flash — Inference Providers & Access
StepFun’s 198B MoE vision-language model across hosted APIs and open weights.
Hosted API · Live Now
Open Weights · Apache 2.0
Sources: StepFun model page, Hugging Face, GitHub, OpenRouter, NVIDIA Technical Blog. Accurate as of May 29, 2026.
Check out the Model Weights, Repo and Technical Details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
