Run Step 3.7 Flash on NVIDIA GPUs
StepFun has released Step 3.7 Flash, a new 198-billion-parameter multimodal model designed for enterprise production environments. Running on NVIDIA-accelerated infrastructure, this Mixture-of-Experts vision-language model activates approximately 11 billion parameters per forward pass. It combines perception, search, and multi-step reasoning to process images, documents, video, and text in real time. The model features a 256,000-token context window, native video and image input, and three configurable reasoning levels to suit varying computational needs. Step 3.7 Flash targets high-throughput use cases such as financial analysis and concurrent coding agents. Developers can optimize inference using StepFun's NVFP4-quantized checkpoints available on Hugging Face, which reduce memory bandwidth and storage requirements. The model supports deployment through open-source frameworks like SGLang, NVIDIA TensorRT-LLM, and vLLM, all optimized for NVIDIA hardware. For prototyping and evaluation, NVIDIA provides GPU-accelerated endpoints via build.nvidia.com. A demo notebook illustrates how Step 3.7 Flash works with NVIDIA Nemotron Parse to extract structured insights from complex documents like financial reports and scientific papers. These pipelines organize unstructured data into usable formats with bounding box accuracy. To facilitate production-ready deployment, NVIDIA NIM offers containerized inference microservices. NIM packages the model with performance tuning and standardized OpenAI-compatible APIs, allowing easy integration across on-premises, cloud, or hybrid environments. This approach streamlines the transition from development to live operations. Customization is supported through the NVIDIA NeMo Framework. Using the NeMo Automodel library, teams can perform Day 0 fine-tuning directly from Hugging Face checkpoints without conversion. This supports supervised fine-tuning and memory-efficient LoRA, achieving 600 tokens per second on Hopper GPUs. For large-scale training, the NeMo Megatron-Bridge recipe offers further performance optimizations. Hardware flexibility extends from data centers using NVIDIA Blackwell to desk-side solutions like the NVIDIA DGX Station. The DGX Station, with 748 gigabytes of coherent memory, provides ample headroom for the full 256k context length, enabling faster local iteration. NVIDIA maintains an active role in the open-source ecosystem, releasing hundreds of projects to promote transparency and AI safety. Users are encouraged to test Step 3.7 Flash on Hugging Face, utilize NVIDIA endpoints, or deploy locally using the vLLM Playbook. This integration ensures that StepFun's advanced AI capabilities are accessible and scalable for diverse enterprise needs.
