NVIDIA DGX Spark Powers Intensive AI Workloads with Petaflop Performance, Large Memory, and Preinstalled AI Software
NVIDIA DGX Spark is designed to handle the most demanding AI workloads by delivering powerful performance in a compact, local system. Unlike traditional desktops or laptops, which often lack sufficient memory and specialized software, DGX Spark brings data-center-grade capabilities directly to developers. Powered by the NVIDIA Blackwell architecture, it delivers 1 petaflop of FP4 AI performance, 128 GB of coherent unified system memory, and memory bandwidth of 273 GB per second. It also comes with the full NVIDIA AI software stack preinstalled, enabling developers to run intensive AI tasks locally without relying on cloud instances or data-center queues. The system excels in fine-tuning large language models. For example, full fine-tuning of a Llama 3.2B model reached a peak of 82,739 tokens per second using BF16 precision. Tuning a Llama 3.1 8B model with LoRA achieved 53,657 tokens per second, while fine-tuning a Llama 3.3 70B model using QLoRA reached 5,079 tokens per second with FP4 precision. These results highlight DGX Spark’s ability to manage memory-heavy tasks that would be impossible on consumer GPUs with only 32 GB of memory. In image generation, DGX Spark’s large memory and strong compute power allow for high-resolution output and faster processing. Using the Flux.1 12B model at FP4 precision, it generates a 1K image every 2.6 seconds. It can also run the BF16 SDXL 1.0 model to produce seven 1K images per minute, demonstrating its capability to handle complex, high-quality image creation tasks locally. For data science, DGX Spark leverages NVIDIA CUDA-X libraries like cuML and cuDF. cuML accelerates machine learning algorithms such as UMAP and HDBSCAN, processing 250 MB datasets in just 4 and 10 seconds respectively. cuDF speeds up pandas-like operations on datasets ranging from 0.5 to 5 GB, completing them in under 12 seconds—dramatically faster than CPU-based systems. In inference workloads, DGX Spark benefits from the NVFP4 and MXFP4 4-bit formats, which provide near-FP8 accuracy with minimal performance loss. This enables efficient model deployment with reduced memory usage and faster processing. Benchmarks show high throughput across multiple models: Qwen3 14B achieved 5,928 tokens per second in prompt processing, while GPT-OSS-20B reached 3,670 tokens per second. The system also supports multi-node setups—two DGX Sparks connected via ConnectX-7 chips successfully ran the 235B Qwen3 model, achieving 11.73 tokens per second in generation, a feat typically reserved for large cloud infrastructure. The new NVFP4 version of the NVIDIA Nemotron Nano 2 model runs efficiently on DGX Spark, delivering up to twice the throughput with negligible accuracy loss. Developers can access model checkpoints via Hugging Face or as NVIDIA NIM containers. With its powerful hardware, extensive software support, and ability to run models that usually require cloud or data-center resources, DGX Spark empowers developers to innovate faster, experiment locally, and scale their AI projects from concept to deployment—all without leaving their desk.
