NVIDIA Unveils AI PC Upgrades: Faster LLMs and Diffusion Models on RTX with Open-Source Tools and New Audio-Video SDKs
AI development on consumer PCs is rapidly advancing, fueled by the growing performance and accessibility of small language models (SLMs) and diffusion models like FLUX.2, GPT-OSS-20B, and Nemotron 3 Nano. Tools such as ComfyUI, llama.cpp, Ollama, and Unsloth are seeing a surge in adoption, with usage doubling in the past year and the number of developers using local models increasing tenfold. This shift marks a move from experimentation to real-world application, as developers build the next generation of AI software on NVIDIA RTX GPUs, from the data center to the desktop. At CES 2026, NVIDIA unveiled a series of key updates to accelerate AI on RTX PCs, focusing on open-source frameworks and performance improvements. The company collaborated with the open-source community to enhance inference across the AI stack. On the diffusion side, ComfyUI now delivers optimized performance on NVIDIA GPUs through PyTorch-CUDA, with support for NVFP4 and FP8 quantization. These formats reduce memory usage by 60% and 40% respectively, enabling up to 3x speedups with NVFP4 and 2x with NVFP8. Sample code and pre-trained checkpoints—including LTX-2, FLUX.2, FLUX.1-dev, FLUX.1-Kontext, Qwen-Image, and Z-Image—are available on Hugging Face. For SLMs, performance on RTX PCs has improved significantly. Token generation throughput for mixture-of-experts (MoE) models increased by 35% in llama.cpp and 30% in Ollama. These gains are driven by new features in llama.cpp, including GPU-based token sampling, which offloads algorithms like TopK, TopP, and temperature sampling to the GPU for better response quality and speed. Concurrency in QKV projections using multiple CUDA streams, along with MMVQ kernel optimizations and faster model loading—up to 65% faster on DGX Spark and 15% on RTX—further boost performance. NVIDIA’s new Blackwell GPUs also support native MXFP4, delivering up to 25% faster prompt processing. Ollama has been updated to leverage these improvements, with new builds available for developers to test in tools like LM Studio and the Ollama App. NVIDIA and Lightricks are also launching LTX-2, a powerful open-source audio-video model that runs locally on RTX AI PCs and DGX Spark. It generates up to 20 seconds of synchronized 4K video at 50 fps with multi-modal control, enabling advanced creative workflows. The model is available in BF16 and NVFP8 formats, with the quantized version reducing memory usage by 30%, making it efficient for local deployment. To support agentic AI development, NVIDIA introduced Nemotron 3 Nano, a 32B parameter MoE model with 3.6B active parameters and a 1M context window. It excels in coding, instruction-following, long-context reasoning, and STEM tasks, and is optimized for RTX and DGX Spark via Ollama and llama.cpp. It supports LoRA-based fine-tuning and is fully open, with accessible weights, datasets, and training recipes to promote transparency and efficiency. For retrieval-augmented generation (RAG), NVIDIA partnered with Docling—a document processing tool developed at IBM and contributed to the Linux Foundation. Docling is optimized for RTX GPUs and delivers up to 4x faster performance than CPUs. It offers both traditional OCR and advanced VLM-based pipelines for complex multi-modal documents, available through vLLM in WSL and Linux. Finally, the NVIDIA Video and Audio Effects SDKs have been enhanced, with AI relighting now 3x faster and requiring only an RTX 3060 or higher. The model size has been reduced by up to 6x, enabling broader accessibility. These updates are showcased in the latest version of NVIDIA Broadcast. NVIDIA continues to work closely with the open-source community to deliver powerful, efficient tools for developers building AI applications on RTX PCs and DGX Spark. The ecosystem is now more capable than ever, empowering creators and engineers to innovate locally with high performance and privacy.
