HyperAI

For financial services firms, AI inference has become just as challenging—and in many cases more complex—than training, despite the common perception that inference was once a simple afterthought. While early machine learning models were small and inference was straightforward, today’s generative AI models demand far more sophisticated handling, especially in industries where speed, accuracy, and security are paramount. Financial institutions—including investment banks, insurance providers, and trading firms—now face a wide spectrum of inference workloads. These range from real-time fraud detection and risk assessment to customer-facing chatbots, personalized recommendations, and internal code assistance tools. Many of these systems must operate across diverse environments: from edge devices in bank branches and mobile phones to massive datacenter clusters. A key challenge is the need to run inference efficiently on a mix of hardware—CPUs with vector and tensor accelerators, GPUs, FPGAs, and custom ASICs—while managing storage not as an afterthought but as a critical performance enabler. Unlike traditional high-performance computing, where storage was often overlooked, AI inference now requires intelligent storage solutions that preserve context, reduce redundant computation, and lower latency. One major innovation is the use of key-value and context window caches. These store intermediate results—like token generation states—on fast flash or persistent memory, so they don’t have to be recomputed with every new token. This dramatically reduces GPU memory pressure and cuts inference costs, especially as context windows grow longer. Companies like Vast Data are enabling this with platforms that extend memory capacity using persistent storage over high-speed networks like RDMA. This allows systems to retain user sessions even after idle periods, avoiding the need to recompute entire conversations. As Jeff Denworth of Vast Data notes, the cost of inference grows quadratically with context length, making efficient storage essential. Data orchestration is equally critical. Hammerspace’s global file system treats local NVMe storage on GPU servers as a Tier 0 distributed cache, intelligently moving data into position before inference jobs begin. By creating a unified view of all storage resources, Hammerspace ensures data arrives at the right GPU at the right time, minimizing bottlenecks. Real-world examples illustrate the scale and complexity of these efforts. JPMorgan Chase launched IndexGPT in July 2024, using OpenAI’s GPT-4 to generate keywords for thematic stock indices. While the model itself is static, it automates a previously manual process and improves accuracy. The tool is now available through Bloomberg and Vida platforms. Bank of America’s Erica, launched in 2018, has evolved into a widely used AI assistant with over 2.6 billion interactions and 20 million active users. Though it relies on traditional NLP and machine learning—not generative models—it demonstrates the long-term value of AI in customer service. Wells Fargo’s Fargo app, launched in 2022, takes a more advanced approach. It uses a tiny LLM on-device for speech-to-text, strips out personal data locally, and then sends queries to a remote Google Gemini Flash model. With 245.5 million interactions in 2024 alone, the app highlights the explosive growth in AI-driven customer engagement. To handle such workloads, financial firms are turning to next-generation infrastructure. Systems like Nvidia’s GB300 NVL72, built by Supermicro and others, pack 72 Blackwell GPUs into a single rack, delivering 1.1 petaflops of FP4 inference performance. The upcoming VR200 NVL144 will offer 3.6 exaflops of FP4 inference power, enabling complex chain-of-thought reasoning across hundreds of smaller models. These systems are not just for scale—they’re necessary for high-precision, low-latency inference in regulated environments. While banks may move cautiously due to compliance concerns, hedge funds and trading firms are already pushing the limits. In short, financial services firms are leading the charge in redefining AI inference—not just in capability but in infrastructure, storage, and system design. Their experience offers valuable lessons for all industries as AI becomes embedded in every layer of enterprise computing.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

For Financial Services Firms, AI Inference Is As Challenging As Training

Related Links

Command Palette

For Financial Services Firms, AI Inference Is As Challenging As Training

Related Links

Command Palette

For Financial Services Firms, AI Inference Is As Challenging As Training

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models