Co-design boosts AI efficiency on edge devices
Researchers from the University of Michigan have developed a groundbreaking hardware-software co-design that enables high-efficiency AI processing on edge devices. Published in Nature Communications, the study demonstrates how this approach significantly reduces energy consumption and latency, allowing for the real-time analysis of continuous data streams such as video feeds and sensor inputs. This advancement makes it feasible to run powerful artificial intelligence models directly on battery-powered devices like smartphones, hearing aids, and autonomous vehicle cameras without relying on cloud connectivity. The core innovation lies in the team's ability to map complex state space models (SSMs) onto a compute-in-memory architecture. While transformer models like those powering ChatGPT are dominant, they struggle with memory efficiency as input sequences grow longer. Conversely, compute-in-memory systems offer superior energy efficiency by processing data where it is stored, yet they have historically been incompatible with the complex mathematics required by traditional AI models. Wei Lu, the study's corresponding author, noted that while compute-in-memory is rigid for convolutions and transformers, it is ideally suited for state space models. This architecture allows all operations within the SSM to be implemented efficiently through device physics. To optimize this system, the researchers addressed specific hardware and software bottlenecks. Historically, state space models relied on complex numbers, forcing chips to perform separate calculations for real and imaginary parts. The team modified the model to use only real numbers, enabling each memory cell to directly represent data. Additionally, to prevent memory bottlenecks during real-time processing, the system uses a fixed decay rate for blocks of the model rather than unique rates for individual neurons. This approach controls how quickly the system forgets old data to accommodate new information. The hardware implementation utilized a Resistive RAM (RRAM) crossbar array fabricated using standard 65 nanometer CMOS processes. By fabricating tungsten oxide memristors with varying thicknesses through controlled oxidation, the team created different decay rates physically. Thinner layers allowed for faster short-term memory decay, while thicker layers slowed the process, aligning perfectly with the model's requirements. Experimental results showed that the RRAM arrays performed vector-matrix multiplications with high precision, achieving results within 4.6 bits of the ideal mathematical output. This co-design not only maintained high accuracy but also significantly outperformed conventional digital hardware in both power consumption and latency. The study proves that state space models and neuromorphic hardware form a naturally compatible pair, overcoming the noise and performance degradation typically associated with porting algorithms to physical hardware. As stated by co-first authors Xiaoyu Zhang and Mingtao Hu, this work physically restructures how state space models compute, moving the field toward hardware-native AI that can operate efficiently anywhere. This breakthrough addresses the critical need for local data processing to enhance speed, privacy, and energy sustainability in next-generation intelligent devices.
