HyperAI

AI privacy concerns are growing, and a breakthrough from Nanjing University’s research team is offering a promising solution. In 2023, Samsung experienced multiple internal data leaks shortly after integrating ChatGPT, as employees inadvertently entered sensitive information—such as semiconductor design parameters, source code, and production yield data—into the model. This data was then potentially stored in the model’s training database, exposing critical intellectual property. Such risks are not limited to corporate secrets; they extend to personal privacy and government data. Most current AI applications on smartphones rely on cloud-based processing, meaning user queries are sent to remote servers for analysis. This model requires users to accept data usage policies, often leading to the collection and retention of private information. As large models become cheaper and more widespread, the reliance on centralized cloud infrastructure could concentrate data in the hands of a few tech giants, raising serious security and privacy issues. To address this, a team led by Dr. Meng Li from the School of Computer Science at Nanjing University has developed a novel approach to enable local, on-device AI deployment—using only domestic GPUs. Their work aims to keep data on users’ devices, eliminating the need to send it to the cloud, thereby significantly improving privacy, security, and reliability. A key challenge in deploying large AI models on edge devices like smartphones is memory constraint. Most smartphones lack the memory capacity to load large models entirely into RAM. Current solutions involve using heavily compressed models, which often sacrifice performance and increase computational overhead. Li’s team has overcome this bottleneck by discovering a fundamental principle in mixture-of-experts (MoE) models: low-scoring experts can be safely replaced without compromising accuracy. Building on this insight, they designed a system that dynamically replaces low-scoring experts during inference and intelligently predicts future expert usage patterns across multiple steps. This strategy dramatically increases cache hit rates—more than doubling them in some cases—and maximizes GPU memory utilization. As a result, tasks that previously required two high-end GPUs can now run efficiently on a single card, drastically reducing hardware demands. This advancement is particularly impactful for edge computing scenarios such as small businesses or homes, where cost and hardware constraints are critical. More importantly, it enables powerful AI models to run directly on smartphones—without requiring users to upgrade to devices with larger memory capacities. The technology achieves this by keeping only the necessary model components in memory at any given time, offloading unused parts to external storage and loading them on demand. The research was conducted in collaboration with domestic computing hardware teams. Initially, Li’s team faced challenges using domestic GPUs due to limited memory capacity despite sufficient computational power. When attempting to load a large model into a 24GB GPU, the memory bottleneck prevented full deployment. This prompted a deeper investigation into efficient memory management strategies. Rather than modifying model architecture or sacrificing accuracy, the team focused on system-level optimizations. Their breakthrough came from recognizing that not all experts in MoE models contribute equally—and that replacing low-performing ones could drastically reduce memory footprint without harming performance. When combined with predictive caching based on expert usage continuity, this approach delivers both high speed and high accuracy. The system was validated on both domestic and NVIDIA hardware, proving its effectiveness across platforms. This success not only advances on-device AI but also reshapes perceptions of domestic computing hardware. Before the project, Li had limited experience with Chinese-made GPUs. But through this collaboration, he observed rapid progress in domestic hardware performance and ecosystem maturity, even if some toolchain challenges remain. The team’s work underscores a critical truth: progress in edge AI requires tight integration of software, algorithm, and hardware. This holistic, co-design approach is now the foundation of their new research direction—focusing on resource-constrained environments like smartphones, personal computers, and small servers. Dr. Li envisions a future where intelligent computing becomes as accessible and affordable as electricity or water. His long-term goal is to drive down the cost of each unit of AI computation—measured in tokens—so that even low-cost devices can run powerful models. In the future, a small, inexpensive hardware module costing just tens or hundreds of yuan could enable meaningful AI capabilities on everyday devices. Such a shift would democratize artificial intelligence, freeing people from repetitive tasks and empowering them to focus on creativity and innovation. Just as plumbing transformed daily life by making water universally accessible, Li believes that widespread, low-cost AI at the edge will revolutionize society—making intelligent technology truly普惠 for all.

Related Links

Related Links

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Command Palette

Nanjing University Team Achieves Local AI Operation on Domestic GPUs

Related Links

Command Palette

Nanjing University Team Achieves Local AI Operation on Domestic GPUs

Related Links

Command Palette

Nanjing University Team Achieves Local AI Operation on Domestic GPUs

Related Links

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models

Beyond Visual Reality: Tsinghua WorldArena's New Evaluation System Reveals the Capability Gap in Embodied World Models