MosaicLeaks: New Training Cuts Privacy Leakage in Deep Research Agents
Researchers have introduced MosaicLeaks, a new benchmark and training framework designed to address a critical privacy vulnerability in deep-research artificial intelligence agents. As these systems increasingly blend local enterprise data with external web retrieval tools, they inadvertently expose sensitive information through a phenomenon known as the mosaic effect. Individual search queries may appear harmless, but their cumulative logs allow adversaries to reconstruct private facts without ever accessing the original documents. To quantify this risk, the research team developed MosaicLeaks, a benchmark comprising 1,001 multi-hop research chains that deliberately interleave local and web-based sub-questions. The framework evaluates privacy exposure across three progressive metrics: intent leakage, where observers deduce research goals; answer leakage, where query logs suffice to answer specific private questions; and full-information leakage, where public query trails alone reveal verifiable enterprise secrets. Testing revealed that untrained agents frequently succumb to these leaks, with baseline answer and full-information leakage rates reaching 34.0 percent. Initial mitigation attempts relied on direct prompt engineering, instructing models to avoid exposing local data during web searches. However, this approach yielded inconsistent results and frequently degraded task performance. More critically, standard reinforcement learning optimized solely for accuracy exacerbated the problem, driving leakage rates up to 51.7 percent as models packed more contextual details into their search strings to improve retrieval success. To resolve the inherent tension between task performance and privacy, the team introduced Privacy-Aware Deep Research, a reinforcement learning framework that pairs a learned privacy classifier with situational task rewards. Rather than scoring entire trajectories, the method evaluates individual planning steps against identical contextual states, precisely rewarding secure query construction and penalizing information-rich phrasing. A lightweight classifier continuously monitors outgoing web queries, flagging both direct leaks and cumulative mosaic risks. This dual-reward system forces the model to extract necessary public documents without carrying private metrics, dates, or entity names into search strings. The results demonstrate a significant leap in both security and efficiency. Under the new training paradigm, answer and full-information leakage dropped to 9.9 percent, surpassing the baseline untrained model while maintaining a strict chain success rate of 58.7 percent. Furthermore, the situational reward mechanism drastically improved sample efficiency, requiring roughly five to six times fewer training examples to reach comparable performance thresholds than traditional outcome-based reinforcement learning. The study underscores a fundamental limitation in current AI safety practices: privacy safeguards cannot be effectively enforced through prompt engineering alone. Because the mosaic effect emerges from sequential decision-making over time, security must be baked into the training loop through precise reward modeling. While MosaicLeaks operates within a controlled environment using synthetic enterprise documents and a fixed web corpus, its findings establish a measurable pathway for hardening next-generation research agents against observational inference attacks. As organizations deploy increasingly autonomous AI systems for internal research, the framework highlights the urgent need for reward-based privacy alignment to prevent operational data from leaking into public query channels.
