LLMs Struggle to Generate Plausible Passwords Due to Limited Memory and Domain Skills
Large language models (LLMs), despite their strong performance in language understanding and code generation, show significant limitations when it comes to generating plausible passwords for specific users, according to a recent study by researchers at the Future Data Minds Research Lab in Australia. The study, published on the arXiv preprint server, reveals that even state-of-the-art open-source models like TinyLLaMA, Falcon-RW-1B, and Flan-T5 struggle to produce accurate or realistic passwords when given synthetic user profiles containing names, birthdates, and hobbies. The researchers evaluated the models by prompting them to generate password suggestions based on structured personal attributes. They used standard metrics from password-guessing research—Hit@1, Hit@5, and Hit@10—to measure how often the correct password appeared in the top-ranked guesses. Results showed consistently poor performance, with all models achieving less than 1.5% accuracy at Hit@10, meaning the correct password rarely appeared among the top ten suggestions. In contrast, traditional password-cracking methods—such as rule-based attacks and combinatorial techniques—demonstrated far higher success rates. These tools systematically exploit common patterns like date formats, name variations, and keyboard sequences, making them far more effective than LLMs at guessing passwords. Further analysis uncovered key reasons behind the LLMs’ shortcomings. The models lack strong memorization abilities and struggle to recall specific examples from their training data. They also fail to generalize learned password patterns to new, unseen scenarios. Even though they can generate fluent text, they do not reliably apply domain-specific knowledge needed for effective password inference. The researchers emphasize that current LLMs lack the domain adaptation and fine-tuning required for tasks like password guessing, particularly when not trained on real leaked password datasets. Without supervised fine-tuning on such data, their ability to infer passwords remains weak. While this study focused on only three models, it highlights an important gap in LLM capabilities that could guide future security research. The findings suggest that LLMs are not currently viable tools for adversarial password cracking, which may actually benefit cybersecurity efforts by reducing the threat of AI-powered attacks. The work also lays the foundation for developing more secure, privacy-preserving approaches to password modeling. By understanding the limits of LLMs, researchers can design better defenses and improve authentication systems to protect user accounts. This insight underscores the importance of ongoing evaluation of AI tools in sensitive domains like cybersecurity.
