MIT and Stanford Unveil SketchAgent: An AI That Draws Like Humans Using Language and Strokes

Scientists from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) and Stanford University have introduced a groundbreaking drawing system called "SketchAgent" that enables AI models to sketch more like humans. Unlike traditional text-to-image models such as DALL-E 3, which create static images, SketchAgent produces drawings through a sequence of individual strokes, making its output more natural and fluid. This innovative approach could revolutionize how teachers and researchers diagram complex concepts, as well as enhance user engagement by providing interactive art games and drawing lessons. The core of SketchAgent's functionality lies in its multimodal language model, which combines text and image data. Instead of relying on extensive datasets of human-drawn sketches, the researchers developed a unique "sketching language." This language translates sketches into numbered sequences of strokes on a grid. Each stroke is labeled with its specific role, such as a rectangle labeled as a "front door" for a house, allowing the model to understand and generalize to new concepts. For instance, the system can create abstract drawings of robots, butterflies, DNA helixes, flowcharts, and even landmarks like the Sydney Opera House. CSAIL postdoc Yael Vinker, the lead author of the paper, emphasizes the natural communication aspect of SketchAgent. "Not everyone realizes how frequently they draw in their daily lives, whether it’s sketching out ideas or visualizing thoughts," she says. "Our tool aims to emulate this process, making multimodal language models more useful for visually expressing ideas." The research team includes CSAIL postdoc Tamar Rott Shaham, undergraduate researcher Alex Zhao, MIT Professor Antonio Torralba, and Stanford University collaborators Kristine Zheng and Judith Ellen Fan. They will present their findings at the 2025 Conference on Computer Vision and Pattern Recognition (CVPR). To assess SketchAgent's capabilities, the researchers conducted several tests. They compared its drawings to those generated by traditional models and found that SketchAgent's step-by-step approach produced more human-like results. One key test involved collaboration mode, where a human and AI model worked together to draw a specific concept. By removing the AI's contributions, the team demonstrated that SketchAgent's strokes were essential to creating recognizable final sketches. For example, without SketchAgent's input for the mast in a sailboat drawing, the sketch became unrecognizable. However, while SketchAgent shows promise, it currently has limitations. The AI renders simple concepts using stick figures and basic doodles but struggles with more complex images, such as logos, sentences, and detailed human figures. There have also been instances where the model misinterpreted user intentions, leading to errors like drawing a bunny with two heads. This issue might arise from the model's "Chain of Thought" reasoning, where it breaks down tasks into smaller steps and occasionally misaligns with human contributions. The researchers suggest that training on synthetic data from diffusion models could improve these issues. Collaboration is a significant aspect of SketchAgent's potential. The system not only draws on its own but can work alongside humans, creating more aligned and coherent final designs. "As models advance in generating other modalities, like sketches, they open up new avenues for users to express ideas in more intuitive and human-like ways," says Rott Shaham. "This could greatly enrich interactions, making AI more accessible and versatile." Future developments for SketchAgent include refining the model's drawing skills and improving its user interface to make it more user-friendly. The researchers aim to reduce the number of prompts needed to generate human-like doodles and enhance the system's ability to understand and align with human intentions during collaborative drawing sessions. Industry insiders have praised the innovation, noting that it bridges a gap between AI and human creativity, making multimodal communication more seamless and interactive. The project was funded by various organizations, including the U.S. National Science Foundation, a Hoffman-Yee Grant from the Stanford Institute for Human-Centered AI, Hyundai Motor Co., the U.S. Army Research Laboratory, the Zuckerman STEM Leadership Program, and a Viterbi Fellowship. Overall, SketchAgent represents a significant step forward in AI-generated sketching, offering a new level of interaction and creativity. While it still has room for improvement, the potential applications in education, research, and art games are exciting and underscore the growing versatility of AI in multimodal communication.

MIT and Stanford Unveil SketchAgent: An AI That Draws Like Humans Using Language and Strokes

Related Links