HyperAIHyperAI

Command Palette

Search for a command to run...

Console

Vector Prism: تنشيط الرسوم المتجهة من خلال تهيئة البنية الدلالية

Jooyeol Yun Jaegul Choo

Abstract

تُعدّ الرسومات المتجهة القابلة للتوسيع (SVG) عنصراً محورياً في التصميم الحديث للويب، ويزداد الطلب على تفعيلها تدريجياً مع تطور البيئات الويب لتصبح أكثر ديناميكية. ومع ذلك، لا يزال تأتمتة تفعيل الرسومات المتجهة تحدياً كبيراً بالنسبة للنماذج البصرية-اللغوية (VLMs)، على الرغم من التقدم الأخير في إنشاء الشيفرة والتخطيط للحركة. تُعاني نماذج VLMs من سوء فهم SVGs، لأن الأجزاء البصرية المتماسكة غالبًا ما تُفكك إلى أشكال منخفضة المستوى، التي لا تقدم توجيهات كافية حول أي عناصر يجب أن تتحرك معًا. في هذا البحث، نقدّم إطاراً يُعيد استرداد البنية الدلالية الضرورية لتفعيل SVG الموثوق به، ويُكشف عن الطبقة المفقودة التي تتجاهلها الأنظمة الحالية لـ VLMs. ويتم تحقيق ذلك من خلال تجميع إحصائي لتنبؤات جزئية ضعيفة متعددة، مما يمكّن النظام من استنتاج الدلالة بشكل ثابت من تنبؤات مشوّشة. وبإعادة تنظيم SVGs إلى مجموعات دلالية، يمكّن نهجنا نماذج VLMs من إنتاج تفعيلات ذات انسجام أبعد بكثير. تُظهر تجاربنا تحسّناً كبيراً مقارنة بالأساليب الحالية، مما يشير إلى أن استرداد الدلالة هو الخطوة الأساسية التي تُفَعِّل تفعيل SVG الموثوق به، وتدعم تفاعلات أكثر قابلية للتفسير بين نماذج VLMs والرسومات المتجهة.

One-sentence Summary

KAIST researchers introduce Vector Prism, a framework that automates Scalable Vector Graphics animation by recovering semantic structure through statistical aggregation of multiple weak part predictions, solving vision-language models' element fragmentation issues; this enables coherent animations via correct element grouping and achieves significant performance gains over prior approaches, unlocking robust and interpretable SVG interactions.

Key Contributions

  • SVGs are structured for rendering efficiency rather than semantic clarity, causing vision-language models (VLMs) to misinterpret fragmented low-level shapes during animation attempts and fail to identify coherent moving parts. This paper formally defines the semantic restructuring challenge and establishes its critical role in enabling reliable SVG animation.
  • The Vector Prism framework recovers semantic structure through statistical aggregation of multiple weak part predictions from VLMs, using a Dawid-Skene model to transform noisy view-dependent outputs into robust semantic labels without requiring VLM fine-tuning. This reorganization aligns SVG syntax with human-understandable parts for coherent motion planning.
  • Experiments show significant improvements in animation quality and instruction faithfulness over state-of-the-art methods, with the approach outperforming commercial services like Sora 2 on complex real-world SVGs by enabling accurate part grouping and motion execution.

Introduction

The authors address the growing need for automated animation of Scalable Vector Graphics (SVGs) in dynamic web environments, where current vision-language models (VLMs) struggle due to SVGs' rendering-optimized structure fragmenting coherent visual elements into disconnected low-level shapes. This fragmentation prevents VLMs from identifying which parts should move together, leading to incoherent animations. Prior approaches either rely on gradient-based optimization that produces jittery, repetitive motions or require massive datasets to fine-tune language models without resolving SVGs' inherent semantic ambiguity. The authors introduce Vector Prism, a statistical inference framework that aggregates multiple noisy VLM predictions across varied SVG visualizations to recover robust semantic part groupings. By restructuring SVGs with these inferred semantic labels, their method enables VLMs to generate significantly more coherent and instruction-faithful animations without domain-specific model fine-tuning.

Dataset

The authors use a carefully curated test dataset of 114 hand-crafted animation instructions paired with 57 unique SVG files, sourced from SVGRepo. Each SVG receives an average of two distinct animation scenarios, covering diverse objects like animals, logos, buildings, and natural elements (fire, clouds, water).

Key details include:

  • Composition: 57 SVG files spanning six thematic categories, with Nature/Environment (31.6%) and Objects/Miscellaneous (26.3%) most represented.
  • Animation patterns: Appearance/Reveal effects (28.1%) and State Transitions (13.2%) dominate, alongside Organic/Natural Movement (12.3%) and Rotational Movement (8.8%).
  • Curation: Instructions simulate real-world web use cases, testing techniques from simple movements to complex 3D rotations and synchronized transitions.

The dataset serves exclusively for evaluation—not training—to assess SVG animation tools against practical web development needs. No training splits, mixture ratios, or additional processing (e.g., cropping) are applied; all entries are manually designed to ensure coverage of critical interaction patterns and visual content types. Metadata reflects thematic categories and animation patterns for structured performance analysis.

Method

The authors leverage a three-stage pipeline to enable vision-language models (VLMs) to generate semantically coherent animations from SVG files. The core innovation lies in the middle stage, Vector Prism, which restructures the inherently syntactic and rendering-optimized SVG hierarchy into a semantically meaningful one that aligns with how VLMs perceive visual concepts. This restructuring bridges the gap between high-level animation planning and low-level code generation.

The pipeline begins with animation planning, where a VLM interprets a rasterized version of the SVG and generates a high-level plan based on user instructions. For example, given the instruction “make the sun rise,” the VLM identifies the sun and sky as semantic components and proposes their respective motions. However, since VLMs lack understanding of SVG syntax, they cannot directly implement these plans. This is where Vector Prism intervenes.

As shown in the figure below, Vector Prism takes each SVG primitive—such as , , or —and renders it through multiple focused views: bounding box, isolation, highlight, zoom-in, and outline. Each view provides a complementary visual cue to the VLM, which then assigns a tentative semantic label to the primitive. These labels are inherently noisy, as different rendering methods vary in reliability. For instance, a bounding box view might reliably identify a “Plus” symbol with p=0.9p = 0.9p=0.9, while a zoom-in view might misclassify it with p=0.5p = 0.5p=0.5.

To resolve this noise, Vector Prism employs a statistical inference framework based on the Dawid-Skene model. It first estimates the reliability pip_ipi of each rendering method iii by analyzing pairwise agreement patterns across all primitives. The agreement matrix AijA_{ij}Aij, which captures how often methods iii and jjj agree on labels, is centered to remove chance agreement and decomposed via eigenvector analysis. The top eigenvector of the centered matrix yields the skill vector δ\deltaδ, from which the reliability pi=1k+δip_i = \frac{1}{k} + \delta_ipi=k1+δi is recovered.

With reliabilities estimated, Vector Prism applies a Bayes decision rule to assign the final semantic label to each primitive. Instead of majority voting, it performs a weighted vote where the weight for method iii is wi=log(k1)p^i1p^iw_i = \log \frac{(k-1)\hat{p}_i}{1 - \hat{p}_i}wi=log1p^i(k1)p^i. This downweights unreliable predictions—for example, a method with p=0.1p = 0.1p=0.1 contributes log19\log \frac{1}{9}log91 to the score—ensuring that the final label minimizes expected classification error. As illustrated in the figure, this approach outperforms majority voting by preventing low-reliability predictors from swinging the outcome.

Once semantic labels are assigned, the final stage restructures the SVG. The original hierarchy is flattened, and primitives are regrouped by label while preserving the original paint order and visual appearance. Overlaps between different semantic groups are checked to prevent rendering artifacts. The resulting SVG retains identical visual output but is now organized into coherent, animation-ready groups—such as “Ears,” “Eyes,” and “Nose”—each annotated with metadata like bounding boxes and geometric centers. This structured SVG is then passed to an LLM, which generates CSS animation code per semantic group, using an iterative strategy to handle token limits and a “lanes” system to prevent animation conflicts.

The entire process transforms an unstructured SVG into a semantically enriched one, enabling VLMs to animate at the level of meaningful parts rather than low-level shapes. This results in animations that are both visually stable and semantically consistent with user intent.

Experiment

  • Evaluated against AniClipart, GPT-5, Wan 2.2, and Sora 2 on instruction-following and perceptual quality metrics (CLIP-T2V, GPT-T2V, DOVER), achieving best scores across all benchmarks by enabling semantic part-aware motion in vector graphics.
  • User study with 760 pairwise comparisons showed 83.4% human preference alignment with GPT-T2V scores, consistently favoring the method over Sora 2 and Wan 2.2 in instruction adherence.
  • SVG animations achieved 54× smaller file sizes than Sora 2 while maintaining resolution independence, demonstrating superior encoding efficiency for web deployment.
  • Semantic clustering via Vector Prism achieved Davies-Bouldin Index of 0.82, significantly outperforming majority voting (12.6) and original SVG groupings (33.8) in structural coherence.

The authors analyze 114 animation instructions categorized by subject theme, with Nature/Environment being the most frequent at 31.6%, followed by Objects/Miscellaneous at 26.3%. UI/Interface Elements, Tech Logos/Brands, Animals/Characters, and Faces/Emojis make up the remaining categories, reflecting a diverse set of animation targets used in their evaluation.

The authors analyze 114 animation interaction patterns generated by their method, finding that “Other/Mixed” patterns are most frequent at 37.7%, followed by “Appearance/Reveal” at 28.1%. Results show a diverse distribution across categories, with “Rotational Movement” being the least common at 8.8%.

The authors evaluate their method against baselines using CLIP-T2V, GPT-T2V, and DOVER metrics, with Vector Prism achieving the highest scores across all three. Results show that their approach outperforms both vector-based and video generation models in instruction following and perceptual quality, despite not being trained on video-text data. The method also maintains vector format compatibility, unlike raster-based video models.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

اشترك في آخر تحديثاتنا
سنرسل لك أحدث التحديثات الأسبوعية إلى بريدك الإلكتروني في الساعة التاسعة من صباح كل يوم اثنين
مدعوم بواسطة MailChimp