HyperAIHyperAI

Command Palette

Search for a command to run...

Console

Vector Prism : Animer les graphiques vectoriels par stratification de la structure sémantique

Jooyeol Yun Jaegul Choo

Abstract

Les Graphiques Vectoriels Scalables (SVG) jouent un rôle central dans la conception web moderne, et la demande d’animation de ces éléments ne cesse de croître avec l’évolution vers des environnements web de plus en plus dynamiques. Pourtant, l’automatisation de l’animation des graphiques vectoriels reste un défi pour les modèles vision-langage (VLM), malgré les progrès récents dans la génération de code et la planification du mouvement. Les VLM traitent fréquemment mal les SVG, car des parties visuellement cohérentes sont souvent fragmentées en formes de bas niveau qui offrent peu de guidance quant aux éléments devant évoluer ensemble. Dans cet article, nous proposons un cadre permettant de restaurer la structure sémantique nécessaire à une animation SVG fiable, et mettant en lumière la couche manquante que les systèmes VLM actuels négligent. Cette restauration est réalisée grâce à une agrégation statistique de plusieurs prédictions partielles faibles, permettant au système d’inférer de manière stable la sémantique à partir de prédictions bruitées. En réorganisant les SVG en groupes sémantiques, notre approche permet aux VLM de produire des animations bien plus cohérentes. Nos expériences montrent des gains substantiels par rapport aux approches existantes, suggérant que la récupération sémantique constitue l’étape clé permettant d’obtenir une animation SVG robuste et de favoriser des interactions plus interprétables entre les VLM et les graphiques vectoriels.

One-sentence Summary

KAIST researchers introduce Vector Prism, a framework that automates Scalable Vector Graphics animation by recovering semantic structure through statistical aggregation of multiple weak part predictions, solving vision-language models' element fragmentation issues; this enables coherent animations via correct element grouping and achieves significant performance gains over prior approaches, unlocking robust and interpretable SVG interactions.

Key Contributions

  • SVGs are structured for rendering efficiency rather than semantic clarity, causing vision-language models (VLMs) to misinterpret fragmented low-level shapes during animation attempts and fail to identify coherent moving parts. This paper formally defines the semantic restructuring challenge and establishes its critical role in enabling reliable SVG animation.
  • The Vector Prism framework recovers semantic structure through statistical aggregation of multiple weak part predictions from VLMs, using a Dawid-Skene model to transform noisy view-dependent outputs into robust semantic labels without requiring VLM fine-tuning. This reorganization aligns SVG syntax with human-understandable parts for coherent motion planning.
  • Experiments show significant improvements in animation quality and instruction faithfulness over state-of-the-art methods, with the approach outperforming commercial services like Sora 2 on complex real-world SVGs by enabling accurate part grouping and motion execution.

Introduction

The authors address the growing need for automated animation of Scalable Vector Graphics (SVGs) in dynamic web environments, where current vision-language models (VLMs) struggle due to SVGs' rendering-optimized structure fragmenting coherent visual elements into disconnected low-level shapes. This fragmentation prevents VLMs from identifying which parts should move together, leading to incoherent animations. Prior approaches either rely on gradient-based optimization that produces jittery, repetitive motions or require massive datasets to fine-tune language models without resolving SVGs' inherent semantic ambiguity. The authors introduce Vector Prism, a statistical inference framework that aggregates multiple noisy VLM predictions across varied SVG visualizations to recover robust semantic part groupings. By restructuring SVGs with these inferred semantic labels, their method enables VLMs to generate significantly more coherent and instruction-faithful animations without domain-specific model fine-tuning.

Dataset

The authors use a carefully curated test dataset of 114 hand-crafted animation instructions paired with 57 unique SVG files, sourced from SVGRepo. Each SVG receives an average of two distinct animation scenarios, covering diverse objects like animals, logos, buildings, and natural elements (fire, clouds, water).

Key details include:

  • Composition: 57 SVG files spanning six thematic categories, with Nature/Environment (31.6%) and Objects/Miscellaneous (26.3%) most represented.
  • Animation patterns: Appearance/Reveal effects (28.1%) and State Transitions (13.2%) dominate, alongside Organic/Natural Movement (12.3%) and Rotational Movement (8.8%).
  • Curation: Instructions simulate real-world web use cases, testing techniques from simple movements to complex 3D rotations and synchronized transitions.

The dataset serves exclusively for evaluation—not training—to assess SVG animation tools against practical web development needs. No training splits, mixture ratios, or additional processing (e.g., cropping) are applied; all entries are manually designed to ensure coverage of critical interaction patterns and visual content types. Metadata reflects thematic categories and animation patterns for structured performance analysis.

Method

The authors leverage a three-stage pipeline to enable vision-language models (VLMs) to generate semantically coherent animations from SVG files. The core innovation lies in the middle stage, Vector Prism, which restructures the inherently syntactic and rendering-optimized SVG hierarchy into a semantically meaningful one that aligns with how VLMs perceive visual concepts. This restructuring bridges the gap between high-level animation planning and low-level code generation.

The pipeline begins with animation planning, where a VLM interprets a rasterized version of the SVG and generates a high-level plan based on user instructions. For example, given the instruction “make the sun rise,” the VLM identifies the sun and sky as semantic components and proposes their respective motions. However, since VLMs lack understanding of SVG syntax, they cannot directly implement these plans. This is where Vector Prism intervenes.

As shown in the figure below, Vector Prism takes each SVG primitive—such as , , or —and renders it through multiple focused views: bounding box, isolation, highlight, zoom-in, and outline. Each view provides a complementary visual cue to the VLM, which then assigns a tentative semantic label to the primitive. These labels are inherently noisy, as different rendering methods vary in reliability. For instance, a bounding box view might reliably identify a “Plus” symbol with p=0.9p = 0.9p=0.9, while a zoom-in view might misclassify it with p=0.5p = 0.5p=0.5.

To resolve this noise, Vector Prism employs a statistical inference framework based on the Dawid-Skene model. It first estimates the reliability pip_ipi of each rendering method iii by analyzing pairwise agreement patterns across all primitives. The agreement matrix AijA_{ij}Aij, which captures how often methods iii and jjj agree on labels, is centered to remove chance agreement and decomposed via eigenvector analysis. The top eigenvector of the centered matrix yields the skill vector δ\deltaδ, from which the reliability pi=1k+δip_i = \frac{1}{k} + \delta_ipi=k1+δi is recovered.

With reliabilities estimated, Vector Prism applies a Bayes decision rule to assign the final semantic label to each primitive. Instead of majority voting, it performs a weighted vote where the weight for method iii is wi=log(k1)p^i1p^iw_i = \log \frac{(k-1)\hat{p}_i}{1 - \hat{p}_i}wi=log1p^i(k1)p^i. This downweights unreliable predictions—for example, a method with p=0.1p = 0.1p=0.1 contributes log19\log \frac{1}{9}log91 to the score—ensuring that the final label minimizes expected classification error. As illustrated in the figure, this approach outperforms majority voting by preventing low-reliability predictors from swinging the outcome.

Once semantic labels are assigned, the final stage restructures the SVG. The original hierarchy is flattened, and primitives are regrouped by label while preserving the original paint order and visual appearance. Overlaps between different semantic groups are checked to prevent rendering artifacts. The resulting SVG retains identical visual output but is now organized into coherent, animation-ready groups—such as “Ears,” “Eyes,” and “Nose”—each annotated with metadata like bounding boxes and geometric centers. This structured SVG is then passed to an LLM, which generates CSS animation code per semantic group, using an iterative strategy to handle token limits and a “lanes” system to prevent animation conflicts.

The entire process transforms an unstructured SVG into a semantically enriched one, enabling VLMs to animate at the level of meaningful parts rather than low-level shapes. This results in animations that are both visually stable and semantically consistent with user intent.

Experiment

  • Evaluated against AniClipart, GPT-5, Wan 2.2, and Sora 2 on instruction-following and perceptual quality metrics (CLIP-T2V, GPT-T2V, DOVER), achieving best scores across all benchmarks by enabling semantic part-aware motion in vector graphics.
  • User study with 760 pairwise comparisons showed 83.4% human preference alignment with GPT-T2V scores, consistently favoring the method over Sora 2 and Wan 2.2 in instruction adherence.
  • SVG animations achieved 54× smaller file sizes than Sora 2 while maintaining resolution independence, demonstrating superior encoding efficiency for web deployment.
  • Semantic clustering via Vector Prism achieved Davies-Bouldin Index of 0.82, significantly outperforming majority voting (12.6) and original SVG groupings (33.8) in structural coherence.

The authors analyze 114 animation instructions categorized by subject theme, with Nature/Environment being the most frequent at 31.6%, followed by Objects/Miscellaneous at 26.3%. UI/Interface Elements, Tech Logos/Brands, Animals/Characters, and Faces/Emojis make up the remaining categories, reflecting a diverse set of animation targets used in their evaluation.

The authors analyze 114 animation interaction patterns generated by their method, finding that “Other/Mixed” patterns are most frequent at 37.7%, followed by “Appearance/Reveal” at 28.1%. Results show a diverse distribution across categories, with “Rotational Movement” being the least common at 8.8%.

The authors evaluate their method against baselines using CLIP-T2V, GPT-T2V, and DOVER metrics, with Vector Prism achieving the highest scores across all three. Results show that their approach outperforms both vector-based and video generation models in instruction following and perceptual quality, despite not being trained on video-text data. The method also maintains vector format compatibility, unlike raster-based video models.


Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

Hyper Newsletters

Abonnez-vous à nos dernières mises à jour
Nous vous enverrons les dernières mises à jour de la semaine dans votre boîte de réception à neuf heures chaque lundi matin
Propulsé par MailChimp