4 days ago

Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang Jiewen Chan Hao-Jen Chien Yu-Lun Liu

Abstract

Voxel art is a distinctive stylization widely used in games and digital media, yet automated generation from 3D meshes remains challenging due to conflicting requirements of geometric abstraction, semantic preservation, and discrete color coherence. Existing methods either over-simplify geometry or fail to achieve the pixel-precise, palette-constrained aesthetics of voxel art. We introduce Voxify3D, a differentiable two-stage framework bridging 3D mesh optimization with 2D pixel art supervision. Our core innovation lies in the synergistic integration of three components: (1) orthographic pixel art supervision that eliminates perspective distortion for precise voxel-pixel alignment; (2) patch-based CLIP alignment that preserves semantics across discretization levels; (3) palette-constrained Gumbel-Softmax quantization enabling differentiable optimization over discrete color spaces with controllable palette strategies. This integration addresses fundamental challenges: semantic preservation under extreme discretization, pixel-art aesthetics through volumetric rendering, and end-to-end discrete optimization. Experiments show superior performance (37.12 CLIP-IQA, 77.90% user preference) across diverse characters and controllable abstraction (2-8 colors, 20x-50x resolutions). Project page: https://yichuanh.github.io/Voxify-3D/

Summarization

Researchers from National Taiwan University propose Voxify3D, a differentiable two-stage framework that generates high-fidelity voxel art from 3D meshes by combining orthographic pixel art supervision, patch-based CLIP alignment, and palette-constrained Gumbel-Softmax quantization, enabling semantic preservation, precise voxel-pixel alignment, and controllable color discretization for game-ready assets.

Key Contributions

Voxify3D addresses the challenge of generating semantically meaningful voxel art from 3D meshes by introducing orthographic pixel art supervision across six canonical views, eliminating perspective distortion and enabling precise voxel-pixel alignment for effective gradient-based optimization.
The method preserves critical semantic features under extreme geometric discretization (20×–50× resolution reduction) through a patch-based CLIP loss that maintains local and global object identity where standard perceptual losses fail.
Voxify3D enables end-to-end optimization with controllable discrete color palettes (2–8 colors) via palette-constrained Gumbel-Softmax quantization, supporting flexible palette extraction strategies and achieving superior aesthetic quality, as validated by high CLIP-IQA scores (37.12) and strong user preference (77.90%).

Introduction

The authors leverage the growing demand for stylized 3D content in games and digital media to address the challenge of automating high-quality voxel art generation from 3D meshes. Existing methods either focus on 2D pixel art—unsuitable for 3D due to projection misalignment and view inconsistency—or rely on photorealistic neural rendering that lacks stylistic abstraction. Prior approaches also fail to preserve semantic features under extreme discretization and struggle with discrete color optimization, while procedural tools require extensive manual tuning.

The authors’ main contribution is Voxify3D, a two-stage framework that bridges 3D voxel optimization with 2D pixel art supervision to generate semantically faithful, palette-constrained voxel art. It overcomes fundamental misalignment and quantization issues through tightly coupled rendering and loss design.

Key innovations include:

Orthographic pixel art supervision using six canonical views to eliminate perspective distortion and enable precise, gradient-based stylization.
Resolution-adaptive patch-based CLIP loss that preserves critical semantic features (e.g., facial details) even under 20×–50× discretization, where global perceptual losses fail.
Palette-constrained differentiable quantization via Gumbel-Softmax with user-controllable palette extraction (K-means, Max-Min, Median Cut, Simulated Annealing), enabling end-to-end optimization of discrete color spaces (2–8 colors).

Method

The authors leverage a two-stage framework to convert 3D meshes into stylized voxel art, balancing geometric fidelity with semantic abstraction. The pipeline begins with coarse voxel grid initialization and progresses to fine-tuning under pixel-art supervision, incorporating semantic guidance and discrete color quantization for stylized output.

In the first stage, the authors adapt Direct Voxel Grid Optimization (DVGO) to construct an explicit voxel radiance field. This grid comprises two components: a density grid $d$ for spatial occupancy and an RGB color grid $\mathbf{c} = (r, g, b)$ for appearance. The grid resolution is determined by dividing the object’s bounding box into $(W/\texttt{cell\_size})^3$ voxels, where $W$ is the canonical orthographic image width and $\texttt{cell\_size}$ defines the pixel-to-voxel scale. Volume rendering along a ray $\mathbf{r}$ computes the final color $C(\mathbf{r})$ using the standard compositing formula:

C(\mathbf{r}) = \sum_{k=1}^N T_k \alpha_k \mathbf{c}_k, \quad T_k = \exp\left(-\sum_{j=1}^{k-1} d_j \delta_j\right), \quad \alpha_k = 1 - \exp(-d_k \delta_k),

where $N$ is the number of samples, $d_k$ is the density, $\delta_k$ is the step size, $T_k$ is accumulated transmittance, and $\alpha_k$ is the opacity at sample $k$ . The coarse grid is optimized using a composite loss:

\mathcal{L}_\text{total} = \mathcal{L}_\text{render} + \lambda_d \mathcal{L}_\text{density} + \lambda_b \mathcal{L}_\text{bg},

where $\mathcal{L}_\text{render}$ minimizes MSE between rendered and target colors, $\mathcal{L}_\text{density}$ applies TV regularization to enforce spatial smoothness, and $\mathcal{L}_\text{bg}$ uses entropy loss to suppress background artifacts. This stage provides a geometrically and chromatically stable initialization.

In the second stage, the authors fine-tune the voxel grid using orthographic pixel art supervision generated from six axis-aligned views. This setup ensures pixel-to-voxel alignment without perspective distortion, as illustrated in the comparison between perspective and orthographic projections. Orthographic rays are defined as $\mathbf{r}_i(t) = \mathbf{o}_i + t\mathbf{d}$ , where $\mathbf{o}_i$ is the ray origin for pixel $\mathbf{p}_i$ and $\mathbf{d}$ is a fixed direction. The authors apply three key losses: pixel-level MSE $\mathcal{L}_\text{pixel} = \|C(\mathbf{r}) - C_\text{pixel}\|_2^2$ , depth consistency $\mathcal{L}_\text{depth} = \|D(\mathbf{r}) - D_\text{gt}\|_1$ , and alpha regularization $\mathcal{L}_\alpha = \|\mathcal{M}_\alpha \odot \bar{\alpha}\|^2$ , where $\mathcal{M}_\alpha$ is a binary mask from the pixel art alpha channel and $\bar{\alpha}$ is the accumulated opacity. These losses jointly preserve structure, enforce clean silhouettes, and suppress floating density in background regions.

To maintain semantic alignment during stylization, the authors introduce a CLIP-based perceptual loss. Half of the rays are sampled to form image patches, and CLIP features are extracted from both rendered patches $\hat{I}_\text{patch}$ and corresponding mesh-based patches $I^\text{mesh}_\text{patch}$ . The loss is computed as:

\mathcal{L}_\text{clip} = 1 - \cos\left( \text{CLIP}(\hat{I}_\text{patch}),\ \text{CLIP}(I^\text{mesh}_\text{patch}) \right),

where cosine similarity encourages semantic fidelity while allowing stylistic abstraction.

To achieve clean, stylized outputs with a coherent palette, the authors replace the RGB color grid with a learned color-logit grid. Each voxel $(i,j,k)$ stores a logit vector $\boldsymbol{\lambda}_{i,j,k} \in \mathbb{R}^C$ , where $C$ is the number of colors in a predefined palette extracted from the pixel art views. During training, Gumbel noise $\mathbf{G}_{i,j,k} \sim \text{Gumbel}(0,1)$ is added to produce noisy logits:

\mathbf{Y}_{i,j,k} = \boldsymbol{\lambda}_{i,j,k} + \mathbf{G}_{i,j,k}.

A temperature-controlled softmax then computes selection probabilities:

s_{i,j,k,n}(\tau) = \frac{\exp(Y_{i,j,k,n} / \tau)}{\sum_{n'=1}^{C} \exp(Y_{i,j,k,n'} / \tau)},

where $\tau$ is annealed during training to transition from soft exploration to discrete selection. The final RGB value is a weighted sum over the palette:

\text{RGB}_{i,j,k} = \sum_{n=1}^{C} s_{i,j,k,n} \cdot \mathbf{c}_n.

In the forward pass, a straight-through estimator uses $\arg\max_n s_{i,j,k,n}$ for discrete selection, while gradients flow through the soft weights. After training, the voxel color is assigned as:

\text{RGB}_{i,j,k}^\text{voxel} = \mathbf{c}_{\arg\max_n \lambda_{i,j,k,n}}.

This enables end-to-end optimization of discrete color assignments while preserving differentiability during training.

The overall fine-tuning loss is a weighted sum:

\mathcal{L}_\text{total} = \lambda_\text{pixel} \mathcal{L}_\text{pixel} + \lambda_\text{depth} \mathcal{L}_\text{depth} + \lambda_\text{alpha} \mathcal{L}_\text{alpha} + \lambda_\text{clip} \mathcal{L}_\text{clip},

where $\mathcal{L}_\text{pixel}$ , $\mathcal{L}_\text{depth}$ , and $\mathcal{L}_\text{alpha}$ supervise geometry and appearance, and $\mathcal{L}_\text{clip}$ provides semantic guidance. Training is scheduled to prioritize CLIP loss early (until 6000 iterations), then shift focus to silhouette refinement via $\mathcal{L}_\text{alpha}$ . After 4500 iterations, optimization is restricted to the front view to refine salient features while preserving global consistency.

Experiment

Qualitative comparisons on eight character meshes show the proposed method preserves sharp edges and key features (e.g., ears, eyes) across 25×–50× resolutions, outperforming IN2N, Vox-E, and Blender in consistency, stylization, and semantic alignment.
Quantitative evaluation using CLIP-IQA on 35 character meshes shows the method achieves the highest average cosine similarity between GPT-4-generated prompts and rendered images, indicating superior semantic fidelity and stylization.
Ablation study confirms the necessity of key components: removing pixel art supervision, orthographic projection, coarse initialization, depth loss, CLIP loss, or Gumbel Softmax leads to blurred results, distortions, or color ambiguity.
User study with 72 participants shows the method wins 77.90% of votes for abstract detail, 80.36% for visual appeal, and 96.55% for geometry preservation against four baselines.
Expert study with 10 art-trained participants shows 88.89% preference for results using Gumbel Softmax in color quantization, highlighting its role in achieving clear edges and dominant tones.
Color palette controllability is demonstrated across 2–8 colors using K-means, Median Cut, Max-Min, and Simulated Annealing, with K-means as the default.
Additional comparisons show the method outperforms Gemini 3 in controllable voxel resolution and color, and Rodin in geometric fidelity, due to multi-view optimization.
Runtime analysis reports total generation time of under 2 hours on an RTX 4090 (8.5 min for Stage 1, 108 min for Stage 2), faster than SD-piXL (~4h).
Failure cases occur on highly complex shapes at low resolutions, suggesting future potential in adaptive voxel grids.

The authors use CLIP-IQA to evaluate semantic fidelity by computing cosine similarity between GPT-4 generated prompts and rendered voxel outputs across 35 character meshes. Results show their method achieves the highest score of 37.12, outperforming all baselines including Blender (36.31), Pixel (35.53), and Vox-E (35.02), with IN2N scoring lowest at 23.93. This confirms superior semantic alignment and stylized abstraction in their approach.

The authors evaluate the impact of CLIP loss across different voxel resolutions, showing that incorporating CLIP loss consistently improves semantic alignment compared to ablations without it. Results indicate higher CLIP-IQA scores across all tested voxel sizes (25× to 50×), confirming that CLIP loss enhances character identity preservation during voxel abstraction.

The authors evaluate their method against four baselines using a user study with 72 participants, measuring performance across abstract detail, visual appeal, and geometry preservation. Results show their method receives 77.90% of votes for abstract detail, 80.36% for visual appeal, and 96.55% for geometry faithfulness, substantially outperforming all alternatives.

The authors evaluate color quantization using a user study with 10 art-trained participants, comparing results with and without Gumbel-Softmax across 10 example pairs. Results show that 88.89% of participants preferred the outputs generated with Gumbel-Softmax for voxel art appeal, highlighting its role in producing clear edges and dominant tones.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

Console

4 days ago

Voxify3D: Pixel Art Meets Volumetric Rendering

View Paper Details

Yi-Chuan Huang Jiewen Chan Hao-Jen Chien Yu-Lun Liu

Abstract

Summarization

Key Contributions

Voxify3D addresses the challenge of generating semantically meaningful voxel art from 3D meshes by introducing orthographic pixel art supervision across six canonical views, eliminating perspective distortion and enabling precise voxel-pixel alignment for effective gradient-based optimization.
The method preserves critical semantic features under extreme geometric discretization (20×–50× resolution reduction) through a patch-based CLIP loss that maintains local and global object identity where standard perceptual losses fail.
Voxify3D enables end-to-end optimization with controllable discrete color palettes (2–8 colors) via palette-constrained Gumbel-Softmax quantization, supporting flexible palette extraction strategies and achieving superior aesthetic quality, as validated by high CLIP-IQA scores (37.12) and strong user preference (77.90%).

Introduction

Key innovations include:

Orthographic pixel art supervision using six canonical views to eliminate perspective distortion and enable precise, gradient-based stylization.
Resolution-adaptive patch-based CLIP loss that preserves critical semantic features (e.g., facial details) even under 20×–50× discretization, where global perceptual losses fail.
Palette-constrained differentiable quantization via Gumbel-Softmax with user-controllable palette extraction (K-means, Max-Min, Median Cut, Simulated Annealing), enabling end-to-end optimization of discrete color spaces (2–8 colors).

Method

C(\mathbf{r}) = \sum_{k=1}^N T_k \alpha_k \mathbf{c}_k, \quad T_k = \exp\left(-\sum_{j=1}^{k-1} d_j \delta_j\right), \quad \alpha_k = 1 - \exp(-d_k \delta_k),

\mathcal{L}_\text{total} = \mathcal{L}_\text{render} + \lambda_d \mathcal{L}_\text{density} + \lambda_b \mathcal{L}_\text{bg},

\mathcal{L}_\text{clip} = 1 - \cos\left( \text{CLIP}(\hat{I}_\text{patch}),\ \text{CLIP}(I^\text{mesh}_\text{patch}) \right),

where cosine similarity encourages semantic fidelity while allowing stylistic abstraction.

\mathbf{Y}_{i,j,k} = \boldsymbol{\lambda}_{i,j,k} + \mathbf{G}_{i,j,k}.

A temperature-controlled softmax then computes selection probabilities:

s_{i,j,k,n}(\tau) = \frac{\exp(Y_{i,j,k,n} / \tau)}{\sum_{n'=1}^{C} \exp(Y_{i,j,k,n'} / \tau)},

where $\tau$ is annealed during training to transition from soft exploration to discrete selection. The final RGB value is a weighted sum over the palette:

\text{RGB}_{i,j,k} = \sum_{n=1}^{C} s_{i,j,k,n} \cdot \mathbf{c}_n.

In the forward pass, a straight-through estimator uses $\arg\max_n s_{i,j,k,n}$ for discrete selection, while gradients flow through the soft weights. After training, the voxel color is assigned as:

\text{RGB}_{i,j,k}^\text{voxel} = \mathbf{c}_{\arg\max_n \lambda_{i,j,k,n}}.

This enables end-to-end optimization of discrete color assignments while preserving differentiability during training.

The overall fine-tuning loss is a weighted sum:

\mathcal{L}_\text{total} = \lambda_\text{pixel} \mathcal{L}_\text{pixel} + \lambda_\text{depth} \mathcal{L}_\text{depth} + \lambda_\text{alpha} \mathcal{L}_\text{alpha} + \lambda_\text{clip} \mathcal{L}_\text{clip},

Experiment

Qualitative comparisons on eight character meshes show the proposed method preserves sharp edges and key features (e.g., ears, eyes) across 25×–50× resolutions, outperforming IN2N, Vox-E, and Blender in consistency, stylization, and semantic alignment.
Quantitative evaluation using CLIP-IQA on 35 character meshes shows the method achieves the highest average cosine similarity between GPT-4-generated prompts and rendered images, indicating superior semantic fidelity and stylization.
Ablation study confirms the necessity of key components: removing pixel art supervision, orthographic projection, coarse initialization, depth loss, CLIP loss, or Gumbel Softmax leads to blurred results, distortions, or color ambiguity.
User study with 72 participants shows the method wins 77.90% of votes for abstract detail, 80.36% for visual appeal, and 96.55% for geometry preservation against four baselines.
Expert study with 10 art-trained participants shows 88.89% preference for results using Gumbel Softmax in color quantization, highlighting its role in achieving clear edges and dominant tones.
Color palette controllability is demonstrated across 2–8 colors using K-means, Median Cut, Max-Min, and Simulated Annealing, with K-means as the default.
Additional comparisons show the method outperforms Gemini 3 in controllable voxel resolution and color, and Rodin in geometric fidelity, due to multi-view optimization.
Runtime analysis reports total generation time of under 2 hours on an RTX 4090 (8.5 min for Stage 1, 108 min for Stage 2), faster than SD-piXL (~4h).
Failure cases occur on highly complex shapes at low resolutions, suggesting future potential in adaptive voxel grids.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang Jiewen Chan Hao-Jen Chien Yu-Lun Liu

Abstract

Summarization

Key Contributions

Introduction

Method

Experiment

Build AI with AI

Hyper Newsletters

Command Palette

Voxify3D: Pixel Art Meets Volumetric Rendering

Yi-Chuan Huang Jiewen Chan Hao-Jen Chien Yu-Lun Liu

Abstract

Summarization

Key Contributions

Introduction

Method

Experiment

Build AI with AI

Hyper Newsletters