초록

컴퓨터 지원 설계 (CAD) 는 구조화되고 편집 가능한 기하학적 표현에 의존하지만, 기존 생성 방법들은 명시적인 설계 이력이나 경계 표현 (BRep, Boundary Representation) 레이블이 포함된 소규모 주석 데이터셋에 의해 제한받고 있습니다. 반면, 수백만 개의 주석되지 않은 3D 메시는 아직 활용되지 못한 채 방치되어 있어 확장 가능한 CAD 생성 기술의 발전을 저해하고 있습니다. 이에 본 연구에서는 CAD 특화 주석 없이 점 단위 감독 (point-level supervision) 만으로 편집 가능한 BRep 을 직접 생성하는 다중 모달 생성 프레임워크인 'DreamCAD'를 제안합니다. DreamCAD 는 각 BRep 을 매개변수 패치 집합 (예: 베지어 곡면) 으로 표현하고, 미분 가능한 테셀레이션 기법을 활용하여 메시를 생성함으로써, 대규모 3D 데이터셋에서의 학습과 연결성 및 편집성이 보장된 표면 재구성을 동시에 가능하게 합니다. 또한, 텍스트-투-CAD 연구 고도화를 위해 GPT-5 를 활용하여 100 만 개 이상의 설명을 생성한 'CADCap-1M'을 소개합니다. 이는 현재까지 공개된 CAD 캡셔닝 데이터셋 중 가장 규모가 큰 것입니다. 실험 결과, DreamCAD 는 텍스트, 이미지, 점 기반 모달리티를 아우르는 ABC 및 Objaverse 벤치마크에서 최첨단 (state-of-the-art) 성능을 달성하였으며, 기하학적 충실도를 향상시키고 사용자 선호도에서 75% 를 상회하는 성과를 보였습니다. 관련 코드와 데이터셋은 추후 공개될 예정입니다.

One-sentence Summary

Researchers from DFKI, RPTU, Imperial College London, and Huawei London propose DreamCAD, a multimodal framework using differentiable Bézier patches to generate editable CAD models from unannotated 3D meshes. This approach overcomes scalability limits of prior methods and introduces the massive CADCap-1M dataset for advanced text-to-CAD applications.

Key Contributions

DreamCAD addresses the scalability bottleneck in multimodal CAD generation by representing shapes as $C^{0}$ -continuous Bézier patches with differentiable tessellation, enabling direct point-level supervision on large-scale 3D meshes without requiring explicit CAD annotations.
The authors introduce CADCap-1M, the largest CAD captioning dataset to date, which contains over 1 million GPT-5-generated descriptions to advance text-to-CAD research and overcome the limitations of small, existing annotated datasets.
Experiments on the ABC and Objaverse benchmarks demonstrate that DreamCAD achieves state-of-the-art performance across text, image, and point modalities, reducing Chamfer Distance by up to 70% and surpassing 75% in user preference evaluations.

Introduction

Computer-Aided Design (CAD) is essential for engineering and manufacturing, yet scaling generative AI for this domain remains difficult because standard Boundary Representation (BRep) formats are discrete and non-differentiable. Prior methods relying on design histories are limited to small datasets and struggle with complex shapes, while approaches using explicit BRep annotations cannot leverage the millions of available unannotated 3D meshes. To overcome these barriers, the authors introduce DreamCAD, a multimodal framework that represents shapes as differentiable Bézier patches to enable direct point-level supervision on large-scale mesh data without requiring CAD-specific labels. This approach allows the model to generate editable parametric surfaces from text, images, or point clouds while achieving state-of-the-art geometric accuracy and supporting the creation of the massive CADCap-1M dataset for future research.

Dataset

Dataset Composition and Sources: The authors introduce CADCap-1M, a dataset containing over 1 million high-quality captions for CAD models sourced from ABC, Automate, CADParser, Fusion360, ModelNet, and 3D-Future. This resource fills a gap in text-to-CAD generation by providing shape-centric descriptions derived from diverse industrial and synthetic repositories.
Key Details for Each Subset:
- ABC and Automate: These subsets undergo rigorous filtering where 99% of trivial cuboids and simple cylindrical objects are removed based on topological analysis of faces, edges, and curvature. Degenerate models with unrealistic bounding boxes or insufficient geometric complexity (fewer than 5 faces or 10 vertices) are also discarded.
- Fusion360: Approximately 46% of samples in this subset contain extractable part names, which are leveraged to enhance caption specificity.
- General Statistics: The final dataset features captions with a mean length under 20 words and high linguistic diversity, including over 21k unigrams and 2.3M trigrams.
Data Usage and Processing:
- Caption Generation: The authors render four orthographic views per model using Blender and prompt GPT-5 to generate descriptions. Prompts are augmented with metadata such as model names, hole counts, and relative dimensions to reduce hallucinations and improve geometric accuracy.
- Training Preparation: For image-to-CAD training, the team renders 150 multi-view images per object using three complementary camera trajectories (azimuth sweep, elevation sweep, and uniform hemisphere sampling) to ensure full coverage.
- Visual Feature Extraction: Textures are synthesized by assigning random diffuse colors from a mid-tone to light palette for textureless meshes. Images are resized to 518x518 pixels for DINO processing.
Metadata and Filtering Strategies:
- Metadata Augmentation: The pipeline extracts part names from STEP files and computes geometric properties like hole counts and aspect ratios to guide the LLM. This approach allows the model to distinguish between visually similar parts, such as different washer types or bolt specifications.
- Quality Control: Low-quality models are filtered using OpenCascade to analyze surface types (planes, cylinders, B-splines) and edge characteristics. This ensures the training data excludes overly simple or physically unrealistic geometries.
- Color Palette: During fine-tuning of Stable-Diffusion 3.5, a curated palette of 30+ dark, low-saturation colors is applied to textureless models to improve visual consistency.

Method

The DreamCAD framework is designed as a multi-modal generative system that produces editable parametric surfaces directly from point-level supervision. The architecture relies on a differentiable representation of 3D shapes and a coarse-to-fine generation strategy to ensure geometric fidelity and topological consistency.

Parametric Surface Representation

The core of the method utilizes bicubic rational Bézier surfaces due to their analytical tractability and compatibility with standard CAD operations. A rational Bézier surface $S(u,v)$ is defined over a $uv$ domain by a grid of control points $C$ and associated non-negative weights $W$ . The surface is evaluated as:

S ( u , v ) = \frac { \sum _ { i , j } B _ { i } ^ { n } ( u ) B _ { j } ^ { m } ( v ) w _ { i j } c _ { i j } } { \sum _ { i , i } B _ { i } ^ { n } ( u ) B _ { i } ^ { m } ( v ) w _ { i j } }

where $B$ represents Bernstein basis functions. This formulation allows for differentiable mesh generation through tessellation, where the $uv$ domain is sampled on a grid to form quadrilateral cells, which are then split into triangles.

Variational Autoencoder and Surface Decoding

The authors employ a Sparse Transformer VAE to encode 3D shapes into compact latent representations. The process begins by voxelizing an input mesh to a $32^3$ resolution. To preserve fine geometric details, each active voxel is augmented with visual cues derived from rendering 150 RGB and normal views. Features are extracted using DINOv2 embeddings and combined with per-view normals, voxel centers, and signed distance values. These features are processed by a sparse Transformer encoder to produce structured latents.

Refer to the framework diagram to see the full pipeline. The decoder reconstructs the 3D shape as a set of Bézier patches. To ensure $C^0$ continuity between adjacent patches, the system first generates an initial parametric quad surface from the sparse voxels using a flood-fill algorithm. Each quad is converted into a bicubic rational Bézier patch by sampling a $4 \times 4$ grid of control points. The decoder then refines this initial surface by predicting local deformations and weight updates for each control point, enforcing continuity by averaging predictions at shared boundaries.

Coarse-to-Fine Conditional Generation

For conditional generation from text, images, or point clouds, DreamCAD adopts a two-stage flow-matching framework. In the first stage, a coarse voxel grid is generated from the input condition using a lightweight VAE. In the second stage, local features are generated for each active voxel, which are then transformed into the final parametric surface by the pretrained decoder.

For text-to-CAD tasks, the authors utilize a two-stage approach to improve prompt fidelity. They fine-tune a Stable Diffusion model on the CADCap-1M dataset to align its output with the image-to-CAD distribution. The generated images then condition the image-to-CAD model. This avoids the slow convergence and low fidelity often associated with direct text-to-3D training.

CAD Topology Recovery

While the generative model produces a set of parametric patches, recovering the explicit CAD topology (BRep) remains a challenge. The authors introduce a post-processing module that leverages a large language model (Qwen3) to recover the topology. The control points and weights of the generated patches are embedded and fed into the model, which outputs a structured NURBS representation including face connectivity, poles, knots, and weights. This allows the final output to be converted into a standard CAD format.

Experiment

Multimodal generation experiments validate that DreamCAD achieves state-of-the-art performance in point-to-CAD, image-to-CAD, and text-to-CAD tasks, accurately reconstructing complex geometries and intricate features while maintaining a zero invalidity ratio across in-distribution and out-of-distribution datasets.
Caption quality evaluations confirm that metadata-augmented prompting produces highly accurate descriptions of part names and geometric details, ensuring reliable semantic alignment for training.
Ablation studies demonstrate that combining G1 and Laplacian regularizers yields the smoothest surfaces with minimal artifacts, while a moderate voxel-grid resolution offers the optimal balance between reconstruction quality and computational efficiency.
Fine-tuning the text-to-image model significantly improves prompt fidelity compared to using a pretrained model, and a subsequent topology recovery experiment proves that the generated parametric surfaces can be successfully converted into valid, production-ready NURBS CAD models.

소스 PDF

AI로 AI 구축

아이디어에서 출시까지 — 무료 AI 코코딩, 즉시 사용 가능한 환경, 최적의 GPU 가격으로 AI 개발을 가속화하세요.

AI 협업 코딩

바로 사용 가능한 GPU

최적의 가격

시작하기 가격 보기

HyperAI Newsletters

최신 정보 구독하기

한국 시간 매주 월요일 오전 9시 에 이번 주의 최신 업데이트를 메일로 발송합니다

이메일 서비스 제공: MailChimp

2달 전

Mohammad Sadil Khan Muhammad Usama Rolandos Alexandros Potamias Didier Stricker Muhammad Zeshan Afzal Jiankang Deng Ismail Elezi

초록

One-sentence Summary

Key Contributions

DreamCAD addresses the scalability bottleneck in multimodal CAD generation by representing shapes as $C^{0}$ -continuous Bézier patches with differentiable tessellation, enabling direct point-level supervision on large-scale 3D meshes without requiring explicit CAD annotations.
The authors introduce CADCap-1M, the largest CAD captioning dataset to date, which contains over 1 million GPT-5-generated descriptions to advance text-to-CAD research and overcome the limitations of small, existing annotated datasets.
Experiments on the ABC and Objaverse benchmarks demonstrate that DreamCAD achieves state-of-the-art performance across text, image, and point modalities, reducing Chamfer Distance by up to 70% and surpassing 75% in user preference evaluations.

Introduction

Dataset

Dataset Composition and Sources: The authors introduce CADCap-1M, a dataset containing over 1 million high-quality captions for CAD models sourced from ABC, Automate, CADParser, Fusion360, ModelNet, and 3D-Future. This resource fills a gap in text-to-CAD generation by providing shape-centric descriptions derived from diverse industrial and synthetic repositories.
Key Details for Each Subset:
- ABC and Automate: These subsets undergo rigorous filtering where 99% of trivial cuboids and simple cylindrical objects are removed based on topological analysis of faces, edges, and curvature. Degenerate models with unrealistic bounding boxes or insufficient geometric complexity (fewer than 5 faces or 10 vertices) are also discarded.
- Fusion360: Approximately 46% of samples in this subset contain extractable part names, which are leveraged to enhance caption specificity.
- General Statistics: The final dataset features captions with a mean length under 20 words and high linguistic diversity, including over 21k unigrams and 2.3M trigrams.
Data Usage and Processing:
- Caption Generation: The authors render four orthographic views per model using Blender and prompt GPT-5 to generate descriptions. Prompts are augmented with metadata such as model names, hole counts, and relative dimensions to reduce hallucinations and improve geometric accuracy.
- Training Preparation: For image-to-CAD training, the team renders 150 multi-view images per object using three complementary camera trajectories (azimuth sweep, elevation sweep, and uniform hemisphere sampling) to ensure full coverage.
- Visual Feature Extraction: Textures are synthesized by assigning random diffuse colors from a mid-tone to light palette for textureless meshes. Images are resized to 518x518 pixels for DINO processing.
Metadata and Filtering Strategies:
- Metadata Augmentation: The pipeline extracts part names from STEP files and computes geometric properties like hole counts and aspect ratios to guide the LLM. This approach allows the model to distinguish between visually similar parts, such as different washer types or bolt specifications.
- Quality Control: Low-quality models are filtered using OpenCascade to analyze surface types (planes, cylinders, B-splines) and edge characteristics. This ensures the training data excludes overly simple or physically unrealistic geometries.
- Color Palette: During fine-tuning of Stable-Diffusion 3.5, a curated palette of 30+ dark, low-saturation colors is applied to textureless models to improve visual consistency.