8 months ago

Abstract

Recent work on image content manipulation based on vision-languagepre-training models has been effectively extended to text-driven 3D sceneediting. However, existing schemes for 3D scene editing still exhibit certainshortcomings, hindering their further interactive design. Such schemestypically adhere to fixed input patterns, limiting users' flexibility in textinput. Moreover, their editing capabilities are constrained by a single or afew 2D visual models and require intricate pipeline design to integrate thesemodels into 3D reconstruction processes. To address the aforementioned issues,we propose a dialogue-based 3D scene editing approach, termed CE3D, which iscentered around a large language model that allows for arbitrary textual inputfrom users and interprets their intentions, subsequently facilitating theautonomous invocation of the corresponding visual expert models. Furthermore,we design a scheme utilizing Hash-Atlas to represent 3D scene views, whichtransfers the editing of 3D scenes onto 2D atlas images. This design achievescomplete decoupling between the 2D editing and 3D reconstruction processes,enabling CE3D to flexibly integrate a wide range of existing 2D or 3D visualmodels without necessitating intricate fusion designs. Experimental resultsdemonstrate that CE3D effectively integrates multiple visual models to achievediverse editing visual effects, possessing strong scene comprehension andmulti-round dialog capabilities. The code is available athttps://sk-fun.fun/CE3D.

Source PDF View Code