Repaint123: Fast and High-quality One Image to 3D Generation
with Progressive Controllable 2D Repainting

Arxiv 2023

Junwu Zhang^1,, Zhenyu Tang^1,, Yatian Pang³, Xinhua Cheng¹, Peng Jin¹, Yida Wei⁴, Munan Ning^1,2, Li Yuan^1,2

^*Equal contributions
¹Peking University ²Pengcheng Laboratory ³National University of Singapore ⁴Wuhan University

Paper

Code

Live Demo

Abstract

Repaint123 crafts 3D content from a single image, matching 2D generation quality in just 2 minutes.

Recent one image to 3D generation methods commonly adopt Score Distillation Sampling (SDS). Despite the impressive results, there are multiple deficiencies including multi-view inconsistency, over-saturated and over-smoothed textures, as well as the slow generation speed. To address these deficiencies, we present Repaint123 to alleviate multi-view bias as well as texture degradation and speed up the generation process. The core idea is to combine the powerful image generation capability of the 2D diffusion model and the texture alignment ability of the repainting strategy for generating high-quality multi-view images with consistency. We further propose visibility-aware adaptive repainting strength for overlap regions to enhance the generated image quality in the repainting process. The generated high-quality and multi-view consistent images enable the use of simple Mean Square Error (MSE) loss for fast 3D content generation. We conduct extensive experiments and show that our method has a superior ability to generate high-quality 3D content with multi-view consistency and fine textures in 2 minutes from scratch.

Image-to-3D generation pipeline

In the coarse stage, we adopt Gaussian Splatting representation optimized by SDS loss at the novel view. In the fine stage, we export Mesh representation and bidirectionally and progressively sample novel views for controllable progressive repainting. The novel-view refined images will compute MSE loss with the input novel-view image for efficient generation. Cameras in red are bidirectional neighbor cameras for obtaining the visibility map.

Controllable repainting scheme

Our scheme employs DDIM Inversion to generate deterministic noisy latent from coarse images, which are then refined via a diffusion model controlled by depth-guided geometry, reference image semantics, and attention-driven reference texture. We binarize the visibility map into an overlap mask by the timestep-aware binarization operation. Overlap regions are selectively repainted during each denoising step, leading to high-quality novel-view image.