Cycle3D: High-quality and Consistent Image-to-3D Generation
via Generation-Reconstruction Cycle

Arxiv 2024

Zhenyu Tang^1,, Junwu Zhang^1,, Xinhua Cheng¹, Wangbo Yu¹, Chaoran Feng¹,
Yatian Pang³, Bin Lin¹, Li Yuan^1,2

^*Equal contributions
¹Peking University ²Pengcheng Laboratory ³National University of Singapore

Abstract

Recent 3D large reconstruction models typically employ a two-stage process: first generate multi-view images by a multi-view diffusion model, and then utilize a feed-forward model to reconstruct images to 3D content. However, multi-view diffusion models often produce low-quality and inconsistent images, adversely affecting the quality of the final 3D reconstruction. To address this issue, we propose a unified 3D generation framework called Cycle3D, which cyclically utilizes a 2D diffusion-based generation module and a feed-forward 3D reconstruction module during the multi-step diffusion process. During the denoising process, the 2D diffusion model can also control the generation of unseen views and inject reference-view information, thereby enhancing the diversity and texture consistency of 3D generation. Extensive experiments demonstrate the superior ability of our method to create 3D content with high-quality and consistency compared with state-of-the-art baseline methods.

Overview of our Cycle3D

During the multi-step denoising process of Cycle3D, the input view remains clean, the pre-trained 2D generation model gradually produces multi-view images with higher quality, while the reconstruction model continuously corrects their 3D inconsistencies. The red boxes highlight inconsistencies between the multi-view images, which are then corrected by reconstruction model.

Process of our Cycle3D

We propose a unified image-to-3D Diffusion framework that cyclically utilizes pre-trained 2D Diffusion model and 3D reconstruction model. During denoising, 2D Diffusion model can inject reference-view features, and the reconstruction model incorporates time embeddings to adapt to \(\hat{x}_0 \) at different timesteps. Additionally, the interaction between features of reconstruction model's encoder and 2D Diffusion model's decoder enhances robustness of reconstruction. During inference, we use the multi-view images \( \hat{x}'_0 \) rendered by reconstruction model and the previous step \(x_t\) , resampling to obtain \(x_{t-1}\), while keeping the input view clean.

Cycle3D: High-quality and Consistent Image-to-3D Generation
via Generation-Reconstruction Cycle

Arxiv 2024

Zhenyu Tang^1,, Junwu Zhang^1,, Xinhua Cheng¹, Wangbo Yu¹, Chaoran Feng¹,
Yatian Pang³, Bin Lin¹, Li Yuan^1,2

Abstract

Overview of our Cycle3D

Process of our Cycle3D

Comparison

More results

Citation

Cycle3D: High-quality and Consistent Image-to-3D Generation via Generation-Reconstruction Cycle

Arxiv 2024

Zhenyu Tang1,*, Junwu Zhang1,*, Xinhua Cheng1, Wangbo Yu1, Chaoran Feng1, Yatian Pang3, Bin Lin1, Li Yuan1,2

Abstract

Overview of our Cycle3D

Process of our Cycle3D

Comparison

More results

Citation

Cycle3D: High-quality and Consistent Image-to-3D Generation
via Generation-Reconstruction Cycle

Zhenyu Tang^1,, Junwu Zhang^1,, Xinhua Cheng¹, Wangbo Yu¹, Chaoran Feng¹,
Yatian Pang³, Bin Lin¹, Li Yuan^1,2