- 1Carnegie Mellon University
- 2Cisco Research
Abstract
Score distillation of 2D diffusion models has proven to be a powerful mechanism to guide 3D optimization, for example enabling text-based 3D generation or single-view reconstruction. A common limitation of existing score distillation formulations, however, is that the outputs of the (mode-seeking) optimization are limited in diversity despite the underlying diffusion model being capable of generating diverse samples. In this work, inspired by the sampling process in denoising diffusion, we propose a score formulation that guides the optimization to follow generation paths defined by random initial seeds, thus ensuring diversity. We then present an approximation to adopt this formulation for scenarios where the optimization may not precisely follow the generation paths (e.g. a 3D representation whose renderings evolve in a co-dependent manner). We showcase the applications of our `Diverse Score Distillation' (DSD) formulation across tasks such as 2D optimization, text-based 3D inference, and single-view reconstruction. We also empirically validate DSD against prior score distillation formulations and show that it significantly improves sample diversity while preserving fidelity.
Method
While the DDIM process follows (ODE-based) trajectories for the noisy image, it also induces trajectories in the one-step prediction and predicted noise spaces. By viewing the score distillation as a process in the one-step prediction space and keeping track of a defined DDIM ODE, we can guide the distillation to follow the specified ODE. Therefore, we can ensure the diversity of the generated samples by maintaining diversity of the underlying ODE trajectories, which are defined by the initial noise.
In 2D, our method can simulate ODE trajectories to generate diverse images. When the number of DDIM steps equals the number of optimization steps, the generated images are identical to the DDIM-generated images (DSD*). When the steps differ (DSD), the generated images remain similar to the DDIM-generated images but are more diverse compared to other methods.
In 3D, however, the 3D shape may not precisely follow the 2D DDIM ODE.
To address this, we propose an approximation (via interpolation)
to adapt this formulation for 3D generation. Specifically, a unique ODE starting point, ε,
is assigned to each 3D shape throughout the optimization process.
Renderings from different views are assumed to lie on the view-conditioned ODE trajectory,
starting from ε. At each iteration, DSD simulates the corresponding ODE up to time t and
obtains the noise prediction, ε(t), from the ODE. The rendered view is linked to the ODE through an
interpolation approximation, which is then used to compute the gradient.
Comparison
We compare our method with other score distllation methods. DSD generates more diverse samples while achieving similar (better) quality.
Diverse Single View Reconstruction
Given an input image, the geometry of the underlying 3D shape is highly undetermined. With a DM conditioned on image and camera, DSD (Ours) reconstructs high-frequency details of the object with diverse interpretations of the underlying geometry. The default distillation adopted by Zero-1-to-3 does not produce multi-modal reconstructs and has limited details.
Diverse Generation with MVDream
By using a camera-aware multiview diffusion model, such as MVDream, our method can generate diverse 3D shapes without Janus effects.