SparseGS

Abstract

The problem of novel view synthesis has grown significantly in popularity recently with the introduction of Neural Radiance Fields (NeRFs) and other implicit scene representation methods. A recent advance, 3D Gaussian Splatting (3DGS), leverages an explicit representation to achieve real-time rendering with high-quality results. However, 3DGS still requires an abundance of training views to generate a coherent scene representation. In few shot settings, similar to NeRF, 3DGS tends to overfit to training views, causing background collapse and excessive floaters, especially as the number of training views are reduced. We propose a method to enable training coherent 3DGS-based radiance fields of 360 scenes from sparse training views. We find that using naive depth priors is not sufficient and integrate depth priors with generative and explicit constraints to reduce background collapse, remove floaters, and enhance consistency from unseen viewpoints. Experiments show that our method outperforms base 3DGS by up to 30.5% and NeRF-based methods by up to 15.6% in LPIPS on the MipNeRF-360 dataset with substantially less training and inference cost.

Ground Truth — From left to right: GT, SparseNeRF, MipNeRF360, 3DGS, **SparseGS**

From left to right: GT, SparseNeRF, MipNeRF360, 3DGS, **SparseGS**

Background

Novel view synthesis has grown significantly in popularity recently thanks to the introduction of implicit 3D scene represenations such as NeRF. These techniques enable many downstream applications in fields ranging from robotics to entertainment. However, most of these techniques are limited by the number of training views required: many need up to 200 views which can be prohibitively high for real-world usage. Prior work has tackled this problem with sparse view techniques, but these almost entirely focus on forward-facing scenes and suffer from long training and inference times of NeRFs.

Approach

We introduce a technique for real-time 360 sparse view synthesis by leveraging 3D Gaussian Splatting. The explicit nature of our scene representations allows to reduce sparse view artifacts with techniques that directly operate on the scene representation in an adaptive manner. Combined with depth based constraints, we are able to render high-quality novel views and depth maps for unbounded scenes.

Our proposed pipeline integrates depth and diffusion constraints, along with a floater pruning technique, to enhance the performance of few-shot novel view synthesis. During training, we render the alpha-blended depth, denoted as d^alpha , and employ Pearson correlation to ensure its alignment with the monocularly estimated depth d^pt. Furthermore, we impose a score distillation sampling loss on novel viewpoints to guarantee the generation of naturally-appearing images. At predetermined intervals, we execute floater pruning as described in Section 3 of our paper. In this illustration, new components that we introduce are highlighted in color, while the foundational 3D Gaussian Splatting pipeline is depicted in grey.

Key Ideas

Use off-the-shelf depth estimation models to regularize novel view outputs
Apply softmax function to gaussian depth values for better depth gradient control
Leverage the explicit gaussian representation to directly remove “floaters”
Reconstruct regions with low coverage in training views with diffusion-model guidance
Use Depth Warping to create more training views

Table 1: Ablation study on pipeline components

Model	PSNR ↑	SSIM ↑	LPIPS ↓
Base 3DGS	15.38	0.442	0.506
Base + Alpha-blending Depth Loss	15.67	0.456	0.500
Base + Softmax Depth Loss	16.52	0.587	0.438
↑ + SDS Loss	16.79	0.585	0.452
↑ + Depth Warping	16.93	0.598	0.438
↑ + Floater Prunning	17.18	0.602	0.437

Floater Removal Example

Table 2: Sparse-view baseline comparisons on the MipNeRF360 dataset

Model	PSNR ↑	SSIM ↑	LPIPS ↓	Runtime* (h)	Render FPS
SparseNeRF	11.5638	0.3206	0.6984	4	1/120
RegNeRF	11.7379	0.2266	0.6892	4	1/120
Mip-NeRF 360	17.1044	0.4660	0.5750	3	1/120
Base 3DGS	15.3840	0.4415	0.5061	0.5	30
ViP-NeRF	11.1622	0.2291	0.7132	4	1/120
SparseGS (Ours)	17.2558	0.5067	0.4738	0.75	30

We use 12 images for each scene. * Runtimes are recorded on one RTX3090.

Table 3: Sparse-view baseline comparisons on the DTU dataset

Model	PSNR ↑	SSIM ↑	LPIPS ↓
SparseNeRF	19.55	0.769	0.201
RegNeRF	18.89	0.745	0.190
DSNeRF	16.90	0.570	0.450
Base 3DGS	14.18	0.6284	0.301
SparseGS (Ours)	18.89	0.702	0.229

While our method does not surpass some of the previous methods specifically designed for frontal scenes in the DTU dataset, we significantly enhance the original 3DGS method, achieving high visual fidelity and competitive metrics.

More Pictures

mipnerf — From left to right: Base 3DGS, MipNeRF360, **SparseGS (Ours)**

Citation

@article{xiong2023sparsegs,
  author    = {Xiong, Haolin and Muttukuru, Sairisheek and Upadhyay, Rishi and Chari, Pradyumna and Kadambi, Achuta},
  title     = {SparseGS: Real-Time 360° Sparse View Synthesis using Gaussian Splatting},
  journal   = {Arxiv},
  year      = {2023},
}