Learned Initializations for Optimizing Coordinate-Based Neural Representations

CVPR 2021

UC Berkeley

Pratul P. Srinivasan2

Jonathan T. Barron2

Ren Ng1

Matthew Tancik*1*Ben Mildenhall*1*Terrance Wang1 Divi Schmidt1 Pratul P. Srinivasan2 Jonathan T. Barron2 Ren Ng1

1 UC Berkeley

2 Google Research

**Equal Contribution

Google Research

UC Berkeley

**Denotes Equal Contribution

Paper

Code

Data

Overview

Coordinate-based neural representations have shown significant promise as an alternative to discrete, array-based representations for complex low dimensional signals. However, optimizing a coordinate-based network from randomly initialized weights for each new signal is inefficient. We propose applying standard meta-learning algorithms to learn the initial weight parameters for these fully-connected networks based on the underlying class of signals being represented (e.g., images of faces or 3D models of chairs). Despite requiring only a minor change in implementation, using these learned initial weights enables faster convergence during optimization and can serve as a strong prior over the signal class being modeled, resulting in better generalization when only partial observations of a given signal are available.

For example, a coordinate-based MLP that represents and image takes a pixel coordinate (x,y) as input and outputs the (R,G,B) color at that location.The network weights θ are typically optimized via gradient descent to produce the desired image. However, finding good parameters can be computationally expensive, and the full optimization process must be repeated foreach new target. Meta-learning can find initial network weights that allow for faster convergence and better generalization.

Single View ShapeNet Reconstructions

The goal of view synthesis is to generate a novel view of a scene from a set of reference images. Recently, neural radiance fields (NeRF) proposed a method to accomplish this task by using a neural representation that predicts a color and density for any input 3D location and 2D viewing direction within the scene, along with a differentiable volumetric rendering model to generate new views from that representation. This network is optimized to minimize the residual of re-rendering each of the input reference images from their respective camera poses. In our view synthesis experiments, we use a simplified NeRF model (simple-NeRF) that maintains the same image supervision and volume rendering context. Unlike the original NeRF model, we do not feed in the viewing direction and we use a single model instead of the two “coarse” and “fine” models used by NeRF

The simple-NeRF formulation relies on multi-view consistency for supervision and therefore fails if naively applied to the task of single view reconstruction. However, if the model is trained starting from meta-learned initial weights, it is able to recover 3D geometry. The MV Meta initialization has access to multiple views per object during meta-learning, whereas the SV Meta initialization only has access to a single view per object during meta-learning. All methods only receive a single input view during test-time optimization.

Phototourism Exploration

The Phototourism dataset consists of thousands of posed tourist photographs of famous landmarks. Our objective is to use these images to create an underlying representation that can be explored and rendered from novel viewpoints with varying lighting conditions. The meta-training dataset for each landmark consists of thousands of images with varying resolution and intrinsic/extrinsic camera parameters. At test-time, we optimize the NeRF to reproduce the appearance of a new image, and then render that simple-NeRF from other viewpoints.

Learned Initializations for Optimizing Coordinate-Based Neural Representations

CVPR 2021

UC Berkeley

UC Berkeley

UC Berkeley

UC Berkeley

Google Research

Google Research

UC Berkeley

Overview

Image Regression

Single View ShapeNet Reconstructions

Phototourism Exploration

Citation