We propose applying standard meta-learning algorithms to learn the initial weight parameters for coordinate based neural representations based on the underlying class of signals being represented (e.g., images of faces or 3D models of chairs). Despite requiring only a minor change in implementation, using these learned initial weights enables faster convergence during optimization and can serve as a strong prior over the signal class being modeled, resulting in better generalization when only partial observations of a given signal are available.
We propose a learning framework that predicts a continuous neural scene representation conditioned on one or few input images. The oringal NeRF method involves optimizing the representation to every scene independently, requiring many calibrated views and significant compute time. We take a step towards resolving these shortcomings by introducing an architecture that conditions a NeRF on image inputs in a fully convolutional manner. This allows the network to be trained across multiple scenes to learn a scene prior, enabling it to perform novel view synthesis in a feed-forward manner from a sparse set of views (as few as one).
We recover relightable NeRF-like models using neural approximations of expensive visibility integrals, so we can simulate complex volumetric light transport during training.
NeurIPS (2020) SpotlightarXivProject WebsiteCodeVideo
We show that passing input points through a simple Fourier feature mapping enables a multilayer perceptron (MLP) to learn high-frequency functions in low-dimensional problem domains. These results shed light on recent advances in computer vision and graphics that achieve state-of-the-art results by using MLPs to represent complex 3D objects and scenes. Using tools from the neural tangent kernel (NTK) literature, we show that a standard MLP fails to learn high frequencies both in theory and in practice. To overcome this spectral bias, we use a Fourier feature mapping to transform the effective NTK into a stationary kernel with a tunable bandwidth. We suggest an approach for selecting problem-specific Fourier features that greatly improves the performance of MLPs for low-dimensional regression tasks relevant to the computer vision and graphics communities.
ECCV (2020) Oral - Best Paper Honorable MentionarXivProject WebsiteCodeVideo
We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whoseoutput is the volume density and view-dependent emitted radiance athat spatial location. We synthesize views by querying 5D coordinatesalong camera rays and use classic volume rendering techniques to projectthe output colors and densities into an image. Because volume renderingis naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how toeffectively optimize neural radiance fields to render photorealistic novelvews of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and viewsynthesis.
CVPR (2020)arXivProject WebsiteCodeVideo
Printed and digitally displayed photos have the ability to hide imperceptible digital data that can be accessed through internet-connected imaging systems. Another way to think about this is physical photographs that have unique QR codes invisibly embedded within them. This paper presents an architecture, algorithms, and a prototype implementation addressing this vision. Our key technical contribution is StegaStamp, a learned steganographic algorithm to enable robust encoding and decoding of arbitrary hyperlink bitstrings into photos in a manner that approaches perceptual invisibility. StegaStamp comprises a deep neural network that learns an encoding/decoding algorithm robust to image perturbations approximating the space of distortions resulting from real printing and photography. We demonstrates real-time decoding of hyperlinks in photos from in-the-wild videos that contain variation in lighting, shadows, perspective, occlusion and viewing distance. Our prototype system robustly retrieves 56 bit hyperlinks after error correction -- sufficient to embed a unique code within every photo on the internet.
CVPR (2020)arXivProject WebsiteVideo
We present a deep learning solution for estimating the incident illumination at any 3D location within a scene from an input narrow-baseline stereo image pair. Previous approaches for predicting global illumination from images either predict just a single illumination for the entire scene, or separately estimate the illumination at each 3D location without enforcing that the predictions are consistent with the same 3D scene. Instead, we propose a deep learning model that estimates a 3D volumetric RGBA model of a scene, including content outside the observed field of view, and then uses standard volume rendering to estimate the incident illumination at any 3D location within that volume. Our model is trained without any ground truth 3D data and only requires a held-out perspective view near the input stereo pair and a spherical panorama taken within each scene as supervision, as opposed to prior methods for spatially-varying lighting estimation, which require ground truth scene geometry for training. We demonstrate that our method can predict consistent spatially-varying lighting that is convincing enough to plausibly relight and insert highly specular virtual objects into real images.
CHI (2020)arXivProject WebsiteCode
Eye movements provide insight into what parts of an image a viewer finds most salient, interesting, or relevant to the task at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Here we explore an alternative: a comprehensive web-based toolbox for crowdsourcing visual attention. We draw from four main classes of attention-capturing methodologies in the literature. ZoomMaps is a novel zoom-based interface that captures viewing on a mobile phone. CodeCharts is a self-reporting methodology that records points of interest at precise viewing durations. ImportAnnots is an annotation tool for selecting important image regions, and cursor-based BubbleView lets viewers click to deblur a small area. We compare these methodologies using a common analysis framework in order to develop appropriate use cases for each interface. This toolbox and our analyses provide a blueprint for how to gather attention data at scale without an eye tracker.
ICCP (2018)Project WebsiteLocal CopyVideoMIT News
Imaging through fog has important applications in industries such as self-driving cars, augmented driving, air- planes, helicopters, drones and trains. Current solutions are based on radar that suffers from poor resolution (due to the long wavelength), or on time gating that suffers from low signal-to-noise ratio. Here we demonstrate a technique that recovers reflectance and depth of a scene obstructed by dense, dynamic, and heterogeneous fog. For practical use cases in self-driving cars, the imaging system is designed in optical reflection mode with minimal footprint and is based on LIDAR hardware. Specifically, we use a single photon avalanche diode (SPAD) camera that time-tags individual detected photons. A probabilistic computational framework is developed to estimate the fog properties from the measurement itself, and distinguish between background pho- tons reflected from the fog and signal photons reflected from the target.
Vehicles, search and rescue personnel, and endoscopes use flash lights to locate, identify, and view objects in their surroundings. Here we show the first steps of how all these tasks can be done around corners with consumer cameras. We introduce a method that couples traditional geometric understanding and data-driven techniques. To avoid the limitation of large dataset gathering, we train the data-driven models on rendered samples to computationally recover the hidden scene on real data. The method has three independent operating modes: 1) a regression output to localize a hidden object in 2D, 2) an identification output to identify the object type or pose, and 3) a generative network to reconstruct the hidden scene from a new viewpoint.
Nature Photonics Vol. 13 (2018)Project WebsiteNature ArticleVideoMIT News
We demonstrate that the imaging optics of an ultrafast camera (or a depth camera) can be dramatically different from the imaging optics of a conventional photography camera. More specifically, we demonstrate that by folding the optical path in time, one can collapse the conventional photography optics into a compact volume or multiplex various functionalities into a single imaging optics piece without losing spatial or temporal resolution. By using time-folding at different regions of the optical path, we achieve an order of magnitude lens tube compression, ultrafast multi-zoom imaging, and ultrafast multi-spectral imaging. Each demonstration was done with a single image acquisition without moving optical components.
Widely used in news, business, and educational media, infographics are handcrafted to effectively communicate messages about complex and often abstract topics including ‘ways to conserve the environment’ and ‘understanding the financial crisis’. Composed of stylistically and semantically diverse visual and textual elements, infographics pose new challenges for computer vision. While automatic text extraction works well on infographics, computer vision approaches trained on natural images fail to identify the stand-alone visual elements in infographics, or ‘icons’. To bridge this representation gap, we propose a synthetic data generation strategy: we augment background patches in infographics from our Visually29K dataset with Internet-scraped icons which we use as training data for an icon proposal mechanism. Combining our icon proposals with icon classification and text extraction, we present a multi-modal summarization application. Our application takes an infographic as input and automatically produces text tags and visual hashtags that are textually and visually representative of the infographic’s topics respectively.
IEEE Transactions on Computational Imaging (2017)Project WebsiteLocal CopyIEEEMIT News
Traditional cameras require a lens and a mega-pixel sensor to capture images. The lens focuses light from the scene onto the sensor. We demonstrate a new imaging method that is lensless and requires only a single pixel for imaging. Compared to previous single pixel cameras our system allows significantly faster and more efficient acquisition. This is achieved by using ultrafast time-resolved measurement with compressive sensing. The time-resolved sensing adds information to the measurement, thus fewer measurements are needed and the acquisition is faster. Lensless and single pixel imaging computationally resolves major constraints in imaging systems design. Notable applications include imaging in challenging parts of the spectrum (like infrared and THz), and in challenging environments where using a lens is problematic.
Optics Express Vol. 25, 17466-17479 (2017)Project WebsiteLocal CopyOSA
A deep learning method for object classification through scattering media. Traditional techniques to see through scattering media rely on a physical model that has to be precisely calibrated. Computationally overcoming the scattering relies heavily on such physical models, and on the calibration accuracy. Thus, such systems are extremely sensitive to an accurate and lengthy calibration process. Our method trains on synthetic data with variations in calibration parameters that allows the network to learn an invariant model to calibration of lab experiments.