This section will cover an overview and history of the technologies we currently use to create RealityMaps.
After a Miner uploads their 20-40 seconds video in a RealityBlock, it is processed through a set of authenticity checks. When completed, that video sliced into single-image frames. Our systems then estimate the movement path you took by processing the images through SfM and incorporating the data obtained from the gyroscope and accelerometer. The result, oriented images in a 3D space, are projected in each direction and fed through a NeRF algorithm to produce a model of the space.
Currently each RealityBlock contains its own model, and we also have a meta-model for a set of hexagons that fuses multiple nearby Collections into a single render. These models in aggregate allow us to create a higher quality visualization of the scene. The 3D metaverse map previews are renderings from this model. The video preview in app is a recording of a camera moving in a specific path we picked out through the model environment. We’ve created a few specific paths to make the previews consistent across RealityBlocks.
RealityMaps is a computer vision 3D model of the real world. Developing this involves several foundation technologies:
- deep learning: a type of machine learning methods that uses artificial neural networks to automatically discover the heuristics needed for feature detection or classification from raw data
- point clouds (3D mesh): a spatial representation of various points in space. When connected together to form polygons, it becomes a mesh.
- reconstruction: take a 2D image and estimates the 3D depth of objects in the photo. The opposite is photography, which takes the 3D light in a scene and converts it to a 2D image.
- novel view synthesis: View synthesis aims to generate new image vantage points from a set of existing viewpoints
As digital cameras have become mainstream, researchers have built sophisticated techniques to process and analyze the treasure trove of data. Up until the mid 2010s, these images were processed using “classical” computer vision, algorithms that require hand crafted parameters to identify target information. This set of techniques required significant research time to iterate on, which is why mapping as a whole took many years to digitize well and then scale beyond 2D road lines. However, when research progressed, the developed features proved to be be very reliable.
For instance, Structure from Motion (SfM) is, a class of classical computer vision algorithms that aim to reconstruct the geometry of an environment based on from multiple of a space. Algorithms, like COLMAP, are well-researched technique that car-top and dashcam street view companies can use for estimating distances of building, road environment, and outdoor objects.
More recently, artificial intelligence (AI), deep learning based computer vision models have emerged. They’ve proven to be far more generalizable, but require GPUs because of how computationally heavy they are. These models are commonplace for users. For instance they are in Instagram and Snapchat for creating dog-ear camera effects, and they power the background-blur feature in Zoom, Google Meets, and Microsoft Teams.
In the last two years, these developments have led to Neural Radiance Fields (NeRF), which captures all these learnings into a transformational computer vision milestone. NeRF takes multiple images of an environment that are positioned and then the light rays from each capture are projected in all directions. A RGB 3D coordinate model is generated to estimate the color of each point in the scene.
Our algorithm represents a scene using a fully-connected (non-convolutional) deep network, whose input is a single continuous 5D coordinate (spatial location (x, y, z) and viewing direction (θ, φ)) and whose output is the volume density and view-dependent emitted radiance at that spatial location.
We synthesize views by querying 5D coordinates along camera rays and use classic volume rendering techniques to project the output colors and densities into an image. Because volume rendering is naturally differentiable, the only input required to optimize our representation is a set of images with known camera poses. We describe how to effectively optimize neural radiance fields to render photorealistic novel views of scenes with complicated geometry and appearance, and demonstrate results that outperform prior work on neural rendering and view synthesis.