This project will no longer be maintained by Intel.
Intel has ceased development and contributions including, but not limited to, maintenance, bug fixes, new releases, or updates, to this project.
Intel no longer accepts patches to this project.
If you have an ongoing need to use this project, are interested in independently developing it, or would like to maintain patches for the open source software community, please create your own fork of this project.
INV: Understanding Functionalities Of MLPs Layers To Design Efficient Representations For Incremental Neural Videos
Shengze Wang, Alexey Supikov, Joshua Ratcliff, Ronald Azuma
TL;DR INVs is a fast incremental representation for neural videos. Each frame can be trained in under 10 minutes
using vanilla NeRF and still achieve good quality (>28db). It only takes around 300kb to store each frame/NeRF, making
this representation streamable. Moreover, INV can achieve state-of-the-art per-frame image quality and competitive
stability with less training budgets than prior SOTAs.
INV
INV is designed to incrementally generate neural 3D videos frame by frame, and it utilizes knowledge of prior frames to
drastically reduce both training time and storage. Our design is based on two insights:
(1) We discovered that MLPs automatically partition themselves into Structure and Color Layers, which store
structural and color/style information respectively.
(2) We leverage this property to retain and improve upon previously learned information. This allows us to reduce
redundancies in training and storage. We could thus amortize training across frames.
https://github.com/intel-collab/applications.ai.neural-video.progressive.git INV
cd INV
git checkout INV_finalized
None
To conduct further experiments in 2D, follow (1) to get the required data first.
More Info (click to expand)
Incremental Transfer encodes a video by training one MLP per frame. It uses the MLP from a previous frame as the initialization for the later frame. By analyzing the changes in each layer, we found that changes in earlier layers induce structural changes, and changes in later layers induce color/style changes.First, run Incremental Transfer training by calling:
python 2D_experiments\neural_video.py
runs according to settings in 2D_experiments\const.py
. Most important args are:
is_vid
: indicates whether this is a video file or a sequence of images.
vid_path
: either a folder that contains frames/
folder or a video file. Result stored in the parent folder of the
video for frames/
folder
per_frame_train_iters
: number of itertions for each frame
start_frame
: the starting frame
img_downsample
: the downsampling factor. To accelerate training, use a larger (e.g. 4 or 8) factor, but make sure frame
size is divisible by this numer.
no_siren_only_mlp
: To use basic MLP instead of MLP. Default True
.
use_nerf_pe
: use NeRF positional encoding (i.e. sin & cos for xyz separately). Default True
. If False
, will use
Fourier Features where xyz encoded together via sin & cos and random frequencies. Similarly, haven't tested for a while.
vid_path/imgs_incremental
: renderings from INV
vid_path/models_incremental
: models for each frame. Note that the entire models are stored, although only the
structure layers (1st layer by default) are different.
More Info (click to expand)
Structure Swap: structure layers could be directly plugged into a pretrained color layer
from a different frame without any training.
Current Observations:
- Positional Encodings are needed to activate structure and color
layers.
- SIREN layers are not required, but accelerates training.
- NeRF P.E. and Fourier Features activate Structure Layers.
Note: They have different artifacts (e.g. NeRF P.E. is horizontal/vertical stripes, but FF show blobs). This is likely because FFs encode xyz's together
To run Structure Swap experiments:
python 2D_experiments\swap_1st_layer.py
Uses the per-frame models stored in vid_path/models_incremental
to swap.
Also uses the same configs (i.e. 2D_experiments\const.py
) as before, except that it uses:
do_color_scheme_transfer
: Variable in the swap_1st_layer.py
script. Should be False
. Used later in 2D Color Transfer
Section.
base_frame
: Variable in the swap_1st_layer.py
script. The base model/frame.
The script keeps later layers of the base_frame
fixed, and swaps the Structure Layer (1st layer) from later frames
into this base frame.
vid_path/imgs_swap_1st_raw
: Resulting rendering of these swaps. Notice that the further in time from the base
frame, the worse the result would be. Such degradation is more prominent in 2D than in 3D.
vid_path/imgs_swap_1st_refined
: After swapping, the new structure layer are also optimized and stored in this
folder. The resulting renderings after optimization usually have high quality.
More Info (click to expand)
Color Transfer: Mixing color and structure knowledge from different images
python 2D_experiments\color_scheme_transfer.py
Arguments:
base_folder
: folder containing the two images.
structure_fn
: which image to use as structure information.
color_fn
: which image to use as color information.
{base_folder}/imgs_color_transfer_{iter}.png
: results after finetuning Structure Layers for iter
iterations on the
image structure_fn
.
INV can achieve SOTA per-frame quality in shorter time than SOTA with vanilla NeRF and no complex engineering.
INV achieves this by retaining prior knowledge and effectively amortizing training across frames.
During Warm-Up stage:
INV incrementally trains and stores 1 NeRF per-frame for N frames. After Warm-Up, INV freezes the later layers in the
NeRF model, and these layers are used as the Shared Color Layers.
During the Structure Transfer stage:
The NeRF model in INV has: (1) trainable Per-Frame Structure Layers (PFSL) at the front, and (2) frozen Shared Color
Layers (SCL) at the back. Notices that PFSLs are trained and stored frame-by-frame. However, SCL is frozen and shared by all frames.
(Optional) Temporal Weight Compression (TWC)
Batches of 300~600 frames of PFSLs weight matrices are compressed together using fpzip. This reduces weight size from
~700KB/frame to ~300KB/frame.
During Streaming, one recovers the whole NeRF for a frame by concatenating the PSFL for that frame with the SCL.
Decompresses if weights were compressed with TWC.
Data folder structure:
basedir/ <br />
-- frames_{factor}/ # all the frames (e.g. frame0001cam19.png) downsampled by {factor}
if using full-res images, use frames/
-- (if META) poses_bounds.npy # generated from LLFF colmap wrapper
-- (if Little Falls) calibration/ # little falls yaml files
-- (results) META_flame_salmon_1_warmup10k_iter10k_s3/
---- nerf_esti/ # output NeRF rendering and visualizations
---- output models, config used, rendered videos, etc.
to extract frames with size:
ffmpeg -i camxx.mp4 -vf scale=1352:1014 frames_2/frame%04dcamxx.png
cd 3D_experiments
python INV_basic.py --config configs/META_flame_salmon_1_10k.txt
META: sample config in configs/META_flame_salmon_1_10k.txt.
LF: sample config in configs/LF_crystalball.txt.
Arguments:
dataset_type
: type of data. META is META
, Little Falls is little falls
.
basedir
& datadir
: make them the same, root dir containing the image folder.
is_nerf_baseline
Set to True
if running NeRF baseline. Set to False
for INV.
i_weights_warmup
iters/frame during warmup (before freezing/sharing later layers). Longer warmup, better color layers,
better performance
i_weights
iters/frame after warmup, during Structure Transfer (with frozen/shared later layers). 10k is 7.5~8min
mid_freeze_start
on and after this layer, layers are frozen/shared. (3 means 0,1,2 are not frozen)
freeze_start_frame
on and after this frame, later layers will be frozen/shared.
More difficult scenes (e.g. META day scenes) need more overall training, so if 10k/frame, then try freeze after 120th frame.
But if 280k/frame, you could freeze early at 30th frame. Easier scenes (e.g. META night scenes) need less training,
30 warmup frames are usually enough.
factor
downsampling factor. Affects the folder of images used. If 1
, assumes frames/
, otherwise assumesframes_{factor}
near
nearest depth (inverted depth if NDC) to start sampling along a ray. META day scenes are ~0.5,
night scenes ~0.35
no_skip_connect
set to True
for INV, False
for NeRF. Skip connection improves performance at the cost of
increased number of Structure Layers, leading to more layers needed to be stored.
The complete models (including frozen color layers, and not compressed) are saved in {basedir}/{expname}/
renderings saved in {basedir}/{expname}/nerf_esti/
e.g. D:\data\cut_roasted_beef\META_flame_salmon_1_warmup10k_iter10k_s3_freeze120_test\nerf_esti
Split INV uses a separate NeRF to encode static background and thus allow most of the computation to be focused on
dynamic foreground content. As a result, flickering is reduced, and the foreground is of much higher quality.
First, generate dynamic masks by thresholding optical flow maps from methods like
SeparableFlow
. True
pixels indicate dynamic foreground pixels. Store the masks under {basedir}/mask/
.
To run on a sequence, e.g. flame_salmon_1
, first extract static background. Set pretraining_static=True
in META_flame_salmon_split_mlp_10k_linux.txt
, and run:
cd 3D_experiments
python INV_split.py --config ./configs/META_flame_salmon_split_mlp_10k_linux.txt
The script would iterate through all frames and store the extracted static model in {basedir}/{expname}_static/
Then, encode dynamic foreground by setting pretraining_static=False
and running:
cd 3D_experiments
python INV_split.py --config ./configs/META_flame_salmon_split_mlp_10k_linux.txt
During the first several frames (default 9), background is also optimized, so it takes slightly longer. Afterwards, it's around
10min/frame.
The complete models (including frozen color layers, and not compressed) are saved in {basedir}/{expname}_dynamic/
Renderings for the test view are saved in {basedir}/{expname}_dynamic/nerf_esti
TWC utilizes floating point compression algorithm fpzip
to compress a temporal weight matrix from 1.12MB
down to
300kb
per frame. TWC first reshapes the weights of the structure layers into a 2D matrix for each frame. Then, TWC
concatenates the weight matrices of all frames into a 3D matrix. Then fpzip
compresses this matrix at 16-bit
resolution. To run TWC on the set of saved models, run:
cd 3D_experiments
python fpzip_test.py
More Info (click to expand)
Similar to 2D Structure Swap, the script shows results of replacing `base_frame`'s Structure Layers with those of other frames. This "swap" causes structural/content changes in the resulting renderings. This process assumes pretrained models for both the `base_frame` and the later frames.NOTE: Best when the model has converged to a good performance. Otherwise, there could be too many artifacts to
see meaningful visualization. E.g. frame 30.
cd 3D_experiments
python nerf_motion_layer.py --config configs/LF_crystalball.txt
Arguments:
base_frame
: The frame whose Structure Layers will be swapped out for Structure Layers from later frames.
first_frame_to_process
: First frame whose Structure Layers will be swapped into base_frame
.
DO_SWAP_LAYERS
: Leave True
for the experiment to perform the swap.
swap_n_layer
first n
layers to swap. Notice that NeRF has two heads. Color head is indexed 8-10 here. Density head
is 11 and handled by DO_SWAP_ALPHA
.
DO_SWAP_ALPHA
if True
, swaps the alpha/sigma/density head.
renderings saved in {basedir}/{expname}/swap
E.g.: cam00_frame0004_e0005_raw_swap_0-0_26.269928.png
means camera 00, swapping 0th to 0th layers of frame 5
(i.e. 1st layer only) into frame 4, resulting rendering gets 26.269928 dB PSNR on frame 5.raw_swap
means no refinement after swap.