Deep Residual Architecture Using Pixel and Feature Cues for View Synthesis and Temporal Interpolation
 

Jinglei Shi, Xiaoran Jiang, Christine Guillemot,
"Deep Residual Architecture Using Pixel and Feature Cues for View Synthesis and Temporal Interpolation", IEEE Trans. on Computational Imaging, accepted, Feb. 2022.
contact: J. Shi, X. Jiang, C. Guillemot

Abstract

In this paper, we propose a deep residual architecture that can be used both for synthesizing high quality angular views in light fields and temporal frames in classical videos. The proposed framework consists of an optical flow estimator optimized for view synthesis, a trainable feature extractor and a residual convolutional network for pixel and feature-based view reconstruction. Among these modules, the finetuning of the optical flow estimator specifically for the view synthesis task yields scene depth or motion information that is well optimized for the targeted problem. In cooperation with the end-to-end trainable encoder, the synthesis block employs both pixel-based and feature-based synthesis with residual connection blocks, and the two synthesized views are fused with the help of a learned soft mask to obtain the final reconstructed view. Experimental results with various datasets show that our method performs favorably against other state-of-the-art (SOTA) methods with a large gain for light field view synthesis. Furthermore, with a little modification, our method can also be used for video frame interpolation, generating high quality frames compared with SOTA interpolation methods.

Algorithm overview

overview


Overview of the proposed deep architecture for both light field view synthesis and video frame temporal interpolation. In the case of the light field view synthesis problem, the input views are 4 sparse light field views (e.g. corner views) with i={tl,tr,bl,br} being the index of the source view positions, and j the index of the target view position. In the case of video frame temporal interpolation, the input views are temporally adjacent views, with i={0,1} the index of source frame time instants, and j=t the target frame instant.


Quantitative assessment for LFVS (center view)

process


process

Visual comparison of the reconstruction error maps (center view)

Synthetic light fields

Stilllife
stilllife_GT
stilllife_kalantari
stilllife_soft3d
stilllife_mpi
stilllife_shearedEPI
stilllife_fpfr
stilllife_ours
colorbar
Buddha
buddha_GT
buddha_kalantari
buddha_soft3d
buddha_mpi
buddha_shearedEPI
buddha_fpfr
buddha_ours
MonasRoom
monasRoom_GT
monasRoom_kalantari
monasRoom_soft3d
monasRoom_mpi
monasRoom_shearedEPI
monasRoom_fpfr
monasRoom_ours
Sideboard
sideboard_GT
sideboard_kalantari
sideboard_soft3d
sideboard_mpi
sideboard_shearedEPI
sideboard_fpfr
sideboard_ours
Cotton
cotton_GT
cotton_kalantari
cotton_soft3d
cotton_mpi
cotton_shearedEPI
cotton_fpfr
cotton_ours
Dino
dino_GT
dino_kalantari
dino_soft3d
dino_mpi
dino_shearedEPI
dino_fpfr
dino_ours
Toy_bricks
Toy_bricks_GT
Toy_bricks_kalantari
Toy_bricks_soft3d
Toy_bricks_mpi
Toy_bricks_shearedEPI
Toy_bricks_fpfr
Toy_bricks_ours
GT DeepVS Soft3D LLFF EPI FPFR Ours

Real-world light fields

Flower1
Flower1_GT
Flower1_kalantari
Flower1_soft3d
Flower1_mpi
Flower1_shearedEPI
Flower1_fpfr
Flower1_ours
Rock
Rock_GT
Rock_kalantari
Rock_soft3d
Rock_mpi
Rock_shearedEPI
Rock_fpfr
Rock_ours
Seahorse
Seahorse_GT
Seahorse_kalantari
Seahorse_soft3d
Seahorse_mpi
Seahorse_shearedEPI
Seahorse_fpfr
Seahorse_ours
GT DeepVS Soft3D LLFF EPI FPFR Ours

Performances for LFVS in terms of viewpoints

inter_extra


Averaged PSNR curves for the different viewpoints. (a) Interpolation. (c) Extrapolation. (b) View indices for interpolation (top) and extrapolation (bottom). 4 input views (red slash) are used for DeepVS, EPI, Soft3D, FPFR, FDL and our method, whereas 5 input views (grey) are used for LLFF.

Comparison with Neural Radiance Field-based methods

process
process

Compression performance analysis

process
compression_curve_cotton
compression_curve_dino

Quantitative assessment for VFI

process

Visual comparison of the interpolated frames

MEMC-Net
FeFlow
SMSP
Ours
GT
MEMC-Net
FeFlow
SMSP
Ours
GT
MEMC-Net
FeFlow
SMSP
Ours
GT
MEMC-Net
FeFlow
SMSP
Ours
GT
MEMC-Net
FeFlow
SMSP
Ours
GT
MEMC-Net FeFlow SMSP Ours GT

Additional results on Berkeley's dataset (up-convert video frame rate to 8X)


Videos taken from Wang, T. C., Zhu, J. Y., Kalantari, N. K., Efros, A. A., & Ramamoorthi, R. (2017). Light field video capture using a learning-based hybrid imaging system. ACM Transactions on Graphics (TOG), 36(4), 1-13. We take the central views of all key frames.

Hybrid/Sequence 02


Hybrid/Sequence 03


Hybrid/Sequence 04

References

DeepVS: N. Kalantari, T. Wang, and R. Ramamoorthi, "Learning-based view synthesis for light field cameras". ACM Trans. on Graphics (TOG), 2016

Soft3D: E. Penner and L. Zhang, "Soft 3D reconstruction for view synthesis". ACM Trans. on Graphics (TOG), 2017

LLFF: B. Mildenhall, P. Srinivasan, et al., "Local light field fusion: Practical view synthesis with prescriptive sampling guidelines". ACM Trans. on Graphics (TOG), 2019

EPI: G. Wu, Y. Liu, Q. Dai, and T. Chai, "Learning sheared epi structure for light field reconstruction". IEEE Trans. Image Process (TIP), 2019

FPFR: J. Shi, X. Jiang, and C. Guillemot, "Learning fused pixel and feature-based view reconstructions for light fields". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.

SuperSloMo: H. Jiang, D. Sun, V. Jampani, J. Kautz, et al. "Super slomo: High quality estimation of multiple intermediate frames for video interpolation". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.

SepConv: S. Niklaus, L. Mai, and F. Liu. "Video frame interpolation via adaptive separable convolution". In IEEE Int. Conf. on Computer Vision (ICCV), 2017.

MEMC: W. Bao, W. Lai, X. Zhang, M. Yang, et al. "Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement". IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI),2019.

FeFlow: S. Gui, C. Wang, Q. Chen, and D. Tao. "Featureflow: Robust video interpolation via structure-to-texture generation". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

SMSP: S. Niklaus and F. Liu. "Softmax splatting for video frame interpolation". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.