Deep Residual Architecture Using Pixel and Feature Cues for View Synthesis and Temporal Interpolation

Jinglei Shi, Xiaoran Jiang, Christine Guillemot,
"Deep Residual Architecture Using Pixel and Feature Cues for View Synthesis and Temporal Interpolation", IEEE Trans. on Computational Imaging, accepted, Feb. 2022.

contact: J. Shi, X. Jiang, C. Guillemot

Abstract

In this paper, we propose a deep residual architecture that can be used both for synthesizing high quality angular views in light fields and temporal frames in classical videos. The proposed framework consists of an optical flow estimator optimized for view synthesis, a trainable feature extractor and a residual convolutional network for pixel and feature-based view reconstruction. Among these modules, the finetuning of the optical flow estimator specifically for the view synthesis task yields scene depth or motion information that is well optimized for the targeted problem. In cooperation with the end-to-end trainable encoder, the synthesis block employs both pixel-based and feature-based synthesis with residual connection blocks, and the two synthesized views are fused with the help of a learned soft mask to obtain the final reconstructed view. Experimental results with various datasets show that our method performs favorably against other state-of-the-art (SOTA) methods with a large gain for light field view synthesis. Furthermore, with a little modification, our method can also be used for video frame interpolation, generating high quality frames compared with SOTA interpolation methods.

Algorithm overview

Overview of the proposed deep architecture for both light field view synthesis and video frame temporal interpolation. In the case of the light field view synthesis problem, the input views are 4 sparse light field views (e.g. corner views) with i={tl,tr,bl,br} being the index of the source view positions, and j the index of the target view position. In the case of video frame temporal interpolation, the input views are temporally adjacent views, with i={0,1} the index of source frame time instants, and j=t the target frame instant.

Quantitative assessment for LFVS (center view)

Visual comparison of the reconstruction error maps (center view)

Synthetic light fields

Stilllife

Buddha

MonasRoom

Sideboard

Cotton

Dino

Toy_bricks


GT	DeepVS	Soft3D	LLFF	EPI	FPFR	Ours

Real-world light fields

Flower1

Rock

Seahorse


GT	DeepVS	Soft3D	LLFF	EPI	FPFR	Ours

Performances for LFVS in terms of viewpoints

Averaged PSNR curves for the different viewpoints. (a) Interpolation. (c) Extrapolation. (b) View indices for interpolation (top) and extrapolation (bottom). 4 input views (red slash) are used for DeepVS, EPI, Soft3D, FPFR, FDL and our method, whereas 5 input views (grey) are used for LLFF.

Comparison with Neural Radiance Field-based methods

Compression performance analysis

Quantitative assessment for VFI

Visual comparison of the interpolated frames


MEMC-Net	FeFlow	SMSP	Ours	GT

Additional results on Berkeley's dataset (up-convert video frame rate to 8X)

Videos taken from Wang, T. C., Zhu, J. Y., Kalantari, N. K., Efros, A. A., & Ramamoorthi, R. (2017). Light field video capture using a learning-based hybrid imaging system. ACM Transactions on Graphics (TOG), 36(4), 1-13. We take the central views of all key frames.

Hybrid/Sequence 02

Hybrid/Sequence 03

Hybrid/Sequence 04

References

DeepVS: N. Kalantari, T. Wang, and R. Ramamoorthi, "Learning-based view synthesis for light field cameras". ACM Trans. on Graphics (TOG), 2016

Soft3D: E. Penner and L. Zhang, "Soft 3D reconstruction for view synthesis". ACM Trans. on Graphics (TOG), 2017

LLFF: B. Mildenhall, P. Srinivasan, et al., "Local light field fusion: Practical view synthesis with prescriptive sampling guidelines". ACM Trans. on Graphics (TOG), 2019

EPI: G. Wu, Y. Liu, Q. Dai, and T. Chai, "Learning sheared epi structure for light field reconstruction". IEEE Trans. Image Process (TIP), 2019

FPFR: J. Shi, X. Jiang, and C. Guillemot, "Learning fused pixel and feature-based view reconstructions for light fields". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.

SuperSloMo: H. Jiang, D. Sun, V. Jampani, J. Kautz, et al. "Super slomo: High quality estimation of multiple intermediate frames for video interpolation". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2018.

SepConv: S. Niklaus, L. Mai, and F. Liu. "Video frame interpolation via adaptive separable convolution". In IEEE Int. Conf. on Computer Vision (ICCV), 2017.

MEMC: W. Bao, W. Lai, X. Zhang, M. Yang, et al. "Memc-net: Motion estimation and motion compensation driven neural network for video interpolation and enhancement". IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI),2019.

FeFlow: S. Gui, C. Wang, Q. Chen, and D. Tao. "Featureflow: Robust video interpolation via structure-to-texture generation". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

SMSP: S. Niklaus and F. Liu. "Softmax splatting for video frame interpolation". In IEEE. Int. Conf. on Computer Vision and Pattern Recognition (CVPR), 2020.