Xueshi Hou, Sujit Dey
Mobile Systems Design Lab,
Center for Wireless Communications, UC San Diego
Jianzhong Zhang, Madhukar Budagavi Samsung Research America
Predictive View Generation to Enable Mobile 360-degree and VR - - PowerPoint PPT Presentation
Predictive View Generation to Enable Mobile 360-degree and VR Experiences Xueshi Hou, Sujit Dey Jianzhong Zhang, Madhukar Budagavi Mobile Systems Design Lab, Samsung Research America Center for Wireless Communications, UC San Diego
Xueshi Hou, Sujit Dey
Mobile Systems Design Lab,
Center for Wireless Communications, UC San Diego
Jianzhong Zhang, Madhukar Budagavi Samsung Research America
§ Goal: Enable wireless and light VR experience § Observation: Existing head-mounted displays (HMDs) have limitations
Rendering with tethered PC Not mobile Rendering on mobile device attached to HMD Clunky to wear
How to make it mobile and portable: wireless and lighter?
Streaming only Field of View (FOV)
2
Example for Cloud-based Solution :
§ Solution: Shifting computing tasks (e.g. rendering) to the edge/cloud, and streaming videos to the HMD
§ VRhead-mounteddevicesmaketherequirementsmuchsteeperthancloud/edge-basedvideostreaming
needed
Display Device Head Motion Framerate & QP Bitrate
(Mbps)
(Virtual Classroom)
Bitrate (Mbps)
(Racing Game)
Acceptable Latency
1080p 4K 1080p 4K
PC Monitor
QP=20 5.8 14.5 16.6 41.5
100-200ms(for VC) <100ms(for Game)
Oculus
QP=15 10.9 27.3 33.9 84.8 28ms Oculus 75fps, QP=15 28.2 70.5 39.7 99.3 22ms
For head motion, cloud/edge-based wireless VR will require very high frame rate and bit rate, and also needs to satisfy ultra-low latency!
Virtual Classroom Racing Game
Experimentsetup:
1080p/4K,GOP=30
3 Note: For Virtual Classroom with 50 students, bitrate needed for 4k > 3.5 Gbps;
§ Possible Method 1: Render 360-degree video on cloud, transmit to RAN edge, and FOV extraction at edge depending on head motion § Advantage: low computation overhead on edge device § Problem: Very high (backhaul) data rate § Possible Method 2: Render 360-degree video on edge device and FOV extraction depending on head motion § Advantage: theoretically low (backhaul) data rate § Problem: Restricted to edge device with very high computation;
(FOV Extraction) 4
§ Solution: Based on head motion prediction, pre-render and stream predicted FOV in advance from edge device § Advantages:
§ Latency: No rendering/encoding delay, minimal communication delay with significantly reduced bandwidth § Edge can be RAN or local; can be mobile device
Glasses Controller Cloud Server
WiFi/Millimeter Wave
Control Video Data
LEC
(Predictive FOV Generation)
Glasses Controller Cloud Server MEC
(Predictive FOV Generation)
Cellular Connection
Control Video Data
(a) (b)
(Predictive FOV Generation)
Glasses Controller Cloud Server
WiFi/Millimeter Wave
Control Video Data
LEC
(Predictive FOV Generation)
Glasses Controller Cloud Server MEC
(Predictive FOV Generation)
Cellular Connection
Control Video Data
(a) (b) (Predictive FOV Generation)
5
(a) Using Mobile Edge Computing node (MEC) (b) Using Local Edge Computing node (LEC)
§ System overview for proposed approach:
90°
180°
x y
~90° ~90°
FOV
FOV in a 360-degree view Projection Euler Coordinates
// //
§ Motivation: address both bandwidth & latency challenges § Common approach to reduce bandwidth: streaming only FOV à still cannot address latency problem § System overview for proposed approach:
6
Predictive View Generation to Enable Mobile 360-degree and VR Experiences:
Early experiments with Samsung Dataset
Glasses Controller Cloud Server
WiFi/Millimeter Wave
Control Video Data
LEC
(Predictive FOV Generation)
Glasses Controller Cloud Server MEC
(Predictive FOV Generation)
Cellular Connection
Control Video Data
(a) (b)
(Predictive FOV Generation)
Glasses Controller Cloud Server
WiFi/Millimeter Wave
Control Video Data
LEC
(Predictive FOV Generation)
Glasses Controller Cloud Server MEC
(Predictive FOV Generation)
Cellular Connection
Control Video Data
(a) (b) (Predictive FOV Generation)
(a) Using Mobile Edge Computing node (MEC) (b) Using Local Edge Computing node (LEC)
(viewpoint refers to the center of FOV)
~90° ~90° Tile (30°x30°) Viewpoint FOV 180°
90°
x y
7
§ Setup: Samsung Gear VR, sampling frequency f=5Hz § Dataset: head motion traces from over 36,000 viewers for 19 360-degree/VR videos during 7 days § Tiles options: 12x6 tiles (30°x30°), 18x6 tiles (20°x30°), etc.
100 200 300 400
Video Duration (s)
0.2 0.4 0.6 0.8 1
CDF
1000 2000 3000
# Viewers
0.2 0.4 0.6 0.8 1
CDF
VR dataset statistics
§ Over 80% of videos have >100s for duration § Around 85% of videos have >1000 viewers
10 20 30 40 50 60
Time (s)
50 100 150
Head Motion Speed (°/s)
75th Percentile 25th Percentile Median Min Max
This boxplot shows head motion speed distribution for over 1500 viewers during 60s; it presents the challenging situation of predicting head motion since viewers may change viewing direction fast as well as frequently
Predictive View Generation to Enable Mobile 360-degree and VR Experiences:
Early experiments with Samsung Dataset
8
Head motion speed versus time in Kong VR
§ Brighter tiles attract more attention and viewers are more likely to look at these areas
§ Feasibility of performing viewpoint prediction (some areas attracting more attention than remaining areas within a 360-degree view)
§ Multiple tiles (as high as 11 tiles) have relatively high probabilities (>5%), indicating the difficulties of predicting viewpoint accurately
Example of attention heatmap
§ Attention heatmap is defined as a series of probability that a viewpoint is within a tile for n viewers during time-period from cts1 to cts2
9
§ Goal: predict viewpoint position (tile) for 200ms in advance § Model: multi-layer long short-term memory (LSTM) networks § Input Features: tile-based one-hot encoding representation for viewpoint traces as 72x10 matrix (72 tiles, 10 timestamps in 2s) § Label for training: whether viewpoint belonging to each tile as 72x1 matrix § Output: probability of viewpoint belonging to each of the 72 tiles
Viewpoint trace during t∈(3,5], in seconds
t=3s
Where is the viewpoint when t= 5.2s (200ms afterwards)?
t=5s
10
11
§ Goal: predict viewpoint position (tile) for 200ms in advance § Model: multi-layer long short-term memory (LSTM) networks § Input Features: tile-based one-hot encoding representation for viewpoint traces as 72x10 matrix (72 tiles, 10 timestamps in 2s) § Label for training: whether viewpoint belonging to each tile as 72x1 matrix § Output: probability of viewpoint belonging to each of the 72 tiles
Viewpoint traces during t∈(3,5) seconds
0.21 0.04 0.11 0.10 0.05 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 0.03 0.37
LSTM Unit LSTM Unit LSTM Unit LSTM Unit LSTM Unit LSTM Unit
… … …
Predicted Viewpoint Fully Connected Layer Viewpoint Features Softmax Layer
§ Dataset: Head motion traces of 36,000 viewers during 7 days for 19 360-degree/VR videos; Each trace point 200ms § Training Data: 45,000 head motion sampling traces (each for 2s long) § Test Data: 5,000 head motion sampling traces (where viewers are different from training data) § Parameters:
§ first layer: 128 LSTM units; second layer: 128 LSTM units; fully connected layer: 72 nodes;
§ We explore four deep learning or classical machine learning models for viewpoint prediction: LSTM, Stacked sparse autoencoders (SAE), Bootstrap- aggregated decision trees (BT), and Weighted k-nearest neighbors (kNN)
§ SAE: two fully-connected layers with 100 and 80 nodes respectively; BT: ensembles with 30 bagged decision trees; kNN: 100 nearest neighbors
12
13
FOV prediction accuracy: § the probability that actual user view will be within the predicted FOV § depends on the LSTM model accuracy and FOV generation method, and thus reflects both the performance of our LSTM model and FOV generation method
FOV generation
Tradeoff between FOV size (bitrate) and FOV prediction accuracy § Prediction accuracy is 100% if predicted FOV is the whole 360-degree but very high bitrate § By selecting more tiles (i.e. larger m) with high viewpoint probability, we can achieve higher FOV prediction accuracy but also higher bitrate
14
FOV generation
Use choice of m to achieve the desired tradeoff between FOV prediction accuracy and bandwidth consumed: § Choice of larger m à higher bandwidth but better FOV prediction accuracy § Choice of smaller m à lower bandwidth but higher risk in FOV prediction accuracy
SAE: Stacked sparse autoencoders; BT: Bootstrap-aggregated decision trees; kNN: Weighted k-nearest neighbors 15
§ FOV prediction accuracy and pixel savings obtained when selecting different number of tiles (i.e. the choice of m) to generate FOV
SAE: Stacked sparse autoencoders; BT: Bootstrap-aggregated decision trees; kNN: Weighted k-nearest neighbors
16
§ As number of tiles m increases, FOV prediction accuracy continuously increases and pixel saving simultaneously decreases à tradeoff between FOV prediction accuracy and pixel saving
SAE: Stacked sparse autoencoders; BT: Bootstrap-aggregated decision trees; kNN: Weighted k-nearest neighbors
Model Medium Motion Sequence High Motion Sequence FOV Acc.(%) Pixel Saving(%) FOV Acc.(%) Pixel Saving(%) SAE 95.0 34.0 95.0 3.9 LSTM 95.5 55.7 95.0 43.7 BT 95.0 14.8 95.2 14.4 kNN 94.8 12.0 95.3 12.0 Model Fashion Show Whale Encounter Roller Coaster FOV Acc.(%) Pixel Saving (%) FOV Acc.(%) Pixel Saving (%) FOV Acc. (%) Pixel Saving (%) SAE 95.4 52.7 95.1 46.8 95.3 29.9 LSTM 95.2 69.7 95.5 66.8 95.2 71.0 BT 95.3 19.1 95.0 18.6 95.2 48.9 kNN 94.9 12.0 95.2 10.3 95.1 21.2 Kong VR
§ Results below for one strategy show possible to achieve high FOV prediction accuracy while significantly reducing bitrate (to rate comparable to real-time generated FOV)
17
We set FOV prediction accuracy as ~95%, and compare pixel savings achieved for each model
§ We propose a predictive view generation approach in order to reduce the latency and bandwidth needed to deliver 360-degree videos and cloud/edge- based VR applications, leading to better mobile VR experiences § We present a multi-layer LSTM model which can learn general head motion pattern and predict the future viewpoint based on past traces § Our method presents good results on on a real head motion trace dataset and shows great potential to reduce bandwidth
18 3 Degrees of Freedom (3DoF) 6 Degrees of Freedom (6DoF)
*Photo Source: Qualcomm
Future Work: § Adaptively streaming using our training model (tiles with different video quality) § 3DOF à 6DOF: View prediction considering body motion (6DoF) and hand motion