Predictive View Generation to Enable Mobile 360-degree and VR - - PowerPoint PPT Presentation

predictive view generation to enable mobile 360 degree
SMART_READER_LITE
LIVE PREVIEW

Predictive View Generation to Enable Mobile 360-degree and VR - - PowerPoint PPT Presentation

Predictive View Generation to Enable Mobile 360-degree and VR Experiences Xueshi Hou, Sujit Dey Jianzhong Zhang, Madhukar Budagavi Mobile Systems Design Lab, Samsung Research America Center for Wireless Communications, UC San Diego


slide-1
SLIDE 1

Xueshi Hou, Sujit Dey

Mobile Systems Design Lab,

Center for Wireless Communications, UC San Diego

Jianzhong Zhang, Madhukar Budagavi Samsung Research America

Predictive View Generation to Enable Mobile 360-degree and VR Experiences

slide-2
SLIDE 2

§ Goal: Enable wireless and light VR experience § Observation: Existing head-mounted displays (HMDs) have limitations

Motivation: Towards a Truly Mobile VR Experience

Rendering with tethered PC Not mobile Rendering on mobile device attached to HMD Clunky to wear

How to make it mobile and portable: wireless and lighter?

Streaming only Field of View (FOV)

2

Example for Cloud-based Solution :

  • 1. Transmit Head Motion and Control to Cloud
  • 2. Field-of-View Rendering on Cloud
  • 3. Transmit Rendered Video to VR Glass

§ Solution: Shifting computing tasks (e.g. rendering) to the edge/cloud, and streaming videos to the HMD

slide-3
SLIDE 3

§ VRhead-mounteddevicesmaketherequirementsmuchsteeperthancloud/edge-basedvideostreaming

  • UserexperiencemuchmoresensitivetovideoartifactsàSignificantlyhighervideoqualityneeded
  • Head motion significantly increases latency sensitivity à Significantly higher frame rate and bitrate

needed

Display Device Head Motion Framerate & QP Bitrate

(Mbps)

(Virtual Classroom)

Bitrate (Mbps)

(Racing Game)

Acceptable Latency

1080p 4K 1080p 4K

PC Monitor

  • 45fps,

QP=20 5.8 14.5 16.6 41.5

100-200ms(for VC) <100ms(for Game)

Oculus

  • 45fps,

QP=15 10.9 27.3 33.9 84.8 28ms Oculus 75fps, QP=15 28.2 70.5 39.7 99.3 22ms

Challenges of Cloud/Edge-based Wireless VR

For head motion, cloud/edge-based wireless VR will require very high frame rate and bit rate, and also needs to satisfy ultra-low latency!

Virtual Classroom Racing Game

Experimentsetup:

  • VR space created using Unity; VR HMD: Oculus Rift DK2; Video: H264,

1080p/4K,GOP=30

3 Note: For Virtual Classroom with 50 students, bitrate needed for 4k > 3.5 Gbps;

slide-4
SLIDE 4

§ Possible Method 1: Render 360-degree video on cloud, transmit to RAN edge, and FOV extraction at edge depending on head motion § Advantage: low computation overhead on edge device § Problem: Very high (backhaul) data rate § Possible Method 2: Render 360-degree video on edge device and FOV extraction depending on head motion § Advantage: theoretically low (backhaul) data rate § Problem: Restricted to edge device with very high computation;

Solution for Ultra-low Latency: Machine Learning Based Predictive Pre-Rendering

(FOV Extraction) 4

slide-5
SLIDE 5

Question: Is it possible to predict Head Motion?

§ Solution: Based on head motion prediction, pre-render and stream predicted FOV in advance from edge device § Advantages:

§ Latency: No rendering/encoding delay, minimal communication delay with significantly reduced bandwidth § Edge can be RAN or local; can be mobile device

Glasses Controller Cloud Server

WiFi/Millimeter Wave

Control Video Data

LEC

(Predictive FOV Generation)

Glasses Controller Cloud Server MEC

(Predictive FOV Generation)

Cellular Connection

Control Video Data

(a) (b)

(Predictive FOV Generation)

Glasses Controller Cloud Server

WiFi/Millimeter Wave

Control Video Data

LEC

(Predictive FOV Generation)

Glasses Controller Cloud Server MEC

(Predictive FOV Generation)

Cellular Connection

Control Video Data

(a) (b) (Predictive FOV Generation)

Solution for Ultra-low Latency: Machine Learning Based Predictive Pre-Rendering

5

(a) Using Mobile Edge Computing node (MEC) (b) Using Local Edge Computing node (LEC)

§ System overview for proposed approach:

slide-6
SLIDE 6

90°

  • 180°

180°

  • 90°

x y

~90° ~90°

FOV

FOV in a 360-degree view Projection Euler Coordinates

// //

§ Motivation: address both bandwidth & latency challenges § Common approach to reduce bandwidth: streaming only FOV à still cannot address latency problem § System overview for proposed approach:

6

Predictive View Generation to Enable Mobile 360-degree and VR Experiences:

Early experiments with Samsung Dataset

Glasses Controller Cloud Server

WiFi/Millimeter Wave

Control Video Data

LEC

(Predictive FOV Generation)

Glasses Controller Cloud Server MEC

(Predictive FOV Generation)

Cellular Connection

Control Video Data

(a) (b)

(Predictive FOV Generation)

Glasses Controller Cloud Server

WiFi/Millimeter Wave

Control Video Data

LEC

(Predictive FOV Generation)

Glasses Controller Cloud Server MEC

(Predictive FOV Generation)

Cellular Connection

Control Video Data

(a) (b) (Predictive FOV Generation)

(a) Using Mobile Edge Computing node (MEC) (b) Using Local Edge Computing node (LEC)

slide-7
SLIDE 7

§ Idea: predictive view generation approach – § only predicted view is extracted (for 360-degree video) or rendered (in case of VR) and transmitted in advance

(viewpoint refers to the center of FOV)

~90° ~90° Tile (30°x30°) Viewpoint FOV 180°

  • 180°

90°

  • 90°

x y

7

slide-8
SLIDE 8

§ Setup: Samsung Gear VR, sampling frequency f=5Hz § Dataset: head motion traces from over 36,000 viewers for 19 360-degree/VR videos during 7 days § Tiles options: 12x6 tiles (30°x30°), 18x6 tiles (20°x30°), etc.

100 200 300 400

Video Duration (s)

0.2 0.4 0.6 0.8 1

CDF

1000 2000 3000

# Viewers

0.2 0.4 0.6 0.8 1

CDF

VR dataset statistics

§ Over 80% of videos have >100s for duration § Around 85% of videos have >1000 viewers

10 20 30 40 50 60

Time (s)

50 100 150

Head Motion Speed (°/s)

75th Percentile 25th Percentile Median Min Max

This boxplot shows head motion speed distribution for over 1500 viewers during 60s; it presents the challenging situation of predicting head motion since viewers may change viewing direction fast as well as frequently

Predictive View Generation to Enable Mobile 360-degree and VR Experiences:

Early experiments with Samsung Dataset

8

Head motion speed versus time in Kong VR

slide-9
SLIDE 9

§ Brighter tiles attract more attention and viewers are more likely to look at these areas

§ Feasibility of performing viewpoint prediction (some areas attracting more attention than remaining areas within a 360-degree view)

§ Multiple tiles (as high as 11 tiles) have relatively high probabilities (>5%), indicating the difficulties of predicting viewpoint accurately

Predictive View Generation to Enable Mobile 360-degree and VR Experiences

Example of attention heatmap

§ Attention heatmap is defined as a series of probability that a viewpoint is within a tile for n viewers during time-period from cts1 to cts2

9

slide-10
SLIDE 10

§ Goal: predict viewpoint position (tile) for 200ms in advance § Model: multi-layer long short-term memory (LSTM) networks § Input Features: tile-based one-hot encoding representation for viewpoint traces as 72x10 matrix (72 tiles, 10 timestamps in 2s) § Label for training: whether viewpoint belonging to each tile as 72x1 matrix § Output: probability of viewpoint belonging to each of the 72 tiles

Viewpoint trace during t∈(3,5], in seconds

t=3s

Where is the viewpoint when t= 5.2s (200ms afterwards)?

t=5s

Viewpoint Prediction using Deep Learning

10

slide-11
SLIDE 11

Viewpoint Prediction using Deep Learning

11

§ Goal: predict viewpoint position (tile) for 200ms in advance § Model: multi-layer long short-term memory (LSTM) networks § Input Features: tile-based one-hot encoding representation for viewpoint traces as 72x10 matrix (72 tiles, 10 timestamps in 2s) § Label for training: whether viewpoint belonging to each tile as 72x1 matrix § Output: probability of viewpoint belonging to each of the 72 tiles

Viewpoint traces during t∈(3,5) seconds

0.21 0.04 0.11 0.10 0.05 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 <0.01 0.03 0.37

slide-12
SLIDE 12

LSTM Unit LSTM Unit LSTM Unit LSTM Unit LSTM Unit LSTM Unit

… … …

Predicted Viewpoint Fully Connected Layer Viewpoint Features Softmax Layer

§ Dataset: Head motion traces of 36,000 viewers during 7 days for 19 360-degree/VR videos; Each trace point 200ms § Training Data: 45,000 head motion sampling traces (each for 2s long) § Test Data: 5,000 head motion sampling traces (where viewers are different from training data) § Parameters:

§ first layer: 128 LSTM units; second layer: 128 LSTM units; fully connected layer: 72 nodes;

Viewpoint Prediction using Deep Learning

§ We explore four deep learning or classical machine learning models for viewpoint prediction: LSTM, Stacked sparse autoencoders (SAE), Bootstrap- aggregated decision trees (BT), and Weighted k-nearest neighbors (kNN)

§ SAE: two fully-connected layers with 100 and 80 nodes respectively; BT: ensembles with 30 bagged decision trees; kNN: 100 nearest neighbors

12

slide-13
SLIDE 13

FOV Selection: Accuracy and Bitrate

13

Predictive View Generation: Accuracy and Bitrate

FOV prediction accuracy: § the probability that actual user view will be within the predicted FOV § depends on the LSTM model accuracy and FOV generation method, and thus reflects both the performance of our LSTM model and FOV generation method

FOV generation

§ FOV generation method: § Select m tiles with highest probabilities predicted by the LSTM model § Compose the predicted FOV as the combination of FOVs for each selected tile § Transmit the predicted FOV with high quality while leaving the rest of tiles blank

slide-14
SLIDE 14

FOV Selection: Accuracy and Bitrate

Tradeoff between FOV size (bitrate) and FOV prediction accuracy § Prediction accuracy is 100% if predicted FOV is the whole 360-degree but very high bitrate § By selecting more tiles (i.e. larger m) with high viewpoint probability, we can achieve higher FOV prediction accuracy but also higher bitrate

14

FOV generation

Predictive View Generation: Accuracy and Bitrate

Use choice of m to achieve the desired tradeoff between FOV prediction accuracy and bandwidth consumed: § Choice of larger m à higher bandwidth but better FOV prediction accuracy § Choice of smaller m à lower bandwidth but higher risk in FOV prediction accuracy

slide-15
SLIDE 15

Predictive View Generation: Accuracy and Bitrate

Medium Motion Sequence

SAE: Stacked sparse autoencoders; BT: Bootstrap-aggregated decision trees; kNN: Weighted k-nearest neighbors 15

§ FOV prediction accuracy and pixel savings obtained when selecting different number of tiles (i.e. the choice of m) to generate FOV

slide-16
SLIDE 16

SAE: Stacked sparse autoencoders; BT: Bootstrap-aggregated decision trees; kNN: Weighted k-nearest neighbors

Predictive View Generation: Accuracy and Bitrate

16

§ As number of tiles m increases, FOV prediction accuracy continuously increases and pixel saving simultaneously decreases à tradeoff between FOV prediction accuracy and pixel saving

High Motion Sequence

slide-17
SLIDE 17

FOV Selection: Accuracy and Bitrate

SAE: Stacked sparse autoencoders; BT: Bootstrap-aggregated decision trees; kNN: Weighted k-nearest neighbors

Model Medium Motion Sequence High Motion Sequence FOV Acc.(%) Pixel Saving(%) FOV Acc.(%) Pixel Saving(%) SAE 95.0 34.0 95.0 3.9 LSTM 95.5 55.7 95.0 43.7 BT 95.0 14.8 95.2 14.4 kNN 94.8 12.0 95.3 12.0 Model Fashion Show Whale Encounter Roller Coaster FOV Acc.(%) Pixel Saving (%) FOV Acc.(%) Pixel Saving (%) FOV Acc. (%) Pixel Saving (%) SAE 95.4 52.7 95.1 46.8 95.3 29.9 LSTM 95.2 69.7 95.5 66.8 95.2 71.0 BT 95.3 19.1 95.0 18.6 95.2 48.9 kNN 94.9 12.0 95.2 10.3 95.1 21.2 Kong VR

§ Results below for one strategy show possible to achieve high FOV prediction accuracy while significantly reducing bitrate (to rate comparable to real-time generated FOV)

Predictive View Generation: Accuracy and Bitrate

17

We set FOV prediction accuracy as ~95%, and compare pixel savings achieved for each model

slide-18
SLIDE 18

§ We propose a predictive view generation approach in order to reduce the latency and bandwidth needed to deliver 360-degree videos and cloud/edge- based VR applications, leading to better mobile VR experiences § We present a multi-layer LSTM model which can learn general head motion pattern and predict the future viewpoint based on past traces § Our method presents good results on on a real head motion trace dataset and shows great potential to reduce bandwidth

Summary and Future Work

18 3 Degrees of Freedom (3DoF) 6 Degrees of Freedom (6DoF)

*Photo Source: Qualcomm

Future Work: § Adaptively streaming using our training model (tiles with different video quality) § 3DOF à 6DOF: View prediction considering body motion (6DoF) and hand motion

slide-19
SLIDE 19

Thanks!

Email: dey@ece.ucsd.edu