Learning to Control Complex Human Motions Using Reinforcement Learning Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com 1
Physics-based Character Animation Motion Control Physics Character Controller Signal Engine Animation [Gang Beasts] [Totally Accurate Battle Simulator]
Designing Controllers for Locomotion Hand-crafted control policy [Hodgins et al. 1995] Simulating abstract model SIMBICON [Yin et al. 2007] SIMBICON, IPM, ZMP β¦ Optimization/policy search [Coros et al. 2010] [Tan et al. 2014] Reinforcement learning Actor-critic [Mordatch et al. 2010] [Peng et al. 2017]
Designing Controllers for Complex Motions 4
Designing controllers for complex motions Tracking Motion Clip Controller 5
Tracking Control for Complex Human Motion Feedback Policy Open-loop Motion Clip Tracking Control Feedback Control Policy Scheduler
Reinforcement Learning Feedback Guided Policy Learning Policy Control Deep Q-Learning Feedback Scheduler Policy
Outline Construct open-loop control SAMCON (Sample-based Motion Control) Guided learning of linear feedback policies Learning to schedule control fragment using deep Q-learning 8
Tracking Control β’ PD servo π = π π ΰ·¨ π β π β π π αΆ π 9
Mocap Clips as Tracking Target 10
Correction with Sampling [ ] ππ’ 11
SAMCON β’ SA mpling-based M otion CON trol [Liu et al. 2010, 2015] β’ Motion Clip ο Open-loop control trajectory Sample Sample Sample Start End β¦ Start 1 Start 2 Start n Particle filtering / Sequential Monte Carlo 12
SAMCON ππ’ ππ’ ππ’ ππ’ State Reference Trajectory time 13
Sampling & Simulation π Actions (PD-control Targets) π’ ππ’ ππ’ ππ’ ππ’ State Reference Trajectory time 14
Resampling π Actions (PD-control Targets) π’ ππ’ ππ’ ππ’ ππ’ State Reference Trajectory time 15
SAMCON Iterations π Actions (PD-control Targets) ππ’ ππ’ ππ’ ππ’ ππ’ State Reference Trajectory time 16
SAMCON Iterations π Actions (PD-control Targets) ππ’ ππ’ ππ’ ππ’ ππ’ State Reference Trajectory time 17
Constructed Open-loop Control Trajectory π Actions (PD-control Targets) ππ’ ππ’ ππ’ ππ’ ππ’ State Reference Trajectory time 18
Control Reconstruction 19
Linear Policy π: ππ = π ππ‘ + ΰ· π ππ = π β ΰ·€ π Control Trajectory π‘ β Η π‘ = ππ‘ Simulation 20
For complex motions Uniform Segmentation Control Fragments Linear Feedback Policy 21
Control Fragment β’ A short control unit: π ΰ· β’ ππ’ β 0.1 seconds long π β’ Open-loop control segment ΰ· π π βΆ ππ’, ΰ· π, π ππ’ β’ Linear Feedback policy π 22
Controller β’ A chain of control fragments β― π 1 π 2 π πΏ 23
Guided Learning of Control Policies Regression Feedback Policy Multiple Open-loop Solutions 24
Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 25
Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 26
Guided Learning of Control Policies Guided Learning Feedback Policy Multiple Open-loop Solutions 27
Example: Cyclical Motion π 2 π π , π π , π π , π π , π π , π π , π π , π π , π π , π π , π π , π π , β¦ π 3 π 1 π 4 π π : π π , ππ’, π π ΰ· SAMCON π π’ 28
Example: Cyclical Motion π 1 π 2 π 2 π 1 π π , π π , π π , π π , π π , π π , π π , π π , π π , π π , π π , π π , β¦ π‘ 2 π‘ 1 π 3 π 4 π π π π SAMCON π π‘ 4 π‘ 3 π’ 29
Policy Update π π‘ 30 30
Policy Update Regression π π‘ 31 31
Guided Learning Iterations Guided SAMCON Regression π π π‘ π‘ 32 32
Guided Learning Iterations Guided SAMCON Regression Regression π π π‘ π‘ 33 33
34
Control Graph β’ A graph whose nodes are control fragments Control Graph 35
Control Graph β’ A graph whose nodes are control fragments β’ Converted from a motion graph Motion Graph Control Graph 36
37
38
Problem of Fixed Time-Indexed Tracking Reference Basin of attraction Simulation
Scheduling Reference Basin of attraction Simulation
Scheduling ?
Deep Q-Learning Learn to perform good actions Raw image input Deep convolutional network [Mnih et al. 2015, DQN]
A Q-Network For Scheduling 300 ReLus 300 ReLus β¦ β¦ β¦ β¦ π π max 0, π¨ max 0, π¨ β¦ β¦ β¦ β¦ β¦ β¦ Q-values state Fully Connected
A Q-Network For Scheduling 300 ReLus 300 ReLus β¦ β¦ β¦ β¦ Input: motion state π π environmental state user command max 0, π¨ max 0, π¨ β¦ β¦ DoFs: 18 ~ 25 β¦ β¦ β¦ β¦ Q-values state Fully Connected
A Q-Network For Scheduling 300 ReLus 300 ReLus β¦ β¦ β¦ β¦ Action Set: π π Control Fragments max 0, π¨ max 0, π¨ # of actions: 39 ~ 146 β¦ β¦ β¦ β¦ β¦ β¦ state Fully Connected
A Q-Network For Scheduling 300 ReLus 300 ReLus Q-Values β¦ β¦ β¦ β¦ π π actions: max 0, π¨ max 0, π¨ β¦ β¦ β¦ β¦ β¦ β¦ Fully Connected
Training Pipeline: Exploration / Exploitation Simulation Reward Replay Buffer Batch SGD
Reward Function π = πΉ tracking + πΉ preference + πΉ feedback + πΉ task + π 0
Importance of the Reference Sequence original sequence is enforced original sequence is not enforced
Tracking penalty term In-sequence action Out-of-sequence action Penalty
Tracking exploration strategy with probability π π select a random action with probability π π select an in-sequence action
Bongo Board Balancing Action Sequence
Effect of Feedback Policy Open-loop Control Fragments Feedback-augmented Fragments
Discover New Transitions
Running
Tripping
Skateboarding
Skateboarding
Walking On A Ball
Push-Recovery
Conclusion Feedback Policy Open-loop Motion Clip Tracking Control Feedback Policy Control Scheduler Libin Liu and Jessica Hodgins. 2017. Learning to Libin Liu, Michiel Van De Panne, and Kangkang Yin. Schedule Control Fragments for Physics-Based 2016. Guided Learning of Control Graphs for Physics- Characters Using Deep Q-Learning. ACM Trans. Graph. Based Characters. ACM Trans. Graph. 35, 3, Article 29 (May 2016), 14 pages. 36, 3, Article 29 (June 2017), 14 pages.
Future Work Statistical/generative model [Holden et al. 2017] Control with raw simulation state and terrain information Active human-object interaction [Peng et al. 2017, DeepLoco] basketball, soccer dancing, boxing, martial arts [Heess et al. 2017] 62
Questions? Libin Liu http://libliu.info DeepMotion Inc http://deepmotion.com
Recommend
More recommend