Harnessing Wake Vortices for Efficient Collective Swimming via Deep - - PowerPoint PPT Presentation

harnessing wake vortices for efficient collective
SMART_READER_LITE
LIVE PREVIEW

Harnessing Wake Vortices for Efficient Collective Swimming via Deep - - PowerPoint PPT Presentation

Harnessing Wake Vortices for Efficient Collective Swimming via Deep Reinforcement Learning Siddhartha Verma With: Guido Novati and Petros Koumoutsakos CSE lab http://www.cse-lab.ethz.ch Collective Swimming Hydrodynamic benefit of


slide-1
SLIDE 1

CSElab

http://www.cse-lab.ethz.ch

Harnessing Wake Vortices for Efficient Collective Swimming via 
 Deep Reinforcement Learning

Siddhartha Verma

With: Guido Novati and Petros Koumoutsakos

slide-2
SLIDE 2

Credit: Artbeats

  • Theoretical work on Schooling & Formation Swimming


Breder (1965), Weihs (1973,1975), Shaw (1978)

  • Experiments: Abrahams & Colgan (1985), Herskin & Steffensen (1998),

Svendsen (2003), Killen et al. (2011)

  • Simulations: Pre-assigned, fixed formations 


Hemelrijk et al. (2015), Daghooghi & Borazjani (2015), Maertens et al. (2017)

Breder (1965) Weihs (1973), Shaw (1978)

Collective Swimming

  • Are Wake Vortices exploited by fish for propulsion?
  • Hydrodynamic benefit of swimming in groups
  • But schools evolve dynamically
slide-3
SLIDE 3
  • Autonomous decision making capability, based on learning from experience

THIS TALK: Adaptive Collective Swimming

  • Goal: Maximize energy-efficiency

No positional or formation constraints

slide-4
SLIDE 4

The Need for Control

  • Without control, trailing fish may get ejected from leader’s wake
  • Coordinated Swimming through unsteady flow field requires:
  • Ability to observe the environment
  • Decision to react appropriately
  • The swimmers learn how to interact with

the environment

  • HERE: Deep Reinforcement Learning - GOAL : Energy Extraction from Vortex Wake

Prior Work @CSE Lab: “Vanilla” Reinforcement learning - Goal: Follow the Leader (Novati et al., Bioinspir. Biomim. 2017)

slide-5
SLIDE 5

Reinforcement Learning

  • Credit assignment:
  • Agent receives feedback
  • An agent learning the best action, through trial-

and-error interaction with environment

  • Actions have consequences
  • Reward (feedback) is delayed
  • Goal
  • Maximize cumulative future reward:
  • Specify what to do, not how to do it

Qπ(st, at) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + . . . | ak = π(sk) ∀k > t ⇤ = E [rt+1 + γQπ(st+1, π(st+1)] Qπ(st, at) = Bellman (1957)

  • Q-learning
  • POLICY for taking the best

ACTION in a given STATE

  • Expected reward updated in previously

visited states

Credit: https:// www.cs.utexas.edu/ ~eladlieb/RLRG.html

slide-6
SLIDE 6

Deep Reinforcement Learning

  • V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015)

at each iteration:

  • agent is in a state s
  • select action a:
  • greedy: based on max Q(s,a,w)
  • explore: random
  • bserve new state s’ and reward r
  • store in memory tuple { s, a, s’, r }

Acting

  • Stable algorithm for training NN surrogates of Q
  • Sample past transitions: experience replay
  • Break correlations in data
  • Learn from all past policies
  • "Frozen" target Q-network to avoid oscillations

at each iteration

  • sample tuple { s, a, s’, r } (or batch)
  • update wrt target with old weights:
  • Periodically update fixed weights

Learning ∂ ∂w ⇣ r + γ max

a0 Q(s0, a0, w) − Q(s, a, w)

⌘2 w− ← w

slide-7
SLIDE 7

Actions, States, Reward

  • Orientation relative to leader: Δx, Δy, θ

States:

Δx Δy θ

  • Current shape of the body (manoeuvre)
  • Time since previous tail beat: Δt

Actions:

  • Decrease curvature
  • Increase curvature

Turn and modulate velocity by controlling body deformation

Reward: based on swimming efficiency

slide-8
SLIDE 8

After training: Efficiency-maximizing ‘Follower’

  • Smart-follower stays in-line with leader
  • Compared to solitary swimmer with

identical muscle movements

  • Presence/absence of wake is the only difference

Leader Smart Follower

  • Decides on its own the best strategy
  • Free to swim outside wake’s influence
  • Energetics: smart-follower exploits wake
  • Head synchronised with lateral flow-velocity

η Speed CoT PDef Smart 1.32 1.11 0.64 0.71 Solo 1 1 1 1

slide-9
SLIDE 9
  • How does smart follower’s behaviour

evolve during training?

First 10,000 transitions Last 10,000 transitions

  • Why the peaks in distribution?

After training: Efficiency-maximizing ‘Follower’

Leader Smart Follower

  • Smart-follower stays in-line with leader
  • Decides on its own the best strategy
  • Free to swim outside wake’s influence
  • Energetics: smart-follower exploits wake
  • Head synchronised with lateral flow-velocity
slide-10
SLIDE 10
slide-11
SLIDE 11

Sequence of events

11

  • Snapshot when η is maximum
  • Lifted vortex generates secondary

vortex (S1)

  • Secondary vortex - high speed region

=> suction due to low pressure

  • Flow-induced force + body deformation

determine Pdef (muscle use)

  • Low Pdef values preferable

W1 L1 S1

  • Wake-vortex (W1) lifts-up the boundary

layer on the swimmer’s body (L1)

slide-12
SLIDE 12

Implementing the Learned Strategy in 3D

  • Target coordinates - maxima in velocity correlation:

12

  • PID controller:
  • Modulate follower’s undulations (curvature + amplitude)
  • Maintain the target position specified
slide-13
SLIDE 13

13

slide-14
SLIDE 14

LR

3D Wake Interactions

  • Wake-interactions benefit the follower
  • 11.6% increase in efficiency - 5.3% reduction in CoT

14

0.4 0.6 0.8 1 17.5 18 18.5 19 19.5 η t

  • Oncoming wake-vortex ring intercepted
  • Generates a new ‘lifted-vortex’ ring (LR)
  • Similar to the 2D case

LR

Follower Leader

slide-15
SLIDE 15

15

11% increase in efficiency for each follower

slide-16
SLIDE 16

Summary

  • Autonomous swimmer learns to exploit unsteady fluctuations in the velocity field
  • Decides to interact with the wake, even when free to swim clear
  • Large energetic savings, without loss in speed (Improvements: 30% and 11%)

16

Swimming via Reinforcement Learning : An effective and robust method for harnessing energy from unsteady flow NEXT: Energy Efficient Swarms of Drones ?

slide-17
SLIDE 17

Backup

slide-18
SLIDE 18

Reacting to an erratic leader

Two fish swimming together in Greece Two fish swimming together in the Swiss supercomputer

Note: Reward allotted here has no connection to relative displacement

slide-19
SLIDE 19

Robustness: Responds Effectively to Perturbations

  • Agent never experienced deviations in the leader’s behaviour during training

19

  • But analogous situations encountered during training (random actions during learning)
  • Agent reacts appropriately to maximise cumulative reward
slide-20
SLIDE 20

Numerical methods

  • 2D: Wavelet-based adaptive grid
  • Cost-effective compared to uniform grids

Rossinelli et al., J. Comput. Phys. (2015) Angot et al., Numerische Mathematik (1999)

  • Brinkman penalization
  • Accounts for fluid-solid interaction
  • Remeshed vortex methods (2D)
  • Solve vorticity form of incompressible Navier-Stokes

∂ω ∂t + u · rω = ω · ru + νr2ω + λr ⇥ (χ (us u))

Diffusion Advection Penalization 0 in 2D

  • 3D: Finite Difference - pressure projection

(Chorin 1968)

Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado

slide-21
SLIDE 21

Goal #1: learn to stay behind the leader

R>0 R<0

Reward: vertical displacement

R∆y = 1 − |∆y| 0.5L

  • Failure condition
  • Stray too far or collide with leader

Rend = −1

Goal #2: learn to maximise swimming-efficiency

Reward: efficiency

Rη = Pthrust Pthrust + max(Pdef, 0)

Thrust power

Deformation power

Rη = T |uCM| T |uCM| + max( R

∂Ω F(x) · udef(x) dx, 0)

Reinforcement Learning: Reward

slide-22
SLIDE 22

Reinforcement Learning: Basic idea

A

  • Credit assignment:
  • Agent receives

feedback

  • 1

A A

  • 2 -1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 7
  • 8
  • 9
  • 10
  • 12
  • 11
  • 11
  • 12
  • 13
  • 14 -15

A

Example: Maze solving State Agent’s position (A) Actions go U, D, L, R Reward

  • 1 per step taken

0 at terminal state

  • An agent learning the best action, through trial-

and-error interaction with environment

  • Actions have long term consequences
  • Reward (feedback) is delayed
  • Goal
  • Maximize cumulative future reward:
  • Specify what to do, not how to do it

Qπ(st, at) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + . . . | ak = π(sk) ∀k > t ⇤ = E [rt+1 + γQπ(st+1, π(st+1)] Qπ(st, at) = Bellman (1957)

  • Now we have a policy
  • Expected reward

updated in previously visited states

Credit: https:// www.cs.utexas.edu/ ~eladlieb/RLRG.html

slide-23
SLIDE 23

Recurrent Neural Network

LSTM Layer 2 LSTM Layer 1 LSTM Layer 3

  • n

(a3)

qn

(a1)

qn

(a2)

qn

(a4)

qn

(a5)

qn

slide-24
SLIDE 24

A flexible maneuvering model

  • Modified midline kinematics preserves travelling wave:
  • Each action prescribes a point of the spline:

Traveling wave Traveling spline c

{

c

{

Decrease Curvature

c c+¼ c+½ c+¾ c+1

Increase Curvature

c c+¼ c+½ c+¾ c+1

slide-25
SLIDE 25

Examples

Reducing local curvature Increasing local curvature Chain of actions Effect of action depends on when action is made

slide-26
SLIDE 26

Simulation Cost (2D)

26

  • Production runs (Re=5000)
  • Domain : [0,1] x [0,1]
  • Resolution : 8192 x 8192
  • 1600 points along fish midline
  • Running with 24 threads (12 hyper

threaded cores - Piz Daint)

  • 10 tail-beat cycles : 27000 time steps
  • Approx. 96 core hours: 1 second/step
  • Wavelet-based adaptive grid
  • https://github.com/cselab/MRAG-I2D

Rossinelli et al., JCP (2015)

  • Training simulations (lower resolution)
  • Resolution : 2048 x 2048
  • 10 tail-beat cycles : 36 core hours
  • Learning converges in : 150,000 tail-beats
  • 0.54 Million core hrs per learning episode
slide-27
SLIDE 27

Simulation Cost (3D)

27

  • Production runs (Re=5000)
  • Domain : [0, 1] x [0, 0.5] x [0, 0.25]
  • Resolution : 6144 x 3072 x 768
  • 600 points along fish midline
  • Running on 128 nodes, 24 threads each

(Hybrid MPI + OpenMP : Piz Daint)

  • 10 tail-beat cycles : 21000 time steps
  • Approx. 37,000 core hours: 


3 seconds/timestep

  • Uniform grid Finite Volume solver
  • Rossinelli et al., SC'13 Proc. Int. Conf. High
  • Perf. Comput., Denver, Colorado
  • Training simulations (lower resolution)
  • Resolution : 2048 x 1024 x 512
  • Expected : 1.2 Million core hours per

learning episode