[PPT] - Harnessing Wake Vortices for Efficient Collective Swimming via Deep PowerPoint Presentation

SLIDE 1

CSElab

http://www.cse-lab.ethz.ch

Harnessing Wake Vortices for Efficient Collective Swimming via   Deep Reinforcement Learning

Siddhartha Verma

With: Guido Novati and Petros Koumoutsakos

SLIDE 2

Credit: Artbeats

Theoretical work on Schooling & Formation Swimming

Breder (1965), Weihs (1973,1975), Shaw (1978)

Experiments: Abrahams & Colgan (1985), Herskin & Steffensen (1998),

Svendsen (2003), Killen et al. (2011)

Simulations: Pre-assigned, fixed formations

Hemelrijk et al. (2015), Daghooghi & Borazjani (2015), Maertens et al. (2017)

Breder (1965) Weihs (1973), Shaw (1978)

Collective Swimming

Are Wake Vortices exploited by fish for propulsion?
Hydrodynamic benefit of swimming in groups
But schools evolve dynamically

SLIDE 3

Autonomous decision making capability, based on learning from experience

THIS TALK: Adaptive Collective Swimming

Goal: Maximize energy-efficiency

No positional or formation constraints

SLIDE 4

The Need for Control

Without control, trailing fish may get ejected from leader’s wake
Coordinated Swimming through unsteady flow field requires:
Ability to observe the environment
Decision to react appropriately
The swimmers learn how to interact with

the environment

HERE: Deep Reinforcement Learning - GOAL : Energy Extraction from Vortex Wake

Prior Work @CSE Lab: “Vanilla” Reinforcement learning - Goal: Follow the Leader (Novati et al., Bioinspir. Biomim. 2017)

SLIDE 5

Reinforcement Learning

Credit assignment:
Agent receives feedback
An agent learning the best action, through trial-

and-error interaction with environment

Actions have consequences
Reward (feedback) is delayed
Goal
Maximize cumulative future reward:
Specify what to do, not how to do it

Qπ(st, at) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + . . . | ak = π(sk) ∀k > t ⇤ = E [rt+1 + γQπ(st+1, π(st+1)] Qπ(st, at) = Bellman (1957)

Q-learning
POLICY for taking the best

ACTION in a given STATE

Expected reward updated in previously

visited states

Credit: https:// www.cs.utexas.edu/ ~eladlieb/RLRG.html

SLIDE 6

Deep Reinforcement Learning

V. Mnih et al. "Human-level control through deep reinforcement learning." Nature (2015)

at each iteration:

agent is in a state s
select action a:
greedy: based on max Q(s,a,w)
explore: random
bserve new state s’ and reward r
store in memory tuple { s, a, s’, r }

Acting

Stable algorithm for training NN surrogates of Q
Sample past transitions: experience replay
Break correlations in data
Learn from all past policies
"Frozen" target Q-network to avoid oscillations

at each iteration

sample tuple { s, a, s’, r } (or batch)
update wrt target with old weights:
Periodically update fixed weights

Learning ∂ ∂w ⇣ r + γ max

a0 Q(s0, a0, w) − Q(s, a, w)

⌘2 w− ← w

SLIDE 7

Actions, States, Reward

Orientation relative to leader: Δx, Δy, θ

States:

Δx Δy θ

Current shape of the body (manoeuvre)
Time since previous tail beat: Δt

Actions:

Decrease curvature
Increase curvature

Turn and modulate velocity by controlling body deformation

Reward: based on swimming efficiency

SLIDE 8

After training: Efficiency-maximizing ‘Follower’

Smart-follower stays in-line with leader
Compared to solitary swimmer with

identical muscle movements

Presence/absence of wake is the only difference

Leader Smart Follower

Decides on its own the best strategy
Free to swim outside wake’s influence
Energetics: smart-follower exploits wake
Head synchronised with lateral flow-velocity

η Speed CoT PDef Smart 1.32 1.11 0.64 0.71 Solo 1 1 1 1

SLIDE 9

How does smart follower’s behaviour

evolve during training?

First 10,000 transitions Last 10,000 transitions

Why the peaks in distribution?

After training: Efficiency-maximizing ‘Follower’

Leader Smart Follower

Smart-follower stays in-line with leader
Decides on its own the best strategy
Free to swim outside wake’s influence
Energetics: smart-follower exploits wake
Head synchronised with lateral flow-velocity

SLIDE 10

SLIDE 11

Sequence of events

11

Snapshot when η is maximum
Lifted vortex generates secondary

vortex (S1)

Secondary vortex - high speed region

=> suction due to low pressure

Flow-induced force + body deformation

determine Pdef (muscle use)

Low Pdef values preferable

W1 L1 S1

Wake-vortex (W1) lifts-up the boundary

layer on the swimmer’s body (L1)

SLIDE 12

Implementing the Learned Strategy in 3D

Target coordinates - maxima in velocity correlation:

12

PID controller:
Modulate follower’s undulations (curvature + amplitude)
Maintain the target position specified

SLIDE 13

13

SLIDE 14

LR

3D Wake Interactions

Wake-interactions benefit the follower
11.6% increase in efficiency - 5.3% reduction in CoT

14

0.4 0.6 0.8 1 17.5 18 18.5 19 19.5 η t

Oncoming wake-vortex ring intercepted
Generates a new ‘lifted-vortex’ ring (LR)
Similar to the 2D case

LR

Follower Leader

SLIDE 15

15

11% increase in efficiency for each follower

SLIDE 16

Summary

Autonomous swimmer learns to exploit unsteady fluctuations in the velocity field
Decides to interact with the wake, even when free to swim clear
Large energetic savings, without loss in speed (Improvements: 30% and 11%)

16

Swimming via Reinforcement Learning : An effective and robust method for harnessing energy from unsteady flow NEXT: Energy Efficient Swarms of Drones ?

SLIDE 17

Backup

SLIDE 18

Reacting to an erratic leader

Two fish swimming together in Greece Two fish swimming together in the Swiss supercomputer

Note: Reward allotted here has no connection to relative displacement

SLIDE 19

Robustness: Responds Effectively to Perturbations

Agent never experienced deviations in the leader’s behaviour during training

19

But analogous situations encountered during training (random actions during learning)
Agent reacts appropriately to maximise cumulative reward

SLIDE 20

Numerical methods

2D: Wavelet-based adaptive grid
Cost-effective compared to uniform grids

Rossinelli et al., J. Comput. Phys. (2015) Angot et al., Numerische Mathematik (1999)

Brinkman penalization
Accounts for fluid-solid interaction
Remeshed vortex methods (2D)
Solve vorticity form of incompressible Navier-Stokes

∂ω ∂t + u · rω = ω · ru + νr2ω + λr ⇥ (χ (us u))

Diffusion Advection Penalization 0 in 2D

3D: Finite Difference - pressure projection

(Chorin 1968)

Rossinelli et al., SC'13 Proc. Int. Conf. High Perf. Comput., Denver, Colorado

SLIDE 21

Goal #1: learn to stay behind the leader

R>0 R<0

Reward: vertical displacement

R∆y = 1 − |∆y| 0.5L

Failure condition
Stray too far or collide with leader

Rend = −1

Goal #2: learn to maximise swimming-efficiency

Reward: efficiency

Rη = Pthrust Pthrust + max(Pdef, 0)

Thrust power

Deformation power

Rη = T |uCM| T |uCM| + max( R

∂Ω F(x) · udef(x) dx, 0)

Reinforcement Learning: Reward

SLIDE 22

Reinforcement Learning: Basic idea

A

Credit assignment:
Agent receives

feedback

1

A A

2 -1
2
3
4
5
6
7
7
8
9
10
12
11
11
12
13
14 -15

A

Example: Maze solving State Agent’s position (A) Actions go U, D, L, R Reward

1 per step taken

0 at terminal state

An agent learning the best action, through trial-

and-error interaction with environment

Actions have long term consequences
Reward (feedback) is delayed
Goal
Maximize cumulative future reward:
Specify what to do, not how to do it

Qπ(st, at) = E ⇥ rt+1 + γrt+2 + γ2rt+3 + . . . | ak = π(sk) ∀k > t ⇤ = E [rt+1 + γQπ(st+1, π(st+1)] Qπ(st, at) = Bellman (1957)

Now we have a policy
Expected reward

updated in previously visited states

Credit: https:// www.cs.utexas.edu/ ~eladlieb/RLRG.html

SLIDE 23

Recurrent Neural Network

LSTM Layer 2 LSTM Layer 1 LSTM Layer 3

n

(a3)

qn

(a1)

qn

(a2)

qn

(a4)

qn

(a5)

qn

SLIDE 24

A flexible maneuvering model

Modified midline kinematics preserves travelling wave:
Each action prescribes a point of the spline:

Traveling wave Traveling spline c

{

c

{

Decrease Curvature

c c+¼ c+½ c+¾ c+1

Increase Curvature

c c+¼ c+½ c+¾ c+1

SLIDE 25

Examples

Reducing local curvature Increasing local curvature Chain of actions Effect of action depends on when action is made

SLIDE 26

Simulation Cost (2D)

26

Production runs (Re=5000)
Domain : [0,1] x [0,1]
Resolution : 8192 x 8192
1600 points along fish midline
Running with 24 threads (12 hyper

threaded cores - Piz Daint)

10 tail-beat cycles : 27000 time steps
Approx. 96 core hours: 1 second/step
Wavelet-based adaptive grid
https://github.com/cselab/MRAG-I2D

Rossinelli et al., JCP (2015)

Training simulations (lower resolution)
Resolution : 2048 x 2048
10 tail-beat cycles : 36 core hours
Learning converges in : 150,000 tail-beats
0.54 Million core hrs per learning episode

SLIDE 27

Simulation Cost (3D)

27

Production runs (Re=5000)
Domain : [0, 1] x [0, 0.5] x [0, 0.25]
Resolution : 6144 x 3072 x 768
600 points along fish midline
Running on 128 nodes, 24 threads each

(Hybrid MPI + OpenMP : Piz Daint)

10 tail-beat cycles : 21000 time steps
Approx. 37,000 core hours:

3 seconds/timestep

Uniform grid Finite Volume solver
Rossinelli et al., SC'13 Proc. Int. Conf. High
Perf. Comput., Denver, Colorado
Training simulations (lower resolution)
Resolution : 2048 x 1024 x 512
Expected : 1.2 Million core hours per

Harnessing Wake Vortices for Efficient Collective Swimming via Deep Reinforcement Learning

Siddhartha Verma

With: Guido Novati and Petros Koumoutsakos

Collective Swimming

THIS TALK: Adaptive Collective Swimming

No positional or formation constraints

The Need for Control

Reinforcement Learning

Deep Reinforcement Learning

Actions, States, Reward

States:

Actions:

Reward: based on swimming efficiency

After training: Efficiency-maximizing ‘Follower’

identical muscle movements

Leader Smart Follower

evolve during training?

After training: Efficiency-maximizing ‘Follower’

Leader Smart Follower

Sequence of events

vortex (S1)

=> suction due to low pressure

determine Pdef (muscle use)

layer on the swimmer’s body (L1)

Implementing the Learned Strategy in 3D

3D Wake Interactions

11% increase in efficiency for each follower

Summary

Backup

Reacting to an erratic leader

Robustness: Responds Effectively to Perturbations

Numerical methods

Goal #1: learn to stay behind the leader

Goal #2: learn to maximise swimming-efficiency

Reinforcement Learning: Reward

Reinforcement Learning: Basic idea

A

A A

A

Recurrent Neural Network

A flexible maneuvering model

{

{

Examples

Simulation Cost (2D)

threaded cores - Piz Daint)

Simulation Cost (3D)

(Hybrid MPI + OpenMP : Piz Daint)

3 seconds/timestep

learning episode

Harnessing Wake Vortices for Efficient Collective Swimming via   Deep Reinforcement Learning