Branes with Brains Reinforcement learning in the landscape of - - PowerPoint PPT Presentation

branes with brains
SMART_READER_LITE
LIVE PREVIEW

Branes with Brains Reinforcement learning in the landscape of - - PowerPoint PPT Presentation

Branes with Brains Reinforcement learning in the landscape of intersecting brane worlds F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 11/30/2017 Based on [work in progress] with Brent Nelson and Jim Halverson Motivation -


slide-1
SLIDE 1

Branes with Brains

Reinforcement learning in the landscape

  • f intersecting brane worlds

FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 11/30/2017

Based on [work in progress] with Brent Nelson and Jim Halverson

slide-2
SLIDE 2
  • Three approaches to machine learning:
  • Supervised Learning:

Train the machine by telling it what to do

  • Unsupervised Learning:

Let the machine train without telling it what to do

  • Reinforcement Learning:

Based on behavioral psychology

Don’t tell the machine exactly what to do but reward “good” and/or punish “bad” actions

AI = reinforcement learning + deep (neural networks) learning

Motivation - ML

[Sutton, Barto ’98 ’17]

[Silver ’16]

slide-3
SLIDE 3
  • Agents interact with an environment (e.g. string landscape)
  • Each interaction changes the state of the agent, e.g. the

dof’s parameterizing the string vacuum

  • Each step is either rewarded (action lead to a more

realistic vacuum) or punished (action lead to a less realistic vacuum)

  • The agent acts with the aim of maximizing its long-term

reward

  • Agent repeats actions until it is told to stop (found a

realistic vacuum or give up)

Motivation - RL

slide-4
SLIDE 4
  • String Theory setup:
  • Intersecting D6-branes on orbifolds of toroidal orientifolds
  • Implementation in Reinforcement Learning (RL)
  • Basic overview
  • Implementing the RL code
  • Modelling the environment
  • Preliminary results
  • Finding consistent solutions
  • Conclusion

Outline

slide-5
SLIDE 5

String Theory 101

Intersecting D6-branes on orbifolds of toroidal orientifolds

slide-6
SLIDE 6
  • Have: IIA String Theory in 9D + time with 32

supercharges

  • Want: A Theory in 3D + time with 4 supercharges
  • Idea: Make extra 6D so small that we do not see them
  • How do we do that?
  • 1. Make them compact
  • 2. Make their diameter so small that our experiments cannot


detect them

  • Reduce supercharges from 32 to 4:
  • Identify some points with their mirror image

String Theory 101

slide-7
SLIDE 7
  • Why this setup?
  • Well studied
  • Comparatively simple
  • Number of (well-defined) solutions known to be finite:

Use symmetries to relate different vacua

Combine consistency conditions to rule out combinations

  • BUT: Number of possibilities so large that not a single

“interesting” solution could be found despite enormous random scans (estimated to 1:109)

  • Seems Taylor-made for big data / AI methods

String Theory 101 - Setup

[Blumenhagen,Gmeiner,Honecker,Lust,Weigand '04'05; Douglas, Taylor '07, ...] [Ibanez, Uranga ’12] [Douglas, Taylor ’07]

slide-8
SLIDE 8
  • How to make a dimension compact? Pacman

String Theory 101 - Compactification

slide-9
SLIDE 9
  • How to make a dimension compact? Pacman

String Theory 101 - Compactification

slide-10
SLIDE 10
  • How to make a dimension compact? Pacman

String Theory 101 - Compactification

slide-11
SLIDE 11
  • How to make a dimension compact? Pacman

String Theory 101 - Compactification

slide-12
SLIDE 12

String Theory 101 - Compactification

x1 y1 y2 y3 x3 x2

  • Now six compact dimensions, but idea too simple
  • Resulting space too simple (but just a little bit)
  • Make it a bit more complicated
slide-13
SLIDE 13

String Theory 101 - Orbifolds

x1 y1

  • Mathematically:
  • Resulting object is called an orbifold
  • Need to also orientifold: 


(plus something similar for the string itself)

T 2 T 2/Z2

(x1, y1) → (x1, −y1) (x1, y1) → (−x1, −y1)

slide-14
SLIDE 14

Winding numbers : Note: Due to orientifold: include

String Theory 101 - Winding numbers

(n, m) (n, m) = (1, 0) (n, m) = (0, 1) (n, m) = (1, 2) (n, m), (n, −m)

slide-15
SLIDE 15
  • D6 brane: our 3D + a line on each torus
  • Can stack multiple D6 branes on top of each other
  • Brane stacks Tuple:

String Theory 101 - D6 branes

x1 y1 y2 y3 x3 x2

⇔ (N, n1, m1, n2, m2, n3, m3) T 2 T 2 T 2 3D

slide-16
SLIDE 16

String Theory 101 - Gauge group and particles

  • Observed gauge group:


D6 branes on top of each other
 Special cases:

  • D6 branes parallel to O6-plane
  • D6 branes orthogonal to O6-plane
  • Intersection of -brane and -brane stack:


Particles in representation

  • Observed particles in the universe:

SU(3) × SU(2) × U(1)Y SO(2N) : N Sp(N) : N

N M

U(N) : N

(N, M)1,−1

3 × (3, 2)1 + 3 × (3, 1)−4 + 3 × (3, 1)2+ 4 × (1, 2)−3 + 1 × (1, 2)3 + 3 × (1, 1)6 Quarks Leptons + Higgs

slide-17
SLIDE 17
  • Green and yellow intersect in points
  • Note: Counting intersections on the orbifold a bit

more subtle

String Theory 101 - MSSM

x1 y1 y2 y3 x3 x2

T 2 T 2 T 2 3D 3 · 1 · 1 = 3

slide-18
SLIDE 18
  • Tadpole cancellation: Balance energy of D6 and O6:
  • K-Theory: Global consistency:

String Theory 101 - Consistency

#stacks

X

a=1

B B @ N a na

1 na 2 na 3

−N a na

1 ma 2ma 3

−N ama

1 na 2 ma 3

−N ama

1ma 2 na 3

1 C C A = B B @ 8 4 4 8 1 C C A

#stacks

X

a=1

B B @ 2N ama

1ma 2ma 3

−N ama

1 na 2 na 3

−N a na

1 ma 2 na 3

−2N a na

1 na 2 ma 3

1 C C A mod B B @ 2 2 2 2 1 C C A = B B @ 1 C C A

slide-19
SLIDE 19
  • SUSY (computational control):
  • Pheno: + particles
  • is iff:

String Theory 101 - Consistency

∀a = 1, . . . , # stacks SU(3) × SU(2) × U(1) T = (T1, T2, . . . , Tk), k = #U(N) stacks U(1) ma

1ma 2ma 3 − j ma 1na 2na 3 − k na 1ma 2na 3 − `na 1na 2ma 3 = 0

na

1na 2na 3 − j na 1ma 2ma 3 − k ma 1na 2ma 3 − `ma 1ma 2na 3 > 0

  2N 1m1

1

2N 2m2

1

· · · 2N kmk

1

2N 1m2

1

2N 2m2

2

· · · 2N kmk

2

2N 1m2

3

2N 2m2

3

· · · 2N kmk

3

  ·      T1 T2 . . . Tk      =    

slide-20
SLIDE 20
  • State space gigantic
  • Choose a maximal value for winding number
  • Let be the number of possible winding number

combinations (up to ) after symmetry reduction

  • Let be the maximal number of stacks
  • Allows for combinations
  • Note: Each stack can have branes

String Theory 101 - IIA state space

NB NS

✓ NB NS ◆

wmax wmax

N = 1, 2, 3, . . .

slide-21
SLIDE 21

Reinforcement learning

slide-22
SLIDE 22
  • At time , agent in state
  • Select action from action space based on policy 

  • Receive reward for action based on reward

function

  • Transition to the next state
  • Try to maximize long-term return ,
  • Keep track of state value (“how good is the state”)
  • Compute advantage estimate 


(“how much better than expected has the action turned out to be”)

Reinforcement learning - Overview

t

at

A st ∈ Stotal

π : Stotal 7! A

π

rt ∈ R

at st+1

R, R : Stotal × A → R γ ∈ (0, 1]

Gt =

X

k=1

γkrt+k

v(s)

Adv = r − v

slide-23
SLIDE 23
  • How to maximize future return?
  • Depends on policy
  • Several approaches
  • Tabular (small state/action spaces):

Temporal difference learning

SARSA

Q-learning

  • Deep RL (large/infinite state/action spaces):

Deep Q-Network

Asynchronous advantage actor-critic (A3C)

Variations/extensions: Wolpertinger [Dulac-Arnold et al ’16], Rainbow

Reinforcement Learning - Overview

π

[Sutton, Barto ’98] [Mnih et al ’15] [Mnih et al ’16]

[Hessel et al '17]

my breakout group on Friday

slide-24
SLIDE 24

Reinforcement Learning - A3C

Global instance

Policy Value Network Input

Environment Worker 1

Policy Value Network Input

Environment Worker 2

Policy Value Network Input

Environment Worker n

Policy Value Network Input

slide-25
SLIDE 25
  • Asynchronous: Have n workers explore the environment

simultaneously and asynchronously

  • improves training stability (experience of workers separated)
  • improves exploration
  • Advantage: Use advantage to update policy
  • Actor-critic: To maximize return need to know state or action

value and optimize policy.

  • Methods like Q-learning focuses on value function
  • Methods like policy-gradient focus on policy
  • AC: Use value estimate (“critic”) to update policy (“actor”)

Reinforcement Learning - A3C

slide-26
SLIDE 26

Chainer RL

  • Open AI Gym: Interface between agent (RL) and environment

(string landscape)

  • We provide the environment
  • We use ChainerRL’s implementation of A3C for the agent

Reinforcement Learning - Implementation

[Brockman et al '16]

Environment

step reset

  • step:
  • go to new state
  • return (new_state, reward, done, comment)
  • reset:
  • reset episode
  • return start_state

make env

  • make environment
  • specify RL method (A3C)
  • specify policy NN (FF,LSTM)

✦method (A3C,DQN,…) ✦NN architecture 
 (FF, LSTM,…) ✦ action space ✦ observation (state) space

slide-27
SLIDE 27
  • State space: ,
  • Action space 

  • Reward : Need a notion of “how good a state is”
  • 1. By how much does a set of stacks violate the tadpole?
  • 2. Is a set of stacks fully consistent (Tadpole, K-Theory, SUSY)


(Note: the latter two are binary, hard to define distance)

  • 3. How far is the state from the Standard Model
  • Missing a group factor of ?
  • Too few Standard model particles ?
  • Extra exotics (particles charged under the Standard Model but not
  • bserved so far)

Note: Only works if good states are “close by” in this sense…

Reinforcement learning - Model the environment

st = [(N 1, n1

1, m1 1, n1 2, m1 2, n1 3, m1 3), (N 2, n2 1, . . .), . . .]

|Stotal| = Nmax ✓ NB NS ◆

A = {N a → N a ± 1, add stack (N, n1, . . .), remove stack (N, n1, . . .)}

st ∈ Stotal

R

SU(3) × SU(2) × U(1)

(Q, u, d, L, Hu, Hd, e)

NS

slide-28
SLIDE 28
  • Parameters:
  • 16 or 32 workers (1 CPU, 16-32 threads, 2.6GHz)
  • Training time of the order 10h
  • Neural networks for value and policy evaluation

Feed-forward NN with 2 hidden Softmax layers with 200 nodes

RNN with linear (200 nodes) and LSTM layer (128 nodes)

  • Initial state: Empty stack
  • Maximal steps per episode: 10,000 - 250,000
  • 10 evaluation runs every 100,000 steps

Preliminary results

slide-29
SLIDE 29
  • Tadpole cancellation
  • Maximum of 10,000 steps in an episode
  • Reward for Tadpole cancellation: 106
  • Punishment for step:

Preliminary results - Tadpole cancellation

500000 1.0×106 1.5×106 2.0×106 2.5×106 3.0×106 200000 400000 600000 800000 1×106

# steps mean scores

5×105 1×106 2×106 50 100 500 1000 5000 104

log(# steps) log(average # steps to solution)

X |8 − Tadpolei(s)|

slide-30
SLIDE 30

Preliminary results - Tadpole cancellation

5×105 1×106 2×106 50 100 500 1000 5000 1×104 5×104

log(# steps) log(average # steps to solution)

  • Tadpole cancellation
  • Maximum of 50,000 steps in an episode
  • Reward for Tadpole cancellation: 106
  • Punishment for step:

X |8 − Tadpolei(s)|

500000 1.0×106 1.5×106 2.0×106 2.5×106 3.0×106 200000 400000 600000 800000 1×106

# steps mean scores

slide-31
SLIDE 31

Preliminary results - Tadpole cancellation

5.0×105 1.0×106 1.5×106 2.0×106 2.5×106 100 500 1000 5000 104

log(# steps) log(average # steps to solution)

  • Tadpole cancellation
  • Maximum of 250,000 steps in an episode
  • Reward for Tadpole cancellation: 106
  • Punishment for step:

X |8 − Tadpolei(s)|

500000 1.0×106 1.5×106 2.0×106 2.5×106 200000 400000 600000 800000 1×106

# of Steps mean scores

slide-32
SLIDE 32

Preliminary results - Tadpole+K-Theory+SUSY

  • Full consistency (TC + K-Th + SUSY)
  • Maximum of 10,000 steps in an episode
  • Rewards: (TC,K-Th, SUSY) = (107, 108, 109)
  • Punishment for step:

X |8 − Tadpolei(s)|

500000 1.0×106 1.5×106 2.0×106 2.5×106 3.0×106 3.5×106 106 107 108 109

# steps mean scores

slide-33
SLIDE 33
  • Reinforcement learning very promising approach to AI + ML
  • A3C performs very well for different environments (mostly tested

for Atari games)

  • Type II orientifold setup well-suited for landscape analysis
  • Physics well-understood
  • Number of configurations too large to approach w/

conventional methods no Standard model found so far

  • Preliminary results:
  • A3C works well for consistency (Tadpole, K-Theory, SUSY)
  • Getting close to SM

Conclusion

slide-34
SLIDE 34

Thank you for your attention!