Branes with Brains
Reinforcement learning in the landscape
- f intersecting brane worlds
FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 11/30/2017
Based on [work in progress] with Brent Nelson and Jim Halverson
Branes with Brains Reinforcement learning in the landscape of - - PowerPoint PPT Presentation
Branes with Brains Reinforcement learning in the landscape of intersecting brane worlds F ABIAN R UEHLE (U NIVERSITY OF O XFORD ) String_Data 2017, Boston 11/30/2017 Based on [work in progress] with Brent Nelson and Jim Halverson Motivation -
FABIAN RUEHLE (UNIVERSITY OF OXFORD) String_Data 2017, Boston 11/30/2017
Based on [work in progress] with Brent Nelson and Jim Halverson
✦
Train the machine by telling it what to do
✦
Let the machine train without telling it what to do
✦
Based on behavioral psychology
✦
Don’t tell the machine exactly what to do but reward “good” and/or punish “bad” actions
✦
AI = reinforcement learning + deep (neural networks) learning
[Sutton, Barto ’98 ’17]
[Silver ’16]
dof’s parameterizing the string vacuum
realistic vacuum) or punished (action lead to a less realistic vacuum)
reward
realistic vacuum or give up)
Intersecting D6-branes on orbifolds of toroidal orientifolds
supercharges
detect them
✦
Use symmetries to relate different vacua
✦
Combine consistency conditions to rule out combinations
“interesting” solution could be found despite enormous random scans (estimated to 1:109)
[Blumenhagen,Gmeiner,Honecker,Lust,Weigand '04'05; Douglas, Taylor '07, ...] [Ibanez, Uranga ’12] [Douglas, Taylor ’07]
⇒
⇒
⇒
⇒
x1 y1 y2 y3 x3 x2
x1 y1
(plus something similar for the string itself)
T 2 T 2/Z2
(x1, y1) → (x1, −y1) (x1, y1) → (−x1, −y1)
Winding numbers : Note: Due to orientifold: include
(n, m) (n, m) = (1, 0) (n, m) = (0, 1) (n, m) = (1, 2) (n, m), (n, −m)
x1 y1 y2 y3 x3 x2
⇔ (N, n1, m1, n2, m2, n3, m3) T 2 T 2 T 2 3D
D6 branes on top of each other Special cases:
Particles in representation
SU(3) × SU(2) × U(1)Y SO(2N) : N Sp(N) : N
N M
U(N) : N
(N, M)1,−1
3 × (3, 2)1 + 3 × (3, 1)−4 + 3 × (3, 1)2+ 4 × (1, 2)−3 + 1 × (1, 2)3 + 3 × (1, 1)6 Quarks Leptons + Higgs
more subtle
x1 y1 y2 y3 x3 x2
T 2 T 2 T 2 3D 3 · 1 · 1 = 3
#stacks
X
a=1
B B @ N a na
1 na 2 na 3
−N a na
1 ma 2ma 3
−N ama
1 na 2 ma 3
−N ama
1ma 2 na 3
1 C C A = B B @ 8 4 4 8 1 C C A
#stacks
X
a=1
B B @ 2N ama
1ma 2ma 3
−N ama
1 na 2 na 3
−N a na
1 ma 2 na 3
−2N a na
1 na 2 ma 3
1 C C A mod B B @ 2 2 2 2 1 C C A = B B @ 1 C C A
∀a = 1, . . . , # stacks SU(3) × SU(2) × U(1) T = (T1, T2, . . . , Tk), k = #U(N) stacks U(1) ma
1ma 2ma 3 − j ma 1na 2na 3 − k na 1ma 2na 3 − `na 1na 2ma 3 = 0
na
1na 2na 3 − j na 1ma 2ma 3 − k ma 1na 2ma 3 − `ma 1ma 2na 3 > 0
2N 1m1
1
2N 2m2
1
· · · 2N kmk
1
2N 1m2
1
2N 2m2
2
· · · 2N kmk
2
2N 1m2
3
2N 2m2
3
· · · 2N kmk
3
· T1 T2 . . . Tk =
combinations (up to ) after symmetry reduction
NB NS
✓ NB NS ◆
wmax wmax
N = 1, 2, 3, . . .
function
(“how much better than expected has the action turned out to be”)
t
at
A st ∈ Stotal
π : Stotal 7! A
π
rt ∈ R
at st+1
R, R : Stotal × A → R γ ∈ (0, 1]
Gt =
∞
X
k=1
γkrt+k
v(s)
Adv = r − v
✦
Temporal difference learning
✦
SARSA
✦
Q-learning
✦
Deep Q-Network
✦
Asynchronous advantage actor-critic (A3C)
✦
Variations/extensions: Wolpertinger [Dulac-Arnold et al ’16], Rainbow
π
[Sutton, Barto ’98] [Mnih et al ’15] [Mnih et al ’16]
[Hessel et al '17]
my breakout group on Friday
⇒
Global instance
Policy Value Network Input
Environment Worker 1
Policy Value Network Input
Environment Worker 2
Policy Value Network Input
Environment Worker n
Policy Value Network Input
…
simultaneously and asynchronously
value and optimize policy.
Chainer RL
(string landscape)
[Brockman et al '16]
Environment
step reset
make env
✦method (A3C,DQN,…) ✦NN architecture (FF, LSTM,…) ✦ action space ✦ observation (state) space
(Note: the latter two are binary, hard to define distance)
Note: Only works if good states are “close by” in this sense…
st = [(N 1, n1
1, m1 1, n1 2, m1 2, n1 3, m1 3), (N 2, n2 1, . . .), . . .]
|Stotal| = Nmax ✓ NB NS ◆
A = {N a → N a ± 1, add stack (N, n1, . . .), remove stack (N, n1, . . .)}
st ∈ Stotal
R
SU(3) × SU(2) × U(1)
(Q, u, d, L, Hu, Hd, e)
NS
✦
Feed-forward NN with 2 hidden Softmax layers with 200 nodes
✦
RNN with linear (200 nodes) and LSTM layer (128 nodes)
500000 1.0×106 1.5×106 2.0×106 2.5×106 3.0×106 200000 400000 600000 800000 1×106
# steps mean scores
5×105 1×106 2×106 50 100 500 1000 5000 104
log(# steps) log(average # steps to solution)
X |8 − Tadpolei(s)|
log(# steps) log(average # steps to solution)
X |8 − Tadpolei(s)|
500000 1.0×106 1.5×106 2.0×106 2.5×106 3.0×106 200000 400000 600000 800000 1×106# steps mean scores
log(# steps) log(average # steps to solution)
X |8 − Tadpolei(s)|
500000 1.0×106 1.5×106 2.0×106 2.5×106 200000 400000 600000 800000 1×106# of Steps mean scores
X |8 − Tadpolei(s)|
500000 1.0×106 1.5×106 2.0×106 2.5×106 3.0×106 3.5×106 106 107 108 109
# steps mean scores
for Atari games)
conventional methods no Standard model found so far
→