midterm postmortem cse 473 ar ficial intelligence
play

Midterm$Postmortem$ CSE$473:$Ar+ficial$Intelligence$ $ - PDF document

Midterm$Postmortem$ CSE$473:$Ar+ficial$Intelligence$ $ Reinforcement$Learning$ ! It$was$long,$hard$ " $ ! Max $ $ $41$$ ! Min $ $ $13$ ! Mean$&$Median $27$ ! Final$ ! Will$include$some$of$the$midterm$problems$ Dan$Weld$


  1. Midterm$Postmortem$ CSE$473:$Ar+ficial$Intelligence$ $ Reinforcement$Learning$ ! It$was$long,$hard…$ " $ ! Max $ $ $41$$ ! Min $ $ $13$ ! Mean$&$Median $27$ ! Final$ ! Will$include$some$of$the$midterm$problems$ Dan$Weld$ University$of$Washington$ [Most$of$these$slides$were$created$by$Dan$Klein$and$Pieter$Abbeel$for$CS188$Intro$to$AI$at$UC$Berkeley.$$All$CS188$materials$are$available$ at$hNp://ai.berkeley.edu.]$ Office$Hour$Change$(this$week)$ Reinforcement$Learning$ ! Thurs$ 10`11am$ ! CSE$588$ ! (Not$Fri)$ “Listen Simkins, when I said that you could always come to me with your problems, I meant during office hours!” Two$Key$Ideas$ Reinforcement$Learning$ ! Credit$assignment$problem$ $ Age ! Explora+on`exploita+on$tradeoff$ nt$ State:$s$ Ac+ons:$a$ Reward:$r$ Environm ent$ ! Basic$idea:$ ! Receive$feedback$in$the$form$of$rewards$ ! Agent’s$u+lity$is$defined$by$the$reward$func+on$ ! Must$(learn$to)$act$so$as$to$maximize$expected$ rewards$ ! All$learning$is$based$on$observed$samples$of$ outcomes!$

  2. The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 7 8 The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 “ “ “ 39, “ = 0, “ = 4 “ “ “ 22, “ = 0, “ = 1 “ “ “ 22, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 9 10 The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 “ “ “ 39, “ = 0, “ = 4 “ “ “ 22, “ = 0, “ = 1 “ “ “ 22, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 13, “ = 0, “ = 2 11 12

  3. The “ Credit Assignment ” Problem The “ Credit Assignment ” Problem I ’ m in state 43, reward = 0, action = 2 I ’ m in state 43, reward = 0, action = 2 “ “ “ 39, “ = 0, “ = 4 “ “ “ 39, “ = 0, “ = 4 “ “ “ 22, “ = 0, “ = 1 “ “ “ 22, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 21, “ = 0, “ = 1 “ “ “ 13, “ = 0, “ = 2 “ “ “ 13, “ = 0, “ = 2 Yippee! I got to a state with a big reward! But which of my actions along the way “ “ “ 54, “ = 0, “ = 2 “ “ “ 54, “ = 0, “ = 2 actually helped me get there?? This is the Credit Assignment problem. “ “ “ 26, “ = 100 , 13 14 Exploration-Exploitation tradeoff Example: Animal Learning ! You have visited part of the state space and found a ! RL studied experimentally for more than 60 years in reward of 100 psychology ! is this the best you can hope for??? ! Rewards: food, pain, hunger, drugs, etc. ! Exploitation : should I stick with what I know and find a ! Mechanisms and sophistication debated good policy w.r.t. this knowledge? ! at risk of missing out on a better reward somewhere ! Example: foraging ! Exploration : should I look for states w/ more reward? ! at risk of wasting time & getting some negative reward ! Bees learn near-optimal foraging plan in field of artificial flowers with controlled nectar supplies ! Bees have a direct neural connection from nectar intake measurement to motor planning area 15 Demos$ Example: Backgammon ! hNp://inst.eecs.berkeley.edu/~ee128/fa11/ ! Reward only for win / loss in terminal states, zero otherwise videos.html$ ! TD-Gammon learns a function approximation to V(s) using a neural network ! Combined with depth 3 search, one of the top 3 players in the world ! You could imagine training Pacman this way … ! … but it ’ s tricky! (It ’ s also P3) 18

  4. Example:$Learning$to$Walk$ Example:$Learning$to$Walk$ Ini+al$ A$Learning$Trial$ Aher$Learning$[1K$ Trials]$ Ini+al$ [Kohl$and$Stone,$ICRA$2004]$ [Kohl$and$Stone,$ICRA$2004]$ [Video:$AIBO$WALK$–$ini+al Example:$Learning$to$Walk$ Example:$Sidewinding$ Finished$ [Kohl$and$Stone,$ICRA$2004]$ [Video:$AIBO$WALK$–$finish [Andrew$Ng]$ [Video:$SNAKE$–$climbStep+sidewin The$Crawler!$ Video$of$Demo$Crawler$Bot$ [Demo:$Crawler$Bot$(L10D1)]$[You,$in$Proj

  5. Other Applications Reinforcement$Learning$ ! S+ll$assume$a$Markov$decision$process$(MDP):$ ! A$set$of$states$s$ ∈ $S$ ! Robotic control ! A$set$of$ac+ons$(per$state)$A$ ! helicopter maneuvering, autonomous vehicles ! A$model$T(s,a,s’)$ ! Mars rover - path planning, oversubscription planning ! A$reward$func+on$R(s,a,s’)$&$discount$γ$ ! elevator planning ! S+ll$looking$for$a$policy$ π (s)$ ! Game playing - backgammon, tetris, checkers ! Neuroscience ! Computational Finance, Sequential Auctions ! New$twist:$don’t$know$T$or$R$ ! Assisting elderly in simple tasks ! I.e.$we$don’t$know$which$states$are$good$or$what$the$ac+ons$do$ ! Spoken dialog management ! Must$actually$try$ac+ons$and$states$out$to$learn$ ! Communication Networks – switching, routing, flow control ! War planning, evacuation planning Overview$ Offline$(MDPs)$vs.$Online$(RL)$ ! Offline$Planning$(MDPs)$ ! Value$itera+on,$policy$itera+on$ ! Online:$Reinforcement$Learning$ ! Model`Based$ ! Model`Free$ ! Passive$ ! Ac+ve$ Offline$ Online$ Solu+on$ Learning$ Passive$Reinforcement$Learning$ Passive$Reinforcement$Learning$ ! Simplified$task:$policy$evalua+on$ ! Input:$a$fixed$policy$ π (s)$ ! You$don’t$know$the$transi+ons$T(s,a,s’)$ ! You$don’t$know$the$rewards$R(s,a,s’)$ ! Goal:$learn$the$state$values$ ! In$this$case:$ ! Learner$is$“along$for$the$ride”$ ! No$choice$about$what$ac+ons$to$take$ ! Just$execute$the$policy$and$learn$from$experience$ ! This$is$NOT$offline$planning!$$You$actually$take$ac+ons$in$the$world.$

  6. Model`Based$Learning$ Model`Based$Learning$ ! Model`Based$Idea:$ ! Learn$an$approximate$model$based$on$experiences$ ! Solve$for$values$as$if$the$learned$model$were$correct$ ! Step$1:$Learn$empirical$MDP$model$ ! Count$outcomes$s’$for$each$s,$a$ ! Normalize$to$give$an$es+mate$of ! ! Discover$each$ !!!!!!!!!!!!!!!!!!!!!! when$we$experience$(s,$a,$s’)$ ! Step$2:$Solve$the$learned$MDP$ ! For$example,$use$value$itera+on,$as$before$ Example:$Model`Based$Learning$ Model`Free$Learning$ Input$ Observed$Episodes$ Learned$ Policy$ π $$ (Training) $ Model $ Episode$ Episode$ T(s,a,s’).$ 1$ 2$ $ T(B,$east,$C)$=$ B,$east,$C,$ B,$east,$C,$ A! 1.00$ `1$ `1$ T(C,$east,$D)$=$ C,$east,$D,$ C,$east,$D,$ 0.75$ B! C $ D $ `1$ `1$ T(C,$east,$A)$=$ 0.25$ D,$exit,$$x,$ D,$exit,$$x,$ Episode$ Episode$ …$ +10$ +10$ E $ $ 3$ 4$ R(s,a,s’).$ E,$north,$C,$`1$ E,$north,$C,$`1$ C,$east,$$$D,$`1$ C,$east,$$$A,$`1$ $ R(B,$east,$C)$=$`1$ D,$exit,$$$$x,$ A,$exit,$$$$x,$`10$ Assume:' γ $=$1$ R(C,$east,$D)$=$`1$ +10$ R(D,$exit,$x)$=$ +10$ …$ Simple$Example:$Expected$Age$ Direct$Evalua+on$ Goal:$Compute$expected$age$of$CSE$473$ ! Goal:$Compute$values$for$each$state$ students$ under$ π $ Known$ P(A)$ ! Idea:$Average$together$observed$ sample$values$ Without$P(A),$instead$collect$samples$[a 1 ,$a 2 ,$…$a N ]$ ! Act$according$to$ π $ ! Every$+me$you$visit$a$state,$write$down$ Unknown$P(A):$“Model$ Unknown$P(A):$“Model$ what$the$sum$of$discounted$rewards$ Based”$ Free”$ Why$does$ turned$out$to$be$ Why$does$ this$work?$$ this$work?$$ ! Average$those$samples$ Because$ Because$ samples$ eventually$ appear$with$ ! This$is$called$direct$evalua+on$ you$learn$the$ the$right$ right$model.$ frequencies.$

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend