machine learning reinforcement learning
play

Machine Learning, Reinforcement Learning Machine Learning: A quick - PDF document

12/6/17 Todays Class Machine Learning, Reinforcement Learning Machine Learning: A quick retrospective AI Class 25 (Ch. 21.1, 20.220.2.5, 20.3) Reinforcement Learning: What is it? Next time: The EM algorithm Monte Carlo


  1. 12/6/17 Today’s Class Machine Learning, Reinforcement Learning • Machine Learning: A quick retrospective AI Class 25 (Ch. 21.1, 20.2–20.2.5, 20.3) • Reinforcement Learning: What is it? • Next time: • The EM algorithm • Monte Carlo and Temporal Difference • Upcoming classes: • EM (more) • Ethics?? • Tournament Slides drawn from Drs. Tim Finin, Paula Matuszek, Rich Sutton, Andy Barto, and Marie desJardins, with thanks Review: What is ML? Revew: Architecture of a ML System • ML is a way to get a computer (in our parlance, a • Every machine learning system has four parts: system ) to do things without having to explicitly 1. A representation or model of what is being describe what steps to take. learned. • By giving it examples (training data) 2. An actor : Uses the representation and actually does something. • Or by giving it feedback 3. A critic : Provides feedback. • It can then look for patterns which explain or 4. A learner : Modifies the representation / model, predict what happens. using the feedback. • The learned system of beliefs is called a model . 3 4 Review: Representation Review: Formalizing Agents • A learning system must have a representation or • Given: model of what is being learned. • A state space S • A set of actions a 1 , …, a k including their results • This is what changes based on experience. • Reward value at the end of each trial (series of action) • In a machine learning system this may be: (may be positive or negative) • A mathematical model or formula • Output: • A set of rules • A mapping from states to actions • A decision tree • A policy • Which is a policy , π • Or some other form of information 6 7 1

  2. 12/6/17 Learning Without a Model Reinforcement Learning • We saw how to learn a value function and/or a policy from a • We often have an agent which has a task to perform transition model • It takes some actions in the world • What if we don’t have a transition model? • At some later point, gets feedback on how well it did • Idea #1: Build one • The agent performs the same task repeatedly • Explore the environment for a long time • Record all transitions • This problem is called reinforcement learning : • Learn the transition model • The agent gets positive reinforcement for tasks done well • Apply value iteration/policy iteration • And gets negative reinforcement for tasks done poorly • Slow, requires a lot of exploration, no intermediate learning • Must somehow figure out which actions to take next time • Idea #2: Learn a value function (or policy) directly from interactions with the environment, while exploring 9 Animals Game Animals Behind the Scene Is it a bird? • Human: I’m thinking of an animal. Computer: Is it a bird? • Computer: Is it a bird? Human: No • Human: No. Yes No Computer: Is it a frog? • Computer: Is it a frog? Human: No • Human: No. Computer: What is it? Is it a penguin? Does it have fur? Human: A mouse • Computer: What is it? Computer: Tell me a • Human: A mouse. question that distinguishes Yes No • Computer: Tell me a question that distinguishes a frog from a frog from a mouse. a mouse. Human: Does it have fur? • Human: Does it have fur? Computer: What is the Is it a mouse? Is it a frog? answer for a frog? • Computer: What is the answer for a frog? Human: no • Human: No. After several rounds... 10 11 Reinforcement Learning Animals Guessing Game Architecture • This is a simple form of Reinforcement Learning • All of the parts of ML Architecture: • The Representation is a sequence of questions and pairs • Feedback is at the end, on a series of actions. of yes/no answers (called a binary decision tree). • Very early concept in Artificial Intelligence! • The Actor “walks” the tree, interacting with a human; at each question it chooses whether to follow the “yes” • Arthur Samuels’ checker branch or the “no” branch. program was a simple • The Critic is the human player telling the game whether reinforcement based learner, it has guessed correctly. initially developed in 1956. • The Learner elicits new questions and adds questions, • In 1962 it beat a human guesses and branches to the tree. checkers master. 12 www-03.ibm.com/ibm/history/ibm100/us/en/icons/ibm700series/impacts/ 2

  3. 12/6/17 Reinforcement Learning (cont.) Simple Example • Goal: agent acts in the world to maximize its • Learn to play checkers rewards • Two-person game • 8x8 boards, 12 checkers/ • Agent has to figure out what it did that made it get side that reward/punishment • relatively simple set of • This is known as the credit assignment problem rules: • RL can be used to train computers to do many tasks http://www.darkfish.com/ checkers/rules.html • Backgammon and chess playing • Job shop scheduling • Goal is to eliminate all • Controlling robot limbs your opponent’s pieces 14 https://pixabay.com/en/checker-board-black-game-pattern-29911 Representing Checkers Representing Rules • Second, we need to represent the rules • First we need to represent the game • Represented as a set of allowable moves given board state • To completely describe one step in the game you need • If a checker is at row x, column y, and row x+1 column y±1 is empty, • A representation of the game board. it can move there. • A representation of the current pieces • If a checker is at (x,y), a checker of the opposite color is at (x+1, y+1), and (x+2,y+2) is empty, the checker must move there, and remove the • A variable which indicates whose turn it is “jumped” checker from play. • A variable which tells you which side is “black” • There are additional rules, but all can be expressed in terms • There is no history needed of the state of the board and the checkers. • Each rule includes the outcome of the relevant action in • A look at the current board setup gives you which makes it � terms of the state. a complete picture of the state of the game a ___ problem? 16 17 What Do We Want to Learn Simple Checkers Learning • Given • Can represent some heuristics in the same formalism as the board and rules • A description of some state of the game • A list of the moves allowed by the rules • If there is a legal move that will create a king, take it. • What move should we make? • If checkers at (7,y) and (8,y-1) or (8,y+1) is free, move there. • If there are two legal moves, choose the one that moves a • Typically more than one move is possible checker farther toward the top row • Need strategies, heuristics, or hints about what move to make • If checker(x,y) and checker(p,q) can both move, and x>p, move • This is what we are learning checker(x,y). • But then each of these heuristics needs some kind of • We learn from whether the game was won or lost priority or weight. • Information to learn from is sometimes called “training signal” 20 21 3

  4. 12/6/17 Formalization for RL Agent Learning Agent • Given: • The general algorithm for this learning agent is: • A state space S • Observe some state • A set of actions a 1 , …, a k including their results • If it is a terminal state • A set of heuristics for resolving conflict among actions • Stop • If won, increase the weight on all heuristics used • Reward value at the end of each trial (series of action) • If lost, decrease the weight on all heuristics used (may be positive or negative) • Otherwise choose an action from those possible in that • Output: state, using heuristics to select the preferred action • A policy (a mapping from states to preferred actions) • Perform the action 22 23 Policy Approaches • A complete mapping from states to actions • Learn policy directly: Discover function mapping • There must be an action for each state from states to actions • There may be more than one action • Could be directly learned values • Not necessarily optimal • Ex: Value of state which removes last opponent checker is +1. • The goal of a learning agent is to tune the policy so that • Or a heuristic function which has itself been trained the preferred action is optimal, or at least good. • Learn utility values for states (value function) • analogous to training a classifier • Estimate the value for each state • Checkers • Checkers: • Trained policy includes all legal actions, with weights • How happy am I with this state that turns a man into a king? • “Preferred” actions are weighted up 24 25 Value Function Learning States and Actions • A typical approach is: • The agent knows what state it is in • At state S choose, some action A • It has actions it can perform in each state • Taking us to new State S 1 • Initially, don’t know the value of any of the states • If S 1 has a positive value: increase value of A at S. • If the outcome of performing an action at a state is • If S 1 has a negative value: decrease value of A at S. deterministic, then the agent can update the utility • If S 1 is new, initial value is unknown: value of A unchanged. value U() of states: • One complete learning pass or trial eventually gets to a • U(oldstate) = reward + U(newstate) terminal, deterministic state. (E.g., “win” or “lose”) • The agent learns the utility values of states as it works • Repeat until? Convergence? Some performance level? its way through the state space 26 27 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend