Learning Systems Learning Systems Chapter 5 Chapter 5 Dr Ahmed - - PowerPoint PPT Presentation

learning systems learning systems
SMART_READER_LITE
LIVE PREVIEW

Learning Systems Learning Systems Chapter 5 Chapter 5 Dr Ahmed - - PowerPoint PPT Presentation

Learning Systems Learning Systems Chapter 5 Chapter 5 Dr Ahmed Rafea Rafea Dr Ahmed Overview Overview One central element of intelligent behavior One central element of intelligent behavior is the ability to adapt or learn from


slide-1
SLIDE 1

Learning Systems Learning Systems

Chapter 5 Chapter 5 Dr Ahmed Dr Ahmed Rafea Rafea

slide-2
SLIDE 2

Overview Overview

  • One central element of intelligent behavior

One central element of intelligent behavior is the ability to adapt or learn from is the ability to adapt or learn from experience. experience.

  • Adding learning or adaptive behavior to an

Adding learning or adaptive behavior to an intelligent agent elevates it to a higher intelligent agent elevates it to a higher level of ability. level of ability.

slide-3
SLIDE 3

Forms of learning Forms of learning

  • Rote Learning:

Rote Learning: based on memorization of

based on memorization of examples examples

  • Parameter or weight adjustment:

Parameter or weight adjustment: how to

how to weight the contribution of important decision weight the contribution of important decision factors to the answer. This technique is the basis factors to the answer. This technique is the basis

  • f Neural Network
  • f Neural Network
  • Induction:

Induction: extraction of important

extraction of important characteristics of the problem to build a model characteristics of the problem to build a model that can be used to predict new situations. that can be used to predict new situations.

  • Clustering:

Clustering: A way to organize similar patterns

A way to organize similar patterns into groups into groups

slide-4
SLIDE 4

Data Mining Data Mining

  • All these learning techniques, including induction

All these learning techniques, including induction and clustering, are used in data mining. and clustering, are used in data mining.

  • Data mining is a process of extracting valuable ,

Data mining is a process of extracting valuable , non non-

  • obvious information from large collection of
  • bvious information from large collection of

data. data.

  • The main contribution of data mining is to find

The main contribution of data mining is to find patterns which were not known to exist (finding patterns which were not known to exist (finding new information or knowledge) new information or knowledge)

  • Therefore, learning as applied to data mining,

Therefore, learning as applied to data mining, can be thought of as a way for intelligent agents can be thought of as a way for intelligent agents to automatically discover knowledge rather than to automatically discover knowledge rather than having it predefined. having it predefined.

slide-5
SLIDE 5

Learning Paradigms Learning Paradigms

  • Supervised Learning: It relies on a teacher that

Supervised Learning: It relies on a teacher that provides the input data as well as the desired provides the input data as well as the desired solution (also known as programming by solution (also known as programming by example) example)

  • Unsupervised Learning: It depends on input data

Unsupervised Learning: It depends on input data

  • nly and makes no demand on knowing the
  • nly and makes no demand on knowing the

solution solution

  • Reinforcement learning : A kind of Supervised

Reinforcement learning : A kind of Supervised Learning used when explicit input/ output pairs Learning used when explicit input/ output pairs

  • f training data are not available
  • f training data are not available
slide-6
SLIDE 6

Online & Offline Learning Online & Offline Learning

  • Online Learning:

Online Learning: It means that the agent

It means that the agent is sent out to perform its tasks and that it is sent out to perform its tasks and that it can adapt after each transaction is can adapt after each transaction is processed. processed.

  • Offline learning:

Offline learning: It involves saving data

It involves saving data while the agent is working and using the while the agent is working and using the data later to train the agent. data later to train the agent.

slide-7
SLIDE 7

Neural Networks Neural Networks

  • Neural Networks are parallel computing models

Neural Networks are parallel computing models that adapt when presented with training data. that adapt when presented with training data.

  • They operate in Supervised, Unsupervised and

They operate in Supervised, Unsupervised and Reinforcement learning modes. Reinforcement learning modes.

  • A neural network is comprised of a set of simple

A neural network is comprised of a set of simple processing units and a set of adaptive , real processing units and a set of adaptive , real-

  • valued connection weights.

valued connection weights.

  • Learning in neural networks is accomplished

Learning in neural networks is accomplished through the adjustment of the connection through the adjustment of the connection weights. weights.

slide-8
SLIDE 8
slide-9
SLIDE 9

Back Propagation Back Propagation

  • Back propagation is the most popular neural

Back propagation is the most popular neural network architecture for supervised learning . network architecture for supervised learning .

  • It features a feed

It features a feed-

  • forward connection topology ,

forward connection topology , meaning that data flows through the network in a meaning that data flows through the network in a single direction , and uses a technique called the single direction , and uses a technique called the backward propagation of errors to adjust backward propagation of errors to adjust connection weights . connection weights .

  • The primary applications of back

The primary applications of back-

  • propagation

propagation networks are for prediction and classification. networks are for prediction and classification.

slide-10
SLIDE 10
slide-11
SLIDE 11

Back Propagation Back Propagation

This diagram illustrates the three major This diagram illustrates the three major steps of the training process: steps of the training process:

  • Input data is presented to the input layer of

Input data is presented to the input layer of units and flows in the network till reaching units and flows in the network till reaching

  • utput units.
  • utput units.
  • The difference between the desired and

The difference between the desired and the actual output is computed, producing the actual output is computed, producing the network error. the network error.

  • This error is then passed backwards

This error is then passed backwards through the network to adjust the through the network to adjust the connection weights. connection weights.

slide-12
SLIDE 12

Kohonen Maps Kohonen Maps

  • Kohonen map networks are unsupervised ,

Kohonen map networks are unsupervised , single single-

  • layer neural networks comprised of an

layer neural networks comprised of an input layer and an output layer. input layer and an output layer.

  • Each time an input vector is presented to the

Each time an input vector is presented to the network, its distance to each unit in the output network, its distance to each unit in the output layer is computed using Euclidean distance. layer is computed using Euclidean distance.

  • Kohonen map networks self

Kohonen map networks self-

  • organize and learn
  • rganize and learn

to map similar inputs into output units that are in to map similar inputs into output units that are in close proximity to each other. close proximity to each other.

  • They have become one of the most popular and

They have become one of the most popular and practical neural network models. practical neural network models.

slide-13
SLIDE 13
slide-14
SLIDE 14

Decision Trees Decision Trees-

  • 1

1

  • Decision trees can be defined as structures that

Decision trees can be defined as structures that consist of consist of

Leaf nodes, representing a class, and

Leaf nodes, representing a class, and

Decision nodes, where a test is to be carried out on a

Decision nodes, where a test is to be carried out on a single attribute value, with one branch for each possible single attribute value, with one branch for each possible

  • utcome of the test.
  • utcome of the test.
  • Decision trees perform induction on example data

Decision trees perform induction on example data sets, generating classifiers and prediction models sets, generating classifiers and prediction models

  • A decision tree examines the data set and uses

A decision tree examines the data set and uses information theory to determine which attribute information theory to determine which attribute contains the most information on which to base a contains the most information on which to base a decision. decision.

slide-15
SLIDE 15

A Training Set: A Training Set: “ “Play/Don Play/Don’ ’t Play t Play” ”

No. No. Outlook Outlook Temperature Temperature Humidity Humidity Windy Windy Class Class 1 1 sunny sunny hot hot high high false false N N 2 2 sunny sunny hot hot high high true true N N 3 3

  • vercast
  • vercast

hot hot high high false false P P 4 4 rain rain mild mild high high false false P P 5 5 rain rain cool cool normal normal false false P P 6 6 rain rain cool cool normal normal true true N N 7 7

  • vercast
  • vercast

cool cool normal normal true true P P 8 8 sunny sunny mild mild high high false false N N 9 9 sunny sunny cool cool normal normal false false P P 10 10 rain rain mild mild normal normal false false P P 11 11 sunny sunny mild mild normal normal true true P P 12 12

  • vercast
  • vercast

mild mild high high true true P P 13 13

  • vercast
  • vercast

hot hot normal normal false false P P 14 14 rain rain mild mild high high true true N N

slide-16
SLIDE 16

Decision Tree Derived from Training Decision Tree Derived from Training Set Set

  • utlook

humidity windy sunny

  • vercast

rain P high normal P N true false P N

slide-17
SLIDE 17

Classification Rule based on Classification Rule based on Decision Tree Decision Tree

If If

  • utlook = overcast
  • utlook = overcast

∨ ∨

  • utlook = sunny &
  • utlook = sunny &

humidity = normal humidity = normal ∨ ∨

  • utlook = rain &
  • utlook = rain &

windy = false windy = false Then Then P P

slide-18
SLIDE 18

ID3 Algorithm ID3 Algorithm

  • Given (1) a set of disjoint target classes (

Given (1) a set of disjoint target classes (C C1, , C C2, , … …, , C Ck), and (2) a set of training data, ), and (2) a set of training data, S S, containing , containing

  • bjects of more than one class
  • bjects of more than one class
  • ID3 uses a series of tests to refine

ID3 uses a series of tests to refine S S into subsets that into subsets that contain objects of only one class. contain objects of only one class.

  • ID3 builds a decision tree, where non

ID3 builds a decision tree, where non-

  • terminal nodes

terminal nodes correspond to tests on a single attribute of the data, correspond to tests on a single attribute of the data, and terminal nodes correspond to classified subsets and terminal nodes correspond to classified subsets

  • f the data.
  • f the data.
  • Let

Let T T be any test on a single attribute. Thus be any test on a single attribute. Thus T T produces a partition { produces a partition {S S1, , S S2, ,… …, , S Sn} based on outcome } based on outcome O O1, , O O2, ,… …, ,O On: : S Si = { = {x x | | T T( (x x) = ) = O Oi} }

slide-19
SLIDE 19

Tree Structure of Partitioned Tree Structure of Partitioned Objects Objects

S S1 S2 Sn

. . . . . . .

O1 O2 On

. . . . .

slide-20
SLIDE 20

Information Theory Information Theory

  • Consider a set of messages

Consider a set of messages M M = = { {m m1, , m m2, , … …, , m mn} }

  • Each message mi has probability

Each message mi has probability p p( (m mi) of being ) of being received and contains an amount of information received and contains an amount of information I I( (m mi) as follows: ) as follows: I I( (m mi) ) = = − −log log2 p p( (m mi) )

  • The uncertainty (or entropy) of a message set

The uncertainty (or entropy) of a message set U U( (M M) ) is the sum of information in the possible messages is the sum of information in the possible messages weighted by their probabilities: weighted by their probabilities: U U( (M M) ) = = − −Σ

Σi p

p( (m mi)log )log2 p p( (m mi) ) for for i i = = 1 to n 1 to n

slide-21
SLIDE 21

Building Decision Trees in ID3 Building Decision Trees in ID3

Let

Let N Ni stand for the number of cases in stand for the number of cases in S S that belong to class that belong to class C Ci. . Then the probability that a random case Then the probability that a random case c c belongs to class belongs to class C Ci is is estimated to be: estimated to be:

Thus the amount of information in a message of class

Thus the amount of information in a message of class C Ci is: is:

Consider the set of target classes as a message set {

Consider the set of target classes as a message set {C C1, , C C2, ,… …, ,C Ck}. }. The uncertainty The uncertainty U U( (S S) measures the average amount of information ) measures the average amount of information need to determine the class of a random case, need to determine the class of a random case, c c ∈ ∈ S S, prior to , prior to partitioning by any test. Thus partitioning by any test. Thus:

: | | ) ( S N C c p

i i =

bits ) ( log ) (

2 i i

C c p C c I ∈ − = ∈

bits ) ( ) ( ) (

to 1 i k i i

C c I C c p S U ∈ × ∈ = ∑ =

slide-22
SLIDE 22

Building Decision Trees in ID3 (contd.) Building Decision Trees in ID3 (contd.)

  • Consider a similar uncertainty measure after S has

Consider a similar uncertainty measure after S has been partitioned into { been partitioned into {S S1, , S S2, , … …, , S Sn} by a test } by a test T T: :

  • U

UT(S) measures how much information is needed for (S) measures how much information is needed for the partitioning. Thus ID3 decides what attribute to the partitioning. Thus ID3 decides what attribute to branch on next by selecting the test branch on next by selecting the test T T that gains the that gains the most information, i.e. maximum most information, i.e. maximum G GS( (T T) given below: ) given below: G GS( (T T) ) = = U U( (S S) ) − − U UT( (S S) )

∑ =

× =

n to i i i i T

S U S S S U

1

) ( | | | | ) (

slide-23
SLIDE 23

“ “Play/Don Play/Don’ ’t Play t Play” ” Example Example

S S = { = {P P, , N N } } For For T T = = Outlook Outlook, { , {S S1, , S S2, , S S3} = { } = {sunny sunny, , overcast

  • vercast,

, rain rain} } U U( (sunny sunny) ) = = − −(2/5)log (2/5)log2(2/5) (2/5) − −(3/5)log (3/5)log2(3/5) (3/5) = = 0.971 0.971 U U( (overcast

  • vercast)

) = = − −(4/4)log (4/4)log2(4/4) (4/4) − −(0/4)log (0/4)log2(0/4) (0/4) = = 0 U U( (rain rain) ) = = − −(3/5)log (3/5)log2(3/5) (3/5) − −(2/5)log (2/5)log2(2/5) = 0.971 (2/5) = 0.971 U UOutlook

Outlook(

(S S) = (5/14) ) = (5/14)× ×0.971 + (4/14) 0.971 + (4/14)× ×0 + (5/14) 0 + (5/14)× ×0.971 = 0.971 = 0.6936 0.6936 G GS( (Outlook Outlook) = ) = U U( (S S) ) − −U UOutlook

Outlook(

(S S) = 0.9403 ) = 0.9403 − − 0.6940= 0.2463 0.6940= 0.2463

9403 . ) 14 / 5 ( log ) 14 / 5 ( ) 14 / 9 ( log ) 14 / 9 ( ) /( log ) /( ) /( log ) /( ) (

2 2 2 2

= − − = + + − + + − = n p n n p n n p p n p p S U

slide-24
SLIDE 24

“ “Play/Don Play/Don’ ’t Play t Play” ” Example (contd.) Example (contd.)

Similarly, Similarly, U UTemperature

Temperature(

(S S) ) = 0.9226 = 0.9226 U UHumidity

Humidity(

(S S) = 0.9177 ) = 0.9177 U UWindy

Windy(

(S S) = 0.8922 ) = 0.8922 G GS( (Temperature Temperature) = ) = U U( (S S) ) − − U UTemperature

Temperature(

(S S) ) = 0.9403 = 0.9403 − − 0.9226 = 0.0177 0.9226 = 0.0177 G GS( (Humidity Humidity) = ) = U U( (S S) ) − − U UHumidity

Humidity(

(S S) = 0.9403 ) = 0.9403 − − 0.9177 = 0.9177 = 0.0126 0.0126 G GS( (Windy Windy) = ) = U U( (S S) ) − − U UWindy

Windy(

(S S) = 0.9403 ) = 0.9403 − − 0.8922 = 0.0481 0.8922 = 0.0481 Thus Thus T T = = Outlook Outlook has the highest information gain and is has the highest information gain and is thus chosen as the root. thus chosen as the root.