Learning Systems Learning Systems
Chapter 5 Chapter 5 Dr Ahmed Dr Ahmed Rafea Rafea
Learning Systems Learning Systems Chapter 5 Chapter 5 Dr Ahmed - - PowerPoint PPT Presentation
Learning Systems Learning Systems Chapter 5 Chapter 5 Dr Ahmed Rafea Rafea Dr Ahmed Overview Overview One central element of intelligent behavior One central element of intelligent behavior is the ability to adapt or learn from
Chapter 5 Chapter 5 Dr Ahmed Dr Ahmed Rafea Rafea
One central element of intelligent behavior is the ability to adapt or learn from is the ability to adapt or learn from experience. experience.
Adding learning or adaptive behavior to an intelligent agent elevates it to a higher intelligent agent elevates it to a higher level of ability. level of ability.
Rote Learning: based on memorization of
based on memorization of examples examples
Parameter or weight adjustment: how to
how to weight the contribution of important decision weight the contribution of important decision factors to the answer. This technique is the basis factors to the answer. This technique is the basis
Induction: extraction of important
extraction of important characteristics of the problem to build a model characteristics of the problem to build a model that can be used to predict new situations. that can be used to predict new situations.
Clustering: A way to organize similar patterns
A way to organize similar patterns into groups into groups
All these learning techniques, including induction and clustering, are used in data mining. and clustering, are used in data mining.
Data mining is a process of extracting valuable , non non-
data. data.
The main contribution of data mining is to find patterns which were not known to exist (finding patterns which were not known to exist (finding new information or knowledge) new information or knowledge)
Therefore, learning as applied to data mining, can be thought of as a way for intelligent agents can be thought of as a way for intelligent agents to automatically discover knowledge rather than to automatically discover knowledge rather than having it predefined. having it predefined.
Supervised Learning: It relies on a teacher that provides the input data as well as the desired provides the input data as well as the desired solution (also known as programming by solution (also known as programming by example) example)
Unsupervised Learning: It depends on input data
solution solution
Reinforcement learning : A kind of Supervised Learning used when explicit input/ output pairs Learning used when explicit input/ output pairs
It means that the agent is sent out to perform its tasks and that it is sent out to perform its tasks and that it can adapt after each transaction is can adapt after each transaction is processed. processed.
It involves saving data while the agent is working and using the while the agent is working and using the data later to train the agent. data later to train the agent.
Neural Networks are parallel computing models that adapt when presented with training data. that adapt when presented with training data.
They operate in Supervised, Unsupervised and Reinforcement learning modes. Reinforcement learning modes.
A neural network is comprised of a set of simple processing units and a set of adaptive , real processing units and a set of adaptive , real-
valued connection weights.
Learning in neural networks is accomplished through the adjustment of the connection through the adjustment of the connection weights. weights.
Back propagation is the most popular neural network architecture for supervised learning . network architecture for supervised learning .
It features a feed-
forward connection topology , meaning that data flows through the network in a meaning that data flows through the network in a single direction , and uses a technique called the single direction , and uses a technique called the backward propagation of errors to adjust backward propagation of errors to adjust connection weights . connection weights .
The primary applications of back-
propagation networks are for prediction and classification. networks are for prediction and classification.
This diagram illustrates the three major This diagram illustrates the three major steps of the training process: steps of the training process:
Input data is presented to the input layer of units and flows in the network till reaching units and flows in the network till reaching
The difference between the desired and the actual output is computed, producing the actual output is computed, producing the network error. the network error.
This error is then passed backwards through the network to adjust the through the network to adjust the connection weights. connection weights.
Kohonen map networks are unsupervised , single single-
layer neural networks comprised of an input layer and an output layer. input layer and an output layer.
Each time an input vector is presented to the network, its distance to each unit in the output network, its distance to each unit in the output layer is computed using Euclidean distance. layer is computed using Euclidean distance.
Kohonen map networks self-
to map similar inputs into output units that are in to map similar inputs into output units that are in close proximity to each other. close proximity to each other.
They have become one of the most popular and practical neural network models. practical neural network models.
Decision trees can be defined as structures that consist of consist of
Leaf nodes, representing a class, and
Leaf nodes, representing a class, and
Decision nodes, where a test is to be carried out on a
Decision nodes, where a test is to be carried out on a single attribute value, with one branch for each possible single attribute value, with one branch for each possible
Decision trees perform induction on example data sets, generating classifiers and prediction models sets, generating classifiers and prediction models
A decision tree examines the data set and uses information theory to determine which attribute information theory to determine which attribute contains the most information on which to base a contains the most information on which to base a decision. decision.
No. No. Outlook Outlook Temperature Temperature Humidity Humidity Windy Windy Class Class 1 1 sunny sunny hot hot high high false false N N 2 2 sunny sunny hot hot high high true true N N 3 3
hot hot high high false false P P 4 4 rain rain mild mild high high false false P P 5 5 rain rain cool cool normal normal false false P P 6 6 rain rain cool cool normal normal true true N N 7 7
cool cool normal normal true true P P 8 8 sunny sunny mild mild high high false false N N 9 9 sunny sunny cool cool normal normal false false P P 10 10 rain rain mild mild normal normal false false P P 11 11 sunny sunny mild mild normal normal true true P P 12 12
mild mild high high true true P P 13 13
hot hot normal normal false false P P 14 14 rain rain mild mild high high true true N N
humidity windy sunny
rain P high normal P N true false P N
If If
∨ ∨
humidity = normal humidity = normal ∨ ∨
windy = false windy = false Then Then P P
Given (1) a set of disjoint target classes (C C1, , C C2, , … …, , C Ck), and (2) a set of training data, ), and (2) a set of training data, S S, containing , containing
ID3 uses a series of tests to refine S S into subsets that into subsets that contain objects of only one class. contain objects of only one class.
ID3 builds a decision tree, where non-
terminal nodes correspond to tests on a single attribute of the data, correspond to tests on a single attribute of the data, and terminal nodes correspond to classified subsets and terminal nodes correspond to classified subsets
Let T T be any test on a single attribute. Thus be any test on a single attribute. Thus T T produces a partition { produces a partition {S S1, , S S2, ,… …, , S Sn} based on outcome } based on outcome O O1, , O O2, ,… …, ,O On: : S Si = { = {x x | | T T( (x x) = ) = O Oi} }
S S1 S2 Sn
. . . . . . .
O1 O2 On
. . . . .
Consider a set of messages M M = = { {m m1, , m m2, , … …, , m mn} }
Each message mi has probability p p( (m mi) of being ) of being received and contains an amount of information received and contains an amount of information I I( (m mi) as follows: ) as follows: I I( (m mi) ) = = − −log log2 p p( (m mi) )
The uncertainty (or entropy) of a message set U U( (M M) ) is the sum of information in the possible messages is the sum of information in the possible messages weighted by their probabilities: weighted by their probabilities: U U( (M M) ) = = − −Σ
Σi p
p( (m mi)log )log2 p p( (m mi) ) for for i i = = 1 to n 1 to n
Let
Let N Ni stand for the number of cases in stand for the number of cases in S S that belong to class that belong to class C Ci. . Then the probability that a random case Then the probability that a random case c c belongs to class belongs to class C Ci is is estimated to be: estimated to be:
Thus the amount of information in a message of class
Thus the amount of information in a message of class C Ci is: is:
Consider the set of target classes as a message set {
Consider the set of target classes as a message set {C C1, , C C2, ,… …, ,C Ck}. }. The uncertainty The uncertainty U U( (S S) measures the average amount of information ) measures the average amount of information need to determine the class of a random case, need to determine the class of a random case, c c ∈ ∈ S S, prior to , prior to partitioning by any test. Thus partitioning by any test. Thus:
: | | ) ( S N C c p
i i =
∈
bits ) ( log ) (
2 i i
C c p C c I ∈ − = ∈
bits ) ( ) ( ) (
to 1 i k i i
C c I C c p S U ∈ × ∈ = ∑ =
Consider a similar uncertainty measure after S has been partitioned into { been partitioned into {S S1, , S S2, , … …, , S Sn} by a test } by a test T T: :
UT(S) measures how much information is needed for (S) measures how much information is needed for the partitioning. Thus ID3 decides what attribute to the partitioning. Thus ID3 decides what attribute to branch on next by selecting the test branch on next by selecting the test T T that gains the that gains the most information, i.e. maximum most information, i.e. maximum G GS( (T T) given below: ) given below: G GS( (T T) ) = = U U( (S S) ) − − U UT( (S S) )
× =
n to i i i i T
S U S S S U
1
) ( | | | | ) (
S S = { = {P P, , N N } } For For T T = = Outlook Outlook, { , {S S1, , S S2, , S S3} = { } = {sunny sunny, , overcast
, rain rain} } U U( (sunny sunny) ) = = − −(2/5)log (2/5)log2(2/5) (2/5) − −(3/5)log (3/5)log2(3/5) (3/5) = = 0.971 0.971 U U( (overcast
) = = − −(4/4)log (4/4)log2(4/4) (4/4) − −(0/4)log (0/4)log2(0/4) (0/4) = = 0 U U( (rain rain) ) = = − −(3/5)log (3/5)log2(3/5) (3/5) − −(2/5)log (2/5)log2(2/5) = 0.971 (2/5) = 0.971 U UOutlook
Outlook(
(S S) = (5/14) ) = (5/14)× ×0.971 + (4/14) 0.971 + (4/14)× ×0 + (5/14) 0 + (5/14)× ×0.971 = 0.971 = 0.6936 0.6936 G GS( (Outlook Outlook) = ) = U U( (S S) ) − −U UOutlook
Outlook(
(S S) = 0.9403 ) = 0.9403 − − 0.6940= 0.2463 0.6940= 0.2463
9403 . ) 14 / 5 ( log ) 14 / 5 ( ) 14 / 9 ( log ) 14 / 9 ( ) /( log ) /( ) /( log ) /( ) (
2 2 2 2
= − − = + + − + + − = n p n n p n n p p n p p S U
Similarly, Similarly, U UTemperature
Temperature(
(S S) ) = 0.9226 = 0.9226 U UHumidity
Humidity(
(S S) = 0.9177 ) = 0.9177 U UWindy
Windy(
(S S) = 0.8922 ) = 0.8922 G GS( (Temperature Temperature) = ) = U U( (S S) ) − − U UTemperature
Temperature(
(S S) ) = 0.9403 = 0.9403 − − 0.9226 = 0.0177 0.9226 = 0.0177 G GS( (Humidity Humidity) = ) = U U( (S S) ) − − U UHumidity
Humidity(
(S S) = 0.9403 ) = 0.9403 − − 0.9177 = 0.9177 = 0.0126 0.0126 G GS( (Windy Windy) = ) = U U( (S S) ) − − U UWindy
Windy(
(S S) = 0.9403 ) = 0.9403 − − 0.8922 = 0.0481 0.8922 = 0.0481 Thus Thus T T = = Outlook Outlook has the highest information gain and is has the highest information gain and is thus chosen as the root. thus chosen as the root.