For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, - - PowerPoint PPT Presentation

for thursday
SMART_READER_LITE
LIVE PREVIEW

For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, - - PowerPoint PPT Presentation

For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, exercise 25, parts a and b only Program 4 Any questions? PAC Learning The only reasonable expectation of a learner is that with high probability it learns a close


slide-1
SLIDE 1

For Thursday

  • Read chapter 23, sections 1-3
  • Homework:

– Chapter 18, exercise 25, parts a and b only

slide-2
SLIDE 2

Program 4

  • Any questions?
slide-3
SLIDE 3

PAC Learning

  • The only reasonable expectation of a learner

is that with high probability it learns a close approximation to the target concept.

  • In the PAC model, we specify two small

parameters, ε and δ, and require that with probability at least (1  δ) a system learn a concept with error at most ε.

slide-4
SLIDE 4

Version Space

  • Bounds on generalizations of a set of

examples

slide-5
SLIDE 5

Consistent Learners

  • A learner L using a hypothesis H and training data

D is said to be a consistent learner if it always

  • utputs a hypothesis with zero error on D

whenever H contains such a hypothesis.

  • By definition, a consistent learner must produce a

hypothesis in the version space for H given D.

  • Therefore, to bound the number of examples

needed by a consistent learner, we just need to bound the number of examples needed to ensure that the version-space contains no hypotheses with unacceptably high error.

slide-6
SLIDE 6

ε-Exhausted Version Space

  • The version space, VSH,D, is said to be ε-exhausted iff every

hypothesis in it has true error less than or equal to ε.

  • In other words, there are enough training examples to

guarantee than any consistent hypothesis has error at most ε.

  • One can never be sure that the version-space is ε-exhausted,

but one can bound the probability that it is not.

  • Theorem 7.1 (Haussler, 1988): If the hypothesis space H is

finite, and D is a sequence of m1 independent random examples for some target concept c, then for any 0 ε  1, the probability that the version space VSH,D is not ε- exhausted is less than or equal to: |H|e–εm

slide-7
SLIDE 7

Sample Complexity Analysis

  • Let δ be an upper bound on the probability of not

exhausting the version space. So:

         

 

/ ln 1 ln / ln ) inequality (flip / ln ) ln( )) , ( consist (                               

 

H m H m H m H m H e e H D H P

m m bad

slide-8
SLIDE 8

Sample Complexity Result

  • Therefore, any consistent learner, given at least:

examples will produce a result that is PAC.

  • Just need to determine the size of a hypothesis space to

instantiate this result for learning specific classes of concepts.

  • This gives a sufficient number of examples for PAC

learning, but not a necessary number. Several approximations like that used to bound the probability of a disjunction make this a gross over-estimate in practice.

  / ln 1 ln        H

slide-9
SLIDE 9

Sample Complexity of Conjunction Learning

  • Consider conjunctions over n boolean features. There are 3n of these

since each feature can appear positively, appear negatively, or not appear in a given conjunction. Therefore |H|= 3n, so a sufficient number of examples to learn a PAC concept is:

  • Concrete examples:

– δ=ε=0.05, n=10 gives 280 examples – δ=0.01, ε=0.05, n=10 gives 312 examples – δ=ε=0.01, n=10 gives 1,560 examples – δ=ε=0.01, n=50 gives 5,954 examples

  • Result holds for any consistent learner.

    / 3 ln 1 ln / 3 ln 1 ln                n

n

slide-10
SLIDE 10

Sample Complexity of Learning Arbitrary Boolean Functions

  • Consider any boolean function over n boolean features such as the

hypothesis space of DNF or decision trees. There are 22^n of these, so a sufficient number of examples to learn a PAC concept is:

  • Concrete examples:

– δ=ε=0.05, n=10 gives 14,256 examples – δ=ε=0.05, n=20 gives 14,536,410 examples – δ=ε=0.05, n=50 gives 1.561x1016 examples

    / 2 ln 2 1 ln / 2 ln 1 ln

2

              

n

n

slide-11
SLIDE 11

COLT Conclusions

  • The PAC framework provides a theoretical framework for

analyzing the effectiveness of learning algorithms.

  • The sample complexity for any consistent learner using

some hypothesis space, H, can be determined from a measure of its expressiveness |H| or VC(H), quantifying bias and relating it to generalization.

  • If sample complexity is tractable, then the computational

complexity of finding a consistent hypothesis in H governs its PAC learnability.

  • Constant factors are more important in sample complexity

than in computational complexity, since our ability to gather data is generally not growing exponentially.

  • Experimental results suggest that theoretical sample

complexity bounds over-estimate the number of training instances needed in practice since they are worst-case upper bounds.

slide-12
SLIDE 12

COLT Conclusions (cont)

  • Additional results produced for analyzing:

– Learning with queries – Learning with noisy data – Average case sample complexity given assumptions about the data distribution. – Learning finite automata – Learning neural networks

  • Analyzing practical algorithms that use a preference bias is

difficult.

  • Some effective practical algorithms motivated by theoretical

results:

– Winnow – Boosting – Support Vector Machines (SVM)

slide-13
SLIDE 13

Beyond a Single Learner

  • Ensembles of learners work better than

individual learning algorithms

  • Several possible ensemble approaches:

– Ensembles created by using different learning methods and voting – Bagging – Boosting

slide-14
SLIDE 14

Bagging

  • Random selections of examples to learn the

various members of the ensemble.

  • Seems to work fairly well, but no real

guarantees.

slide-15
SLIDE 15

Boosting

  • Most used ensemble method
  • Based on the concept of a weighted training set.
  • Works especially well with weak learners.
  • Start with all weights at 1.
  • Learn a hypothesis from the weights.
  • Increase the weights of all misclassified examples

and decrease the weights of all correctly classified examples.

  • Learn a new hypothesis.
  • Repeat
slide-16
SLIDE 16

Why Neural Networks?

slide-17
SLIDE 17

Why Neural Networks?

  • Analogy to biological systems, the best examples we

have of robust learning systems.

  • Models of biological systems allowing us to

understand how they learn and adapt.

  • Massive parallelism that allows for computational

efficiency.

  • Graceful degradation due to distributed represent-

ations that spread knowledge representation over large numbers of computational units.

  • Intelligent behavior is an emergent property from

large numbers of simple units rather than resulting from explicit symbolically encoded rules.

slide-18
SLIDE 18

Neural Speed Constraints

  • Neuron “switching time” is on the order of

milliseconds compared to nanoseconds for current transistors.

  • A factor of a million difference in speed.
  • However, biological systems can perform

significant cognitive tasks (vision, language understanding) in seconds or tenths of seconds.

slide-19
SLIDE 19

What That Means

  • Therefore, there is only time for about a

hundred serial steps needed to perform such tasks.

  • Even with limited abilties, current AI

systems require orders of magnitude more serial steps.

  • Human brain has approximately 1011

neurons each connected on average to 104

  • thers, therefore must exploit massive

parallelism.

slide-20
SLIDE 20

Real Neurons

  • Cells forming the basis of neural tissue

– Cell body – Dendrites – Axon – Syntaptic terminals

  • The electrical potential across the cell membrane

exhibits spikes called action potentials.

  • Originating in the cell body, this spike travels

down the axon and causes chemical neuro- transmitters to be released at syntaptic terminals.

  • This chemical difuses across the synapse into

dendrites of neighboring cells.

slide-21
SLIDE 21

Real Neurons (cont)

  • Synapses can be excitory or inhibitory.
  • Size of synaptic terminal influences strength
  • f connection.
  • Cells “add up” the incoming chemical

messages from all neighboring cells and if the net positive influence exceeds a threshold, they “fire” and emit an action potential.

slide-22
SLIDE 22

Model Neuron (Linear Threshold Unit)

  • Neuron modelled by a unit (j) connected by

weights, wji, to other units (i):

  • Net input to a unit is defined as:

netj = S wji * oi

  • Output of a unit is a threshold function on

the net input:

– 1 if netj > Tj – 0 otherwise

slide-23
SLIDE 23

Neural Computation

  • McCollough and Pitts (1943) show how

linear threshold units can be used to compute logical functions.

  • Can build basic logic gates

– AND: Let all wji be (Tj /n)+ where n = number

  • f inputs

– OR: Let all wji be Tj+ – NOT: Let one input be a constant 1 with weight Tj+e and the input to be inverted have weight

  • Tj
slide-24
SLIDE 24

Neural Computation (cont)

  • Can build arbitrary logic circuits,

finite-state machines, and computers given these basis gates.

  • Given negated inputs, two layers of linear

threshold units can specify any boolean function using a two-layer AND-OR network.

slide-25
SLIDE 25

Learning

  • Hebb (1949) suggested if two units are both

active (firing) then the weight between them should increase:

wji = wji + ojoi –  is a constant called the learning rate – Supported by physiological evidence

slide-26
SLIDE 26

Alternate Learning Rule

  • Rosenblatt (1959) suggested that if a target
  • utput value is provided for a single neuron

with fixed inputs, can incrementally change weights to learn to produce these outputs using the perceptron learning rule.

– Assumes binary valued input/outputs – Assumes a single linear threshold unit. – Assumes input features are detected by fixed networks.

slide-27
SLIDE 27

Perceptron Learning Rule

  • If the target output for output unitj is tj

wji = wji + (tj - oj)oi

  • Equivalent to the intuitive rules:

– If output is correct, don't change the weights – If output is low (oj = 0, tj =1), increment weights for all inputs which are 1. – If output is high (oj = 1, tj =0), decrement weights for all inputs which are 1.

  • Must also adjust threshold:

Tj = Tj + (tj - oj)

  • or equivalently assume there is a weight wj0 = -Tj for an

extra input unit 0 that has constant output o0 =1 and that the threshold is always 0.

slide-28
SLIDE 28

Perceptron Learning Algorithm

  • Repeatedly iterate through examples adjusting

weights according to the perceptron learning rule until all outputs are correct Initialize the weights to all zero (or randomly) Until outputs for all training examples are correct For each training example, e, do Compute the current output oj Compare it to the target tj and update the weights according to the perceptron learning rule.

slide-29
SLIDE 29

Algorithm Notes

  • Each execution of the outer loop is called an

epoch.

  • If the output is considered as concept membership

and inputs as binary input features, then easily applied to concept learning problems.

  • For multiple category problems, learn a separate

perceptron for each category and assign to the class whose perceptron most exceeds its threshold.

  • When will this algorithm terminate (converge) ??
slide-30
SLIDE 30

Representational Limitations

  • Perceptrons can only represent linear

threshold functions and can therefore only learn data which is linearly separable (positive and negative examples are separable by a hyperplane in n-dimensional space)

  • Cannot represent exclusive-or (xor)
slide-31
SLIDE 31

Perceptron Learnability

  • System obviously cannot learn what it cannot represent.
  • Minsky and Papert(1969) demonstrated that many

functions like parity (n-input generalization of xor) could not be represented.

  • In visual pattern recognition, assumed that input features

are local and extract feature within a fixed radius. In which case no input features support learning

– Symmetry – Connectivity

  • These limitations discouraged subsequent research on

neural networks.

slide-32
SLIDE 32

Perceptron Convergence and Cycling Theorems

  • Perceptron Convergence Theorem: If there are a

set of weights that are consistent with the training data (i.e. the data is linearly separable), the perceptron learning algorithm will converge (Minsky & Papert, 1969).

  • Perceptron Cycling Theorem: If the training data

is not linearly separable, the Perceptron learning algorithm will eventually repeat the same set of weights and threshold at the end of some epoch and therefore enter an infinite loop.

slide-33
SLIDE 33

Perceptron Learning as Hill Climbing

  • The search space for Perceptron learning is the space of

possible values for the weights (and threshold).

  • The evaluation metric is the error these weights produce

when used to classify the training examples.

  • The perceptron learning algorithm performs a form of

hill-climbing (gradient descent), at each point altering the weights slightly in a direction to help minimize this error.

  • Perceptron convergence theorem guarantees that for the

linearly separable case there is only one local minimum and the space is well behaved.

slide-34
SLIDE 34

Perceptron Performance

  • Can represent and learn conjunctive

concepts and M-of-N concepts (true if any M of a set of N selected binary features are true).

  • Although simple and restrictive, this

high-bias algorithm performs quite well on many realistic problems.

  • However, the representational restriction is

limiting in many applications.

slide-35
SLIDE 35

Multi-Layer Neural Networks

  • Multi-layer networks can represent arbitrary

functions, but building an effective learning method for such networks was thought to be difficult.

  • Generally networks are composed of an input layer,

hidden layer, and output layer and activation feeds forward from input to output.

  • Patterns of activation are presented at the inputs and

the resulting activation of the outputs is computed.

  • The values of the weights determine the function

computed.

  • A network with one hidden layer with a sufficient

number of units can represent any boolean function.

slide-36
SLIDE 36

Basic Problem

  • General approach to the learning algorithm

is to apply gradient descent.

  • However, for the general case, we need to

be able to differentiate the function computed by a unit and the standard threshold function is not differentiable at the threshold.

slide-37
SLIDE 37

Differentiable Threshold Unit

  • Need some sort of non-linear output function to

allow computation of arbitary functions by mulit-layer networks (a multi-layer network of linear units can still only represent a linear function).

  • Solution: Use a nonlinear, differentiable output

function such as the sigmoid or logistic function

  • j = 1/(1 + e-(netj - Tj) )
  • Can also use other functions such as tanh or a

Gaussian.

slide-38
SLIDE 38

Error Measure

  • Since there are mulitple continuous outputs,

we can define an overall error measure: E(W) = 1/2 *( S S (tkd - okd)2)

dD kK

where D is the set of training examples, K is the set of output units, tkd is the target

  • utput for the kth unit given input d, and okd

is network output for the kth unit given input d.

slide-39
SLIDE 39

Gradient Descent

  • The derivative of the output of a sigmoid

unit given the net input is oj/ netj = oj(1 - oj)

  • This can be used to derive a learning rule

which performs gradient descent in weight space in an attempt to minimize the error function. wji = -(E / wji)

slide-40
SLIDE 40

Backpropogation Learning Rule

  • Each weight wji is changed by

wji = joi

j = oj (1 - oj) (tj - oj) if j is an output unit j = oj (1 - oj) Sk wkj otherwise where  is a constant called the learning rate, tj is the correct output for unit j, dj is an error measure for unit j.

  • First determine the error for the output

units, then backpropagate this error layer by layer through the network, changing weights appropriately at each layer.

slide-41
SLIDE 41

Backpropogation Learning Algorithm

  • Create a three layer network with N hidden units and fully

connect input units to hidden units and hidden units to output units with small random weights. Until all examples produce the correct output within e or the mean-squared error ceases to decrease (or other termination criteria):

Begin epoch For each example in training set do: Compute the network output for this example. Compute the error between this output and the correct output. Backpropagate this error and adjust weights to decrease this error. End epoch

  • Since continuous outputs only approach 0 or 1 in the limit,

must allow for some e-approximation to learn binary functions.

slide-42
SLIDE 42

Comments on Training

  • There is no guarantee of convergence, may
  • scillate or reach a local minima.
  • However, in practice many large networks

can be adequately trained on large amounts

  • f data for realistic problems.
  • Many epochs (thousands) may be needed

for adequate training, large data sets may require hours or days of CPU time.

  • Termination criteria can be:

– Fixed number of epochs – Threshold on training set error

slide-43
SLIDE 43

Representational Power

Multi-layer sigmoidal networks are very expressive.

  • Boolean functions: Any Boolean function can be represented

by a two layer network by simulating a two-layer AND-OR

  • network. But number of required hidden units can grow

exponentially in the number of inputs.

  • Continuous functions: Any bounded continuous function can

be approximated with arbitrarily small error by a two-layer

  • network. Sigmoid functions provide a set of basis functions

from which arbitrary functions can be composed, just as any function can be represented by a sum of sine waves in Fourier analysis.

  • Arbitrary functions: Any function can be approximated to

arbitarary accuracy by a three-layer network.

slide-44
SLIDE 44

Sample Learned XOR Network

Hidden unit A represents ¬(X  Y) Hidden unit B represents ¬(X  Y) Output O represents: A  ¬B ¬(X  Y)  (X  Y) X  Y

A B X Y 3.11 6.96

  • 7.38
  • 5.24
  • 2.03
  • 5.57
  • 3.6
  • 3.58
  • 5.74
slide-45
SLIDE 45

Hidden Unit Representations

  • Trained hidden units can be seen as newly

constructed features that re-represent the examples so that they are linearly separable.

  • On many real problems, hidden units can

end up representing interesting recognizable features such as vowel-detectors, edge-detectors, etc.

  • However, particularly with many hidden

units, they become more “distributed” and are hard to interpret.

slide-46
SLIDE 46

Input/Output Coding

  • Appropriate coding of inputs and outputs

can make learning problem easier and improve generalization.

  • Best to encode each binary feature as a

separate input unit and for multi-valued features include one binary unit per value rather than trying to encode input information in fewer units using binary coding or continuous values.

slide-47
SLIDE 47

I/O Coding cont.

  • Continuous inputs can be handled by a single

input by scaling them between 0 and 1.

  • For disjoint categorization problems, best to

have one output unit per category rather than encoding n categories into log n bits. Continuous output values then represent certainty in various categories. Assign test cases to the category with the highest output.

  • Continuous outputs (regression) can also be

handled by scaling between 0 and 1.

slide-48
SLIDE 48

Neural Net Conclusions

  • Learned concepts can be represented by networks
  • f linear threshold units and trained using gradient

descent.

  • Analogy to the brain and numerous successful

applications have generated significant interest.

  • Generally much slower to train than other learning

methods, but exploring a rich hypothesis space that seems to work well in many domains.

  • Potential to model biological and cognitive

phenomenon and increase our understanding of real neural systems.

– Backprop itself is not very biologically plausible

slide-49
SLIDE 49

Natural Language Processing

  • What’s the goal?
slide-50
SLIDE 50

Communication

  • Communication for the speaker:

– Intention: Decided why, when, and what information should be transmitted. May require planning and reasoning about agents' goals and beliefs. – Generation: Translating the information to be communicated into a string of words. – Synthesis: Output of string in desired modality, e.g.text on a screen or speech.

slide-51
SLIDE 51

Communication (cont.)

  • Communication for the hearer:

– Perception: Mapping input modality to a string of words, e.g. optical character recognition or speech recognition. – Analysis: Determining the information content of the string.

  • Syntactic interpretation (parsing): Find correct parse tree

showing the phrase structure

  • Semantic interpretation: Extract (literal) meaning of the string

in some representation, e.g. FOPC.

  • Pragmatic interpretation: Consider effect of overall context on

the meaning of the sentence

– Incorporation: Decide whether or not to believe the content of the string and add it to the KB.

slide-52
SLIDE 52

Ambiguity

  • Natural language sentences are highly

ambiguous and must be disambiguated.

I saw the man on the hill with the telescope. I saw the Grand Canyon flying to LA. I saw a jet flying to LA. Time flies like an arrow. Horse flies like a sugar cube. Time runners like a coach. Time cars like a Porsche.

slide-53
SLIDE 53

Syntax

  • Syntax concerns the proper ordering of

words and its effect on meaning.

The dog bit the boy. The boy bit the dog. * Bit boy the dog the Colorless green ideas sleep furiously.

slide-54
SLIDE 54

Semantics

  • Semantics concerns of meaning of words,

phrases, and sentences. Generally restricted to “literal meaning”

– “plant” as a photosynthetic organism – “plant” as a manufacturing facility – “plant” as the act of sowing

slide-55
SLIDE 55

Pragmatics

  • Pragmatics concerns the overall

commuinicative and social context and its effect on interpretation.

– Can you pass the salt? – Passerby: Does your dog bite? Clouseau: No. Passerby: (pets dog) Chomp! I thought you said your dog didn't bite!! Clouseau:That, sir, is not my dog!

slide-56
SLIDE 56

Modular Processing

acoustic/ phonetic syntax semantics pragmatics Speech recognition Parsing Sound waves words Parse trees literal meaning meaning

slide-57
SLIDE 57

Examples

  • Phonetics

“grey twine” vs. “great wine” “youth in Asia” vs. “euthanasia” “yawanna” ­> “do you want to”

  • Syntax

I ate spaghetti with a fork. I ate spaghetti with meatballs.

slide-58
SLIDE 58

More Examples

  • Semantics

I put the plant in the window. Ford put the plant in Mexico. The dog is in the pen. The ink is in the pen.

  • Pragmatics

The ham sandwich wants another beer. John thinks vanilla.