for thursday
play

For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, - PowerPoint PPT Presentation

For Thursday Read chapter 23, sections 1-3 Homework: Chapter 18, exercise 25, parts a and b only Program 4 Any questions? PAC Learning The only reasonable expectation of a learner is that with high probability it learns a close


  1. For Thursday • Read chapter 23, sections 1-3 • Homework: – Chapter 18, exercise 25, parts a and b only

  2. Program 4 • Any questions?

  3. PAC Learning • The only reasonable expectation of a learner is that with high probability it learns a close approximation to the target concept. • In the PAC model, we specify two small parameters, ε and δ , and require that with probability at least (1  δ ) a system learn a concept with error at most ε .

  4. Version Space • Bounds on generalizations of a set of examples

  5. Consistent Learners • A learner L using a hypothesis H and training data D is said to be a consistent learner if it always outputs a hypothesis with zero error on D whenever H contains such a hypothesis. • By definition, a consistent learner must produce a hypothesis in the version space for H given D . • Therefore, to bound the number of examples needed by a consistent learner, we just need to bound the number of examples needed to ensure that the version-space contains no hypotheses with unacceptably high error.

  6. ε -Exhausted Version Space • The version space, VS H , D , is said to be ε -exhausted iff every hypothesis in it has true error less than or equal to ε. • In other words, there are enough training examples to guarantee than any consistent hypothesis has error at most ε. • One can never be sure that the version-space is ε -exhausted, but one can bound the probability that it is not. • Theorem 7.1 (Haussler, 1988): If the hypothesis space H is finite, and D is a sequence of m  1 independent random examples for some target concept c , then for any 0  ε  1, the probability that the version space VS H , D is not ε - exhausted is less than or equal to: | H | e – ε m

  7. Sample Complexity Analysis • Let δ be an upper bound on the probability of not exhausting the version space. So:      m ( consist ( , )) P H D H e bad     m e H     ln( ) m H        ln / (flip inequality ) m    H    H     ln / m        1      ln ln / m H   

  8. Sample Complexity Result • Therefore, any consistent learner, given at least:   1     ln ln / H    examples will produce a result that is PAC. • Just need to determine the size of a hypothesis space to instantiate this result for learning specific classes of concepts. • This gives a sufficient number of examples for PAC learning, but not a necessary number. Several approximations like that used to bound the probability of a disjunction make this a gross over-estimate in practice.

  9. Sample Complexity of Conjunction Learning • Consider conjunctions over n boolean features. There are 3 n of these since each feature can appear positively, appear negatively, or not appear in a given conjunction. Therefore |H|= 3 n, so a sufficient number of examples to learn a PAC concept is:     1 1          n ln ln 3 / ln ln 3 / n       • Concrete examples: – δ=ε=0.05, n =10 gives 280 examples – δ=0.01, ε=0.05, n =10 gives 312 examples – δ=ε=0.01, n =10 gives 1,560 examples – δ=ε=0.01, n =50 gives 5,954 examples • Result holds for any consistent learner.

  10. Sample Complexity of Learning Arbitrary Boolean Functions • Consider any boolean function over n boolean features such as the hypothesis space of DNF or decision trees. There are 2 2^ n of these, so a sufficient number of examples to learn a PAC concept is:     1 1  n      2    n ln ln 2 / ln 2 ln 2 /       • Concrete examples: – δ=ε=0.05, n =10 gives 14,256 examples – δ=ε=0.05, n =20 gives 14,536,410 examples – δ=ε=0.05, n =50 gives 1.561 x10 16 examples

  11. COLT Conclusions • The PAC framework provides a theoretical framework for analyzing the effectiveness of learning algorithms. • The sample complexity for any consistent learner using some hypothesis space, H , can be determined from a measure of its expressiveness | H | or VC( H ), quantifying bias and relating it to generalization. • If sample complexity is tractable, then the computational complexity of finding a consistent hypothesis in H governs its PAC learnability. • Constant factors are more important in sample complexity than in computational complexity, since our ability to gather data is generally not growing exponentially. • Experimental results suggest that theoretical sample complexity bounds over-estimate the number of training instances needed in practice since they are worst-case upper bounds.

  12. COLT Conclusions (cont) • Additional results produced for analyzing: – Learning with queries – Learning with noisy data – Average case sample complexity given assumptions about the data distribution. – Learning finite automata – Learning neural networks • Analyzing practical algorithms that use a preference bias is difficult. • Some effective practical algorithms motivated by theoretical results: – Winnow – Boosting – Support Vector Machines (SVM)

  13. Beyond a Single Learner • Ensembles of learners work better than individual learning algorithms • Several possible ensemble approaches: – Ensembles created by using different learning methods and voting – Bagging – Boosting

  14. Bagging • Random selections of examples to learn the various members of the ensemble. • Seems to work fairly well, but no real guarantees.

  15. Boosting • Most used ensemble method • Based on the concept of a weighted training set. • Works especially well with weak learners. • Start with all weights at 1. • Learn a hypothesis from the weights. • Increase the weights of all misclassified examples and decrease the weights of all correctly classified examples. • Learn a new hypothesis. • Repeat

  16. Why Neural Networks?

  17. Why Neural Networks? • Analogy to biological systems, the best examples we have of robust learning systems. • Models of biological systems allowing us to understand how they learn and adapt. • Massive parallelism that allows for computational efficiency. • Graceful degradation due to distributed represent- ations that spread knowledge representation over large numbers of computational units. • Intelligent behavior is an emergent property from large numbers of simple units rather than resulting from explicit symbolically encoded rules.

  18. Neural Speed Constraints • Neuron “switching time” is on the order of milliseconds compared to nanoseconds for current transistors. • A factor of a million difference in speed. • However, biological systems can perform significant cognitive tasks (vision, language understanding) in seconds or tenths of seconds.

  19. What That Means • Therefore, there is only time for about a hundred serial steps needed to perform such tasks. • Even with limited abilties, current AI systems require orders of magnitude more serial steps. • Human brain has approximately 10 11 neurons each connected on average to 10 4 others, therefore must exploit massive parallelism.

  20. Real Neurons • Cells forming the basis of neural tissue – Cell body – Dendrites – Axon – Syntaptic terminals • The electrical potential across the cell membrane exhibits spikes called action potentials. • Originating in the cell body, this spike travels down the axon and causes chemical neuro- transmitters to be released at syntaptic terminals. • This chemical difuses across the synapse into dendrites of neighboring cells.

  21. Real Neurons (cont) • Synapses can be excitory or inhibitory. • Size of synaptic terminal influences strength of connection. • Cells “add up” the incoming chemical messages from all neighboring cells and if the net positive influence exceeds a threshold, they “fire” and emit an action potential.

  22. Model Neuron (Linear Threshold Unit) • Neuron modelled by a unit ( j ) connected by weights, w ji , to other units ( i ): • Net input to a unit is defined as: net j = S w ji * o i • Output of a unit is a threshold function on the net input: – 1 if net j > T j – 0 otherwise

  23. Neural Computation • McCollough and Pitts (1943) show how linear threshold units can be used to compute logical functions. • Can build basic logic gates – AND: Let all w ji be (T j /n)+  where n = number of inputs – OR: Let all w ji be T j +  – NOT: Let one input be a constant 1 with weight T j +e and the input to be inverted have weight -T j

  24. Neural Computation (cont) • Can build arbitrary logic circuits, finite-state machines, and computers given these basis gates. • Given negated inputs, two layers of linear threshold units can specify any boolean function using a two-layer AND-OR network.

  25. Learning • Hebb (1949) suggested if two units are both active (firing) then the weight between them should increase: w ji = w ji +  o j o i –  is a constant called the learning rate – Supported by physiological evidence

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend