cs 678 machine learning
play

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and - PDF document

CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts example of predicting basketball


  1. CS 678 Machine Learning Lecture Notes 1 Week 1 - chapter 1 and probability 1.1 General syllabus what do students know (prog. lang., stats, math, calculus ) 1.2 machine learning 1.2.1 general concepts • example of predicting basketball players (height and speed) • detecting patterns or regularities • application of ML to large databases is data mining • pattern recognition (face recognition, fingerprint, character, etc.) • combines math, statistics and computer science 1.2.2 examples of ML • learning associations • classification – classes – discriminant, prediction – OCR, face recognition, medical diagnosis, speech recognition – knowledge extraction, compression, outlier detection • regression 1

  2. 1.3 probability • events, probability and sample space • axioms – 0 ≤ P ( E ) ≤ 1 – P ( S ) = 1 example: ∗ E 1 = die = 1 ∗ S = E 1 ∪ E 2 ∪ E 3 ∪ E 4 ∪ E 5 ∪ E 6 ∗ p ( E 2 ) = 1 6 ∗ p ( S ) = 1 – P ( ∪ E i ) = � P ( E i ) – P ( E ∪ E c ) = P ( E ) + P ( E c ) = 1 – P ( EUF ) = P ( E ) + P ( F ) − P ( E ∩ F )) • conditional prob: – P ( E | F ) = P ( E ∩ F ) /P ( F ) – P ( F | E ) = P ( E | F ) P ( F ) /P ( E ) bayes formula (show derivation) ∗ E = have lung cancer, F = smoke ∗ P ( E ) = people with lung cancer = . 05 all people ∗ P ( F ) = people who smoke = . 50 all people ∗ P ( F | E ) = people who smoke and have lung cancer = . 50 people who have lung cancer ∗ P ( E | F ) = . 80 · . 05 = . 08 . 5 – marginals ∗ P ( X ) = � i P ( X | Y i ) P ( Y i ) ∗ E i =first die is i ∗ P ( T = 7 | E i ) = 1 ∗ so P ( T = 7) = P ( T = 7 | E 1 ) P ( E 1 ) + P ( T = 7 | E 2 ) P ( E 2 ) + ... = � 1 / 36 = 1 6 6 ∗ also do the same with P ( E 3 ) ∗ can also be done with continuous distributions... – P ( E 1 | F ) = P ( F | E 1 ) P ( E 1 ) P ( F | E 1 ) P ( E 1 ) = P ( F ) � i P ( F | E i ) P ( E i ) – P ( E ∩ F ) = P ( E ) P ( F ) if E and F are independent ∗ P ( E | F ) = P ( E ∩ F ) /P ( F ) ∗ P ( E ∩ F ) = P ( E | F ) P ( F ) ∗ so if E and F are independent, P ( E | F ) = P ( E ) ∗ for example, given the first die is 2, P ( die 2 = 3) = 1 / 6 ∗ independence is THE big assumption in machine learning: i.i.d. • random variables 2

  3. – probability distributions ∗ F ( a ) = P { X < = a } ∗ P { a < X < = b } = F ( b ) − F ( a ) ∗ F ( a ) = sum ( x < = a )( P ( x )) discrete � a ∗ F ( a ) = ∞ p ( x ) dx – joint distributions ∗ F ( x, y ) = P { X ≤ x, Y ≤ y } ∗ F X ( x ) = P { X ≤ x, Y ≤ ∞} marginal (show both the discrete and continuous) – conditional distributions: P X | Y ( x | y ) = P { X = x | Y = y } P { Y = y } – bayes rule: P ( y | x ) = P ( x | y ) P Y ( y ) /P X ( x ) (posterior=likelihood*prior/evidence) � – expectation (mean) E [ X ] = � i x i P ( x i ) or E [ X ] = xp ( x ) dx – variance: V ar ( X ) = E [( X − µ ) 2 ] = E [ X 2 ] − µ 2 – distributions ∗ binomial ∗ multinomial ∗ uniform ∗ normal ∗ others (chi-sq, t, F, etc) 3

  4. 2 Week 2 - chapter 2 supervised learning 2.1 learning from examples • positive, negative examples • x = x 1 ...x d input representation (just the pertinant attributes) • X = { x t , r t } N t =1 • hypothesis h , hypothesis class, parameters. h ( x ) = 1 if h classifies x as positive • empirical error - classifier does not match those in X : E ( h | X ) = � N t =1 l ( h ( x t ) � = r t ) • generalization - most specific (S) vs. most general (G)(false positives and negatives). Doubt - those in G - S are not certain so we do not make a decision 2.2 vapnik-chervonenkis dimension maximum number of points that can be shattered by d dimensions. Draw example with 4 points and rectangles. 2.3 PAC learning • want the maximum error to be ǫ , for the 4 rectangles ǫ/ 4 • prob of not an error = 4(1 − ǫ/ 4) N • given the inequality (1 − x ) ≤ e − x , we want to choose N and δ so that 4 e − ǫN/ 4 ≤ δ , which leads to • N ≥ (4 /ǫ ) log (4 /δ ) • example: ǫ = . 1 and δ = . 05 we need to have at least 77 samples 2.4 noise imprecision in recording, labeling mistakes, additional attributes. Question: do you think it is possible to predict with certainty something like ”will so-and-so like a particular movie” given all pertinent data? Complex models can be more accurate but simple models are easier to use, train, explain and may be more accurate (overfitting) - occam’s razor. 2.5 learning multiple classes create rectangles for each class 4

  5. 2.6 regression t =1 where r t ∈ ℜ • X = x t , r tN • interpolation: r t = f ( x t ) , regression: r t = f ( x t ) + ǫ t =1 [ r t − g ( x t )] 2 � N • empirical error: E ( g | X ) = 1 N • if linear: g ( x ) = w 1 x 1 + ... + w d x d + w 0 = � d j =1 w j x j + w 0 • with one attribute: g ( x ) = w 1 x 1 + w 0 t =1 [ r t − ( w 1 x t + w 0 )] 2 • error function: E ( w 1 , w 0 | X ) = � N • taking the partials, setting to zero and solving: � N t =1 x t r t − ¯ xrN – w 1 = � N t =1 ( x t ) 2 − N ¯ x 2 – w 0 = ¯ r − w 1 ¯ x • quadratic and higher-order polynomials 2.7 model selection and generalization • • Go over example in table 2.1. • When the data does not identify a model with certainty, it is an ill-posed problem. • Inductive bias is the set of assumptions that are adopted. • model selection is choosing the right bias. • Underfitting is when the hypothesis is less complex than function • overfitting hypothesis is too complex 2.8 dimensions of supervised ML algorithm (recap) • model: g ( x | θ ) • loss function: E ( θ | X ) = � N t =1 L ( r t , g ( x t | θ )) • optimization method: θ ∗ = argmin θ E ( θ | X ) 5

  6. 2.9 implementation • program to find most specific parameters • program to find most general parameters • program to learn for multiple classes • program to do regression (many packages) 6

  7. 3 Week 3 - chapter 3 Baysian decision theory • observable ( x ) and unobservable ( z ) variables x = f ( z ) • choose the most probable event heads • estimate P ( X ) using samples, i.e. ˆ p 0 = totaltosses 3.1 classification • use the observable variables to predict the class • choose C = 1 if P ( C = 1 | x 1 , x 2 ) > . 5 • prob of error is 1 − max ( P ( C = 1 | x 1 , x 2 ) , P ( C = 1 | x 1 , x 2 )) • bayes rule: P ( C | x ) = p ( x | C ) P ( C ) p ( x ) • prior is the probability of the class • class likelihood is the probability of the data given the class • evidence is the probability of the data, normalization constant • classifier: choose the class with the highest posterior prob: choose C i if P ( C i ) = max k P ( C k | x ) • example: want to predict success of college applicant given: gpa, sat score • example: predict a patient’s reaction (get better, no diff, get worse) given their blood pressure and ethnic background 3.2 losses and risks need to weight decisions as not all decision have the same consequences • let α i be the action of choosing C i • and λ ik be the loss associated with taking action α i when the class is really C i • then the risk of taking action α i is R ( α i | x ) = � K k =1 λ ik P ( C k | x ) • zero-one loss is often assumed to simplify things. assigning risks can always be done as a post processing step. • example: say P ( C 0 | x ) = . 4 and P ( C 1 | x ) = . 6 but that λ 00 = 0 , λ 01 = 10 , λ 10 = 20 and λ 11 = 0 . So – R ( α 0 | x ) = 0 · . 4 + 10 · . 6 – R ( α 1 | x ) = 20 · . 4 + 0 · . 6 • reject option - create one more α and λ 7

  8. 3.3 discriminant functions • g i ( x ) = − R ( α i ) • g i ( x ) = P ( x | C i ) P ( C i ) when zero-one loss function is considered • show briefly the quadratic discriminator 3.4 utility theory • utility function: UE ( α i | x ) = � k U ik P ( S k | x ) • choose α i if UE ( α i | x ) = max j EU ( α j | x ) • typically defined in monetary terms 3.5 value of information • assessing the value of additional information (attributes) • expected utility of current best action: UE ( x ) = max i � k U ik P ( S k | x ) • with new feature z , UE ( x, z ) = max i � k U ik P ( S k | x, z ) • if EU ( x, z ) > EU ( x ) , then z is useful but only if utility of the additional feature is more than the cost of observation and processing 3.6 baysian nets • define probabilistic networks, graphical models and DAG • (slides) define causes and diagnostic arcs in network • explain P ( R | W ) = P ( W | R ) P ( R ) P ( W ) • explain P ( W | S ) = P ( W | R, S ) P ( R | S ) + P ( W | R, S ) P ( R | S ) • P ( W ) = P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S )+ P ( W | R, S ) P ( R, S ) • explain why P ( S | R, W ) is less than P ( S | W ) • local structure - results in storing fewer parameters and making computations easier • belief propagation and junction trees are methods of efficiently solving nets • classification 3.7 influence diagrams 3.8 association rules 8

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend