Statistical Natural Language Processing Machine learning: evaluation - PowerPoint PPT Presentation

Statistical Natural Language Processing Machine learning: evaluation Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017

ML evaluation Measuring success/failure in regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, outcome variable 1 / 20 Root mean squared error (RMSE) y y 2 y 3 ˆ y 1 � n � � 1 ∑ � ( y i − ˆ y i ) 2 RMSE = y 2 ˆ n i y 1 ˆ y 3 x • Measures average error in the units compatible with the

ML evaluation Measuring success/failure in regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, 2 / 20 Coeffjcient determination y y i ∑ n y ) 2 i ( ˆ y i − ¯ R 2 = ∑ n y ) 2 i ( y i − ¯ ) 2 y i ˆ ( RMSE = 1 − ¯ y σ y y ¯ x • r 2 is a standardized measure in range [ 0 , 1 ] • Indicates the ratio of variance of y explained by x • For single predictor it is the square of the correlation coeffjcient r

ML evaluation Measuring success in classifjcation Accuracy of the error function correct total number of predictions Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 20 • In classifjcation, we do not care (much) about the average • We are interested in how many of our predictions are • Accuracy measures this directly accuracy = number of correct predictions

ML evaluation Accuracy may go wrong an empty document set (no results found) the accuracy is: In general, if our class distribution is skewed accuracy will be a bad indicator of success Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 4 / 20 • Think about a ‘dummy’ search engine that always returns • If we have – 1 000 000 documents – 1000 relevant documents (including the term in the query)

ML evaluation Accuracy may go wrong an empty document set (no results found) the accuracy is: be a bad indicator of success Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 4 / 20 • Think about a ‘dummy’ search engine that always returns • If we have – 1 000 000 documents – 1000 relevant documents (including the term in the query) 999 000 1 000 000 = 99 . 90 % • In general, if our class distribution is skewed accuracy will

ML evaluation true value Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predicted neg. pos. negative Measuring success in classifjcation positive Precision, recall, F-score 5 / 20 TP precision = TP + FP TP recall = TP + FN TP FP F 1 -score = 2 × precision × recall FN TN precision + recall

ML evaluation Example: back to the search engine Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, the choice of the ‘positive’ class is important. Precision and recall are asymmetric, 6 / 20 queries • We had a ‘dummy’ search engine that returned false for all • For a query – 1 000 000 documents – 1000 relevant documents accuracy = 999 000 1 000 000 = 99 . 90 % 0 precision = 1 000 000 = 0 % 0 recall = 1 000 000 = 0 %

ML evaluation 3 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, and F-score and Recall and Precision Accuracy both predicted 7 9 1 Classifjer evaluation: another example negative positive true value 1 3 neg. 9 7 pos. negative positive true value Consider the following two classifjers: 7 / 20

ML evaluation positive Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predicted 7 9 3 1 Classifjer evaluation: another example negative true value 1 3 neg. 9 7 pos. negative positive true value Consider the following two classifjers: 7 / 20 Accuracy both 8/20 = 0 . 4 Precision 7/16 = 0 . 44 and 1/4 = 0 . 25 Recall 7/10 = 0 . 7 and 1/10 = 0 . 1 F-score 0 . 54 and 0 . 14

ML evaluation Multi-class evaluation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, there is no natural positive class 8 / 20 precision/recall/f-score • For multi-class problems, it is common to report average • For C classes, averaging can be done two ways: ∑ C TP i ∑ C TP i i i TP i + FP i TP i + FN i precision M = recall M = C C ∑ C ∑ C i TP i i TP i precision µ = recall µ = ∑ C ∑ C i TP i + FP i i TP i + FN i ( M = macro, µ = micro) • The averaging can also be useful for binary classifjcation, if

ML evaluation 2 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, predicted 7 7 0 c 8 Confusion matrix 12 b 4 3 10 a c b a true class classifjcation tasks 9 / 20 • A confusion matrix is often useful for multi-class • Are the classes balanced? • What is the accuracy? • What is per-class, and averaged precision/recall?

ML evaluation is another indication of Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, precision recall Precision–recall trade-ofg success of a classifjer 10 / 20 correct models useful for picking the decreasing recall hyperparameter) results in by changing a 1 • Increasing precision (e.g., 0 . 5 • Precision–recall graphs are • Area under the curve (AUC) 0 0 0 . 5 1

ML evaluation Performance metrics a summary class distribution is skewed straightforward, but others measures need averaging use/report the metric that is useful for the purpose Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 11 / 20 • Accuracy does not refmect the classifjer performance when • Precision and recall are binary and asymmetric • For multi-class problems, calculating accuracy is • These are just the most common measures: there are more • You should understand what these metrics measure, and

ML evaluation Model selection/evaluation Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, hyperparameters (e.g., regularization constant) model 12 / 20 not overlap with the training data models tend to fjt to the noise in the training data training data • Our aim is to fjt models that are (also) useful outside the • Evaluating a model on the training data is wrong: complex • The results should always be tested on a test set that does • Test set is ideally used only once - to evaluate the fjnal • Often, we also need to tune the model, e.g., to tune • Tuning has to be done on a separate development set

ML evaluation 600 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y x 1000 800 400 Back to polynomial regression 200 0 10 8 6 4 2 13 / 20

ML evaluation Back to polynomial regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y x 1000 800 600 400 200 0 10 8 6 4 2 13 / 20 y = − 221 . 3 + 109 . 9x

ML evaluation Back to polynomial regression Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y x 1000 800 600 400 200 0 10 8 6 4 2 13 / 20 y = − 221 . 3 + 109 . 9x y = 45 . 50 − 3 . 52x + 12 . 13x 2

ML evaluation 600 Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, y Back to polynomial regression 1000 800 x 400 200 2 4 6 13 / 20 10 8 0 y = − 221 . 3 + 109 . 9x y = 45 . 50 − 3 . 52x + 12 . 13x 2 + 2604 . 21x 2 y = 1445 . 80 − 3189 . 13x − 1026 . 76x 3 + 218 . 40x 4 − 25 . 52x 5 + 1 . 54x 6

ML evaluation Training/test error polynomial degree error Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 14 / 20 15 10 5 2 4 6 8 10

ML evaluation the squared deviations from the mean estimate Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, models with low bias result in high variance. Bias–variance relationship is a trade-ofg: Bias and variance (revisited) 15 / 20 Variance of an estimate is, simply its variance, the value of estimate being estimated, and the expected value of the Bias of an estimate is the difgerence between the value B ( ˆ w ) = E [ ˆ w ] − w • An unbiased estimator has 0 bias [ w ]) 2 ] var ( ˆ ( ˆ w − E [ ˆ w ) = E w is the parameters that defjne the model

ML evaluation Some issues with bias and variance idiosyncrasies of the training data for the data at hand Complex models tend to overfjt – and exhibit high variance Simple models tend to show low variance, but likely to have (high) bias Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 16 / 20 • Overfjtting occurs when the model learns the • Underfjtting occurs when the model is not fmexible enough

ML evaluation Some issues with bias and variance idiosyncrasies of the training data for the data at hand have (high) bias Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 16 / 20 • Overfjtting occurs when the model learns the • Underfjtting occurs when the model is not fmexible enough • Complex models tend to overfjt – and exhibit high variance • Simple models tend to show low variance, but likely to

ML evaluation Cross validation development set both training and tuning with some additional efgort ‘average’ parameter estimates over multiple folds Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 17 / 20 • To avoid overfjtting, we want to tune our models on a • But (labeled) data is valuable • Cross validation is a technique that uses all the data, for • Besides tuning hyper-parameters, we may also want to get • We may also use cross-validation during testing

Statistical Natural Language Processing Machine learning: evaluation - PowerPoint PPT Presentation

Statistical Natural Language Processing Machine learning: evaluation ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 ML evaluation Measuring success/failure in regression Summer Semester 2017

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Statistical natural language processing 24.05.19 Statistical Natural Language Processing 1 The

Teoria Erg odica Diferenci avel lecture 21: Entropy Instituto Nacional de Matem atica

EARTHS TWINS ? THE SEARCH FOR EXO-EARTHS ZSUZSA HORVATH KOSZTOLANYI HIGHSCHOOL BUDAPEST

Talent Turns Toxic LESSON 4 Your Response to the Lesson What was most interesting in the Bible

TOXIC INEQUALITY: CONFRONTING THE RACIAL WEALTH GAP Sponsored by @ assetfunders # assetfunders

Coupling of BEM and FEM in the Time Domain for Fluid-Structure Interaction . Stephan 1 Ernst P

Statistical Natural Language Processing Classifjcation ar ltekin University of

A p -adically entire function with integral values on Q p and additive characters of perfectoid

Manifold Reconstruction Jean-Daniel Boissonnat Geometrica, INRIA