Statistical Natural Language Processing Text Classifjcation ar - PowerPoint PPT Presentation

Statistical Natural Language Processing Text Classifjcation Çağrı Çöltekin University of Tübingen Seminar für Sprachwissenschaft Summer Semester 2017

Some examples generic search on internet as a result of looking for a Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, * Fresh from this morning … deposited at a financial institute in Europe. reliable person that will help me to retrieve funds I a BIG problem that I had to get your contact via a is it spam? ANTI-CORRUPTION GRAFT agenda of the rulling government is I am sorry to invade your privacy; but the ongoing Republic Nigeria under regime of Jonathan Good-luck. My name is Dr. Pius Anyim, former senate president of the Dear Friend, Subject: Dear Friend / Lets work together From: Dr Pius Ayim <mikeabass15@gmail.com> 1 / 32

Some examples @DB_Bahn mußten sie für den Sauna-Besuch Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications of text classifjcation Sentiment analysis is currently one of the most popular zuzahlen ? the rest of the album is background music. is the customer happy? "Wouldnt Be nice" are indeed masterpieces...but Bob Dylan sometimes do. "God Only Know" and They definitly can not write great lyrics like but the songwriting is childish and rubbish. this album. Yes, the production is wonderfull I never understood what's the BIG deal behind 2 / 32

Some examples @DB_Bahn mußten sie für den Sauna-Besuch Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, applications of text classifjcation zuzahlen ? the rest of the album is background music. is the customer happy? "Wouldnt Be nice" are indeed masterpieces...but Bob Dylan sometimes do. "God Only Know" and They definitly can not write great lyrics like but the songwriting is childish and rubbish. this album. Yes, the production is wonderfull I never understood what's the BIG deal behind 2 / 32 • Sentiment analysis is currently one of the most popular

Some examples which language is this text in? Član 3. Svako ima pravo na život, slobodu i ličnu bezbjednost. Detecting language of the text is often the fjrst step for many NLP applications. Extremely easy for the most part, but tricky for – closely related languages – text with code-switching Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 32

Some examples which language is this text in? Član 3. Svako ima pravo na život, slobodu i ličnu bezbjednost. many NLP applications. – closely related languages – text with code-switching Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 3 / 32 • Detecting language of the text is often the fjrst step for • Extremely easy for the most part, but tricky for

More questions given a doctor’s report? Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, web page? institutional or personal – economy – travel – sports – politics answer the support email? book? its description? product be listed based on 4 / 32 – gender student essay get? learner? level of a language – native language – age affjliation – political party • Who wrote the book? • What category should a • Find the author’s • What is the genre of the • Which department should • Is the author depressed? • Is this news about • What is the profjciency • What grade should a • Is the web site an • What is the diagnosis,

Text classifjcation documents complete books important (and interacts with the classifjcation method) Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 5 / 32 • In many NLP applications we need to classify text • Documents of interest vary from short messages to • The classifjcation task can be binary or multi-class • The core part of the solution is a classifjer • The way to extract features from the documents is

Text classifjcation the defjnition more) of the known classes input a document otput the predicted document class input a set of documents with associated labels otput a classifjer Essentially, the task is supervised learning (classifjcation). Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 6 / 32 • Given a document, our aim is to classify it into one (or • During prediction • During training

How about a rule-based method? We will stick to statistical / machine learning approaches Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 7 / 32 • They exist, and still used often in the industry • Rule-based approaches are language specifjc • It is diffjcult to adapt them to new environments

Supervised learning model Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, label predicted new data features ML prediction algorithm ML labels features data training training 8 / 32

Two important parts – what features to use? words? characters? both? n-grams of words or characters? – what value to assign to each feature? Ç. Çöltekin, SfS / University of Tübingen Summer Semester 2017 9 / 32 • How do we represent a document? • What classifjcation algorithm should we use?

Bag of words (BoW) representation fjlm what , proof titan a Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, have movies about . come clue to this that from sci-fj a.e. “ ” supposed be hollywood is animated it do because . ’s know does most how japan good n’t I thing The idea: use words that occur in text as features without BoW representation is supposed to be about. to do it. I don’t know what this fjlm Hollywood doesn’t have a clue how because “titan a.e.” is proof that sci-fj movies come from Japan, It’s a good thing most animated The document paying attention to their order. 10 / 32

Bag of words representation do Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, … masterpiece pathetic great clue be good have with binary features a thing 11 / 32 be about. to The document value It’s a good thing most animated sci-fj movies feature come from Japan, because “titan a.e.” is proof all words in our document collection that Hollywood doesn’t have a clue how to do it. I don’t know what this fjlm is supposed to 1 1 1 1 1 1 1 • If the word is in the document, the 1 value of 1 , otherwise 0 0 0 • The feature vector contains values for 0

Bag of words representation do Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, … masterpiece pathetic great clue be good have with (document) frequencies a thing 12 / 32 to The document It’s a good thing most animated sci-fj movies come from Japan, because “titan a.e.” is proof that Hollywood doesn’t have a clue how to do it. I don’t know what this fjlm is supposed to be about. vectors – efgect of document length feature value – frequent is not always good 2 2 2 1 1 1 1 • Use frequencies rather than binary 1 0 • May help in some cases, but 0 0

Bag of words representation a Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, … masterpiece pathetic great clue be good have with relative frequencies thing 13 / 32 be about. to value The document feature It’s a good thing most animated sci-fj movies come from Japan, because “titan a.e.” is proof that Hollywood doesn’t have a clue how to do document length do it. I don’t know what this fjlm is supposed to 0.06 0.06 0.06 0.03 0.03 0.03 0.03 • Relative frequencies are less sensitive to 0.03 0.00 0.00 • Still, high-frequency words dominate 0.00

tf-idf weighting term count in doc Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, number of docs doc length 14 / 32 documents that contain the term idf inverse document frequency - inverse of the ratio of tf term frequency - frequency of the word in the document – Words that appear in many documents are not specifjc important/representative for the document – Words that appear multiple times in a document is • Intuition: • tf-idf uses two components • Both components are typically normalized tf-idf t , d = C t , d | d | × log N n t number of docs with t

tf-idf example the Summer Semester 2017 SfS / University of Tübingen Ç. Çöltekin, good a the book a 15 / 32 bad good the Document 1 ( d 1 ) 5 tf-idf ( t , d ) = tf ( t , d ) × idf ( t ) 2 1 tf-idf ( good , d 1 ) = ? Document 2 ( d 2 ) 2 tf-idf ( bad , d 1 ) = ? 2 1 tf-idf ( the , d 1 ) = ? Document 3 ( d 3 ) 1 tf-idf ( good , d 3 ) = ? 2 3

Statistical Natural Language Processing Text Classifjcation ar - PowerPoint PPT Presentation

Statistical Natural Language Processing Text Classifjcation ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 Some examples generic search on internet as a result of looking for a Summer

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Statistical natural language processing 24.05.19 Statistical Natural Language Processing 1 The

EFFECTIVE CODE REVIEW EFFECTIVE CODE REVIEW Who am I? @d0ugal Raise your hand Not

Gods Masterpiece of Marriage Gods Masterpiece of Marriage Gods Masterpiece of

OpenBMP Project Overview GROW WG / IETF/Chicago 2017.03.27 Randy Bush <randy@psg.com>

Prts st tt s

Data Presentation in a Web-App Journey of a start-up Simon Oxley Co-founder & CTO Aware

CS70: Jean Walrand: Lecture 25. Markov Chains: Distributions 1. Review 2. Distribution 3.

Labeling Text in Several Languages with Mul;lingual Hierarchical AEen;on Networks Nikolaos Pappas

Pseudogrupoids and hoc genus omne in universal algebra Aldo Ursini-Siena, Italy

Sambuz

Useful Links

Newsletter

Mail Us

Statistical Natural Language Processing Text Classifjcation ar - PowerPoint PPT Presentation

Statistical Natural Language Processing Text Classifjcation ar ltekin University of Tbingen Seminar fr Sprachwissenschaft Summer Semester 2017 Some examples generic search on internet as a result of looking for a Summer

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Paula

Natural Language Processing: Part II Overview of Natural Language Processing (L90): ACS Lecture

Information Extraction Industrial Natural Language Processing Industrial Natural Language

Natural Language Processing 1 Lecture 11: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 10: Language generation and summarisation Katia Shutova

Natural Language Processing 1 Lecture 8: Compositional semantics and discourse processing Katia

Natural Language Processing Fall 2018 Frank Ferraro Natural language processing ITE 358

Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2019 Natural Language

Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language

MIA - Master on Artificial Intelligence Advanced Natural Language Processing Advanced Natural

Advanced Natural Language Processing: What is Natural Language Processing (NLP)? Background

Introduction Karl Stratos Rutgers University Karl Stratos CS 533: Natural Language Processing

Statistical natural language processing 24.05.19 Statistical Natural Language Processing 1 The

EFFECTIVE CODE REVIEW EFFECTIVE CODE REVIEW Who am I? @d0ugal Raise your hand Not

Gods Masterpiece of Marriage Gods Masterpiece of Marriage Gods Masterpiece of

OpenBMP Project Overview GROW WG / IETF/Chicago 2017.03.27 Randy Bush &lt;randy@psg.com&gt;

Prts st tt s

Data Presentation in a Web-App Journey of a start-up Simon Oxley Co-founder &amp; CTO Aware

CS70: Jean Walrand: Lecture 25. Markov Chains: Distributions 1. Review 2. Distribution 3.

Labeling Text in Several Languages with Mul;lingual Hierarchical AEen;on Networks Nikolaos Pappas

Pseudogrupoids and hoc genus omne in universal algebra Aldo Ursini-Siena, Italy

Sambuz

Useful Links

Newsletter

Mail Us

OpenBMP Project Overview GROW WG / IETF/Chicago 2017.03.27 Randy Bush <randy@psg.com>

Data Presentation in a Web-App Journey of a start-up Simon Oxley Co-founder & CTO Aware