Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - PowerPoint PPT Presentation

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University

Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP @alex_brandsen 2 2 Discover the world at Leiden University

Discover the world at Leiden University

What is Text Mining? Deriving information from semi- or unstructured text collections • Search engines • Spam filters • Translation • Turnitin • Customer service 4 Discover the world at Leiden University

Today’s Topics • Challenges of text data • Pre-processing of text • Tasks involving text data 5 Discover the world at Leiden University

After this lecture... • You can list the challenges of processing text data • You can motivate and describe the most common text preprocessing steps: - tokenisation - lowercasing - stopword removal • You can conceptually explain text classification as a supervised learning task 6 Discover the world at Leiden University

Introduction: Challenges 7 Discover the world at Leiden University

Challenges of Processing Text Data Can you think of any challenges / problems specific to text data? 8 Discover the world at Leiden University

Textual Data is Unstructured Or at best semi-structured: 9 Discover the world at Leiden University

Textual Data can be Multi-Lingual • Dat is best wel een big deal • Oh my god , die jurk is fantastisch • Me, when lezers meteen aan de slag gaan met iets dat ze op mijn blog hebben gespot: 😮 • DUO: “ Hoi Valentina, we would advise you to delete your tweet with your burgerservicenummer .” • Waarschijnlijk is dat seksistische beeld gwn zo ingrained into our society 10 Discover the world at Leiden University

Textual Data is Noisy • Optical Character Recognition (OCR): printed text to digital text • Special characters • Spelling errors 11 Discover the world at Leiden University

Preprocessing 12 Discover the world at Leiden University

PreProcessing: from Raw Text to Features CLEAN & GATHER TOKENISE ANNOTATION NORMALISE Raw Clean Linguistically Documents Tokens text text annotated text FEATURE CREATION & SELECTION Features 13 Discover the world at Leiden University

Tokenisation Cutting a collection of characters (sentence) into tokens (words) Mr._O'Neill_thinks_that_Germany's_capital_isn’t_busy. Mr . O'Neill thinks that Germany 's capital is not busy . 14 Discover the world at Leiden University

Lowercasing Sometimes useful, depends on application Mr . O'Neill thinks that Germany 's capital is not busy . mr . o'neill thinks that germany 's capital is not busy . 15 Discover the world at Leiden University

Removing Stopwords Common words aren’t “useful” in analysis: mr . o'neill thinks that germany 's capital is not busy . mr . o'neill thinks germany capital busy . 16 Discover the world at Leiden University

Removing inflections Lemmatization Stemming Use morphological analysis Cut off end • Apples → apple • Apples → apple • Studied → studi • studied → study 17 Discover the world at Leiden University

Tasks Involving Text Data • Text classification - input is collection of texts, output is label(s) per text - spam categorisation • Sequence labelling - input is text, output is sequence of labels - named entity tagging • Sequence-to-sequence learning - input is text, output is text - machine translation 18 Discover the world at Leiden University

(Un)supervised Machine Learning Supervised: • training data is labelled • output is labels • “spam or not?” Unsupervised: • training data has no labels • output is clusters • “these two documents look alike” • “this group of texts has the same topic” 19 Discover the world at Leiden University

Classification 20 Discover the world at Leiden University

Classification Texts -> labels Emails -> spam or not News item -> news category New article -> relevant to you or not (researchgate / academia) Review -> positive or negative 21 Discover the world at Leiden University

Classification with Machine Learning Doc id Content Class 1 request urgent interest urgent Spam 2 assistance low interest deposit Spam Examples for machine learning 3 symposium defense june Ham 4 notas symposium deadline june Ham New unlabeled email, predict class 5 registration assistance symposium deadline ? Neutral Spam Ham Ham Ham! Machine Learning algorithm calculates probability of an email being spam based on how ‘spammy’ the words in the email are (spammyness calculated by looking at distribution of words over the spam and ham categories) 22 Discover the world at Leiden University

Precision and Recall • Precision: How many of the positive predictions were correct? - How many predicted spam emails are actually spam? • Recall: How many of the positive documents did you retrieve? - How many of the spam emails did you catch? • F1: Harmonic mean of precision and recall - How well does my model perform in general? 23 Discover the world at Leiden University

Sentiment Analysis Example Let’s look at some statements! Unicorns are awesome Finding unicorns is difficult 24 Discover the world at Leiden University

Sentiment Analysis Example Tokenize! Unicorns are awesome Finding unicorns is difficult 25 Discover the world at Leiden University

Sentiment Analysis Example Lower case! unicorns are awesome finding unicorns is difficult 26 Discover the world at Leiden University

Sentiment Analysis Example Remove stopwords! unicorns are awesome finding unicorns is difficult 27 Discover the world at Leiden University

Sentiment Analysis Example Add structure! unicorns awesome finding difficult sentiment unicorns 1 0 awesome 1 0 1 1 0 1 unicorns finding difficult Bag of words approach 28 Discover the world at Leiden University

After this lecture... • You can list the challenges of processing text data • You can motivate and describe the most common text preprocessing steps: - tokenisation - lowercasing - stopword removal • You can conceptually explain text classification as a supervised learning task 29 Discover the world at Leiden University

Tutorial 30 Discover the world at Leiden University

What will we be doing? Get some movie reviews from IMDB 1. Clean the data 2. Look at the most frequent words 3. Sentiment analysis → :) or :( 4. 31 Discover the world at Leiden University

Most frequent words 32 Discover the world at Leiden University

Model performance 33 Discover the world at Leiden University

Most important words for classifier ☹ ☺ 34 Discover the world at Leiden University

Get started! Go to alexbrandsen.nl/tmtutorial/tutorial.pdf (or download from Kaltura) If you already have programming experience, you can also do these other tutorials: https://github.com/mchesterkadwell/bughunt-analysis https://github.com/alexbrandsen/Text-Analysis-for-Humanities-research (In both tutorials you can click the Binder button to start a Python notebook in your browser, no installation required.) 35 Discover the world at Leiden University

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - PowerPoint PPT Presentation

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

About(Us( Sebas&an(Pado( Rui(Wang( Professor(of(Computa&onal(

Performance Assessments For Deeper Learning D R . R UTH C HUNG W EI S TANFORD U NIVERSITY N

Cr Creating Royalty: Modeling Temporal- Te Textual Analysis in Tu Turandot Jos Joshua Neumann

Task Analysis, Alternative Views of Contextual Inquiry 1 Administrivia Project Subjects?

Introductory Missive January 2012 About The Class Students in almost every field must use

Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult

Hong Kongs Financial Markets: Positioning in the Asia Pacific Region Andrew Sheng Chairman

Singapore Day in Rome Presented by: Ms Francisca SIOW Graduate Studies Office About NTU NTU

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - PowerPoint PPT Presentation

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

About(Us( Sebas&amp;an(Pado( Rui(Wang( Professor(of(Computa&amp;onal(

Performance Assessments For Deeper Learning D R . R UTH C HUNG W EI S TANFORD U NIVERSITY N

Cr Creating Royalty: Modeling Temporal- Te Textual Analysis in Tu Turandot Jos Joshua Neumann

Task Analysis, Alternative Views of Contextual Inquiry 1 Administrivia Project Subjects?

Introductory Missive January 2012 About The Class Students in almost every field must use

Textual Data Analysis J.-C. Chappelier Laboratoire dIntelligence Artificielle Facult

Hong Kongs Financial Markets: Positioning in the Asia Pacific Region Andrew Sheng Chairman

Singapore Day in Rome Presented by: Ms Francisca SIOW Graduate Studies Office About NTU NTU

About(Us( Sebas&an(Pado( Rui(Wang( Professor(of(Computa&onal(