text mining workshop
play

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 - PowerPoint PPT Presentation

Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP


  1. Text Mining Workshop LUCDH Studium Digitale A. Brandsen 25-09-2020 1 Discover the world at Leiden University Discover the world at Leiden University

  2. Hello! I am Alex Brandsen PhD candidate at the Faculty of Archaeology / DSRP @alex_brandsen 2 2 Discover the world at Leiden University

  3. Discover the world at Leiden University

  4. What is Text Mining? Deriving information from semi- or unstructured text collections • Search engines • Spam filters • Translation • Turnitin • Customer service 4 Discover the world at Leiden University

  5. Today’s Topics • Challenges of text data • Pre-processing of text • Tasks involving text data 5 Discover the world at Leiden University

  6. After this lecture... • You can list the challenges of processing text data • You can motivate and describe the most common text pre- processing steps: - tokenisation - lowercasing - stopword removal • You can conceptually explain text classification as a supervised learning task 6 Discover the world at Leiden University

  7. Introduction: Challenges 7 Discover the world at Leiden University

  8. Challenges of Processing Text Data Can you think of any challenges / problems specific to text data? 8 Discover the world at Leiden University

  9. Textual Data is Unstructured Or at best semi-structured: 9 Discover the world at Leiden University

  10. Textual Data can be Multi-Lingual • Dat is best wel een big deal • Oh my god , die jurk is fantastisch • Me, when lezers meteen aan de slag gaan met iets dat ze op mijn blog hebben gespot: 😮 • DUO: “ Hoi Valentina, we would advise you to delete your tweet with your burgerservicenummer .” • Waarschijnlijk is dat seksistische beeld gwn zo ingrained into our society 10 Discover the world at Leiden University

  11. Textual Data is Noisy • Optical Character Recognition (OCR): printed text to digital text • Special characters • Spelling errors 11 Discover the world at Leiden University

  12. Preprocessing 12 Discover the world at Leiden University

  13. PreProcessing: from Raw Text to Features CLEAN & GATHER TOKENISE ANNOTATION NORMALISE Raw Clean Linguistically Documents Tokens text text annotated text FEATURE CREATION & SELECTION Features 13 Discover the world at Leiden University

  14. Tokenisation Cutting a collection of characters (sentence) into tokens (words) Mr._O'Neill_thinks_that_Germany's_capital_isn’t_busy. Mr . O'Neill thinks that Germany 's capital is not busy . 14 Discover the world at Leiden University

  15. Lowercasing Sometimes useful, depends on application Mr . O'Neill thinks that Germany 's capital is not busy . mr . o'neill thinks that germany 's capital is not busy . 15 Discover the world at Leiden University

  16. Removing Stopwords Common words aren’t “useful” in analysis: mr . o'neill thinks that germany 's capital is not busy . mr . o'neill thinks germany capital busy . 16 Discover the world at Leiden University

  17. Removing inflections Lemmatization Stemming Use morphological analysis Cut off end • Apples → apple • Apples → apple • Studied → studi • studied → study 17 Discover the world at Leiden University

  18. Tasks Involving Text Data • Text classification - input is collection of texts, output is label(s) per text - spam categorisation • Sequence labelling - input is text, output is sequence of labels - named entity tagging • Sequence-to-sequence learning - input is text, output is text - machine translation 18 Discover the world at Leiden University

  19. (Un)supervised Machine Learning Supervised: • training data is labelled • output is labels • “spam or not?” Unsupervised: • training data has no labels • output is clusters • “these two documents look alike” • “this group of texts has the same topic” 19 Discover the world at Leiden University

  20. Classification 20 Discover the world at Leiden University

  21. Classification Texts -> labels Emails -> spam or not News item -> news category New article -> relevant to you or not (researchgate / academia) Review -> positive or negative 21 Discover the world at Leiden University

  22. Classification with Machine Learning Doc id Content Class 1 request urgent interest urgent Spam 2 assistance low interest deposit Spam Examples for machine learning 3 symposium defense june Ham 4 notas symposium deadline june Ham New unlabeled email, predict class 5 registration assistance symposium deadline ? Neutral Spam Ham Ham Ham! Machine Learning algorithm calculates probability of an email being spam based on how ‘spammy’ the words in the email are (spammyness calculated by looking at distribution of words over the spam and ham categories) 22 Discover the world at Leiden University

  23. Precision and Recall • Precision: How many of the positive predictions were correct? - How many predicted spam emails are actually spam? • Recall: How many of the positive documents did you retrieve? - How many of the spam emails did you catch? • F1: Harmonic mean of precision and recall - How well does my model perform in general? 23 Discover the world at Leiden University

  24. Sentiment Analysis Example Let’s look at some statements! Unicorns are awesome Finding unicorns is difficult 24 Discover the world at Leiden University

  25. Sentiment Analysis Example Tokenize! Unicorns are awesome Finding unicorns is difficult 25 Discover the world at Leiden University

  26. Sentiment Analysis Example Lower case! unicorns are awesome finding unicorns is difficult 26 Discover the world at Leiden University

  27. Sentiment Analysis Example Remove stopwords! unicorns are awesome finding unicorns is difficult 27 Discover the world at Leiden University

  28. Sentiment Analysis Example Add structure! unicorns awesome finding difficult sentiment unicorns 1 0 awesome 1 0 1 1 0 1 unicorns finding difficult Bag of words approach 28 Discover the world at Leiden University

  29. After this lecture... • You can list the challenges of processing text data • You can motivate and describe the most common text pre- processing steps: - tokenisation - lowercasing - stopword removal • You can conceptually explain text classification as a supervised learning task 29 Discover the world at Leiden University

  30. Tutorial 30 Discover the world at Leiden University

  31. What will we be doing? Get some movie reviews from IMDB 1. Clean the data 2. Look at the most frequent words 3. Sentiment analysis → :) or :( 4. 31 Discover the world at Leiden University

  32. Most frequent words 32 Discover the world at Leiden University

  33. Model performance 33 Discover the world at Leiden University

  34. Most important words for classifier ☹ ☺ 34 Discover the world at Leiden University

  35. Get started! Go to alexbrandsen.nl/tmtutorial/tutorial.pdf (or download from Kaltura) If you already have programming experience, you can also do these other tutorials: https://github.com/mchesterkadwell/bughunt-analysis https://github.com/alexbrandsen/Text-Analysis-for-Humanities-research (In both tutorials you can click the Binder button to start a Python notebook in your browser, no installation required.) 35 Discover the world at Leiden University

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend