Text Mining in R ( tm 101) ViennaR Mario Annau, - PowerPoint PPT Presentation

Text ¡Mining ¡in ¡R ¡ ( tm ¡101) ¡ ViennaR ¡ Mario ¡Annau, ¡22.2.2016 ¡

Textmining? ¡ • Sta<s<cal ¡analysis ¡of ¡textual ¡data ¡ • Use ¡Cases ¡include ¡ – Spam ¡Filtering ¡ – Search ¡ – Sen<ment ¡Analysis ¡ – Topic ¡Modelling ¡ – ... ¡

tm? ¡ • Infrastructure ¡to ¡Analyze ¡Collec<ons ¡of ¡Texts ¡ (Corpora) ¡in ¡R ¡ • Typical ¡ tm ¡pipeline: ¡ 1. Read ¡Data ¡from ¡Numerous ¡Sources ¡into ¡Corpus ¡ 2. Preprocess ¡Data ¡ 3. Create ¡DTM/TDM ¡ 4. Apply ¡Model ¡

Contents ¡ • Data ¡Reading ¡ • Data ¡Structures ¡ • Preprocessing ¡Pipeline ¡ removePunctua<on, ¡tolower, ¡removeWords, ¡ stripWhitespace, ¡stemDocument ¡ • Examples ¡ • Known ¡Weaknesses ¡and ¡Outlook ¡ • Plans ¡for ¡ Sen'mentAnalysis ¡ (tm.plugin.sen<ment ¡2.0) ¡

Data ¡Reading ¡ • tm ¡ Separates ¡Data ¡ Source ¡and ¡ Reader ¡(Iterator) ¡ • Supported ¡ Data ¡Sources ¡and ¡Readers: ¡ ¡ ¡ R> tm::getSources() [1] "DataframeSource" "DirSource" "URISource" "VectorSource" [5] "XMLSource" "ZipSource" R> tm::getReaders() [1] "readDOC" "readPDF" [3] "readPlain" "readRCV1" [5] "readRCV1asPlain" "readReut21578XML" [7] "readReut21578XMLasPlain" "readTabular" [9] "readTagged" "readXML" • e.g. ¡Read ¡PDF ¡Files ¡from ¡Directory: ¡ ¡ R> Corpus(DirSource(directory = “.", pattern = "*.pdf"), readerControl = list(reader = readPDF, language = "en")) ¡

Data ¡Structures ¡ • TextDocument ¡( NLP ) ¡ • Annota<ons ¡( NLP ) ¡ • Corpus ¡ • DocumentTermMatrix ¡

Preprocessing ¡Pipeline ¡ R> removePunctuation ("This is awesome and cool!") [1] "This is awesome and cool" R> tolower ("This is awesome and cool!") [1] "this is awesome and cool!" R> removeWords ("This is awesome and cool!", stopwords()) [1] "This awesome cool!“ R> stripWhitespace (removeWords(tolower(removePunc tuation("This is awesome and cool!")), stopwords())) [1] " awesome cool“ R>stemDocument(crude[[1]])

Document ¡Term ¡Matrix ¡ R> control = list( removePunctuation = TRUE, removeNumbers = TRUE, tolower = TRUE, removeWords = list(stopwords("english")), stripWhitespace = TRUE, stemDocument = TRUE) R> dtm <- DocumentTermMatrix(crude, control=control)

Calculate ¡Simple ¡Sen<ment ¡Score ¡ • We ¡can ¡now ¡use ¡the ¡DTM ¡to ¡calculate ¡ sen<ment ¡scores ¡based ¡on ¡dic<onary ¡ • e.g. ¡ sentiment <- DocumentTermMatrix(crude, control=control) pos <- tm_term_score(dtm, dic_gi$positive, FUN = slam::row_sums) neg <- tm_term_score(dtm, dic_gi$negative, FUN = slam::row_sums) sentiment <- (pos - neg) / (pos + neg) ¡

Known ¡Weaknesses ¡ • ? ¡

Sen'mentAnalysis ¡ package ¡ • tm, ¡tm.plugin.sen'ment ¡-‑> ¡bag ¡of ¡words ¡ approach ¡with ¡caveats ¡ • syuzhet ¡-‑> ¡nice ¡collec<on ¡of ¡techniques, ¡quite ¡ different ¡goals ¡ • coreNLP ¡ • Datasets? ¡-‑> ¡Bing ¡Liu ¡

Text Mining in R ( tm 101) ViennaR Mario Annau, - PowerPoint PPT Presentation

Text Mining in R ( tm 101) ViennaR Mario Annau, 22.2.2016 Textmining? Sta<s<cal analysis of textual data Use Cases include Spam Filtering

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Metaphor Identification Ji Kozmr Procedure on Czech Palack University Olomouc annDH,

Personal Health Informatics CS 8803 Fall 2015 Introduction: Aug 17th, 2015 Lauren Wilcox Asst

APM(Robot) Towards a platform for meta-reasoning in robotic applications Cdric Dinont,

Master Informatique - Universit Paris-Sud 27/09/18 Outline The design of everyday things

Lecture 17: Software Engineering Research 2015-07-16 Prof. Dr. Andreas Podelski, Dr. Bernd

SIG ECI Questionnaire for ECI Questionnaire Questionnaire for Early Career Investigators ready

Introduction Daniel Arribas-Bel & Thomas de Graaff September 5, 2014 Introduction Why this

SOLILOQUY: A Cautionary Tale P. Campbell M. Groves D. Shepherd CESG 1 Outline We describe

Text Mining in R ( tm 101) ViennaR Mario Annau, - PowerPoint PPT Presentation

Text Mining in R ( tm 101) ViennaR Mario Annau, 22.2.2016 Textmining? Sta<s<cal analysis of textual data Use Cases include Spam Filtering

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

Text Mining Text Mining Web pages Emails Technical documents Corporate documents

Data Mining 2020 Text Classification Naive Bayes Ad Feelders Universiteit Utrecht Ad Feelders

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Data Mining in Bioinformatics Day 4: Text Mining Karsten Borgwardt February 25 to March 10

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Enhancing ICANN Text Accountability 26 June 2014 Text #ICANN50 Text #ICANN50 Text #ICANN50

Add Your Title Here Replace your text here! Replace your text here! Insert your title here 1

Text Text #ICANN51 15 October 2014 Text Text IDN Root Zone LGR Sarmad Hussain IDN Program

Text Text #ICANN51 Contractual Compliance Text Text Contractual Compliance Update

Text Text #ICANN50 Contractual Compliance Text Text GNSO Council Meeting Wednesday, Jun 25

2. Text Mining D-BSSE Karsten Borgwardt Data Mining II Course, Basel Spring Semester 2016 118

God Rescues Daniel from the Lions Daniel 6 Here is some test text Here is some test text Here

5. Text CHAPTER HIGHLIGHTS Text tradition. Codes for computer text. C d f t t t

Metaphor Identification Ji Kozmr Procedure on Czech Palack University Olomouc annDH,

Personal Health Informatics CS 8803 Fall 2015 Introduction: Aug 17th, 2015 Lauren Wilcox Asst

APM(Robot) Towards a platform for meta-reasoning in robotic applications Cdric Dinont,

Master Informatique - Universit Paris-Sud 27/09/18 Outline The design of everyday things

Lecture 17: Software Engineering Research 2015-07-16 Prof. Dr. Andreas Podelski, Dr. Bernd

SIG ECI Questionnaire for ECI Questionnaire Questionnaire for Early Career Investigators ready

Introduction Daniel Arribas-Bel &amp; Thomas de Graaff September 5, 2014 Introduction Why this

SOLILOQUY: A Cautionary Tale P. Campbell M. Groves D. Shepherd CESG 1 Outline We describe

Introduction Daniel Arribas-Bel & Thomas de Graaff September 5, 2014 Introduction Why this