Social Media Information Extraction Tutorials Shubhanshu Mishra 1* , - PowerPoint PPT Presentation

Social Media Information Extraction Tutorials Shubhanshu Mishra 1* , NLP Researcher Rezvaneh (Shadi) Rezapour 2 , PhD Candidate Jana Diesner 2 , Associate Professor 1 Twitter, Inc. 2 University of Illinois at Urbana-Champaign (UIUC) *Work presented here was done during my PhD at UIUC Content and views expressed in this tutorial are solely the responsibility of the presenters. https://socialmediaie.github.io/tutorials/IC2S2_2020/

Initial setup • Open google Colab notebook specified at: https://socialmediaie.github.io/tutorials/IC2S2_2020/#software- setup • On Colab click Connect • Then on the Menu click Runtime > Restart and run all • Meanwhile you can also follow the steps on the link above to install SocialMediaIE locally on your machine. • If you face any issues with installation please report an issue at: https://github.com/socialmediaie/SocialMediaIE/issues 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 2

Agenda • Introduction (30 mins) (Shubhanshu and Jana) • Applications of Information Extraction(IE) (30 mins) (Shubhanshu, Jana and Shadi) • Collecting and distributing social media data (20 mins) • Break (10 mins) • Hands on Practice (Shubhanshu) • Improving IE on social media data using machine learning (1 hr) • Conclusion and future direction (20 mins) 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 3

Introduction 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 4

Information extraction https://shubhanshu.com/phd_thesis/ “Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.” – (Sarawagi, 2008) 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 5

Digital Social Trace Data https://shubhanshu.com/phd_thesis/ Digital Social Trace Data (DSTD) are digital activity traces generated by individuals as part of a social interactions, such as interactions on social media websites like Twitter, Facebook; or in scientific publications. Inspired from Digital Trace Data (Howison et. al, 2011) 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 6

7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 7

Information extraction tasks https://shubhanshu.com/phd_thesis Corpus level Document level Key-phrase extraction Classification Token level Taxonomy • Sentiment construction • Hate Speech • Sarcasm Tagging Disambiguation • Topic • Spam detection • Named entity • Word Sense • Relation Extraction • Part of speech • Entity Linking Topic modelling 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 8

Examples of information extraction for social media text 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 9

Text classification https://github.com/socialmediaie/SocialMediaIE 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 10

Sequence tagging https://github.com/socialmediaie/SocialMediaIE 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 11

Applications of information extraction Index documents by entities DocID Entity Entity type WikiURL 1 Roger Federer Person URL1 2 Facebook Organization URL2 3 Katy Perry Music Artist URL3 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 12

Applications of Information extraction 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 13

Entity mention clustering Washington is a great place. I just visited Washington . Washington was a great president. Washington made some good changes to constitution. 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 14

Visualizing temporal trends in data https://shubhanshu.com/social-comm-temporal-graph/ 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 15

Lexicon-based Approach Utilizes a lexicon to describe or extract information from a textual content, e.g., lexicon-based sentiment analysis to analyze polarity of text • What to consider first: • How is the lexicon created • Scope: • Using MPQA lexicon to study hashtags in Tweets • Domain Adaptation • Fine-tuning of the lexicon to represent the data • Evaluation of the results • Error analysis, hand annotation, close-reading,.. 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 16

Sentiment Analysis, Presidential Election, and Candidates’ Ranking • Aim: • Test whether incorporating prevalent hashtags from a given dataset into a sentiment lexicon improves sentiment prediction accuracy • Method: • Used hashtag-enhanced lexicon-based sentiment analysis to analyze tweets that mention the US Presidential candidates to find the correlation between the candidates' likeability in tweets with the actual voting outcomes in the New York State Presidential Primary election • Domain adapted the MPQA lexicon: • Extracted and annotated top hashtags and added them to the MPQA lexicon Rezapour, R., Wang, L., Abdar, O., & Diesner, J. (2017). Identifying the overlap between election result and candidates’ ranking based on hashtag -enhanced, lexicon-based sentiment analysis. In 2017 IEEE 11th International Conference on Semantic Computing (ICSC). (pp. 93-96). 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 17

Using moral foundations analysis in analyzing social effects • Motivation: “A language is not just words. It’s a culture, a tradition, a unification of a community, a whole history that creates what a community is. It’s all embodied in a language.” (Noam Chomsky) People’s Cognition, Behavior, Attitude, Emotion, and Values Cultural and Personal Values (Internal Stimuli) People’s Everyday Language and Interaction with the World 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 18

Using moral foundations analysis in analyzing social effects (contd.) • Method: • Use Moral Foundations Dictionary (MFD) to extract words with moral weights and use them as features in prediction models • Limitations with MFD: • Number of entries is small and might not capture (all) variations of terms indicative of morality in text data. • Entries are not syntactically disambiguated, which can limit the results, e.g., by capturing false positives. • Safe (noun) -> does not signal morality • Safe (adjective) -> represents care-virtue • Enhanced MFD: • Used wordnet to get synonym, antonym and hypernym of the words and extensively pruned the lexicon Rezapour, R., Shah, S. H., & Diesner, J. (2019). Enhancing the measurement of social effects by capturing morality. In Proceedings of the Tenth Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis (WASSA). Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL). 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 19 Rezapour, Rezvaneh; Diesner, Jana (2019): Expanded Morality Lexicon. University of Illinois at Urbana-Champaign. https://doi.org/10.13012/B2IDB-3805242_V1.1

Cross-cutting Communication in Social Media Poster Session 3 Detecting Characteristics of Cross- cutting Language Networks on Social Media 3:00 PM - 4:00 PM CDT on July 20 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 20

Detecting and Prioritizing Needs during Crisis Events (i.e., COVID19) • Method: • Created a list of needed resources ranked by priority • Extracted phrases and terms closest to the terms “needs” and “supplies” • Extracted sentences that specify who-needs-what resources • Identified sentences where who is the subject and what is the direct object • Selected sentences where the left child of need in the dependency parse tree is a nominal subject (nsubj), and the right child is a direct object (dobj) Sarol, M. J., Dinh, L., Rezapour, R., Chin, C. L., Yang, P., & Diesner, J. (2020). An Empirical Methodology for Detecting and Prioritizing Needs during Crisis Events. arXiv preprint arXiv:2006.01439 . 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 21

Use of Social Media Data for Research • Publicly available online data provides a unique source of rich input for analyzing and studying people, their behavior, and feelings • Availability of different tools from domains such as NLP and ML made it easier for everyone to perform various types of data analysis • Things to consider before using any data: • How the data is it collected • Is the data reusable for your research • Is the data representative enough • Does the data or method answer your research question • How generalizable is the findings? 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 22

Collecting and distributing social media data 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 23

Publicly available social media data • Many researchers make annotated social media data publicly available for academic research . • Good place for benchmarking or evaluating your models. • Many datasets available for text classification. • Few for information extraction via sequence tagging (but still enough) • Varied annotation practices and data scope: • See here: https://socialmediaie.github.io/MetaCorpus/ 7/17/2020 https://socialmediaie.github.io/tutorials/IC2S2_2020/ 24

Social Media Information Extraction Tutorials Shubhanshu Mishra 1* , - PowerPoint PPT Presentation

Social Media Information Extraction Tutorials Shubhanshu Mishra 1* , NLP Researcher Rezvaneh (Shadi) Rezapour 2 , PhD Candidate Jana Diesner 2 , Associate Professor 1 Twitter, Inc. 2 University of Illinois at Urbana-Champaign (UIUC) *Work

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Tutorials By Dr Sharon Truter To the Tutorials By Dr Sharon Truter What to expect from the

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

Social Media donts What is social media Social media is nothing new Just an extension

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Getting Social What is social media? Why does social media matter? What social media

network science and social science on Twitter mor naaman rutgers SC&I | social media

Social Media Seminar for Development Educators Part 1: Social Media Basics How are these

TexProtects Tutorials Project Partner: TexProtects Project Name: TexProtects Tutorials Team

lear learnr nr: : Inter Interactiv active e R R tutorials tutorials Jiena McLellan Kans

Tutorials 3 tutorials: Day 1: introduction to Bayesian analysis and BAT, basic examples Day 2:

1. Social Media Outline 1.1. What is Social Media? 1.2. Opinion Retrieval 1.3. Feed

Domain-Specific Languages for Composable Editor Plugins LDTA 2009, York, UK Lennart Kats ( me ),

LR Parsing Compiler Design CSE 504 Shift-Reduce Parsing 1 LR Parsers 2 SLR and LR(1) Parsers

Roadshow presentation 1H 2017/2018 - results Belgium, February 2018 P-1213 1H results supported

Observatories for Biodiversity and Protected Areas Management Steve Peedell European

Fourth Quarter 2018 Investor Presentation 2 SAFE HARBOR STATEMENT FORWARD LOOKING STATEMENTS

VASA: Single-chip MPEG-2 422P@HL CODEC LSI with Multi-chip Configuration for Large Scale

Fiscal 1 st Quarter 2020 Results November 6, 2019 DISCLAIMER Forward-Looking Statements and

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana

Social Media Information Extraction Tutorials Shubhanshu Mishra 1* , - PowerPoint PPT Presentation

Social Media Information Extraction Tutorials Shubhanshu Mishra 1* , NLP Researcher Rezvaneh (Shadi) Rezapour 2 , PhD Candidate Jana Diesner 2 , Associate Professor 1 Twitter, Inc. 2 University of Illinois at Urbana-Champaign (UIUC) *Work

Presentation 1 What is social media? Get Media Smart social media 2 What is social media?

Tutorials By Dr Sharon Truter To the Tutorials By Dr Sharon Truter What to expect from the

uf: Minimizing the Coq Extraction TCB Eric Mullen , Stuart Pernsteiner, James Wilcox, Zachary

Social Media Legal Issues Brian C. England Deputy City Attorney Garland, Texas March 7, 2018

Social Media for Mason AGENDA What is Social Media Social Media Strategy Content

Social Media donts What is social media Social media is nothing new Just an extension

Soil Extraction Cell: An Alternative Soil Extraction Cell: An Alternative Method of Soil

Social Media Analytics Ahmed Abbasi University of Virginia 1 Outline Social Media Overview

Declarative Information Extraction Declarative Information Extraction Using Datalog Datalog with

Getting Social What is social media? Why does social media matter? What social media

network science and social science on Twitter mor naaman rutgers SC&amp;I | social media

Social Media Seminar for Development Educators Part 1: Social Media Basics How are these

TexProtects Tutorials Project Partner: TexProtects Project Name: TexProtects Tutorials Team

lear learnr nr: : Inter Interactiv active e R R tutorials tutorials Jiena McLellan Kans

Tutorials 3 tutorials: Day 1: introduction to Bayesian analysis and BAT, basic examples Day 2:

1. Social Media Outline 1.1. What is Social Media? 1.2. Opinion Retrieval 1.3. Feed

Domain-Specific Languages for Composable Editor Plugins LDTA 2009, York, UK Lennart Kats ( me ),

LR Parsing Compiler Design CSE 504 Shift-Reduce Parsing 1 LR Parsers 2 SLR and LR(1) Parsers

Roadshow presentation 1H 2017/2018 - results Belgium, February 2018 P-1213 1H results supported

Observatories for Biodiversity and Protected Areas Management Steve Peedell European

Fourth Quarter 2018 Investor Presentation 2 SAFE HARBOR STATEMENT FORWARD LOOKING STATEMENTS

VASA: Single-chip MPEG-2 422P@HL CODEC LSI with Multi-chip Configuration for Large Scale

Fiscal 1 st Quarter 2020 Results November 6, 2019 DISCLAIMER Forward-Looking Statements and

Combining LSI with other Classifiers to Improve Accuracy of Single-label Text Categorization Ana

network science and social science on Twitter mor naaman rutgers SC&I | social media