SLIDE 1
From Emotion Analysis and Topic Extraction to Narrative Modeling - - PowerPoint PPT Presentation
From Emotion Analysis and Topic Extraction to Narrative Modeling - - PowerPoint PPT Presentation
From Emotion Analysis and Topic Extraction to Narrative Modeling Andreea Kremm Mohammed Ibraaz Syed About us q Andreea Kremm Founder of Netex Group (www.netex.ai) M.Sc. Psychology (University of Roehampton, London) Research Interests:
SLIDE 2
SLIDE 3
What are Narratives? How do Narratives affect the Economy?
"This past year has been the most difficult and painful year
- f my career.
It was excruciating."
Elon Musk, New York Times interview, 08/16/2018 https://www.nytimes.com/2018/08/16/business/elon-musk-interview-tesla.html
SLIDE 4
Musk´s Narrative effect on Tesla´s stock:
https://finance.yahoo.com/chart/TSLA
SLIDE 5
What is Narrative Economics?
SLIDE 6
How do Narratives spread?
- Kermack-McKendrick (1927)
mathematical theory of disease epidemics
- SIR – Model: S=susceptible, I=infected,
R=recovered, where N=S+I+R is assumed constant
- Powerful narratives spread, mutate, and
propagate like a virus
SLIDE 7
The Structure of a Narrative
- The Plot (Overcoming the Monster, Rags to Riches, Voyage,
Return, Comedy, Tragedy, Rebirth, etc.)
- The Characters (Hero, Villain, Maiden, King, etc.)
- Emotionally engaging
- Take-Away / Lesson, Call to Action
- A good story is easily remembered and gladly retold
SLIDE 8
Narrative Modeling Algorithm
- Analyze a Narrative:
ü Emotion Analysis ü Entity-Relation Extraction ü Topic Extraction ü Subject Modeling
- Insert into an SIR Disease Epidemics Model Equation
- Predicting Narrative Spread and Economic Consequences
SLIDE 9
Emotion Analysis Showcase
- Task: recognize emotions in written English text
- Solution: Bi-LSTM trained as a classifier
- Resources:
ü NRC-EmoLex (National Research Council Canada Word-Emotion
Association Lexicon) ü Facebook´s FastText ü Training dataset: 7,665 emotion labeled sentences from the Association for the Advancement of Affective Computing (AAAC)
SLIDE 10
Methodology
SLIDE 11
Emotion Analysis Results
- Random accuracy: 20% (Baseline)
- Softmax (word counting) accuracy:
21%
- IBM Watson Tone Analyzer: 39%
- IBM Watson NLU: 58%
- Bi-LSTM with 128 LSTM cells in one layer: 66%
- Bi-LSTM with 32 LSTM cells in four layers: 71%
SLIDE 12
Visualizing Entity Embeddings
SLIDE 13
Challenges and Limitations
- Limited size of the training dataset
- Limited size of NRC-EmoLex
- Single label emotions
- No subject modeling
- No information about the author´s context
- Topic was disregarded
SLIDE 14
- 1. Where do we find an appropriate data set of narrative-rich
text?
- 2. How do we pre-process the data to facilitate narrative
extraction?
- 3. How do we estimate the number of narratives (topics)?
- 4. How do we estimate narrative similarity and model their
evolution?
Topic Extraction Showcase
- Four Key Problems to Solve:
SLIDE 15
Selecting an Appropriate Dataset
- Politicians are often responsible for spreading narratives
- Press releases issues by politicians
- Politicians’ social media accounts
- Social media messages often lack context
- News data is often labeled with categories / topics and related issues
- Social media and news data can complement each other
SLIDE 16
Data and Pre-Processing
- Data sets selected (solution to 1st problem):
- White House Press Briefings from January 20, 2017 onwards
- Tweets by President Donald Trump from January 20, 2017
- nwards
- Narrative extraction-specific pre-processing (solution to 2nd problem):
- Pre-existing labels incorporated into document strings
- Summaries of documents also added to their strings
- 2017 data divided into six 2-month time periods:
Period 1 Period 2 Period 3 Period 4 Period 5 Period 6
January – February March – April May – June July – August September – October November – December
SLIDE 17
Methodology (1)
- Additional pre-processing
- Stop words removed
- Terms appearing in 90%+ of documents ignored
- Unigrams and bigrams considered
- Conversion into TFIDF matrix – to filter out most important words
- Documents as rows, Words as columns
- Entries correspond to word counts in each document
- Entries of words occurring in multiple documents downweighed
- Different matrix for each time period
SLIDE 18
Methodology (2)
- Hierarchical (agglomerative) clustering algorithm (solution to 3rd problem):
- HAC used on each of the 6 TFIDF matrices
- Linkage criterion: Ward’s method (minimizes variance of new clusters)
- Cut-off of 70% of final merge used to estimate optimal number of clusters
- Output:
SLIDE 19
Methodology (3)
- Hierarchical clustering thresholds: # of clusters increase non-linearly
2 clusters 5 clusters 16 clusters
SLIDE 20
Methodology (4)
- Latent Dirischlet Allocation (LDA) algorithm used to extract topics
- Each topic comes with probabilities of generating particular words
- Used separately for each time period (6 times total)
- Cutoff from hierarchical clustering used to determine # of topics
- Sample Outputs of LDA:
- Supreme Court Nomination topic:
- Federal Emergency topic:
SLIDE 21
- Two points in space (straight line)
- Even two probability distributions
- Dissimilarity / distance measures can be used to compare:
- Two points on a sphere (great circle distance)
Methodology (5)
- Hellinger Distance used to compare topics (solution to 4th problem):
- Can effectively determine similar topics
- Can be applied to track topic evolution over time
SLIDE 22
Time period 1 (January & February, 2017): Time period 2 (March & April, 2017): Time period 3 (May & June, 2017):
Key Results (1)
- Estimated # of clusters from HAC led to coherent topics
- Similar topics (through Hellinger distance) could be compared
- ver time to track topic evolution:
SLIDE 23
Key Results (2)
- A “Make America Great” topic was generally the most common
- Discovered the Supreme Court nomination process as a major topic in
early 2017
- Criticism of the media was a major topic through multiple time periods
- Model was able to distinguish between unique topics:
- Various foreign policy topics
- Natural disasters – Hurricanes Harvey (Aug. 2017) & Maria (Sep. 2017)
SLIDE 24
Conclusions and Limitations
- Different tools can be effectively combined to model narratives
- Can generate quantitative data on narratives and their evolution
- Narrative Economics in a very young field
- Various avenues for future research:
- New data sources
- Alternate pre-processing methods
- Different thresholds / time intervals / other parameter tuning
SLIDE 25
Future Research
- Analyze a Narrative:
ü Emotion Analysis ü Entity-Relation Extraction ü Topic Extraction ü Subject Modeling
- Insert into an SIR Disease Epidemics Model Equation
- Predicting Narrative Spread and Economic Consequences
SLIDE 26
Acknowledgments
www.narrativeeconomics.com
- Naveed Ghaffar, co-founder Narrative Economics
(naveedgh@gmail.com)
- Dr. Rashed Iqbal, co-founder Narrative Economics
(rashed_iqbal@econ.ucla.edu)
SLIDE 27