[PPT] - Speed Dating/Matchmaking Event September 19, 2014 John%Beieler PowerPoint Presentation

SLIDE 1

BDSS IGERT Speed Dating/Matchmaking Event

September 19, 2014

SLIDE 2

John%Beieler

PhD$Student,$Poli/cal$Science

jub270@psu.edu johnbeieler.org

SLIDE 3

Event&Data Who$did$what$to$whom

Python,)R
Natural)language)processing
Forecas7ng
Poli7cal)violence

SLIDE 4

Wanghuan'Chu'

4th+year'Ph.D.'student'in'Sta6s6cs'
Key'strength:'Sta6s6cal'modeling'

– Nonparametric'regressions,'mixed+effects/mul6level' models,'discrete'choice'models,'sta6s6cal'learning' algorithms,'causal'inference'techniques,'etc.'

Research'experience'

– Thesis&research:'Feature'screening'methods'for'ultrahigh' dimensional'longitudinal'data.'

e.g.'Gene6c'data'with'870,000'SNPs'from'540'subjects'(p'>>'n)'

– 1st&IGERT&rota1on:'Causal'media6on'analysis'for'clustering' data'using'mixed+effects'models,'propensity'score' modeling'and'inverse'probability'weigh6ng.'

SLIDE 5

Wanghuan'Chu'

Poten6al'components'for'the'ideal'project'

– Parallel'compu6ng'to'Big'Data'(e.g.'MapReduce).' – Sta6s6cal'methodologies'at'data'analy6cs'layer.' – Interes6ng'social'science'ques6ons'to'be'explored.'

SoVware:'R'and'SAS'(MACRO'and'SQL)'
Interested'in'learning'

– New'programming'language'(e.g.'Python).' – New'methodology'and'domain'knowledge.'

SLIDE 6

An Introduction:

Cindy Cook

cmc496@psu.edu

!

B.S. in Mathematics

Graph Theory
Parallel Computing with MPI:
Recommender Systems

!

M.S. in Applied Statistics

R, SAS, Stata, C++
Machine Learning
Survival Analysis
Cox Models

!

Ph.D. in Statistics

No particular advisor or research

SLIDE 7

Research Interests:

! Big Data

With statistical applications in the Social

Sciences

Python, parallel computing in R, and

broadening my overall computing skills

! Data that has spatial/temporal trends ! Networks on a large scale ! Any combination of these

SLIDE 8

Timmy Huynh ■ Sociology & Demography

Advisor: John Iceland tnh133@psu.edu

 Education

 B.A., Geography / Economics, The University of Texas at Austin, 2010  M.A., Social Sciences, The University of Chicago, 2011

 Research experience (selected)

 REU Summer Institute in Minority Group Demography – Austin, TX, 2009  Summer Institute in LGBT Population Health – Boston, MA, 2010  Asian Americans Advancing Justice – Chicago, IL, 2011-2012  Oak Ridge National Laboratory – Oak Ridge, TN, 2012-2013

 Research interests

 Urban sociology  Spatial demography  Economic geography  Networks  (Geo)Visualization

 Skills

 Statistics (Stata, SPSS, R)  GIS (ArcGIS, GeoDa, ERDAS)  Programming (Python, JavaScript)

SLIDE 9

Christopher Inkpen

Sociology and Demography

Recent Projects

determinants of student migration
visualizing global migration

patterns

assessing impact of recession on

internal migration Tools

Statistical Models: linear regression,

GLM, HLM, fixed and random effects, spatial econometrics

Computing: Stata, R, Python, SQL
Mapping: ArcGIS, CartoDB

Broad Interests

global migration patterns
assimilation
population processes

SLIDE 10

Areas to explore

Population estimation and data fusion

Mapping of social networks

SLIDE 11

Department of Human Development and Family Studies

Rachel Koffer

rek183@psu.edu 3rd year Ph.D student in the Department of Human Development and Family Studies Concentrations: Individual Development, Methodology Advisors: Nilam Ram, David Almeida B.A. Psychology, Economics; Minor: Environmental Studies

Skills I Bring to the Rotation:

SAS, R, LISREL, SPSS, STATA Statistical skills: General linear, multilevel, structural equation modeling, PCA and Factor Analysis

Skills I Hope to Develop/Improve During the Rotation:

Python; Data visualization, Machine Learning

SLIDE 12

Methodological Interests:

Analysis of: Intensive longitudinal data (many measurements across short time span); Multiple time scales (intensive longitudinal data w/in longer-term data);

Substantive Interests:

Association between daily experiences and well-being. Effects of daily stressors on daily and long-term affective (mood) and physical well-being.

Potential Interests for Research Rotation:

Machine learning techniques for developmental time series data Application of interdisciplinary methods to stress concepts

Department of Human Development and Family Studies

Rachel Koffer

rek183@psu.edu

SLIDE 13 September 19, 2014

Fridolin Linder

Department: Political Science (2nd Year PhD) Fields: Methodology, Comparative Politics, Statistics (Grad. Minor) Interests: Predictive Modeling/Machine Learning, Text Analysis (Classification,Scaling), Political Representation, Research Design/Causal Inference/Epistemology Skills: Statistics, R (substantial), Python Current Projects: Datamining as Exploratory Data Analysis (w/ Zach Jones), Rationalization of candidate choice through missreporting of ideological self-placement (experiment)

Fridolin Linder BDSS IGERT Matchmaking Event 1 / 1

SLIDE 14

SLIDE 15

Jonathan K. Nelson 

Department of Geography  jkn128@psu.edu"

Abstract—I am a Ph.D student in the department of geography. Prior to coming to Penn State I was a cartographer for National Geographic. I study spatial data representation and explore patterns and relationships in geographic phenomena, using spatial statistics and visual analytics approaches. I am particularly interested in interactive multi-scale visual and data abstraction techniques for making sense of BIG DATA. " My current research rotation is in the GeoVISTA Center and involves leveraging geo- social media data to support crisis management. Other projects I am working on include: a visual analysis of 1200 student maps from a massive open online course titled “Maps and the Geospatial Revolution;” an exploratory analysis on multiscalar effects of the modifiable areal unit problem on cancer diagnosis rates and median income; and a human-pet-computer interaction study that aims to build healthy relationships between pet owners and their dogs using personal visualization and quantification." Tools I commonly use for carrying out and conveying my research include: Adobe Creative Suite, Avenza MAPublisher, Final Cut Pro; ESRI ArcGIS, GeoDaA, R; CSS, HTML, JavaScript D3. "

!Keywords— spatial data, visualization, cartography, map, scale, aggregation, information design "

SLIDE 16

Deeper Learning in Large-Scale Text

INTELLIGENT SYSTEMS LABORATORY APPLIED COGNITIVE SCIENCES LABORATORY

Alexander G. Ororbia II, IST PhD Student

SLIDE 17

What do I do?



Build:



Deep models for learning from Scholarly Big Data

 Multilayer neural networks, learning kernels  Boltzmann Machines  Convolutional Networks—text recognition in-the-wild (“Text in the Wild”)



Active Learning Algorithms

 Bayesian Network Lattice for error-correcting Amazon Mechanical Turker annotations (Ororbia et. al,

2014, Under Review)



Investigate:



Can deep architectures discover/model inherent hierarchical structure in text?



How can intelligent systems work in tandem with humans to solve complex problems?



Can intelligent tools be built that harvest and organize vast amounts of scholarly data?

 What insights can these same algorithms extract from the data?

SLIDE 18

BDSS-IGERT

Joshua Snoke snoke@psu.edu

About Me:
2nd Year PhD Student, Department of Statistics
1st Year BDSS-IGERT Trainee
B.S. in Mathematics and Economics
Current Research:
Data Privacy, Disclosure Limitation Methods
Synthetic Data for Public Use (in Sociology Studies),

Parametric and Non-Parametric Methods

SLIDE 19

BDSS-IGERT

Joshua Snoke snoke@psu.edu

Currently Seeking a Research Project Outside of the Statistics Dept.
Interests and Applications:
Policy, Politics (National and Global)
Social Networks, Relationships
Methodology, Causal Inference, Bayesian Methods
Computational Proficiency:
Significant Experience in R
Some Experience in Python, Java, and SQL

SLIDE 20

Sam Stehle

Geography

Previous activities

Data management for mobile GIS
Matching space-time patterns for

political/social comparison

Visual analytic software

design/implementation

Text classification of Twitter data
Event data collection/classification
Time series intervention modeling

Methods experience

Java
Python
C++
R
Spatial analysis, GIS
SQL
Time series analysis
Raster/image analysis
Machine learning with Weka

Background

B.S. University of Utah; geography, minor in Computer Science
M.S. Penn State; geography

SLIDE 21

Sam Stehle

Interests for Future Work

Topics

Political geography
Understanding events
Spatio-temporal patterns
Sport
Media representations
Multi-scale issues

Dissertation considerations

Geography/politics of international sport
British Commonwealth Games
Multi-scale spatio-temporal modeling
RSS feed data + geo/social/political context
Data-driven vs. dictionary event classification
Spatio-temporal diffusion patterns

SLIDE 22

Clio Andris (Assistant Professor of GIScience) Dept. of Geography, clio@psu.edu.

Courses: Fall GEOG560: Interpersonal Relationships in Geographic Space, Spring GEOG363: GIS

A System of Systems

SLIDE 23

Network thanks to Paul Hooper, Emory U.

Example 2 Example 3 Example 4

SLIDE 24

MID/DLE: An End to Data Collection

Analyze Collect

Researcher Computer

Participant

Measure Measure Measure

Participant

Measure Measure Measure

Participant

Measure Measure Measure Maintain

Tim Brick, HDFS

SLIDE 25

SLIDE 26 1950 1960 1970 1980 1990 2000 2010

0.50
0.25

0.00 0.25 0.50 0.75 Latent Physical Integrity More Abuse More Respect Estimated Yearly Average of Two Dynamic Latent Physical Integrity Variables Dynamic Standard of Accountability Constant Standard of Accountability

4
2

2 4

4
2

2 4 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Dynamic Standard Model Estimates Constant Standard Model Estimates More Abuse More Respect More Abuse More Respect Disagreement between estimates increases each year Christopher J. Fariss Respect for Human Rights has Improved Over Time

SLIDE 27

Human Rights Documentation Project

11,715 human rights documents. Eventually I will have more than 20,000 documents Most are already coded.

Christopher J. Fariss Respect for Human Rights has Improved Over Time

SLIDE 28

Epidemics – The Dynamics

f Infectious Disease!
> 19,000 participants!
158 countries !
> 5,000 completed!
> 300,000 unique video views!
> 8000 browsed the forums!
> 4200 participating in forums!
3683 forum threads!
31919 forum posts!
15486 forum comments!
Novel format for delivering content !
Novel format for interacting with learners!
The online discussion is a educational resource !

Ferrari, BIOL

SLIDE 29

Epidemics – The Dynamics

f Infectious Disease!

SLIDE 30

Chris Fowler

Assistant Professor of Geography and Demography csfowler@psu.edu

The big question: When cities spend money on stuff (e.g. affordable housing, transport systems, parks)…. …who suffers, who benefits? …how do those costs and benefits change communities? The current question: Can we use demographic data at very fine geographic scales to identify signals that relate to the above?

SLIDE 31

More segregated More Diverse Smallest Scale Largest Scale

Completely Segregated Very Diverse Segregated

The Measure: Multiscale segregation

SLIDE 32

Multiscale Segregation Profiles: Functional Forms

More segregated More diverse 18,000 cells x 25 scales x 3 census years = 1.35 million data points to study 8 neighborhoods

SLIDE 33

Multiscale segregation is just the start

Improving the measure
Scale and interpolation issues
Restricted Census data
Other variables
Parcel data on housing price
Poverty status
Income
Other cities
Visualization and classification
Interpretation and prediction

SLIDE 34

GeoTxt SensePlace2

Frank Hardisty (hardisty@psu.edu) at the GeoVISTA Center

SLIDE 35

People and Ideas

GeoVISTA Student Affiliates

John Beieler
Jennifer Mason
Jonathan Nelson
Sam Stehle
Josh Stevens

Project Ideas

Interactive NLP
GeoTxt crowd-

sourcing

Interactive social

graph analysis

Your pet idea

bridging social science and geo- science

SLIDE 36

The Psychometrics of College Tests Loken, HDFS

SLIDE 37

Signal and noise in data on body weight

SLIDE 38

Stephen A. Matthews

Professor of Sociology, Anthropology & Demography (courtesy, Geography) Director, Graduate Program in Demography

Research Interests: My research focuses on population health and health inequality. An important part of my work is an interest in conceptual and methodological issues associated with how neighborhoods are defined and their attributes are measured, and the relevance of these definitions and measures to individual behavior and health outcomes. Proposed Research Project: A friend and colleague, Basile Chaix (Université Pierre et Marie Curie, Paris) has geocoded data on 90,000 places (activity locations) for 6,000 Parisians. For each respondent we know the self-reported boundaries of their neighborhood (VERITAS-RECORD project).

During the project the emphasis would be to develop/refine methods to (a)compare the patterning of locations visits to self-reported neighborhood; (b) compare patterns across individuals residing in the same neighborhood; (c) identify hierarchical use patterns (frequency) among location types; (d) examine the significance of focal locations (e.g., work, home); and, (e) determine the optimal/minimal number (and type) of locations reported that offer a useful proxy for the total distribution of locations that an individual visits. Stephen A. Matthews – sxm27@psu.edu – BDSS Speed-dating Meeting (Fall 2014) Slide 01

SLIDE 39

VERITAS-RECORD = Visualization and Evaluation of Route Itineraries, Travel Destinations, and Activity Spaces – Residential Environment and CORonary Heart Disease. Youtube VIDEO at https://www.youtube.com/watch?v=91x_S2Q-tic

Stephen A. Matthews – sxm27@psu.edu – BDSS Speed-dating Meeting (Fall 2014) Slide 02

Ideal Skill Sets Required: a)Good data organizing skills b)Good communication skills c)Good documentation skills d)Solid statistical background e)Programming skills (for automating repetitive tasks) f)Patience g)Mapping and data visualization skills h)GIS experience, preferably ArcGIS i)Some familiarity with point pattern analysis, local neighborhood statistics, and density/surface mapping. j)Some familiarity with activity space and time-geography literature – and willingness to learn more. Opportunities to be involved in manuscripts to be developed for publication in epidemiology, public health and/or geography-related journals

SLIDE 40

Quantitative interest

Process models Multivariate continuous time modeling with all driving parameters person-specific

unbalanced/unstructured data
describing change in terms of instantaneous regulation and short- and long term trends
synchronicity in changes among the longitudinal variables

Bayesian statistics

flexible framework for implementing parameter estimation for highly complex models
focus: sequential updating methods for online inference from streaming data (health monitors,

Twitter etc.)

bayesian.zitaoravecz.net

Main goal: developing novel multivariate dynamical models that capture psychologically

meaningful properties of change over time in terms of latent variables

study individual differences therein e.g., interventions can be tailored based on these variables

SLIDE 41

Substantive interest

Affective science

regulatory mechanisms in valence and arousal levels
their connection to personality traits

Well-being

subjective (self-reported) wellbeing as multidimensional state
devising measurement instrument

Cognitive process models

describing decision making in terms of latent variables
identifying links between cognitive parameters

and individual characteristics Research tools

translating methodological research into practical tools
free, user-friendly programs for applied researchers

bayesian.zitaoravecz.net

20 40 60 80 100 20 40 60 80 100 Valence Arousal

SLIDE 42

Donna Peuquet

Department of Geography Research interests:

Geographic knowledge

representation

Knowledge discovery
Space-time dynamics

– Visualization – Geovisual analytics – Data models – Computational/statistical modes

Current project: STempo

Provide the capability to:

– quickly reveal temporal and spatio- temporal patterns from large collections of event data

Find previously unknown patterns
Confirm suspected / assumed

patterns

Find examples of specific patterns

in other locations / times

Using computational + visualization

techniques

…and coded events from RSS

newsfeeds - GDELT

SLIDE 43

Potential projects:

Add context to visualization and/or analysis
Examine how patterns change over varying

contexts

Develop means to identify anomalies, precursor

and postcursor events

Develop capability to visually identify repeating

patterns/cycles

Develop means to facilitate evaluation of

pattern importance

What kinds of hidden significant structure exists in complex space- time behaviors?

SLIDE 44

NILAM RAM HUMAN DEVELOPMENT & FAMILY STUDIES

NUR5@PSU.EDU

IGERT BDSS PSU SEPTEMBER 19, 2014

FINDING MEANING IN

THE DATA FOREST

SLIDE 45 stressdirections.com

Data Acquisition Data Fusion Data Management Data Visualization Data Mining Data Modeling In-Vivo Data In-Silica Data In-Virtual Data Real-Time Data Interactive Data

SLIDE 46

HMM (STATE SEQUENCE) ESTIMATION

REAL TIME ANALYSIS  TIME-AWARE RECOMMENDATIONS

Probabilistic state sequence extracted from 4-state HMM ID# 103, age 2-mo Inoculation Paradigm

Xt+1 = AXt + Vt+1 Yt+1 = CXt + Wt+1

SLIDE 47

CELLULAR AUTOMATA

SIMULATIONS OF COMPLEX EMERGENT BEHAVIOR + INTERACTIVE DATA VIZ  GAMING

dt = .01, R = .2, A = .08, B = 1.5, C = .15, Du = .5, Dv = 20 100x100 grid with periodic boundaries, random uniform initial conditions (0,.1)

u t = Rf u,v

( )+ Du2u

v t = Rg u,v

( )+ Dv2v

f u,v

( ) = A Bu+

u2 v 1+Cu2

( )

g u,v

( ) = u2 v

SLIDE 48

ENSEMBLE METHODS FOR (UN)STRUCTURED DATA

SLIDE 49

ENSEMBLE METHODS FOR (UN)STRUCTURED DATA

SLIDE 50

Data$Privacy,$Causal$Inference,$ Categorical$Data$methodologies$…$

Aleksandra$(Sesa)$Slavkovic$ sesa@psu.edu$$

$ Departments$of$StaBsBcs$&$Public$Health$Sciences$$ Pennsylvania$State$University$$ $ $ Sep$19,$2014$@$BDSS$matching$day$

$

1"

SLIDE 51

Privacy"in"Sta-s-cal"Databases"

Agency/" Organiza-on/ Database" Respondents/ Individuals/ Organiza-ons" Users"

Queries" Answers" Government," Researchers," Businesses" Clinicians" Pa-ents"" (or)"" Malicious" adversary"

""Large"collec-ons"of"personal"informa-on""
""census"&"survey"data"
""social"networks""
$$medical/"public"health/genomic"
""web"search"records,"etc"

Collect"""!"""""""Store""""!"Analyze/Share" Cloud"compu-ng"

SLIDE 52

Privacy"Research"ques-ons"

Research$MaOers:"Privacy"in"Sta-s-cal"Databases"
Main$theme:$integra-ng"computer"science"and"sta-s-cal"

approaches"to"data"privacy" – Social,"Behavioral"&"Economic"data" – TradeQoff"between"data"u-lity"and"disclosure"risk" – Rigorous"privacy"defini-ons"(e.g.,"Differen-al"Privacy)" – Synthe-c"data" – Priva-za-on"of"social"networks"data"" – Private"GenomeQwide"associa-on"studies"" – Privacy"with"Distributed"databases"

3"

Image"ref:"hWp://www.orgnet.com/email.html"

SLIDE 53

Other"projects"

More"general"categorical"data"methodologies"(Bayesian"analysis,"algebraic"sta-s-cs,"…")"

with"observa-onal"data" – Causal"Inference" – Ecological"Inference" "

Sta-s-cal"Data"integra-on"

– Combining"data"from"mul-ple"sources" – Merging"big"data"with"probability"samples."Can"we"use"informa-on"from"surveys"to" help"generalize"analyses"from"largeQscale"administra-ve/private"or"organic"data?"

Popula-on"size"es-ma-on"
Data"analysis"and"methodology"with"small"n,"large"p"problems"in"two"se[ngs"

– CSCW"and"HCII"data" – Study"of"communica-on"and"awareness"in"online"collabora-ve"tools" – Ques-oners"and"logQac-vity"data" – NEW:"Neural"data"and"neuroimaging"fMRI"data"analysis"modeling"language" plas-city"in"bilinguals" – Time"and"frequency"domain"analyses"

4

SLIDE 54

Communication-based diffusion

Rachel Smith, Communication Arts & Sciences

1. PEPFAR Namibia
Existing data available for

network analysis and HIV- related indicators

– Two-mode networks (persons and community groups) – Cross-sectional – 15 communities, ~n=300 in each site

2. ¡‘Contagious’ ¡messages
Collect and analyze new

data

– Track online messages related to ebola – Predict a) what types of messages get passed onto another person, and b) predict what aspects of the message change and remain the ¡same ¡in ¡the ¡‘retelling’ – Compare end to CDC or WHO stories and advice

SLIDE 55

Promoting Intergenerational Communication through Facebook

Dr. S. Shyam Sundar

College of Communications Different Use of Facebook among Senior Citizens (N=352)

Jung, E.H. & Sundar, S. S. (2014). Senior Citizens on Facebook: How do they Interact and Why? Paper presented at the 96th annual conference of the Association for Education in Journalism and Mass Communication, Montreal, Canada.

Social bonding One-to-one communication (e.g., commenting, chatting)
Social bonding & Social bridging  Self-presentation activities (e.g., updating status)
Social bonding & Curiosity  Social ¡surveillance ¡activities ¡(e.g., ¡checking ¡out ¡people’s ¡

walls)

Frequency of senior ¡citizens’ ¡ participation in Facebook activities (N= 168)

Facebook Activity Mean SD

Stay in touch with friends and family 3.27 1.43 Reunite with

ld friends

2.56 1.14 Keep up with

thers’ ¡activities ¡

2.39 1.14 Comment on

thers’ ¡postings ¡

2.38 1.17 View or upload photographs 2.33 1.29 Pass the time 2.15 1.77 Keep up with current events 1.95 1.20 Update my status 1.88 1.00 Browse profiles 1.80 1.08 Post items (e.g. news articles) 1.78 .91

Sundar, S. S., Oeldorf-Hirsch, A., Nussbaum, J. F., & Behr, R. A. (2011). Retirees on Facebook: Can online social networking enhance their health and wellness? Proceedings of the 2011 Annual Conference Extended Abstracts ¡on ¡Human ¡Factors ¡in ¡Computing ¡Systems ¡(CHI ¡EA’11), 2287-2292.

SLIDE 56

BIG DATA

LIKES PHOTOS WALL POSTS

COMMENTS

CHATTING

PRIVATE MESSAGES

What are senior citizens doing on Facebook for intergenerational communication?

What’s ¡on ¡ your mind?

BIG DATA @ FACEBOOK

What kinds of technology affordances do senior

citizens use? With whom are they using it?

What is the relationship between sender and

receiver?

Through sentiment analysis, how much social

support do they receive from family members on Facebook?

SLIDE 57

SLIDE 58

SLIDE 59 I I I

SLIDE 60 I I I

SLIDE 61

Leadership and Sentiment Analysis of an Online Cancer Support Community using Computational Text Mining

Kenneth Portier & Greta Greer, American Cancer Society John Yen, Prasenjit Mitra, Kang Zhao, Baojun Qiu, Dinghao Wu, & Cornelia Caragea, The Pennsylvania State University

Introduction

Online communities are an important source of social support for cancer survivors and caregivers. The ACS Cancer Survivors Network (CSN) is the oldest and largest

f these, with 160, 000+ members. This study used

computational text mining analysis of 48,779 threaded discussions (468,000 posts by 27,173 members) to identify emerging community leaders and classify user sentiment over sequential posts.

Methods

Leader Analysis: Posts of 41 recognized CSN leaders and 2366 other users were analyzed. 21 leadership characteristics were scored and used to calibrate single and ensemble classifiers capable of correctly identifying leaders.

Leader Analysis Results Sentiment Analysis Results Implications

78% and 85% of community leaders are correctly identified with the single best and ensemble classifiers respectively. The best fitting sentiment classifier has an 80% correct classification rate with 68.8% of posts classified as positive. 75% of negative thread originators subsequently express positive sentiment when at least one reply is received from

peers. Probability increases with number of replies. Positive

thread initiators are more likely to have positive subsequent sentiment than negative thread originators.

Early/proactive identification of potential leaders

gives community managers the opportunity to encourage the growth of desired leadership qualities and thereby maintain strong peer leadership.

Sentiment analyses results support the hypothesis

that online cancer communities like CSN can effectively facilitate peer interactions in a safe, welcoming environment to help members feel more positive about their situation.

Influence Micro level

Network structure, diffusion, and evolution

Macro level

Sentiment influence & Influential users Members’ ¡publishing ¡ behaviors and influence Information diffusion & the evolution of collaboration networks

Sentiment Analysis: User sentiment was computed through a multi-stage process. 13 lexical/style features were extracted from a training set of 298 randomly-selected posts manually assigned to positive (204) or negative (94) sentiment, then used to calibrate 5 classifiers. Utilizing the best-fit classifier, sentiment level was established for all 468,000 posts, and change ¡in ¡sentiment ¡between ¡users’ ¡initial ¡and ¡subsequent ¡ posts examined.

Community Member Posting Features Classifiers Decisio n Leader Participant Full Community Training Set Leader Participant

Contribution features

– The numbers of posts/threads – The length of posts – The ¡time ¡span ¡of ¡one’s ¡activities – …

Centrality features

– A post-reply network among users – Nodes and Edges – In/out-degree, Betweenness, PageRank

Semantic features

– Appearance of words with positive/negative ¡sentiment ¡in ¡a ¡user’s ¡ posts – The use of slangs and emoticons

80% use Internet for health-related purposes

Adult Internet users in the U.S.

1 in 4 joins OHCs

Community Member Lexical/style Features Classifier Decisio n Leader Participant Full Community With unknown Sentiment. Training Set with Assigned sentimen t

+

Similarity measure

CSN Forum Posts Labeled Posts Non- Labeled Posts Sentiment Model Model Selection Feature Extraction (Post, Pr, Label)

ROC Area = Area under the Receiver Operator Characteristic Curve Best is AdaBoost: False positive rate 0.152, False negative rate 0.33 Best Features: Post_Length, #_Negitive_Words, #_Internet_Slang_Words, #_Names_Mentioned, (N#_Pos+1)/(#_Neg+1), PosStrength, NegStrength

Initial Post (P1) Responding Reply (R1) 1st Self-Reply (P2)

…

M-th Self-Reply (Pn) Reponding Reply (Rm)

… …

Sentiment Change Indicator: The difference between the ¡ ¡average ¡sentiment ¡of ¡the ¡originators’ ¡self ¡replies ¡ and the initial sentiment of the thread originator. The more positive the sentiment of replies from others, the more positive the originator became. Naïve Bayesian Logistic Reg Random Forest One-Class SVM Two-Class SVM Ensemble Classifier

SLIDE 62

Topic Discovery Using Discussion Posts in an Online Cancer Community

Kenneth Portier & Greta Greer, American Cancer Society John Yen, Prasenjit Mitra, Siddhartha Banerjee, Mo Yu, & Prakhar Biyani, The Pennsylvania State University Lior Rokach & Nir Ofek, Ben-Gurion University of the Negev

The ACS Cancer Survivors Network (CSN) is the
ldest and largest online community for cancer

survivors, with 25K unique visits per day.

Question: Are different discussion topics

associated with different sentiment changes, a measure of social support, of the thread initiators?

Method: sentiment analysis and topic modeling
Data: CSN breast and colorectal cancer

discussion forum posts from 2005-2010

Sentiment Analysis:

User sentiment is computed through

a multi-stage process.

13 lexical/style features are extracted

from a training set of 298 randomly- selected posts manually assigned to positive (204) or negative (94) sentiment

The training set is used to calibrate 5

classifiers.

Utilizing the best-fit classifier (80%

correct classification rate), sentiment level was established for all 468,000 posts (68.8% Positive).

Estimate the impact of Responding

Reply sentiment on sentiment change of thread initiators.

Negative thread initiators are likely to

have positive sentiment change.

Sentiment Change Score computed as the

difference between the average sentiment of the ¡originators’ ¡self ¡replies ¡and ¡the ¡initial ¡ sentiment of the thread originator.

Only one or a couple of days typically span the

time between initial post and first follow-up response by the thread initiator.

The sentiment change observed is likely a

reaction to the positive sentiment posts from the community.

The more positive the sentiment of replies from others, the more positive the originator became.

Topic Model Analysis:

Topics of thread initiating posts were identified

using Modified Latent Dirichlet Allocation (LDA-VEM), which assigns each initiating post the probabilities of belonging to each topic.

Posts were classified to the highest-probability

topic.

Analyses of 20 to 50 topics indicated choice of

30 topics as being reasonable for both forums.

Selected word combinations (bi-grams)

identified in an initial analysis of the posts were subsequently converted to single words (e.g. ¡“breast ¡cancer” ¡to ¡“breastcancer”) ¡to ¡ retain their meaning.

Remaining words were reduced to root form

(i.e., stemming).

Terms occurring very often (> 80% of posts) or

very seldom (<5 posts) were removed prior to analysis.

Methods Overview

Breast Cancer Discussion Forum Colorectal Cancer Discussion Forum

Average sentiment change Index

(and associated 95% confidence intervals) vs main post topic for CSN beast and colorectal cancer discussion board

High average sentiment change

scores indicate that community responses have a positive effect

n the emotions of the thread

initiators.

Low average sentiment change

scores could indicate either that community response has little impact ¡on ¡the ¡initiator’s ¡ emotions or (more likely) that the initial post sentiment was high to begin with.

Sentiment change score vs

initial sentiment by topics for the breast and colorectal cancer discussion board.

Each box is centered on the

mean for each topic with the area representing 95% confidence.

Topics with high average

initial post sentiment tend to have lower average sentiment change scores.

Results

The increased understanding of topics

and related sentiment supports development of ancillary information to be made available to CSN members to supplement forum discussions.

We envision using these results to create

tools

Notify community leaders when

posts with low initiating sentiment do not produce adequate community response

Point the initiator to threads where

the topic may have been discussed in the recent past.

These improvements can further

improve community social support and subsequently ¡members’ ¡quality ¡of ¡life.

Both forums show that pain, medical worries,

and treatment side-effect issues initiate with very low (most negative) sentiment and have highest sentiment change.

Breast cancer posts tend to initiate lower

sentiment than colon cancer posts while sentiment change tends to be higher.

Conclusions

Initial Post (P1) 1st Self-Reply (P2) M-th Self-Reply (Pn) Responding Replies (R1-Rk) To Initial Post Responding Replies (R1-Rk) To Self-Reply Training Set N=298 Feature Extraction Classifier Calibration & Selection Classification of Corpus Sentiment Change Analysis Initial Post Extraction Breast & Colorectal Cancer Forums Initial Post Key Words Key Phrase Recoding Common & Unique Word Removal LDA-VEM Analysis – Topic Identification Initial Post Topics & Likelihood

Cancer Survivors Network (CSN) Discussion Board Posts

Sentiment Change