BDSS IGERT Speed Dating/Matchmaking Event
September 19, 2014
Speed Dating/Matchmaking Event September 19, 2014 John%Beieler - - PowerPoint PPT Presentation
BDSS IGERT Speed Dating/Matchmaking Event September 19, 2014 John%Beieler PhD$Student,$Poli/cal$Science jub270@psu.edu johnbeieler.org Event&Data Who$did$what$to$whom Python,)R Natural)language)processing Forecas7ng
BDSS IGERT Speed Dating/Matchmaking Event
September 19, 2014
PhD$Student,$Poli/cal$Science
jub270@psu.edu johnbeieler.org
Wanghuan'Chu'
– Nonparametric'regressions,'mixed+effects/mul6level' models,'discrete'choice'models,'sta6s6cal'learning' algorithms,'causal'inference'techniques,'etc.'
– Thesis&research:'Feature'screening'methods'for'ultrahigh' dimensional'longitudinal'data.'
– 1st&IGERT&rota1on:'Causal'media6on'analysis'for'clustering' data'using'mixed+effects'models,'propensity'score' modeling'and'inverse'probability'weigh6ng.'
Wanghuan'Chu'
– Parallel'compu6ng'to'Big'Data'(e.g.'MapReduce).' – Sta6s6cal'methodologies'at'data'analy6cs'layer.' – Interes6ng'social'science'ques6ons'to'be'explored.'
– New'programming'language'(e.g.'Python).' – New'methodology'and'domain'knowledge.'
An Introduction:
Cindy Cook
cmc496@psu.edu
!
B.S. in Mathematics
!
M.S. in Applied Statistics
!
Ph.D. in Statistics
Research Interests:
! Big Data
Sciences
broadening my overall computing skills
! Data that has spatial/temporal trends ! Networks on a large scale ! Any combination of these
Timmy Huynh ■ Sociology & Demography
Advisor: John Iceland tnh133@psu.edu
Education
B.A., Geography / Economics, The University of Texas at Austin, 2010 M.A., Social Sciences, The University of Chicago, 2011
Research experience (selected)
REU Summer Institute in Minority Group Demography – Austin, TX, 2009 Summer Institute in LGBT Population Health – Boston, MA, 2010 Asian Americans Advancing Justice – Chicago, IL, 2011-2012 Oak Ridge National Laboratory – Oak Ridge, TN, 2012-2013
Research interests
Urban sociology Spatial demography Economic geography Networks (Geo)Visualization
Skills
Statistics (Stata, SPSS, R) GIS (ArcGIS, GeoDa, ERDAS) Programming (Python, JavaScript)
Christopher Inkpen
Sociology and Demography
Recent Projects
patterns
internal migration Tools
GLM, HLM, fixed and random effects, spatial econometrics
Broad Interests
Areas to explore
Population estimation and data fusion
Mapping of social networks
Department of Human Development and Family Studies
Rachel Koffer
rek183@psu.edu 3rd year Ph.D student in the Department of Human Development and Family Studies Concentrations: Individual Development, Methodology Advisors: Nilam Ram, David Almeida B.A. Psychology, Economics; Minor: Environmental Studies
Skills I Bring to the Rotation:
SAS, R, LISREL, SPSS, STATA Statistical skills: General linear, multilevel, structural equation modeling, PCA and Factor Analysis
Skills I Hope to Develop/Improve During the Rotation:
Python; Data visualization, Machine Learning
Methodological Interests:
Analysis of: Intensive longitudinal data (many measurements across short time span); Multiple time scales (intensive longitudinal data w/in longer-term data);
Substantive Interests:
Association between daily experiences and well-being. Effects of daily stressors on daily and long-term affective (mood) and physical well-being.
Potential Interests for Research Rotation:
Machine learning techniques for developmental time series data Application of interdisciplinary methods to stress concepts
Department of Human Development and Family Studies
Rachel Koffer
rek183@psu.edu
Fridolin Linder
Department: Political Science (2nd Year PhD) Fields: Methodology, Comparative Politics, Statistics (Grad. Minor) Interests: Predictive Modeling/Machine Learning, Text Analysis (Classification,Scaling), Political Representation, Research Design/Causal Inference/Epistemology Skills: Statistics, R (substantial), Python Current Projects: Datamining as Exploratory Data Analysis (w/ Zach Jones), Rationalization of candidate choice through missreporting of ideological self-placement (experiment)
Fridolin Linder BDSS IGERT Matchmaking Event 1 / 1Jonathan K. Nelson
Department of Geography jkn128@psu.edu"
Abstract—I am a Ph.D student in the department of geography. Prior to coming to Penn State I was a cartographer for National Geographic. I study spatial data representation and explore patterns and relationships in geographic phenomena, using spatial statistics and visual analytics approaches. I am particularly interested in interactive multi-scale visual and data abstraction techniques for making sense of BIG DATA. " My current research rotation is in the GeoVISTA Center and involves leveraging geo- social media data to support crisis management. Other projects I am working on include: a visual analysis of 1200 student maps from a massive open online course titled “Maps and the Geospatial Revolution;” an exploratory analysis on multiscalar effects of the modifiable areal unit problem on cancer diagnosis rates and median income; and a human-pet-computer interaction study that aims to build healthy relationships between pet owners and their dogs using personal visualization and quantification." Tools I commonly use for carrying out and conveying my research include: Adobe Creative Suite, Avenza MAPublisher, Final Cut Pro; ESRI ArcGIS, GeoDaA, R; CSS, HTML, JavaScript D3. "
!Keywords— spatial data, visualization, cartography, map, scale, aggregation, information design "
INTELLIGENT SYSTEMS LABORATORY APPLIED COGNITIVE SCIENCES LABORATORY
Alexander G. Ororbia II, IST PhD Student
What do I do?
Build:
Deep models for learning from Scholarly Big Data
Multilayer neural networks, learning kernels Boltzmann Machines Convolutional Networks—text recognition in-the-wild (“Text in the Wild”)
Active Learning Algorithms
Bayesian Network Lattice for error-correcting Amazon Mechanical Turker annotations (Ororbia et. al,2014, Under Review)
Investigate:
Can deep architectures discover/model inherent hierarchical structure in text?
How can intelligent systems work in tandem with humans to solve complex problems?
Can intelligent tools be built that harvest and organize vast amounts of scholarly data?
What insights can these same algorithms extract from the data?Joshua Snoke snoke@psu.edu
Parametric and Non-Parametric Methods
Joshua Snoke snoke@psu.edu
Geography
Previous activities
political/social comparison
design/implementation
Methods experience
Background
Interests for Future Work
Topics
Dissertation considerations
Clio Andris (Assistant Professor of GIScience) Dept. of Geography, clio@psu.edu.
Courses: Fall GEOG560: Interpersonal Relationships in Geographic Space, Spring GEOG363: GIS
A System of Systems
Network thanks to Paul Hooper, Emory U.
Example 2 Example 3 Example 4
MID/DLE: An End to Data Collection
Analyze Collect
Researcher Computer
Participant
Measure Measure Measure
Participant
Measure Measure Measure
Participant
Measure Measure Measure Maintain
Tim Brick, HDFS
Human Rights Documentation Project
11,715 human rights documents. Eventually I will have more than 20,000 documents Most are already coded.
Christopher J. Fariss Respect for Human Rights has Improved Over TimeEpidemics – The Dynamics
Ferrari, BIOL
Epidemics – The Dynamics
Chris Fowler
Assistant Professor of Geography and Demography csfowler@psu.edu
The big question: When cities spend money on stuff (e.g. affordable housing, transport systems, parks)…. …who suffers, who benefits? …how do those costs and benefits change communities? The current question: Can we use demographic data at very fine geographic scales to identify signals that relate to the above?
More segregated More Diverse Smallest Scale Largest Scale
Completely Segregated Very Diverse Segregated
The Measure: Multiscale segregation
Multiscale Segregation Profiles: Functional Forms
More segregated More diverse 18,000 cells x 25 scales x 3 census years = 1.35 million data points to study 8 neighborhoods
Multiscale segregation is just the start
GeoTxt SensePlace2
Frank Hardisty (hardisty@psu.edu) at the GeoVISTA Center
People and Ideas
GeoVISTA Student Affiliates
Project Ideas
sourcing
graph analysis
bridging social science and geo- science
The Psychometrics of College Tests Loken, HDFS
Signal and noise in data on body weight
Stephen A. Matthews
Professor of Sociology, Anthropology & Demography (courtesy, Geography) Director, Graduate Program in Demography
Research Interests: My research focuses on population health and health inequality. An important part of my work is an interest in conceptual and methodological issues associated with how neighborhoods are defined and their attributes are measured, and the relevance of these definitions and measures to individual behavior and health outcomes. Proposed Research Project: A friend and colleague, Basile Chaix (Université Pierre et Marie Curie, Paris) has geocoded data on 90,000 places (activity locations) for 6,000 Parisians. For each respondent we know the self-reported boundaries of their neighborhood (VERITAS-RECORD project).
During the project the emphasis would be to develop/refine methods to (a)compare the patterning of locations visits to self-reported neighborhood; (b) compare patterns across individuals residing in the same neighborhood; (c) identify hierarchical use patterns (frequency) among location types; (d) examine the significance of focal locations (e.g., work, home); and, (e) determine the optimal/minimal number (and type) of locations reported that offer a useful proxy for the total distribution of locations that an individual visits. Stephen A. Matthews – sxm27@psu.edu – BDSS Speed-dating Meeting (Fall 2014) Slide 01
VERITAS-RECORD = Visualization and Evaluation of Route Itineraries, Travel Destinations, and Activity Spaces – Residential Environment and CORonary Heart Disease. Youtube VIDEO at https://www.youtube.com/watch?v=91x_S2Q-tic
Stephen A. Matthews – sxm27@psu.edu – BDSS Speed-dating Meeting (Fall 2014) Slide 02
Ideal Skill Sets Required: a)Good data organizing skills b)Good communication skills c)Good documentation skills d)Solid statistical background e)Programming skills (for automating repetitive tasks) f)Patience g)Mapping and data visualization skills h)GIS experience, preferably ArcGIS i)Some familiarity with point pattern analysis, local neighborhood statistics, and density/surface mapping. j)Some familiarity with activity space and time-geography literature – and willingness to learn more. Opportunities to be involved in manuscripts to be developed for publication in epidemiology, public health and/or geography-related journals
Quantitative interest
Process models Multivariate continuous time modeling with all driving parameters person-specific
Bayesian statistics
Twitter etc.)
bayesian.zitaoravecz.net
Main goal: developing novel multivariate dynamical models that capture psychologically
meaningful properties of change over time in terms of latent variables
study individual differences therein e.g., interventions can be tailored based on these variables
Substantive interest
Affective science
Well-being
Cognitive process models
and individual characteristics Research tools
bayesian.zitaoravecz.net
20 40 60 80 100 20 40 60 80 100 Valence ArousalDonna Peuquet
Department of Geography Research interests:
representation
– Visualization – Geovisual analytics – Data models – Computational/statistical modes
Current project: STempo
– quickly reveal temporal and spatio- temporal patterns from large collections of event data
patterns
in other locations / times
techniques
newsfeeds - GDELT
Potential projects:
contexts
and postcursor events
patterns/cycles
pattern importance
What kinds of hidden significant structure exists in complex space- time behaviors?
NILAM RAM HUMAN DEVELOPMENT & FAMILY STUDIES
NUR5@PSU.EDU
IGERT BDSS PSU SEPTEMBER 19, 2014
FINDING MEANING IN
THE DATA FOREST
Data Acquisition Data Fusion Data Management Data Visualization Data Mining Data Modeling In-Vivo Data In-Silica Data In-Virtual Data Real-Time Data Interactive Data
HMM (STATE SEQUENCE) ESTIMATION
REAL TIME ANALYSIS TIME-AWARE RECOMMENDATIONS
Probabilistic state sequence extracted from 4-state HMM ID# 103, age 2-mo Inoculation Paradigm
Xt+1 = AXt + Vt+1 Yt+1 = CXt + Wt+1
CELLULAR AUTOMATA
SIMULATIONS OF COMPLEX EMERGENT BEHAVIOR + INTERACTIVE DATA VIZ GAMING
dt = .01, R = .2, A = .08, B = 1.5, C = .15, Du = .5, Dv = 20 100x100 grid with periodic boundaries, random uniform initial conditions (0,.1)
u t = Rf u,v
( )+ Du2u
v t = Rg u,v
( )+ Dv2v
f u,v
( ) = A Bu+
u2 v 1+Cu2
( )
g u,v
( ) = u2 v
ENSEMBLE METHODS FOR (UN)STRUCTURED DATA
ENSEMBLE METHODS FOR (UN)STRUCTURED DATA
Data$Privacy,$Causal$Inference,$ Categorical$Data$methodologies$…$
Aleksandra$(Sesa)$Slavkovic$ sesa@psu.edu$$
$ Departments$of$StaBsBcs$&$Public$Health$Sciences$$ Pennsylvania$State$University$$ $ $ Sep$19,$2014$@$BDSS$matching$day$
$
1"
Privacy"in"Sta-s-cal"Databases"
Agency/" Organiza-on/ Database" Respondents/ Individuals/ Organiza-ons" Users"
Queries" Answers" Government," Researchers," Businesses" Clinicians" Pa-ents"" (or)"" Malicious" adversary"
Collect"""!"""""""Store""""!"Analyze/Share" Cloud"compu-ng"
Privacy"Research"ques-ons"
approaches"to"data"privacy" – Social,"Behavioral"&"Economic"data" – TradeQoff"between"data"u-lity"and"disclosure"risk" – Rigorous"privacy"defini-ons"(e.g.,"Differen-al"Privacy)" – Synthe-c"data" – Priva-za-on"of"social"networks"data"" – Private"GenomeQwide"associa-on"studies"" – Privacy"with"Distributed"databases"
3"
Image"ref:"hWp://www.orgnet.com/email.html"
Other"projects"
with"observa-onal"data" – Causal"Inference" – Ecological"Inference" "
– Combining"data"from"mul-ple"sources" – Merging"big"data"with"probability"samples."Can"we"use"informa-on"from"surveys"to" help"generalize"analyses"from"largeQscale"administra-ve/private"or"organic"data?"
– CSCW"and"HCII"data" – Study"of"communica-on"and"awareness"in"online"collabora-ve"tools" – Ques-oners"and"logQac-vity"data" – NEW:"Neural"data"and"neuroimaging"fMRI"data"analysis"modeling"language" plas-city"in"bilinguals" – Time"and"frequency"domain"analyses"
4
Communication-based diffusion
Rachel Smith, Communication Arts & Sciences
network analysis and HIV- related indicators
– Two-mode networks (persons and community groups) – Cross-sectional – 15 communities, ~n=300 in each site
data
– Track online messages related to ebola – Predict a) what types of messages get passed onto another person, and b) predict what aspects of the message change and remain the ¡same ¡in ¡the ¡‘retelling’ – Compare end to CDC or WHO stories and advice
Promoting Intergenerational Communication through Facebook
College of Communications Different Use of Facebook among Senior Citizens (N=352)
Jung, E.H. & Sundar, S. S. (2014). Senior Citizens on Facebook: How do they Interact and Why? Paper presented at the 96th annual conference of the Association for Education in Journalism and Mass Communication, Montreal, Canada.walls)
Frequency of senior ¡citizens’ ¡ participation in Facebook activities (N= 168)
Facebook Activity Mean SD
Stay in touch with friends and family 3.27 1.43 Reunite with
2.56 1.14 Keep up with
2.39 1.14 Comment on
2.38 1.17 View or upload photographs 2.33 1.29 Pass the time 2.15 1.77 Keep up with current events 1.95 1.20 Update my status 1.88 1.00 Browse profiles 1.80 1.08 Post items (e.g. news articles) 1.78 .91
Sundar, S. S., Oeldorf-Hirsch, A., Nussbaum, J. F., & Behr, R. A. (2011). Retirees on Facebook: Can online social networking enhance their health and wellness? Proceedings of the 2011 Annual Conference Extended Abstracts ¡on ¡Human ¡Factors ¡in ¡Computing ¡Systems ¡(CHI ¡EA’11), 2287-2292.BIG DATA
LIKES PHOTOS WALL POSTS
COMMENTS
CHATTING
PRIVATE MESSAGES
What are senior citizens doing on Facebook for intergenerational communication?
What’s ¡on ¡ your mind?
BIG DATA @ FACEBOOK
citizens use? With whom are they using it?
receiver?
support do they receive from family members on Facebook?
Online communities are an important source of social support for cancer survivors and caregivers. The ACS Cancer Survivors Network (CSN) is the oldest and largest
computational text mining analysis of 48,779 threaded discussions (468,000 posts by 27,173 members) to identify emerging community leaders and classify user sentiment over sequential posts.
Leader Analysis: Posts of 41 recognized CSN leaders and 2366 other users were analyzed. 21 leadership characteristics were scored and used to calibrate single and ensemble classifiers capable of correctly identifying leaders.
78% and 85% of community leaders are correctly identified with the single best and ensemble classifiers respectively. The best fitting sentiment classifier has an 80% correct classification rate with 68.8% of posts classified as positive. 75% of negative thread originators subsequently express positive sentiment when at least one reply is received from
thread initiators are more likely to have positive subsequent sentiment than negative thread originators.
gives community managers the opportunity to encourage the growth of desired leadership qualities and thereby maintain strong peer leadership.
that online cancer communities like CSN can effectively facilitate peer interactions in a safe, welcoming environment to help members feel more positive about their situation.
Influence Micro level
Network structure, diffusion, and evolution
Macro level
Sentiment influence & Influential users Members’ ¡publishing ¡ behaviors and influence Information diffusion & the evolution of collaboration networks
Sentiment Analysis: User sentiment was computed through a multi-stage process. 13 lexical/style features were extracted from a training set of 298 randomly-selected posts manually assigned to positive (204) or negative (94) sentiment, then used to calibrate 5 classifiers. Utilizing the best-fit classifier, sentiment level was established for all 468,000 posts, and change ¡in ¡sentiment ¡between ¡users’ ¡initial ¡and ¡subsequent ¡ posts examined.
Community Member Posting Features Classifiers Decisio n Leader Participant Full Community Training Set Leader Participant
– The numbers of posts/threads – The length of posts – The ¡time ¡span ¡of ¡one’s ¡activities – …
– A post-reply network among users – Nodes and Edges – In/out-degree, Betweenness, PageRank
– Appearance of words with positive/negative ¡sentiment ¡in ¡a ¡user’s ¡ posts – The use of slangs and emoticons
80% use Internet for health-related purposes
1 in 4 joins OHCs
Community Member Lexical/style Features Classifier Decisio n Leader Participant Full Community With unknown Sentiment. Training Set with Assigned sentimen t
CSN Forum Posts Labeled Posts Non- Labeled Posts Sentiment Model Model Selection Feature Extraction (Post, Pr, Label)
ROC Area = Area under the Receiver Operator Characteristic Curve Best is AdaBoost: False positive rate 0.152, False negative rate 0.33 Best Features: Post_Length, #_Negitive_Words, #_Internet_Slang_Words, #_Names_Mentioned, (N#_Pos+1)/(#_Neg+1), PosStrength, NegStrength
Initial Post (P1) Responding Reply (R1) 1st Self-Reply (P2)
M-th Self-Reply (Pn) Reponding Reply (Rm)
Sentiment Change Indicator: The difference between the ¡ ¡average ¡sentiment ¡of ¡the ¡originators’ ¡self ¡replies ¡ and the initial sentiment of the thread originator. The more positive the sentiment of replies from others, the more positive the originator became. Naïve Bayesian Logistic Reg Random Forest One-Class SVM Two-Class SVM Ensemble Classifier
Sentiment Analysis:
a multi-stage process.
from a training set of 298 randomly- selected posts manually assigned to positive (204) or negative (94) sentiment
classifiers.
correct classification rate), sentiment level was established for all 468,000 posts (68.8% Positive).
Reply sentiment on sentiment change of thread initiators.
have positive sentiment change.
difference between the average sentiment of the ¡originators’ ¡self ¡replies ¡and ¡the ¡initial ¡ sentiment of the thread originator.
time between initial post and first follow-up response by the thread initiator.
reaction to the positive sentiment posts from the community.
The more positive the sentiment of replies from others, the more positive the originator became.
Topic Model Analysis:
using Modified Latent Dirichlet Allocation (LDA-VEM), which assigns each initiating post the probabilities of belonging to each topic.
topic.
30 topics as being reasonable for both forums.
identified in an initial analysis of the posts were subsequently converted to single words (e.g. ¡“breast ¡cancer” ¡to ¡“breastcancer”) ¡to ¡ retain their meaning.
(i.e., stemming).
very seldom (<5 posts) were removed prior to analysis.
Breast Cancer Discussion Forum Colorectal Cancer Discussion Forum
(and associated 95% confidence intervals) vs main post topic for CSN beast and colorectal cancer discussion board
scores indicate that community responses have a positive effect
initiators.
scores could indicate either that community response has little impact ¡on ¡the ¡initiator’s ¡ emotions or (more likely) that the initial post sentiment was high to begin with.
initial sentiment by topics for the breast and colorectal cancer discussion board.
mean for each topic with the area representing 95% confidence.
initial post sentiment tend to have lower average sentiment change scores.
and treatment side-effect issues initiate with very low (most negative) sentiment and have highest sentiment change.
sentiment than colon cancer posts while sentiment change tends to be higher.
Initial Post (P1) 1st Self-Reply (P2) M-th Self-Reply (Pn) Responding Replies (R1-Rk) To Initial Post Responding Replies (R1-Rk) To Self-Reply Training Set N=298 Feature Extraction Classifier Calibration & Selection Classification of Corpus Sentiment Change Analysis Initial Post Extraction Breast & Colorectal Cancer Forums Initial Post Key Words Key Phrase Recoding Common & Unique Word Removal LDA-VEM Analysis – Topic Identification Initial Post Topics & Likelihood
Cancer Survivors Network (CSN) Discussion Board Posts
Sentiment Change