Stream Characterization from Content Allen Gorin Human Language - PowerPoint PPT Presentation

Stream Characterization from Content Allen Gorin Human Language Technology Research U.S. DoD, Fort Meade MD a.gorin@ieee.org

Collaborators Carey Priebe (JHU) John Grothendieck (BBN) Nash Borges Dave Marchette John Conroy Alan McCree Glen Coppersmith Youngser Park Rich Cox Alison Stevens Mike Decerbo Jerry Wright SCC for DIMACS 2 5/3/2010 2

Outline • Motivation • HLT Research Issues • Joint model of content in context • Experiments on speech using Switchboard • Experiments on text using Enron 5/3/2010 SCC for DIMACS 3

Environmental Awareness Focus of Attention Peripheral glances SCC for DIMACS 4 5/3/2010 4

Environmental Awareness: Focus of Attention plus Peripheral ‘Vision’ Lower resolution and lossy compression Enables change and anomaly detection 5/3/2010 SCC for DIMACS 5

Coping with Information Overload SCC for DIMACS 6 5/3/2010 6

Analytic Questions • Is the information environment stable? – describe environment – lossy compression • Did something change? – Where? What? SCC for DIMACS 7 5/3/2010 7

HLT Research Issues • Focus on stream statistics – Rather than on individual documents – E.g. Language Characterization (McCree) – Classifier output is biased and noisy (Grothendieck) – Piece-wise stationary segments (Wright) • Content has associated meta-data – Better living through content in context – Theory, simulations and experiments – with Priebe, Grothendieck, et al SCC for DIMACS 9 5/3/2010 9

Experimental Corpora • Enron corpus of emails – 500K emails over 189 weeks from DoJ/CMU – 184 communicants – 32 topics as defined by LDC • Switchboard corpus of spoken dialogs – 2500 topical dialogs – between pairs of 500 speakers – speaker demographics 5/3/2010 SCC for DIMACS 10

Joint model of content in context • Consider a set of communication events M = { z i = ( u i ,v i ,t i ,x i )} � M with • An event in M is z i � V x V x R + x � – representing (to, from, time, content) • A time window defines a graph with content- attributed edges • Attribution functions h V and h E to further color vertices and edges 5/3/2010 SCC for DIMACS 12

Examples from Enron Corpus (high-dimensional and heterogeneous features) SCC for DIMACS 13 5/3/2010 13

SwitchBoard Communications Graph Vertex ~ speakers Edges ~ dialogs SCC for DIMACS 14 5/3/2010 14

Joint Model of Content and Context via Attributed Graphs • Edge attributes – Content-derived meta-data (a.k.a. meta-content ) – E.g. topic id, ASR, turn-taking behavior • Vertex attributes – External meta-data about speaker – E.g. demographics such as age, gender, education, … – Graph-derived meta-data – E.g. vertex degree ~ willingness to communicate SCC for DIMACS 15 5/3/2010 15

Joint Model of Content and Context • Random Attributed Graph – Provides a joint model of content and context • In Switchboard – Content is an attribute of an edge (dialog) – Consider turn-taking behavior in the dialog – Context is an attribute of the vertices (speakers) – Consider age, education, gender of speakers • Joint model enables inference of – Unobserved demographic distribution – From observed turn-taking behavior 5/3/2010 SCC for DIMACS 17

Models of Turn-Taking Behavior • Turn-taking behavior has predictive power – for speaker ID (Jones) – for speaker traits in meeting room data ( Lakowski ) – for social roles and networks (Pentland) • Joint model of vertex, edge attributes and graph – social correlates of turn-taking behavior – Grothendieck and Borges – experiment to exploit joint distribution – observed meta-content (turn-taking) – estimate unseen demographic distributions SCC for DIMACS 18 5/3/2010 18

Turn-taking Behavior Model derived from SAD A = active I = inactive SCC for DIMACS 19 5/3/2010 19

Semi-Markov Model of Turn-Taking Behavior 5/3/2010 SCC for DIMACS 20

Latent Classes of Turn-Taking Behavior • Train turn-taking model from Switchboard corpus • First-order partition via divisive clustering – E.g., Style 0 has more and longer II (both silent) – E.g., Style 1 has more and longer AA (both active) • Classify each dialog as style 0 or 1 • Edge attribute (meta-content) • Classify each speaker as having style 0 or 1 • Vertex attribute induced from edge attributes SCC for DIMACS 21 5/3/2010 21

Enriching vertex attributes with edge meta-content and graph meta-data • X = external meta- Y 1 X 1 data on speaker v X 2 • Y = conversation X 3 Y 2 turn-taking style V . • T(Y) = turn-taking . . style of speaker v Y 3 • #V = number of T(Y) conversations #V including speaker v SCC for DIMACS 22 5/3/2010 22

Experimental Evaluation • E.g., overall ratio of male:female is 1:1 – speakers with TT style 0 have ratio 2:1 • Have joint distribution of content and context – exploit observed content (turn-taking behavior) – to estimate unobserved context (demographic mix) • Experiment : create speaker sets with mixture proportion v of style 0, for v in [0,1] • Result: across all mixtures v of styles, – predict proportions of age, education, gender, … – yields RMS error ~ 0.1 SCC for DIMACS 23 5/3/2010 23

Classic Problems in DSP • Estimate characteristic parameters – Oppenheim (1975) • To detect a signal in background noise – Van Trees (1968) • Motivates initial focus on change/anomaly detection 5/3/2010 SCC for DIMACS 24

Better Living through Content in Context • Information Exploitation = statistical inference • Better = more powerful statistical test – for change/anomaly detection • Some results to date – Theorem that joint can be more powerful – Simulation experiments – Proof-of-concept experiment on Enron Corpus 5/23/2010 SCC for DIMACS 5/3/2010 26

Time series of Time Series of Attributed Graphs attributed graphs Generated from observations of some random attributed graph? SCC for DIMACS 5/3/2010 28

Change detection in a time series of Graphs Homogeneous Anomalous Chatter Group 5/23/2010 SCC for DIMACS 5/3/2010 29

Detecting ‘Signal’ in ‘Noise’ - models and theory G N (t) G S (t) + G N (t) G is a probability distribution over attributed graphs G S (t) SCC for DIMACS 5/3/2010 30

Random Attributed Graphs • Let’s work through an example with a very simple model of content and context • Existence of an edge between two vertices is IID Bernoulli with probability p • Content topic (on each edge) is IID Bernoulli with probability θ • Change detection via testing candidate anomaly (alternative) versus history (null) 5/3/2010 SCC for DIMACS 31

Null Hypothesis (noise): an attributed Erdos-Renyi Graph Random Graph ERC(N, p, � ) N = # vertices in the graph p = probability of an edge Each edge labeled - with topic 0 or 1 - with � = probability of topic 1 5/23/2010 SCC for DIMACS 5/3/2010 32

Alternative Hypothesis (noise + signal): an ERC subgraph with different parameters Random Graph K (N,p, � , M, q, � ’ ) N = # vertices in whole graph p = prob(edge) in kidney � = topic parameter in kidney M = # vertices in egg q = prob(edge) in egg � ’ = topic parameter in egg 5/23/2010 SCC for DIMACS 5/3/2010 33

Theorem A statistical test based on fusion of externals and content can be more powerful than a test based on externals alone or content alone. (Grothendieck and Priebe) 5/23/2010 SCC for DIMACS 5/3/2010 36

Proof by Construction • T G = # of graph edges • T C = # of graph edges attributed with topic 1 • T = 0.5 T G + 0.5 T C • Test for change from homogeneous null graph: – Power of test based upon T G is β G – Power of test based upon T C is β C – Power of test based upon T is β • For tests with false alarm rate α = 0.05, – gray-scale plot of power difference Δ = β -max( β G , β C ) 5/23/2010 SCC for DIMACS 5/3/2010 37

Power Difference: Δ = β – max(β C , β G ) � ( � ’, q) depends on the parameters of the anomalous chatter group p = 0.5 � =0.5 � ’ _ q = subgraph connectivity � ’ = subgraph topic + Grayscale = � ( � ’ , q) _ q 5/23/2010 SCC for DIMACS 5/3/2010 38

Detecting ‘Signal’ in Empirical ‘Noise’ G N (t) G S (t) + G N (t) Enron Data G S (t) Model SCC for DIMACS 5/3/2010 40

Stream Characterization from Content Allen Gorin Human Language - PowerPoint PPT Presentation

Stream Characterization from Content Allen Gorin Human Language Technology Research U.S. DoD, Fort Meade MD a.gorin@ieee.org Collaborators Carey Priebe (JHU) John Grothendieck (BBN) Nash Borges Dave Marchette John Conroy Alan McCree

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Characterization of the Household Electricity Characterization of the Household Electricity

SITE CHARACTERIZATION Part 1. Non-Intrusive Site Characterization Technologies Tyler E. Gass,

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Geomaterial Characterization Sub-topics Chemical characterization pH, TDS, EC, BOD, COD

Sub-topics Chemical characterization Sorption-Desorption (Contaminant Transport in Porous

Texas Stream Team Texas Stream Team Mission Expand understanding and awareness of water quality

Multiprocessors - Flynns Taxonomy (1966) Single Instruction stream, Single Data stream

Architecting a Stochastic Computing Unit with Molecular Optical Devices Xiangyu (Mike) Zhang,

Security Skins: Embedded, Unspoofable Security Indicators Rachna Dhamija Center for Research on

Hopalong Casualty Capabilities and Limitations of Visual Surveillance Ingo L utkebohle

Librarians Gulcin Cribb University Librarian Our Passion, Our Commitment, Your Advantage

Integrating Housing into Regional Planning Background SCI provides resources to more fully

Aging Gracefully Wake Up Your Genes for Vibrant Health with Ingrid DeHart

Cell Physiolgy By: Dr. Foadoddini Department of Physiology & Pharmacology Birjand University

Get UP to Drive Harm Down ND Webinar March 29, 2018 Maryanne Whitney RN CNS MSN Cynosure Health

Sambuz

Useful Links

Newsletter

Mail Us

Stream Characterization from Content Allen Gorin Human Language - PowerPoint PPT Presentation

Stream Characterization from Content Allen Gorin Human Language Technology Research U.S. DoD, Fort Meade MD a.gorin@ieee.org Collaborators Carey Priebe (JHU) John Grothendieck (BBN) Nash Borges Dave Marchette John Conroy Alan McCree

? sync ref chosen as sync source by Listener Stream B: Presentation Stream C: timestamps

Stream Ciphers Stream Ciphers 1 Stream Ciphers Generalization of one-time pad Trade

Characterization of the Household Electricity Characterization of the Household Electricity

SITE CHARACTERIZATION Part 1. Non-Intrusive Site Characterization Technologies Tyler E. Gass,

Fresh water stream ecosystem Gr ov p 2 The description of stream lives Quadrat 1: Hong Kong Newt

Phase III Stream Assessment Study: Potential Stream Restoration Projects Strawberry Run and

UPLOAD VIDEOS TO MICROSOFT STREAM VIA ACCESSUH To upload a video on Microsoft Stream, go to

Assessing stream and riparian conditions Stream Habitat Assessment Conducted yearly

CS162: Introduction to Computer Science II Streams 1 Streams A stream is a flow of data

Stream Processing Marco Serafini COMPSCI 532 Lecture 5 Stream vs. Batch Processing Batch

Stream Switching Control draft-gentric-mmusic-stream-switching-00.txt Philippe Gentric

B.e) Stream Ciphers W. Schindler: Cryptography, B-IT, winter 2006 / 2007 2 B.125 Stream Ciphers

Geomaterial Characterization Sub-topics Chemical characterization pH, TDS, EC, BOD, COD

Sub-topics Chemical characterization Sorption-Desorption (Contaminant Transport in Porous

Texas Stream Team Texas Stream Team Mission Expand understanding and awareness of water quality

Multiprocessors - Flynns Taxonomy (1966) Single Instruction stream, Single Data stream

Architecting a Stochastic Computing Unit with Molecular Optical Devices Xiangyu (Mike) Zhang,

Security Skins: Embedded, Unspoofable Security Indicators Rachna Dhamija Center for Research on

Hopalong Casualty Capabilities and Limitations of Visual Surveillance Ingo L utkebohle

Librarians Gulcin Cribb University Librarian Our Passion, Our Commitment, Your Advantage

Integrating Housing into Regional Planning Background SCI provides resources to more fully

Aging Gracefully Wake Up Your Genes for Vibrant Health ***with Ingrid DeHart***

Cell Physiolgy By: Dr. Foadoddini Department of Physiology &amp; Pharmacology Birjand University

Get UP to Drive Harm Down ND Webinar March 29, 2018 Maryanne Whitney RN CNS MSN Cynosure Health

Sambuz

Useful Links

Newsletter

Mail Us

Aging Gracefully Wake Up Your Genes for Vibrant Health with Ingrid DeHart

Cell Physiolgy By: Dr. Foadoddini Department of Physiology & Pharmacology Birjand University