Social Network Analysis (and More) in Multimedia Indexing: Making - - PowerPoint PPT Presentation

social network analysis and more in multimedia indexing
SMART_READER_LITE
LIVE PREVIEW

Social Network Analysis (and More) in Multimedia Indexing: Making - - PowerPoint PPT Presentation

Social Network Analysis (and More) in Multimedia Indexing: Making Sense of People in Multiparty Recordings Alessandro Vinciarelli IDIAP Research Institute - CP592 Martigny (Switzerland) e-mail: vincia@idiap.ch Outline Part I - Introduction


slide-1
SLIDE 1

Social Network Analysis (and More) in Multimedia Indexing: Making Sense of People in Multiparty Recordings

Alessandro Vinciarelli IDIAP Research Institute - CP592 Martigny (Switzerland) e-mail: vincia@idiap.ch

slide-2
SLIDE 2

Outline

  • Part I - Introduction

Slide 2 of 38

slide-3
SLIDE 3

Outline

  • Part I - Introduction
  • Making sense of people?

Slide 2 of 38

slide-4
SLIDE 4

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.

Slide 2 of 38

slide-5
SLIDE 5

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.

Slide 2 of 38

slide-6
SLIDE 6

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications

Slide 2 of 38

slide-7
SLIDE 7

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications
  • The role recognition problem.

Slide 2 of 38

slide-8
SLIDE 8

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications
  • The role recognition problem.
  • The story segmentation problem.

Slide 2 of 38

slide-9
SLIDE 9

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications
  • The role recognition problem.
  • The story segmentation problem.
  • Part III - What’s Next?

Slide 2 of 38

slide-10
SLIDE 10

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications
  • The role recognition problem.
  • The story segmentation problem.
  • Part III - What’s Next?
  • Towards Social Signal Processing?

Slide 2 of 38

slide-11
SLIDE 11

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications
  • The role recognition problem.
  • The story segmentation problem.
  • Part III - What’s Next?
  • Towards Social Signal Processing?
  • The social status recognition problem.

Slide 2 of 38

slide-12
SLIDE 12

Outline

  • Part I - Introduction
  • Making sense of people?
  • A one-slide introduction to Social Network Analysis.
  • From SNA to Multimedia Indexing.
  • Part II - Applications
  • The role recognition problem.
  • The story segmentation problem.
  • Part III - What’s Next?
  • Towards Social Signal Processing?
  • The social status recognition problem.
  • Conclusions.

Slide 2 of 38

slide-13
SLIDE 13

Part I Introduction

Slide 3 of 38

slide-14
SLIDE 14

Making Sense of People (I)

One of our most common activities is to make sense of people, i.e. to understand, predict and recall the behavior of persons we know little or even nothing about.

Slide 4 of 38

slide-15
SLIDE 15

Making Sense of People (II)

The domain studying the way we make sense of people is called Social Cognition and relies on two major assumptions:

  • Social Cognition is a form of categorical thinking, i.e. we tend

to class others into predefined categories or stereotypes.

  • Social Cognition is thinking about relationships, i.e. we make

sense of people through the relationships they have with

  • thers.

Technology has learnt from neurology (neural networks), genetics (genetic algorithms), physiology (speech processing), etc. Why not to learn from Social Cognition?

Slide 5 of 38

slide-16
SLIDE 16

What is Social Network Analysis?

In very simple terms, Social Networks are graphs where each node corresponds to an individual and each link corresponds to a

  • relationship. Social Network Analysis (SNA) is a corpus of

mathematical techniques, mostly based on graph theory, that extract quantitative measures about social relationships:

  • how much a person is central.
  • how close two or more individuals are to each other.
  • how many social groups are present.
  • who belongs to which social group.
  • etc.

Slide 6 of 38

slide-17
SLIDE 17

From SNA to Multimedia Indexing

Signal Processing Social Network Analysis _ x Machine Learning C

If a Social Network is extracted from the signal, then each individual can be represented with a vector of social features. The vector can be mapped into socially relevant high level information. This requires two main operations:

  • Automatic extraction of Social Networks from data.
  • Machine Learning techniques for people classification.

Slide 7 of 38

slide-18
SLIDE 18

Part II.1 Role Recognition

Slide 8 of 38

slide-19
SLIDE 19

The Role Recognition Problem

  • The role recognition problem consists in assigning

automatically each individual a role r belonging to a predefined set R = {r1, . . . , r|R|}.

  • The experiments have been performed over corpora of radio

programs and the roles are:

  • Anchorman (AM)
  • Second Anchorman (SA)
  • Guest (GT)
  • Interview Participant (IP)
  • Abstract (AB)
  • Meteo (MT)

Slide 9 of 38

slide-20
SLIDE 20

A Role Recognition Approach

spk1 spk5 ... spk18 spk1

Speaker Duration Extraction Speaker Duration Analysis

anchorman guest ... anchorman meteo

Speaker Clustering Social Network Analysis Social Network Extraction

  • The first step of the process is the application of an

unsupervised speaker clustering approach.

  • The segmentation resulting from the first step is used to

extract information about:

  • the pattern of social relationships
  • the duration distribution of different speakers
  • The two information sources are then combined into a single

classification approach.

Slide 10 of 38

slide-21
SLIDE 21

Social Network Extraction (I)

t(sec) 50 100 150 200 250 300 350 400 450 500 550 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp2 sp5 sp5 sp5 sp18 sp18 sp18 sp10 sp10 sp10 sp12 sp5 sp5 filtered segmentation raw segmentation groundtruth

Speaker Clustering techniques enable one to split multiparty audio recordings into single speaker segments. The network can be extracted by connecting adjacent speakers.

Slide 11 of 38

slide-22
SLIDE 22

Social Network Extraction (II)

The speaker clustering is not a perfect process, thus the resulting network is noisy, i.e. it involves spurious individuals and spurious relationships.

Slide 12 of 38

slide-23
SLIDE 23

Statistical Foundations (I)

The role recognition problem can be thought of as finding the vector r∗:

  • r∗ = arg max
  • r∈RG p(

r|Y) (1) where

  • R is the set of predefined roles
  • G is the number of speakers ai.

r = (r1, . . . , rG) is the vector of the speaker roles

  • Y = {

y1, . . . , yG} is the set of the vectors representing the speakers

yi = (τi, xi), where τi is the percentage of times for which speaker ai talks.

Slide 13 of 38

slide-24
SLIDE 24

Statistical Foundations (II)

By applying the Bayes Theorem and by taking into account that Y is constant, the problem can be formulated equivalently:

  • r∗ = arg max
  • r∈RG p(Y|

r)p( r) (2) We assume that the roles of the different speakers are statistically independent:

  • r∗ = arg max
  • r∈RG

G

  • i=1

p( yi|ri)p(ri) (3) We further assume that τi and xi are statistically independent:

  • r∗ = arg max
  • r∈RG

G

  • i=1

p(τi|ri)p( xi|ri)p(ri) (4)

Slide 14 of 38

slide-25
SLIDE 25

The Data

The experimente have been performed over two corpora of radio

  • programs. The first (called C1) contains 96 news bulletins for a

total of 19 hours and 56 minutes of material, the second (called C2) contains 26 talk-shows for a total of 26 hours of material. Corpus AM SA GT IP AB MT C1 41.2% 5.5% 34.8% 4.0% 7.1% 6.3% C2 17.3% 10.3% 64.9% 0.0% 4.0% 1.7% The table reports the percentage of data each role accounts for.

Slide 15 of 38

slide-26
SLIDE 26

Results

The results are reported in terms of accuracy, i.e. percentage of time correctly labeled in terms of role. Corpus all AM SA GT IP AB MT C1 81.1 94.9 1.0 95.8 0.0 58.9 73.4 C2 81.3 70.2 88.3 89.8 18.3 29.7 5.0 The experiments are performed over the whole corpus using a leave

  • ne out approach.

Slide 16 of 38

slide-27
SLIDE 27

Part II.2 Story Segmentation

Slide 17 of 38

slide-28
SLIDE 28

The Story Segmentation Problem

  • The identification of semantically coherent segments makes

the acces to the content easier.

  • In the case of broadcast news, the segmentation is performed

in terms of stories.

Slide 18 of 38

slide-29
SLIDE 29

The Story Segmentation Approach

spk1 spk2 spk1 ... spk7 spk1 spk5 _ x1 _ x2 _ x1 ... _ x7 _ x1 _ x5 HMM h1 h2 h3 hM−2 hM−1 hM

...

Speaker Clustering SNA

  • The main idea of the approach is that people involved in the

same story are more likely to interact with each other, thus stories are expected to correspond to social groups.

  • SNA is used to extract feature vectors accounting for the

social groups and HMMs are used to map the vector sequences into story sequences.

Slide 19 of 38

slide-30
SLIDE 30

Affiliation Network Extraction (I)

Actors Events

An Affiliation Network is a bipartite graph, i.e. with two kinds of nodes: actors and events. Links are allowed only between nodes of different kind. There are two major approaches to define the events:

  • Gatherings: meetings, parties, etc.
  • Proximity in time and/or space.

Slide 20 of 38

slide-31
SLIDE 31

Affiliation Network Extraction (II)

s3=a1 s4=a 3 s5=a 2 s6=a 1 s7=a 2 t ∆ 1 t ∆

2

t ∆ 3 t ∆ 4 t ∆

5

t ∆

6

t ∆ 7 w1 w2 w3 w4 =a1 x1= (1,1,1,1) x2= (0,0,1,1) x3= (1,1,1,0) t s =a

2 3

t s

1

In the case of broadcast news, the events are defined using the proximity in time. Each speaker ai is represented by a vector

  • yi = (yi1, . . . , yiN), where N is the number of events and yij = Z

when ai talks during event ej (and 0 otherwise). The dimension of y is reduced using the PCA and the resulting vectors are xi.

Slide 21 of 38

slide-32
SLIDE 32

Statistical Foundations (I)

  • The goal of the story segmentation is to assign each vector

xi a label hi which can be either the number of a story or the anchormen role.

  • The story segmentation problem can be thought of as finding

the sequence H∗ = (h1, ..., hM) which maximizes the following a-posteriori probability: H∗ = arg max

H∈H p(H|X)p(H)

(5) where

  • H is the set of all possible H sequences.
  • M is the number of single speaker segments detected at the

speaker clustering step.

Slide 22 of 38

slide-33
SLIDE 33

Statistical Foundations II

  • The term p(H|X) is estimated using a fully connected Hidden

Markov Model (HMM) with S+1 states, where S is the maximum number of stories that can be observed and the ”+1” state is for the anchormen role.

  • The term p(H) is estimated using a tri-gram statistical

language model: p(H) =

M

  • k=3

p(hk|hk−1, hk−2) (6)

Slide 23 of 38

slide-34
SLIDE 34

The Story Segmentation Results

The table reports the purity as a function of the number of windows and the amount of variance retained. variance fraction win 70% 80% 90% 100% 10 0.74 0.76 0.76 0.78 14 0.74 0.76 0.76 0.77 20 0.75 0.77 0.78 0.79

  • the purity is always around 0.75.
  • the average number of stories detected by the system is 16.5.

Slide 24 of 38

slide-35
SLIDE 35

Part III What’s Next?

Slide 25 of 38

slide-36
SLIDE 36

Towards Social Signal Processing?

Human-human communication involves Social Signals, an array of nonverbal behaviors, mostly unconscious, which convey socially relevant information.

  • Vocal Social Signals are often unconscious and cannot be

easily controlled: the information they carry is thus reliable, language independent and, to some extent, culture independent.

  • Vocal Social Signals can be analyzed through robust and

established Signal Processing techniques. This is the basis of a potential new domain: Social Signal Processing.

Slide 26 of 38

slide-37
SLIDE 37

The Social Status Recognition Problem

Social psychology tells that how we say things is as important as what we say:

  • The delivery, i.e. the non-verbal characteristics of the way

people talk, conveys important information about social status.

  • Social Signal literature suggests some characteristics that can

be extracted through Signal Processing.

  • In the case of broadcast news, the main social statuses are

journalist and non-journalist. The goal of this work is to automatically recognize the status of each speaker using only the voice.

Slide 27 of 38

slide-38
SLIDE 38

A Social Status Recognition Approach

Feature Extraction Signal Processing Machine Learning Social Status

  • The voice of the speakers is analyzed using common Signal

Processing techniques (in particular pitch tracking).

  • The pitch is used to distinguish between voiced and

non-voiced segments.

  • Features extracted from pitch variations are fed to a linear

classifier.

Slide 28 of 38

slide-39
SLIDE 39

Pitch Tracking

Time Frequency Spectorgram and Pitch Estimate 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 200 400 600 800 1000 1200 1400 1600 1800 2000

The pitch curve is the average between the frequency at the first peak of the autocorrelation and the first peak of the Fourier transform.

Slide 29 of 38

slide-40
SLIDE 40

Feature Extraction (I)

Consider the set V = {v1, . . . , vM} contaning the lengths of the voiced segments. The lengths are quantized because the pitch is measured at regular time steps and T = {t1, . . . , tN} is the set of the represented lengths. The relative entropy of the V elements distribution is: Hv = − N

i=1 p(ti) log p(ti)

log N (7) The same process can be applied for the non-voiced segments leading to an entropy Hs. As a result, each intervention can be represented using a feature vector:

  • x = (Hs, Hv)

(8)

Slide 30 of 38

slide-41
SLIDE 41

Feature Extraction (II)

1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 −0.6 −0.58 −0.56 −0.54 −0.52 −0.5 −0.48 −0.46 x1 x2 Speaking Mode Journalists Others

After projecting the vectors onto the principal components, the plot shows that, although the overlapping, journalists and non-journalists occupy different regions of the feature space.

Slide 31 of 38

slide-42
SLIDE 42

Classification

  • A Gaussian N(

x| µs, Σs) is obtained for each of the two classes and, given an unseen feature vector, the classification is performed as follows: s( x) = arg max

s∈{0,1} N(

x| µs, Σs)p(s) (9) where s( x) is the class assigned to x, and p(s) is the a-priori probability of class s. The Gaussian parameters are estimated by maximizing the likelihood.

  • The data set is split into two parts that are used alternatively

as a training and test set.

Slide 32 of 38

slide-43
SLIDE 43

The data

10 20 30 40 50 60 70 80 10 20 30 40 50 60 70 80 length (seconds) number of occurrences Length Distribution

The data set consists of 686 single speaker segments, 313 journalists and 373 non-journalists. The number of individuals is 330 (234 non-journalists and 96 journalists).

Slide 33 of 38

slide-44
SLIDE 44

The results

5 10 15 20 25 30 35 40 45 50 55 60 50 55 60 65 70 75 80 85 time (seconds) recognition rate (%) Recognition Rate vs Time

The plot shows the recognition rate as a function of the time extracted from the test set segments.

Slide 34 of 38

slide-45
SLIDE 45

What About Humans?

  • A pool of 16 human assessors has listened to 30 randomly

selected clips, the total number of judgments is 480

  • The clips are in French, but the mother tongues of the

assessors are English (2 persons), Hindi (5 persons), Chinese (6 persons), Farsi (1 person), Serbian (1 person) and Arab (1 person). total women men 82.3% 88.0% 79.0% The performance of the automatic system over the same clips is 73.3%, but the test set is too small to conclude that the difference is statistically significant.

Slide 35 of 38

slide-46
SLIDE 46

Conclusions

  • Social Cognition seems to be a reasonable source of

inspiration for multimedia indexing algorithms.

  • Broadcast material offers a good compromise between

spontaneous interactions and reasonable constraints.

  • In a comparison with speech recognition, Social Signals seem

to play the role of the acoustic features and the Social Networks seem to play the role of the language model.

  • So far we used separately Social Signals and Social Networks,

but in the future it can be worth to combine the two.

  • We will address more ambitious problems like finding who

supports whom in discussions, who are the best communicators, when there are conflicts and when there is cooperation, etc.

Slide 36 of 38

slide-47
SLIDE 47

Thank You!

Slide 37 of 38