Social Data Toby Segaran Author, Programming Collective Intelligence - - PowerPoint PPT Presentation

social data
SMART_READER_LITE
LIVE PREVIEW

Social Data Toby Segaran Author, Programming Collective Intelligence - - PowerPoint PPT Presentation

Social Data Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies Data mining? Sorting through data* to identify patterns and establish relationships * usually a lot of data Where and why? Methods


slide-1
SLIDE 1

Social Data

Toby Segaran Author, Programming Collective Intelligence Data Magnate, Metaweb Technologies

slide-2
SLIDE 2

Data mining?

“Sorting through data* to identify patterns and establish relationships”

* usually a lot of data

slide-3
SLIDE 3

Where and why? Methods and examples

slide-4
SLIDE 4

Where and why?

  • Targeted Advertising
  • Recommendations
  • Search Results
  • Group Discovery
  • Filtering of Documents
  • Theme Extraction
slide-5
SLIDE 5

Google ad

slide-6
SLIDE 6

Facebook ad

slide-7
SLIDE 7

This is strange...

  • Google just has text
  • Facebook knows more about me
  • But it’s taking a few cues...
slide-8
SLIDE 8

Status: “engaged”

slide-9
SLIDE 9

Where and why?

  • Targeted Advertising
  • Recommendations
  • Search Results
  • Group Discovery
  • Filtering of Documents
  • Theme Extraction
slide-10
SLIDE 10

Real Amazon Products

slide-11
SLIDE 11

Netflix Prize

slide-12
SLIDE 12

Strands Contest

slide-13
SLIDE 13

Custom News

slide-14
SLIDE 14

Custom News

slide-15
SLIDE 15

Custom News

slide-16
SLIDE 16

Where and why?

  • Targeted Advertising
  • Recommendations
  • Search Results
  • Group Discovery
  • Filtering of Documents
  • Theme Extraction
slide-17
SLIDE 17

Ranking algorithms

The now-incredibly-famous paper

slide-18
SLIDE 18

Ranking algorithms

slide-19
SLIDE 19
  • Google begins tracking clicks in 2005
  • MSN search claims neural network
  • AOL Data Scandal

Learning behavior

slide-20
SLIDE 20

Where and why?

  • Targeted Advertising
  • Recommendations
  • Search Results
  • Group Discovery
  • Filtering of Documents
  • Theme Extraction
slide-21
SLIDE 21

In Biology

slide-22
SLIDE 22

Page grouping

slide-23
SLIDE 23

News stories

slide-24
SLIDE 24

Where and why?

  • Targeted Advertising
  • Recommendations
  • Search Results
  • Group Discovery
  • Filtering of Documents
  • Theme Extraction
slide-25
SLIDE 25

The obvious: spam

SpamBayes

slide-26
SLIDE 26

Other email uses

slide-27
SLIDE 27

Web documents

“As you add information to Twine, it is automatically tagged so that you and others can find it more easily”

slide-28
SLIDE 28

Where and why?

  • Targeted Advertising
  • Recommendations
  • Search Results
  • Group Discovery
  • Filtering of Documents
  • Theme Extraction
slide-29
SLIDE 29

What is the buzz?

slide-30
SLIDE 30

Customer Community

slide-31
SLIDE 31

Where and why? Methods and examples

slide-32
SLIDE 32

Methods and Examples

  • Bayesian Filtering
  • Distance Metrics
  • Clustering
  • Decision Trees
  • Network Analysis
  • Feature Extraction
slide-33
SLIDE 33

Bayesian Filtering

slide-34
SLIDE 34

Bayesian Filtering

slide-35
SLIDE 35

Bayesian Filtering

slide-36
SLIDE 36

Bayesian Filtering

slide-37
SLIDE 37

Bayesian Filtering

slide-38
SLIDE 38

Bayesian Filtering

school work algorithm

slide-39
SLIDE 39

Bayesian Filtering

school work algorithm v1agra trades associate

slide-40
SLIDE 40

Craigslist personals

slide-41
SLIDE 41

Analysis

Five Cities

W4M Personal Ads

slide-42
SLIDE 42

Results

New York

Mets Lounges Offense Desires Musical Submissive Create Song Oral

Boston

Pink Sox Poetry Intellectually Punk Appreciation Exercise Winter Education

Chicago

Cubs Burbs Bears Girlie Insecure Cheat Importance Blunt Mouth

slide-43
SLIDE 43

Results

Los Angeles

Excellent Vegas Meaningful Star Lame Industry Heat Fitness Entertainment Latino

San Francisco

Tee Employment Picnic STD Tasting Hikes French .com Kayaking Cycling

slide-44
SLIDE 44

Methods and Examples

  • Bayesian Filtering
  • Distance Metrics
  • Clustering
  • Decision Trees
  • Network Analysis
  • Feature Extraction
slide-45
SLIDE 45

Preference distance

Sarah Marshall Leatherheads 3 3 2 3 1 5 2 5

slide-46
SLIDE 46

Preference distance

5 4 3 2 1 1 2 3 4 5

slide-47
SLIDE 47

Preference distance

5 4 3 2 1 1 2 3 4 5

1 2.23

slide-48
SLIDE 48

For recommendations

5 4 3 2 1 1 2 3 4 5

Prom Night: 5 Prom Night: 2 ? 1 2.23

slide-49
SLIDE 49

For recommendations

5 4 3 2 1 1 2 3 4 5

Prom Night: 5 Prom Night: 2 4.1

slide-50
SLIDE 50

Linguistic distance

The Six Degrees Hypothesis Experienced It Is When You Travel

slide-51
SLIDE 51

Linguistic distance

The Six Degrees Hypothesis Experienced It Is When You Travel Six Degrees Hypothesis Experienced Travel Six 3 Degrees 3 Hypothesis 1 Experienced 5 Travel 6

slide-52
SLIDE 52

Linguistic distance

“china” “kids” “music” “travel” “yahoo” Gothamist 3 3 3 GigaOM 6 1 4 2 QuickOnlineTips 2 2 12 O’Reilly Radar 1 3 6 4

slide-53
SLIDE 53

Linguistic distance

“china” “kids” “music” “yahoo” Gothamist 3 3 GigaOM 6 1 2 Quick Online Tips 2 2 12

Euclidean “as the crow flies”

= 12 (approx)

slide-54
SLIDE 54

Article/blog similarity

Valleywag - Huffington > Slashdot - Wired

slide-55
SLIDE 55

Methods and Examples

  • Bayesian Filtering
  • Distance Metrics
  • Clustering
  • Decision Trees
  • Network Analysis
  • Feature Extraction
slide-56
SLIDE 56

Hierarchical Clustering

5 4 3 2 1 1 2 3 4 5

slide-57
SLIDE 57

Hierarchical Clustering

5 4 3 2 1 1 2 3 4 5

slide-58
SLIDE 58

Hierarchical Clustering

5 4 3 2 1 1 2 3 4 5

slide-59
SLIDE 59

Hierarchical Clustering

5 4 3 2 1 1 2 3 4 5

slide-60
SLIDE 60

Hierarchical Clustering

slide-61
SLIDE 61

Grouping bloggers

slide-62
SLIDE 62

Grouping bloggers

slide-63
SLIDE 63

Grouping bloggers

slide-64
SLIDE 64

Grouping articles

slide-65
SLIDE 65
slide-66
SLIDE 66
slide-67
SLIDE 67
slide-68
SLIDE 68

Methods and Examples

  • Bayesian Filtering
  • Distance Metrics
  • Clustering
  • Decision Trees
  • Network Analysis
  • Feature Extraction
slide-69
SLIDE 69

Decision Trees

slide-70
SLIDE 70

CART Algorithm

Brand Type Life (hrs) Duracell C 4 Energizer C 5 Duracell AA 2 Energizer AA 2.5

From any dataset...

slide-71
SLIDE 71

CART Algorithm

Brand Type Life (hrs) Duracell C 4 Energizer C 5 Duracell AA 2 Energizer AA 2.2

... find the best split ... Type is C? Avg=4.5 Avg=2.1

No Yes

slide-72
SLIDE 72

CART Algorithm

Brand Type Life (hrs) Duracell C 4 Energizer C 5 Duracell AA 2 Energizer AA 2.2

... and repeat. Type is C?

No Yes

Duracell

No Yes

Duracell

No Yes

4 2.2 2 5

slide-73
SLIDE 73

Hot or Not

slide-74
SLIDE 74

Hot or Not

slide-75
SLIDE 75

Methods and Examples

  • Bayesian Filtering
  • Distance Metrics
  • Clustering
  • Decision Trees
  • Network Analysis
  • Feature Extraction
slide-76
SLIDE 76

A network

A B C D E F

slide-77
SLIDE 77

PageRank

A B C D E F 1.0 1.0 1.0 1.0 1.0 1.0

slide-78
SLIDE 78

PageRank

A B C D E F 1.0 1.0 1.0 1.0 1.0 1.0 D = 0.15 + .85*E/1 + .85 * F/2 + .85*B/1 = 2.275

slide-79
SLIDE 79

PageRank

A B C D E F 0.58 0.58 1.0 2.275 1.0 0.15

slide-80
SLIDE 80

PageRank

A B C D E F 0.58 0.58 2.08 1.56 0.3 0.15

slide-81
SLIDE 81

PageRank

A B C D E F 1.03 1.03 1.48 1.56 0.3 0.15

slide-82
SLIDE 82

PageRank

A B C D E F 0.78 0.78 1.48 1.34 0.3 0.15

slide-83
SLIDE 83

CI FOO participants

slide-84
SLIDE 84

Science papers

The paper attempts to provide an alternative method for measuring the importance of scientific papers based on the Google's PageRank. The method is a meaningful extension of the common integer counting of citations and is then experimented for bringing PageRank to the citation analysis in a large citation network. It offers a more integrated picture

  • f the publications' influence in a specific field.

Bringing PageRank to the citation analysis

slide-85
SLIDE 85

Clustering coefficient

“How many of each persons friends are friends with each other?”

slide-86
SLIDE 86

Clustering coefficient

A B C D E F Low clustering coefficient

slide-87
SLIDE 87

Clustering coefficient

A B C D E F High clustering coefficient “small world graph”

slide-88
SLIDE 88
slide-89
SLIDE 89
slide-90
SLIDE 90
slide-91
SLIDE 91

Twitter!

slide-92
SLIDE 92

Twitter!

slide-93
SLIDE 93

Methods and Examples

  • Bayesian Filtering
  • Distance Metrics
  • Clustering
  • Decision Trees
  • Network Analysis
  • Feature Extraction
slide-94
SLIDE 94

Independent Features

slide-95
SLIDE 95

Message boards

slide-96
SLIDE 96

Message boards

slide-97
SLIDE 97

Matrix Factorization

Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 1 Calorie 2 4 1 3 Weigh 2 3 1 1 Carbs 1 1 2 Treadmill 3 2 2 2

Msg1 M2 M3 M4 M5 F1 1 2 3 F2 2 1 1 3 F3 1 2 F1 F2 F3 Gym 1 2 Calorie 2 1 Weigh 2 2 1 Carbs 1 3 Treadmill 1 2

Features Matrix Weight Matrix

x

Current Guess

slide-98
SLIDE 98

Matrix Factorization

Msg1 M2 M3 M4 M5 F1 1 2 3 F2 2 1 1 3 F3 1 2 F1 F2 F3 Gym 1 2 Calorie 2 1 Weigh 2 2 1 Carbs 1 3 Treadmill 1 2

Features Matrix Weight Matrix

x

Msg1 Msg2 Msg3 Msg4 Msg 5 Gym 2 3 Calorie 2 1 1 3 Weigh 1 2 Carbs 3 2 Treadmill 1 2

Target Result

Msg1 Msg2 Msg3 Msg4 Msg5 Gym 1 3 3 1 Calorie 2 4 1 3 Weigh 2 3 1 1 Carbs 1 1 2 Treadmill 3 2 2 2

Current Guess

slide-99
SLIDE 99

Matrix Factorization

Msg1 M2 M3 M4 M5 F1 2 1 F2 2 1 3 F3 1 1 F1 F2 F3 Gym 1 Calorie 1 1 Weigh 2 Carbs 1 Treadmill 1

Features Matrix Weight Matrix

x

Msg1 Msg2 Msg3 Msg4 Msg 5 Gym 2 3 Calorie 2 1 1 3 Weigh 1 2 Carbs 3 2 Treadmill 1 2

Target Result

Msg1 Msg2 Msg3 Msg4 Msg5 Gym 2 3 Calorie 2 1 1 3 Weigh 1 2 Carbs 3 2 Treadmill 1 2

Current Guess

slide-100
SLIDE 100

Interpreting Features

Msg1 M2 M3 M4 M5 F1 2 1 F2 2 1 3 F3 1 1 F1 F2 F3 Gym 1 Calorie 1 1 Weigh 2 Carbs 1 Treadmill 1

Features Matrix Weight Matrix

Theme 1 Theme 2 Theme 3 Gym Calorie Weigh Treadmill Carbs Calorie Msg1 Msg2 Msg3 etc. Theme 1 Theme 2 Theme 3 Theme 3

slide-101
SLIDE 101

Diet & Body themes

Atkins Induction South Beach Carbs Chocolate Black Coffee Olive Broccoli Gym Weights Exercise Running Injured Cook Recipe Fried Home Money Organic Want Best Calories Weight Fats Protein Cholesterol

slide-102
SLIDE 102

Wikipedia people

she her after when father women series television show which radio bbc league major baseball season played with

  • lympics

competed won summer medal athelete university professor received science research born

slide-103
SLIDE 103

We’re just getting started...

slide-104
SLIDE 104

Homepage http://kiwitobes.com Freebase http://freebase.com

slide-105
SLIDE 105

Questions?