Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil - - PowerPoint PPT Presentation

analyzing big data at
SMART_READER_LITE
LIVE PREVIEW

Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil - - PowerPoint PPT Presentation

Analyzing Big Data at Web 2.0 Expo, 2010 Kevin Weil @kevinweil Three Challenges Collecting Data Large-Scale Storage and Analysis Rapid Learning over Big Data My Background Studied Mathematics and Physics at Harvard, Physics


slide-1
SLIDE 1

Web 2.0 Expo, 2010 Kevin Weil @kevinweil

Analyzing Big Data at

slide-2
SLIDE 2

Three Challenges

  • Collecting Data
  • Large-Scale Storage and Analysis
  • Rapid Learning over Big Data
slide-3
SLIDE 3

My Background

  • Studied Mathematics and Physics at Harvard,

Physics at Stanford

  • Tropos Networks (city-wide wireless): GBs of data
  • Cooliris (web media): Hadoop for analytics, TBs of data
  • Twitter: Hadoop, Pig, machine learning, visualization,

social graph analysis, PBs of data

slide-4
SLIDE 4

Three Challenges

  • Collecting Data
  • Large-Scale Storage and Analysis
  • Rapid Learning over Big Data
slide-5
SLIDE 5

Data, Data Everywhere

  • You guys generate a lot of data
  • Anybody want to guess?
slide-6
SLIDE 6

Data, Data Everywhere

  • You guys generate a lot of data
  • Anybody want to guess?
  • 12 TB/day (4+ PB/yr)
slide-7
SLIDE 7

Data, Data Everywhere

  • You guys generate a lot of data
  • Anybody want to guess?
  • 12 TB/day (4+ PB/yr)
  • 20,000 CDs
slide-8
SLIDE 8

Data, Data Everywhere

  • You guys generate a lot of data
  • Anybody want to guess?
  • 12 TB/day (4+ PB/yr)
  • 20,000 CDs
  • 10 million floppy disks
slide-9
SLIDE 9

Data, Data Everywhere

  • You guys generate a lot of data
  • Anybody want to guess?
  • 12 TB/day (4+ PB/yr)
  • 20,000 CDs
  • 10 million floppy disks
  • 450 GB while I give this talk
slide-10
SLIDE 10

Syslog?

  • Started with syslog-ng
  • As our volume grew, it didn’t scale
slide-11
SLIDE 11

Syslog?

  • Started with syslog-ng
  • As our volume grew, it didn’t scale
  • Resources
  • verwhelmed
  • Lost data
slide-12
SLIDE 12

Scribe

  • Surprise! FB had same problem, built and open-

sourced Scribe

  • Log collection framework over Thrift
  • You “scribe” log lines, with categories
  • It does the rest
slide-13
SLIDE 13

Scribe

  • Runs locally; reliable in

network outage

FE FE FE

slide-14
SLIDE 14

Scribe

  • Runs locally; reliable in

network outage

  • Nodes only know

downstream writer; hierarchical, scalable

FE FE FE Agg Agg

slide-15
SLIDE 15

Scribe

  • Runs locally; reliable in

network outage

  • Nodes only know

downstream writer; hierarchical, scalable

  • Pluggable outputs

FE FE FE Agg Agg HDFS File

slide-16
SLIDE 16

Scribe at Twitter

  • Solved our problem, opened new vistas
  • Currently 40 different categories logged from

javascript, Ruby, Scala, Java, etc

  • We improved logging, monitoring, behavior during

failure conditions, writes to HDFS, etc

  • Continuing to work with FB to make it better

http://github.com/traviscrawford/scribe

slide-17
SLIDE 17

Three Challenges

  • Collecting Data
  • Large-Scale Storage and Analysis
  • Rapid Learning over Big Data
slide-18
SLIDE 18

How Do You Store 12 TB/day?

  • Single machine?
  • What’s hard drive write speed?
slide-19
SLIDE 19

How Do You Store 12 TB/day?

  • Single machine?
  • What’s hard drive write speed?
  • ~80 MB/s
slide-20
SLIDE 20

How Do You Store 12 TB/day?

  • Single machine?
  • What’s hard drive write speed?
  • ~80 MB/s
  • 42 hours to write 12 TB
slide-21
SLIDE 21

How Do You Store 12 TB/day?

  • Single machine?
  • What’s hard drive write speed?
  • ~80 MB/s
  • 42 hours to write 12 TB
  • Uh oh.
slide-22
SLIDE 22

Where Do I Put 12TB/day?

  • Need a cluster of machines
  • ... which adds new layers
  • f complexity
slide-23
SLIDE 23

Hadoop

  • Distributed file system
  • Automatic replication
  • Fault tolerance
  • Transparently read/write across multiple machines
slide-24
SLIDE 24

Hadoop

  • Distributed file system
  • Automatic replication
  • Fault tolerance
  • Transparently read/write across multiple machines
  • MapReduce-based parallel computation
  • Key-value based computation interface allows

for wide applicability

  • Fault tolerance, again
slide-25
SLIDE 25

Hadoop

  • Open source: top-level Apache project
  • Scalable: Y! has a 4000 node cluster
  • Powerful: sorted 1TB random integers in 62 seconds
  • Easy packaging/install: free Cloudera RPMs
slide-26
SLIDE 26

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-27
SLIDE 27

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-28
SLIDE 28

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-29
SLIDE 29

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-30
SLIDE 30

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-31
SLIDE 31

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-32
SLIDE 32

MapReduce Workflow

  • Challenge: how many tweets

per user, given tweets table?

  • Input: key=row, value=tweet info
  • Map: output key=user_id, value=1
  • Shuffle: sort by user_id
  • Reduce: for each user_id, sum
  • Output: user_id, tweet count
  • With 2x machines, runs 2x faster

Inputs Map Map Map Map Map Map Map Shuffle/ Sort Reduce Reduce Reduce Outputs

slide-33
SLIDE 33

Two Analysis Challenges

  • Compute mutual followings in Twitter’s interest graph
  • grep, awk? No way.
  • If data is in MySQL... self join on an n-

billion row table?

  • n,000,000,000 x n,000,000,000 = ?
slide-34
SLIDE 34

Two Analysis Challenges

  • Compute mutual followings in Twitter’s interest graph
  • grep, awk? No way.
  • If data is in MySQL... self join on an n-

billion row table?

  • n,000,000,000 x n,000,000,000 = ?
  • I don’t know either.
slide-35
SLIDE 35

Two Analysis Challenges

  • Large-scale grouping and counting
  • select count(*) from users? maybe.
  • select count(*) from tweets? uh...
  • Imagine joining these two.
  • And grouping.
  • And sorting.
slide-36
SLIDE 36

Back to Hadoop

  • Didn’t we have a cluster of machines?
  • Hadoop makes it easy to distribute the calculation
  • Purpose-built for parallel calculation
  • Just a slight mindset adjustment
slide-37
SLIDE 37

Back to Hadoop

  • Didn’t we have a cluster of machines?
  • Hadoop makes it easy to distribute the calculation
  • Purpose-built for parallel calculation
  • Just a slight mindset adjustment
  • But a fun one!
slide-38
SLIDE 38

Analysis at Scale

  • Now we’re rolling
  • Count all tweets: 20+ billion, 5 minutes
  • Parallel network calls to FlockDB to compute

interest graph aggregates

  • Run PageRank across users and interest graph
slide-39
SLIDE 39

But...

  • Analysis typically in Java
  • Single-input, two-stage data flow is rigid
  • Projections, filters: custom code
  • Joins lengthy, error-prone
  • n-stage jobs hard to manage
  • Data exploration requires compilation
slide-40
SLIDE 40

Three Challenges

  • Collecting Data
  • Large-Scale Storage and Analysis
  • Rapid Learning over Big Data
slide-41
SLIDE 41

Pig

  • High level language
  • Transformations on

sets of records

  • Process data one

step at a time

  • Easier than SQL?
slide-42
SLIDE 42

Why Pig?

Because I bet you can read the following script

Change this to your big-idea call-outs...

slide-43
SLIDE 43

A Real Pig Script

  • Just for fun... the same calculation in Java next
slide-44
SLIDE 44

No, Seriously.

slide-45
SLIDE 45

Pig Makes it Easy

  • 5% of the code
slide-46
SLIDE 46

Pig Makes it Easy

  • 5% of the code
  • 5% of the dev time
slide-47
SLIDE 47

Pig Makes it Easy

  • 5% of the code
  • 5% of the dev time
  • Within 20% of the running time
slide-48
SLIDE 48

Pig Makes it Easy

  • 5% of the code
  • 5% of the dev time
  • Within 20% of the running time
  • Readable, reusable
slide-49
SLIDE 49

Pig Makes it Easy

  • 5% of the code
  • 5% of the dev time
  • Within 20% of the running time
  • Readable, reusable
  • As Pig improves, your calculations run faster
slide-50
SLIDE 50

One Thing I’ve Learned

  • It’s easy to answer questions
  • It’s hard to ask the right questions
slide-51
SLIDE 51

One Thing I’ve Learned

  • It’s easy to answer questions
  • It’s hard to ask the right questions
  • Value the system that promotes innovation

and iteration

slide-52
SLIDE 52

One Thing I’ve Learned

  • It’s easy to answer questions
  • It’s hard to ask the right questions
  • Value the system that promotes innovation

and iteration

  • More minds contributing = more value from your data
slide-53
SLIDE 53

Counting Big Data

  • How many requests per day?
slide-54
SLIDE 54

Counting Big Data

  • How many requests per day?
  • Average latency? 95% latency?
slide-55
SLIDE 55

Counting Big Data

  • How many requests per day?
  • Average latency? 95% latency?
  • Response code distribution per hour?
slide-56
SLIDE 56

Counting Big Data

  • How many requests per day?
  • Average latency? 95% latency?
  • Response code distribution per hour?
  • Twitter searches per day?
slide-57
SLIDE 57

Counting Big Data

  • How many requests per day?
  • Average latency? 95% latency?
  • Response code distribution per hour?
  • Twitter searches per day?
  • Unique users searching, unique queries?
slide-58
SLIDE 58

Counting Big Data

  • How many requests per day?
  • Average latency? 95% latency?
  • Response code distribution per hour?
  • Twitter searches per day?
  • Unique users searching, unique queries?
  • Links tweeted per day? By domain?
slide-59
SLIDE 59

Counting Big Data

  • How many requests per day?
  • Average latency? 95% latency?
  • Response code distribution per hour?
  • Twitter searches per day?
  • Unique users searching, unique queries?
  • Links tweeted per day? By domain?
  • Geographic distribution of all of the above
slide-60
SLIDE 60

Correlating Big Data

  • Usage difference for mobile users?
slide-61
SLIDE 61

Correlating Big Data

  • Usage difference for mobile users?
  • ... for users on desktop clients?
slide-62
SLIDE 62

Correlating Big Data

  • Usage difference for mobile users?
  • ... for users on desktop clients?
  • ... for users of #newtwitter?
slide-63
SLIDE 63

Correlating Big Data

  • Usage difference for mobile users?
  • ... for users on desktop clients?
  • ... for users of #newtwitter?
  • Cohort analyses
slide-64
SLIDE 64

Correlating Big Data

  • Usage difference for mobile users?
  • ... for users on desktop clients?
  • ... for users of #newtwitter?
  • Cohort analyses
  • What features get users hooked?
slide-65
SLIDE 65

Correlating Big Data

  • Usage difference for mobile users?
  • ... for users on desktop clients?
  • ... for users of #newtwitter?
  • Cohort analyses
  • What features get users hooked?
  • What features power Twitter users use often?
slide-66
SLIDE 66

Research on Big Data

  • What can we tell from a user’s tweets?
slide-67
SLIDE 67

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
slide-68
SLIDE 68

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
  • ... from the tweets of those they follow?
slide-69
SLIDE 69

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
  • ... from the tweets of those they follow?
  • What influences retweets? Depth of the retweet tree?
slide-70
SLIDE 70

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
  • ... from the tweets of those they follow?
  • What influences retweets? Depth of the retweet tree?
  • Duplicate detection (spam)
slide-71
SLIDE 71

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
  • ... from the tweets of those they follow?
  • What influences retweets? Depth of the retweet tree?
  • Duplicate detection (spam)
  • Language detection (search)
slide-72
SLIDE 72

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
  • ... from the tweets of those they follow?
  • What influences retweets? Depth of the retweet tree?
  • Duplicate detection (spam)
  • Language detection (search)
  • Machine learning
slide-73
SLIDE 73

Research on Big Data

  • What can we tell from a user’s tweets?
  • ... from the tweets of their followers?
  • ... from the tweets of those they follow?
  • What influences retweets? Depth of the retweet tree?
  • Duplicate detection (spam)
  • Language detection (search)
  • Machine learning
  • Natural language processing
slide-74
SLIDE 74

Diving Deeper

  • HBase and building products from Hadoop
  • LZO Compression
  • Protocol Buffers and Hadoop
  • Our analytics-related open source: hadoop-lzo,

elephant-bird

  • Moving analytics to realtime

http://github.com/kevinweil/hadoop-lzo http://github.com/kevinweil/elephant-bird

slide-75
SLIDE 75

Questions?

Follow me at twitter.com/kevinweil

Change this to your big-idea call-outs...