ProgrammingandDebugging LargeScaleDataProcessingWorkflows - - PowerPoint PPT Presentation

programming and debugging large scale data processing
SMART_READER_LITE
LIVE PREVIEW

ProgrammingandDebugging LargeScaleDataProcessingWorkflows - - PowerPoint PPT Presentation

ProgrammingandDebugging LargeScaleDataProcessingWorkflows ChristopherOlstonandmanyothers Yahoo!Research Context Elaborateprocessingoflargedatasets e.g.:


slide-1
SLIDE 1

Christopher
Olston
and
many
others
 Yahoo!
Research


Programming
and
Debugging

 Large‐Scale
Data
Processing
Workflows


slide-2
SLIDE 2

Context


  • Elaborate
processing
of
large
data
sets




e.g.:


  • web
search
pre‐processing

  • cross‐dataset
linkage

  • web
informa=on
extrac=on


serving
 inges?on
 storage
&
 processing


slide-3
SLIDE 3

Context




Debugging
aides:


  • Before:
example
data
generator

  • During:
instrumenta=on
framework

  • ABer:
provenance
metadata
manager


storage
&
 processing
 scalable
file
system


e.g.
GFS


distributed

 sor=ng
&
hashing


e.g.
Map‐Reduce


dataflow
programming
 framework


e.g.
Pig


workflow
manager


e.g.
Nova


Detail:
 

Inspector

 

Gadget
 Overview


slide-4
SLIDE 4

Pig:
A
High‐Level
Dataflow
Language

 and
Run=me
for
Hadoop


Visits = load ‘/data/visits’ as (user, url, time); Visits = foreach Visits generate user, Canonicalize(url), time; Pages = load ‘/data/pages’ as (url, pagerank); VP = join Visits by url, Pages by url; UserVisits = group VP by user; Sessions = foreach UserVisits generate flatten(FindSessions(*)); HappyEndings = filter Sessions by BestIsLast(*); store HappyEndings into '/data/happy_endings';

Web
browsing
sessions
with
“happy
endings.”


slide-5
SLIDE 5

vs.
map‐reduce:
less
code!


20 40 60 80 100 120 140 160 180

Hadoop Pig

1/20
the
lines
of
code


50 100 150 200 250 300

Hadoop Pig

Minutes

1/16
the
development
=me
 performs
on
par
with
raw
Hadoop


"The [Hofmann PLSA E/M] algorithm was implemented in pig in 30-35 lines of pig-latin statements. Took a lot less compared to what it took in implementing the algorithm in Map-Reduce Java. Exactly that's the reason I wanted to try it

  • ut in Pig. It took 3-4 days for me to write it, starting from learning pig.”
  • - Prasenjit Mukherjee, Mahout project
slide-6
SLIDE 6

vs.
SQL:

 



step‐by‐step
style;







lower‐level
control


"I much prefer writing in Pig [Latin] versus SQL. The step-by-step method of creating a program in Pig [Latin] is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.”

  • - Jasmine Novak, Engineer, Yahoo!

"PIG seems to give the necessary parallel programming construct (FOREACH, FLATTEN, COGROUP .. etc) and also give sufficient control back to the programmer (which purely declarative approach like [SQL on top of Map- Reduce] doesnʼt).”

  • - Ricky Ho, Adobe Software
slide-7
SLIDE 7

Transform
 to
(user,
Canonicalize(url),
=me)
 Load
 Pages(url,
pagerank)
 Load
 Visits(user,
url,
=me)
 Join
 url
=
url
 Group
 by
user
 Transform
 to
(user,
Average(pagerank)
as
avgPR)
 Filter
 avgPR
>
0.5


Conceptually:

 A
Graph
of
Data
Transforma=ons


Find
users
who
tend
to
visit
“good”
pages.


slide-8
SLIDE 8

Transform
 to
(user,
Canonicalize(url),
=me)
 Join
 url
=
url
 Group
 by
user
 Transform
 to
(user,
Average(pagerank)
as
avgPR)
 Filter
 avgPR
>
0.5
 Load
 Pages(url,
pagerank)
 Load
 Visits(user,
url,
=me)
 (Amy,
0.65)
 (Amy,
0.65)
 (Fred,
0.4)
 (Amy,
{
(Amy,
www.cnn.com,
8am,
0.9),

 












(Amy,
www.snails.com,
9am,
0.4)

})
 (Fred,
{
(Fred,
www.snails.com,
11am,
0.4)
})
 (Amy,
www.cnn.com,
8am,
0.9)

 (Amy,
www.snails.com,
9am,
0.4)
 (Fred,
www.snails.com,
11am,
0.4)
 (Amy,
cnn.com,
8am)

 (Amy,
hhp://www.snails.com,
9am)
 (Fred,
www.snails.com/index.html,
11am)
 (Amy,
www.cnn.com,
8am)

 (Amy,
www.snails.com,
9am)
 (Fred,
www.snails.com,
11am)
 (www.cnn.com,
0.9)

 (www.snails.com,
0.4)


Illustrated!


“ILLUSTRATE lets me check the output of my lengthy batch jobs and their custom functions without having to do a lengthy run of a long pipeline. [This feature] enables me to be productive.”

  • - Russell Jurney, LinkedIn
slide-9
SLIDE 9

Transform
 to
(user,
Canonicalize(url),
=me)
 Join
 url
=
url
 Group
 by
user
 Transform
 to
(user,
Average(pagerank)
as
avgPR)
 Filter
 avgPR
>
0.5
 Load
 Pages(url,
pagerank)
 Load
 Visits(user,
url,
=me)
 (Amy,
cnn.com,
8am)

 (Amy,
hhp://www.snails.com,
9am)
 (Fred,
www.snails.com/index.html,
11am)
 (Amy,
www.cnn.com,
8am)

 (Amy,
www.snails.com,
9am)
 (Fred,
www.snails.com,
11am)
 (www.youtube.com,
0.9)

 (www.frogs.com,
0.4)


(Naïve
Algorithm)


slide-10
SLIDE 10

Pig
Project
Status


  • Produc=zed
at
Yahoo
(~12‐person
team)


– 1000s
of
jobs/day
 – 70%
of
Hadoop
jobs


  • Open‐source
(the
Apache
Pig
Project)

  • Offered
on
Amazon
Elas=c
Map‐Reduce

  • Used
by
LinkedIn,
Twiher,
Yahoo,
...

slide-11
SLIDE 11

✔


Next:
NOVA




Debugging
aides:


  • Before:
example
data
generator

  • During:
instrumenta=on
framework

  • ABer:
provenance
metadata
manager


storage
&
 processing
 scalable
file
system


e.g.
GFS


distributed

 sor=ng
&
hashing


e.g.
Map‐Reduce


dataflow
programming
 framework


e.g.
Pig


workflow
manager


e.g.
Nova


✔
 ✔


slide-12
SLIDE 12

Why
a
Workflow
Manager?


  • Modularity:
a
workflow
connects
N
dataflow
modules


– Wrihen
independently,
and
re‐used
in
other
workflows
 – Scheduled
independently


  • Op?miza?on:
op=mize
across
modules


– Share
read
costs
among
side‐by‐side
modules
 – Pipeline
data
between
end‐to‐end
modules


  • Con?nuous
processing:
push
new
data
through


– Selec=ve
re‐running
 – Incremental
algorithms
(“view
maintenance”)


  • Manageability:
help
humans
keep
tabs
on
execu=on


– Alerts
 – Metadata
(e.g.
data
provenance)


slide-13
SLIDE 13

Example
 Workflow


template
 detec=on
 news
site
 templates
 news
 ar=cles
 RSS
feed
 template
 tagging
 shingling
 de‐ duping
 unique
 ar=cles
 ALL
 NEW
 NEW
 NEW
 NEW
 NEW
 NEW
 NEW
 ALL
 ALL
 shingle
 hashes
 seen
 ALL
 NEW


slide-14
SLIDE 14

Data
Passes
Through
Many
Sub‐Systems


GFS
 Map‐Reduce
 Pig
 Nova
 low‐latency
 processor
 serving
 inges=on


datum
X
 datum
Y
 metadata
 queries


provenance
of
X?


slide-15
SLIDE 15

Ibis
Project


  • Benefits:


– Provide
uniform
view
to
users
 – Factor
out
metadata
management
code
 – Decouple
metadata
life=me
from
data/subsystem
life=me


  • Challenges:


– Overhead
of
shipping
metadata
 – Disparate
data/processing
granulari=es
 data
processing
sub‐systems
 metadata
manager
 users
 metadata
 queries
 answers
 metadata


Ibis


integrated
 metadata


slide-16
SLIDE 16

What’s
Hard
About

 Mul=‐Granularity
Provenance?


  • Inference:
Given
rela=onships
expressed
at

  • ne
granularity,
answer
queries
about
other


granulari=es
(the
seman;cs
are
tricky
here!)


  • Efficiency:
Implement
inference
without


resor=ng
to
materializing
everything
in
terms


  • f
finest
granularity
(e.g.
cells)

slide-17
SLIDE 17

✔


Next:
INSPECTOR
GADGET




Debugging
aides:


  • Before:
example
data
generator

  • During:
instrumenta=on
framework

  • ABer:
provenance
metadata
manager


storage
&
 processing
 scalable
file
system


e.g.
GFS


distributed

 sor=ng
&
hashing


e.g.
Map‐Reduce


dataflow
programming
 framework


e.g.
Pig


workflow
manager


e.g.
Nova


✔
 ✔
 ✔
 ✔


slide-18
SLIDE 18

Mo=vated
by

 User
Interviews


  • Interviewed
10
Yahoo
dataflow
programmers


(mostly
Pig
users;
some
users
of
other
 dataflow
environments)


  • Asked
them
how
they
(wish
they
could)
debug

slide-19
SLIDE 19

Summary
of
User
Interviews


#
of
requests
 feature
 7
 crash
culprit
determina=on
 5
 row‐level
integrity
alerts
 4
 table‐level
integrity
alerts
 4
 data
samples
 3
 data
summaries
 3
 memory
use
monitoring
 3
 backward
tracing
(provenance)
 2
 forward
tracing
 2
 golden
data/logic
tes=ng
 2
 step‐through
debugging
 2
 latency
alerts
 1
 latency
profiling
 1


  • verhead
profiling


1
 trial
runs


slide-20
SLIDE 20

Our
Approach


  • Goal:
a
programming
framework
for
adding


these
behaviors,
and
others,
to
Pig


  • Precept:
avoid
modifying
Pig
or
tampering


with
data
flowing
through
Pig


  • Approach:
perform
Pig
script
rewri=ng
–



insert
special
UDFs
that
look
like
no‐ops
to
Pig


slide-21
SLIDE 21

group
 count
 join
 filter
 load
 load
 IG
 coordinator
 store
 IG
agent
 IG
agent
 IG
agent
 IG
agent
 IG
agent
 IG
agent


Pig
w/
Inspector
Gadget


slide-22
SLIDE 22

Example:
 Crash
Culprit
Determina;on


record
counts
 Phases
1
to
n‐1:
 records
 Phase
n:
 Phases
1
to
n‐1:

 




maintain
count
lower
bounds
 Phase
n:

 




maintain
last‐seen
records
 group
 count
 join
 filter
 load
 load
 IG
 coordinator
 store
 IG
agent
 IG
agent
 IG
agent
 IG
agent
 IG
agent
 IG
agent


slide-23
SLIDE 23

Example:
 Forward
Tracing


tracing
instruc=ons
 report
traced
 records
to
user
 group
 count
 join
 filter
 load
 load
 IG
 coordinator
 store
 IG
agent
 IG
agent
 IG
agent
 IG
agent
 traced
records


slide-24
SLIDE 24

join
 filter
 load
 load
 IG
 coordinator
 IG
agent
 IG
agent
 IG
agent
 IG
agent
 store


dataflow
engine
run;me


applica=on


launch
instrumented
 dataflow
run(s)
 end
 user
 dataflow
program
 +
app.
parameters


IG
driver
library


raw
result(s)
 result


Flow


slide-25
SLIDE 25

Agent
&
Coordinator
APIs


Agent
Class
 init(args)
 tags
=
observeRecord(record,
tags)
 receiveMessage(source,
message)
 finish()
 Coordinator
Class
 init(args)
 receiveMessage(source,
message)


  • utput
=
finish()


Agent
Messaging
 sendToCoordinator(message)
 sendToAgent(agentId,
message)
 sendDownstream(message)
 sendUpstream(message)
 Coordinator
Messaging
 sendToAgent(agentId,
message)


slide-26
SLIDE 26

Applica=ons
Developed
Using
IG


#
of
requests
 feature
 lines
of
code
(Java)
 7
 crash
culprit
determina=on
 141
 5
 row‐level
integrity
alerts
 89
 4
 table‐level
integrity
alerts
 99
 4
 data
samples
 97
 3
 data
summaries
 130
 3
 memory
use
monitoring
 N/A
 3
 backward
tracing
 (provenance)
 237
 2
 forward
tracing
 114
 2
 golden
data/logic
tes=ng
 200
 2
 step‐through
debugging
 N/A
 2
 latency
alerts
 168
 1
 latency
profiling
 136
 1


  • verhead
profiling


124
 1
 trial
runs
 93


slide-27
SLIDE 27

Rest
of
talk:
IG
DETAILS


  • Seman=cs
under
parallel/distributed
execu=on

  • Messaging
&
tagging
implementa=on

  • Limita=ons

  • Performance
experiments

  • Related
work

slide-28
SLIDE 28

Parallel/Distributed
Execu=on


group
 load
 filter
 split
 median
 count
 store
 group
 load
 filter
 split
 median
 count
 store


stage
(e.g.
reduce)
 stage
(e.g.
map)


.
.
.
 .
.
.
 .
.
.
 .
.
.


slide-29
SLIDE 29

Messaging
Details


  • Seman=cs:

  • Implementa=on:


– Within‐process:
shared
memory
 – Cross‐process:
relay
through
coordinator
(coordinator
 buffers
message
for
recipients
that
haven’t
started
yet)


Message
Request
 Seman?cs
 sendToCoordinator(message)
 asynchronous,
guaranteed
delivery
 sendToAgent(agentId,
message)
 asynchronous,
best‐effort
delivery
 sendDownstream(message)
 “follow
the
arrows,”
guaranteed
delivery
 sendUpstream(message)
 (same‐stage
only:)
“invert
the
arrows,”
guaranteed


slide-30
SLIDE 30

Tagging
Implementa=on


  • Uses
messaging
APIs

  • Within‐stage:


– Leverage
“iterator
model”
synchronous
pipeline
execu=on


1. sendDownstream(“tag
future
outputs
with
T”);
release
output
record
 2. sendDownstream(“stop
tagging”)


  • Cross‐stage:


– Leverage
Pig
operator
seman=cs
(group‐by,
cogroup,
join,


  • rder‐by)


– Group/cogroup:
use
group
key
 – Join/order‐by:
use
all
record
fields
(back‐tags
dups!)


slide-31
SLIDE 31

Limita=ons
of
the
IG
Approach


  • Assumes
query
op=miza=on
nonexistent/disabled

  • IG
sits
on
top
of
Pig,
so
hard
to
correlate
with
lower‐level


logs/errors


  • Crash/re‐start
results
in
record
being
seen
by
agents


mul=ple
=mes


– Fortunately,
all
apps
we’ve
wrihen
can
tolerate
this,
















 e.g.
data
only
sent
in
finish();
rely
on
idempotence


  • Tagging
implementa=on
not
scalable

  • Tagging
implementa=on
relies
on
Pig
details

slide-32
SLIDE 32

Performance
Experiments


  • 15‐machine
Pig/Hadoop
cluster
(1G
network)

  • Four
dataflows
over
a
small
web
crawl
sample


(10M
URLs):


Dataflow
Program
 Early
 Projec?on
 Op?miza?on?
 Early
 Aggrega?on
 Op?miza?on?
 Number
of
Map‐ Reduce
Jobs
 Dis=nct
Inlinks
 N
 N
 1
 Frequent
Anchortext
 Y
 N
 1
 Big
Site
Count
 Y
 Y
 1
 Linked
By
Large
 N
 Y
 2


slide-33
SLIDE 33

Dataflow
Running
Times


50 100 150 200 250 300 350 400 Distinct Inlinks Frequent Anchor Text Big Site Count Linked by Large Running time (seconds) Script Regular Pig No-op DH DS FT LA LP RI TI

slide-34
SLIDE 34

Summary




Debugging
aides:


  • Before:
example
data
generator

  • During:
instrumenta=on
framework

  • ABer:
provenance
metadata
manager


storage
&
 processing
 scalable
file
system


e.g.
GFS


distributed

 sor=ng
&
hashing


e.g.
Map‐Reduce


dataflow
programming
 framework


e.g.
Pig


workflow
manager


e.g.
Nova


Pig
 Nova
 Dataflow
Illustrator
 Inspector
Gadget
 Ibis


slide-35
SLIDE 35

Related
Work


  • Pig:
DryadLINQ,
Hive,
Jaql,
Scope,
rela;onal
query


languages


  • Nova:
BigTable,
CBP,
Oozie,
Percolator,
scien;fic
workflow,


incremental
view
maintenance


  • Dataflow
illustrator:
[Mannila/Raiha,
PODS’86],
reverse


query
processing,
constraint
databases,
hardware
 verifica;on
&
model
checking


  • Inspector
gadget:
XTrace,
taint
tracking,
aspect‐oriented


programming


  • Ibis:
Kepler
COMAD,
ZOOM
user
views,
provenance


management
for
databases
&
scien;fic
workflows


slide-36
SLIDE 36

Collaborators


Shubham
Chopra
 Anish
Das
Sarma
 Alan
Gates
 Pradeep
Kamath
 Ravi
Kumar
 Shravan
Narayanamurthy
 Olga
Natkovich
 Benjamin
Reed
 Santhosh
Srinivasan
 Utkarsh
Srivastava
 Andrew
Tomkins