Dataflow Execution Dataflow Execution Craig Knoblock University of - - PowerPoint PPT Presentation

dataflow execution dataflow execution
SMART_READER_LITE
LIVE PREVIEW

Dataflow Execution Dataflow Execution Craig Knoblock University of - - PowerPoint PPT Presentation

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This talk is based in part on slides from Greg Barish Craig Knoblock University of Southern California 1 Outline of talk Outline of talk


slide-1
SLIDE 1

Craig Knoblock University of Southern California 1

Dataflow Execution Dataflow Execution

Craig Knoblock University of Southern California

This talk is based in part

  • n slides from Greg Barish
slide-2
SLIDE 2

Craig Knoblock University of Southern California 2

Outline of talk Outline of talk

  • Introduction
  • Streaming dataflow execution systems
  • Network Query Engines
  • A streaming dataflow plan language
  • Discussion
slide-3
SLIDE 3

Craig Knoblock University of Southern California 3

Motivation Motivation

  • Problem
  • Information gathering may involve accessing and

integrating data from many sources

  • Total time to execute these plans may be large
  • Why?
  • Unpredictable network latencies
  • Varying remote source capabilities
  • Thus, execution is often I/O-bound
  • Complicating factor: binding patterns
  • During execution, many sources cannot be queried

until a previous source query has been answered

slide-4
SLIDE 4

Craig Knoblock University of Southern California 4

Traditional Approaches Traditional Approaches

  • Executing information gathering plans
  • Generate a plan
  • Plan typically consists of a partial ordering of the
  • perators
  • Execute the plan based on the given order
  • Operators process all of their input data before

transmitting any results to consumer(s)

  • Operators as fast as their most latent input
  • Long delays due to the dependencies in the plan
slide-5
SLIDE 5

Craig Knoblock University of Southern California 5

Streaming Dataflow Streaming Dataflow Execution Systems Execution Systems

slide-6
SLIDE 6

Craig Knoblock University of Southern California 6

Streaming Dataflow Streaming Dataflow

  • Plans consist of a network of operators
  • Each operator like a function
  • Example: Wrapper, Select, etc.
  • Operators produce and consume data
  • Operators “fire” when any part of any input data becomes available
  • Data routed between operators are relations
  • Zero or more tuples with one or more attributes

Wrapper Select Join Wrapper

Address 100 Main St., Santa Monica, 90292 520 4th St. Santa Monica, 90292 2 Ocean Blvd, Venice, 90292

City State Max Price Santa Monica CA 200000

Input Output Plan

slide-7
SLIDE 7

Craig Knoblock University of Southern California 7

Dataflow Dataflow vs vs Von Von-

  • Neumann

Neumann

ADD ADD ADD MUL ADD MUL

((a + b) * (c + d))

abcd a b c d

actor arc

slide-8
SLIDE 8

Craig Knoblock University of Southern California 8

Parallelism of Streaming Dataflow Parallelism of Streaming Dataflow

  • Dataflow (horizontal parallelism)
  • Decentralized, independent operator execution
  • Enables "maximally parallel" operator execution
  • Also known as the "dataflow limit"
  • Streaming/pipelining (vertical parallelism)
  • Producer emits tuples to consumer ASAP
  • Producer & consumer can process same relation

simultaneously

  • Effective because information gathering latencies

can be high – even at the tuple level

  • Data often "trickles" out of I/O-bound operators
slide-9
SLIDE 9

Craig Knoblock University of Southern California 9

Example: The Example: The RepInfo RepInfo Agent Agent

  • INPUT
  • Any street address

e.g., 4767 Admiralty Way, Marina del Rey, CA, 90292

  • OUTPUT
  • Federal reps
  • 2 senators,
  • 1 house member
  • For each rep:
  • Recent news
  • Real-time funding

information

slide-10
SLIDE 10

Craig Knoblock University of Southern California 10

Vote-Smart:

–List of officials

RepInfo RepInfo Sources Sources

slide-11
SLIDE 11

Craig Knoblock University of Southern California 11

Vote-Smart:

–List of officials

Yahoo

–Recent news

RepInfo RepInfo Sources Sources

slide-12
SLIDE 12

Craig Knoblock University of Southern California 12

Vote-Smart:

–List of officials

Yahoo

–Recent news

Open Secrets

–Funding graph

RepInfo RepInfo Sources Sources

slide-13
SLIDE 13

Craig Knoblock University of Southern California 13

OpenSecrets OpenSecrets – – Navigation + Fetching! Navigation + Fetching!

slide-14
SLIDE 14

Craig Knoblock University of Southern California 14

OpenSecrets OpenSecrets – – Navigation + Fetching! Navigation + Fetching!

slide-15
SLIDE 15

Craig Knoblock University of Southern California 15

OpenSecrets OpenSecrets – – Navigation + Fetching! Navigation + Fetching!

slide-16
SLIDE 16

Craig Knoblock University of Southern California 16

OpenSecrets OpenSecrets – – Navigation + Fetching! Navigation + Fetching!

slide-17
SLIDE 17

Craig Knoblock University of Southern California 17

RepInfo RepInfo agent plan agent plan

Wrapper

OpenSecrets (member page)

Join

name

Select

senators, house reps

Wrapper

Vote-Smart address all officials senators & house reps graph URL recent news combined results

Wrapper

OpenSecrets (funding page) funding URL

Wrapper

Yahoo News

Wrapper

OpenSecrets (names page) member URL

4676 Admiralty Way Marina del Rey CA

George Bush Dick Cheney Barbara Boxer Dianne Feinstein Jane Harman James Hahn

Barbara Boxer Dianne Feinstein Jane Harman Boxer Anthrax investigation continues… Boxer Bay area politicans meet… Feinstein Bay area politicans meet… Harman Life in LA is just too sunny…

slide-18
SLIDE 18

Craig Knoblock University of Southern California 18

Streaming Dataflow Systems for Streaming Dataflow Systems for Network Environments Network Environments

  • Focus
  • Autonomous data sources on the Internet
  • Unpredictable network latencies
  • Network Query Engines
  • Build plans to support queries
  • Tukwila
  • Telegraph
  • Niagara
  • Agent-based Execution System
  • Support a richer plan language
  • Theseus
slide-19
SLIDE 19

Craig Knoblock University of Southern California 19

Network Query Engine Network Query Engine --

  • - Tukwila

Tukwila

slide-20
SLIDE 20

Craig Knoblock University of Southern California 20

Network Query Engines Network Query Engines

  • Focus on supporting streaming XML data
  • Plan is defined by a query on the XML sources
  • Xquery is the emerging standard for XML querying
  • Challenges
  • How to convert XML data into tuples for a streaming

dataflow system

  • How to handle queries over graphs
  • How to optimize the query processing
  • Here we focus on how Tukwila handles the first

issue [Ives, Halevy, Weld, VLDB Journal, 2002]

slide-21
SLIDE 21

Craig Knoblock University of Southern California 21

Example XML Document Example XML Document

slide-22
SLIDE 22

Craig Knoblock University of Southern California 22

Graph Representation of XML Graph Representation of XML

slide-23
SLIDE 23

Craig Knoblock University of Southern California 23

XML Query and Result XML Query and Result

slide-24
SLIDE 24

Craig Knoblock University of Southern California 24

Tukwila Architecture Tukwila Architecture

slide-25
SLIDE 25

Craig Knoblock University of Southern California 25

Example Query Example Query

slide-26
SLIDE 26

Craig Knoblock University of Southern California 26

Query Plan Query Plan

slide-27
SLIDE 27

Craig Knoblock University of Southern California 27

X X-

  • scan Processing

scan Processing

slide-28
SLIDE 28

Craig Knoblock University of Southern California 28

Operators in Tukwila Operators in Tukwila

slide-29
SLIDE 29

Craig Knoblock University of Southern California 29

Discussion Discussion

  • Tukwila has
  • operators for streaming data into and out of XML
  • X-scan
  • Output, element, attribute
  • Standard relational operations
  • Select, project, join
  • Sort, aggregate, nest, group, etc.
  • Focuses on the efficient processing of XML

queries or streaming data sources

slide-30
SLIDE 30

Craig Knoblock University of Southern California 30

A Streaming Dataflow A Streaming Dataflow Plan Language Plan Language

slide-31
SLIDE 31

Craig Knoblock University of Southern California 31

Theseus Theseus

  • A plan language and execution system for Web-

based information integration

  • Expressive enough for monitoring a variety of sources
  • Efficient enough for near-real-time monitoring

Theseus

Executor

PLAN myplan { INPUT: x OUTPUT: y BODY { Op (x : y) } } 01010101010110 00011101101011 11010101010101

Plan

Input Data

slide-32
SLIDE 32

Craig Knoblock University of Southern California 32

Expressivity Expressivity

  • Basic relational-style operators
  • Select, Project, Join, Union, etc.
  • Operators for gathering Web data
  • Xwrapper
  • Queries Web source via Fetch agent (returns XML)
  • Xquery, Rel2Xml, and Xml2Rel
  • XML processing utilities
  • Operators for monitoring Web data
  • DbExport, DbQuery, DbAppend, DbUpdate
  • Facilitates the tracking of online data
  • Email, Phone, Fax
  • Facilitates asynchronous notification
slide-33
SLIDE 33

Craig Knoblock University of Southern California 33

Expressivity Expressivity

  • Operators for extensibility
  • Apply: single-row functions (e.g., UPPER)
  • Aggregate: multi-row functions (e.g., SUM)
  • Operators for conditional plan execution
  • Null: Tests and routes data accordingly
  • Subplans and recursion
  • Plans are named and have INPUT & OUTPUT
  • We can use them as operators (subplans) in other plans
  • Subplans make recursion possible
  • Makes it easy to follow arbitrarily long list of result pages that are

each separated by a NEXT page link

  • Subplans encourage modularity & reuse
slide-34
SLIDE 34

Craig Knoblock University of Southern California 34

Operators Operators

  • perator (Input1,Input2,…:Output1,Output2,…)

WAIT: waitInput1,waitInput2, … ENABLE: enableInput1,enableInput2, …

  • Data formats
  • Operators pass relations
  • Relations are composed of tuples
  • Each attribute of a tuple can be primitive, relation, or

XML object

slide-35
SLIDE 35

Craig Knoblock University of Southern California 35

Operator Streaming Operator Streaming

  • Operators support stream-oriented processing
  • Firing rule met when any input receives a tuple
  • This enables ASAP processing of data
  • End of data signaled by end-of-stream (EOS)
  • Operators vary on when they can begin output:
  • Union: immediately (i.e., for each input)
  • Minus: after EOS for second input has arrived
  • Email: after EOS for all inputs have arrived
slide-36
SLIDE 36

Craig Knoblock University of Southern California 36

Wrapper Operator ( Wrapper Operator (Xwrapper Xwrapper) )

PURPOSE: Extract data from web pages as relation

  • INPUT:
  • Name: URL prefix of wrapper
  • bind_map: Wrapper binding map
  • bind_dat: Binding tuples
  • OUTPUT:
  • new_rel:Incoming relation joined with new attributes

auth = USER PASSWORD greg secret wrapper(“http://fetch.com?wrapper=foo”, “user=$user, pwd=$password”, auth : quotes) quotes = USER PASSWORD SYMBOL PRICE greg secret ORCL 15.50 greg secret CSCO 21.50

slide-37
SLIDE 37

Craig Knoblock University of Southern California 37

Plans and Plans and Subplans Subplans

plan planName { input: planInput1, planInput2, …

  • utput: planOutput1, planOutput2, …

body {

  • perator(opInput1,… : opOutput1,…)
  • perator …

… } }

  • Plans can be called just like operators (subplans)
slide-38
SLIDE 38

Craig Knoblock University of Southern California 38

UNION WRAPPER Restaurants

city

WRAPPER Theaters WRAPPER Geocoder

NAME ADDRESS CITY STATE Rock 187 Maxella Venice CA AMC Movies 191 Maxella Venice CA EOS

Example plan: Example plan: TheaterLoc TheaterLoc

WRAPPER TigerMap

slide-39
SLIDE 39

Craig Knoblock University of Southern California 39

TheaterLoc TheaterLoc Plan Plan

PLAN theaterloc { INPUT: city OUTPUT: latlons, map_url BODY { wrapper ("cuisinenet", "name, addr", city : restaurants) wrapper ("yahoo_movies", "name, addr" city : theaters) union (restaurants, theaters : addresses) wrapper ("geocoder", "name,lat,lon", addresses : latlons) wrapper ("tigermap", latlons : map_url) } }

slide-40
SLIDE 40

Craig Knoblock University of Southern California 40

Transactions Transactions

  • Enable
  • Concurrent plan access by multiple clients
  • Recursive plan execution
  • Transactions each assigned unique ID
  • Individual transactions can be aborted
  • All transactions are assigned a “time to live”
  • Unprocessed data is garbage collected by Theseus
slide-41
SLIDE 41

Craig Knoblock University of Southern California 41

Conditionals and Recursion Conditionals and Recursion

  • Conditional outputs are defined by enabling
  • utputs depending on the action results

Null(inStream : outStreamTrue,outStreamFalse)

  • Plans can be called recursively
  • Termination defined by conditional operators
  • Transactions support recursive calls in same

execution environment

  • System provides tail-recursion optimization
slide-42
SLIDE 42

Craig Knoblock University of Southern California 42

Real Estate Plan Real Estate Plan

New Listing: 3br 2bath 200K Send Email Notification

slide-43
SLIDE 43

Craig Knoblock University of Southern California 43 WRAPPE WRAPPER

house-list

GET_URLS GET_URLS WRAPPE WRAPPER

house-details

UNION UNION NULL NULL WRAPPE WRAPPER

house-list

GET_URLS GET_URLS

false true

SELECT SELECT

(cond)

PROJECT PROJECT

addr, price addr, price

FOR FORMAT AT

"price < %s AND beds = $s"

criteria

GET_URLS FIND_HOUSES

Email Email PROJECT PROJECT

house_url

DIS DISTINCT CT

next_page_url

house results

Real Estate Plan Real Estate Plan

slide-44
SLIDE 44

Craig Knoblock University of Southern California 44

Parallel Remote Data Retrievals Parallel Remote Data Retrievals

Listings Page Retrievals Details Page Retrievals

slide-45
SLIDE 45

Craig Knoblock University of Southern California 45

Discussion Discussion

  • Theseus, Tukwila, Telegraph, Niagara are all:
  • Streaming dataflow systems
  • Target network-based query execution
  • Large source latencies
  • Unknown characteristics of sources
  • Focus on techniques for improving the efficiency of

plan execution

  • More on this in upcoming class