dataflow execution dataflow execution
play

Dataflow Execution Dataflow Execution Craig Knoblock University of - PowerPoint PPT Presentation

Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This talk is based in part on slides from Greg Barish Craig Knoblock University of Southern California 1 Outline of talk Outline of talk


  1. Dataflow Execution Dataflow Execution Craig Knoblock University of Southern California This talk is based in part on slides from Greg Barish Craig Knoblock University of Southern California 1

  2. Outline of talk Outline of talk • Introduction • Streaming dataflow execution systems • Network Query Engines • A streaming dataflow plan language • Discussion Craig Knoblock University of Southern California 2

  3. Motivation Motivation • Problem • Information gathering may involve accessing and integrating data from many sources • Total time to execute these plans may be large • Why? • Unpredictable network latencies • Varying remote source capabilities • Thus, execution is often I/O-bound • Complicating factor: binding patterns • During execution, many sources cannot be queried until a previous source query has been answered Craig Knoblock University of Southern California 3

  4. Traditional Approaches Traditional Approaches • Executing information gathering plans • Generate a plan • Plan typically consists of a partial ordering of the operators • Execute the plan based on the given order • Operators process all of their input data before transmitting any results to consumer(s) • Operators as fast as their most latent input • Long delays due to the dependencies in the plan Craig Knoblock University of Southern California 4

  5. Streaming Dataflow Streaming Dataflow Execution Systems Execution Systems Craig Knoblock University of Southern California 5

  6. Streaming Dataflow Streaming Dataflow • Plans consist of a network of operators • Each operator like a function • Example: Wrapper, Select, etc. • Operators produce and consume data • Operators “fire” when any part of any input data becomes available • Data routed between operators are relations • Zero or more tuples with one or more attributes Input Plan Output City State Max Price Wrapper Santa Monica CA 200000 Address 100 Main St., Santa Monica, 90292 Join Wrapper 520 4th St. Santa Monica, 90292 2 Ocean Blvd, Venice, 90292 Select Craig Knoblock University of Southern California 6

  7. Dataflow vs vs Von Von- -Neumann Neumann Dataflow ((a + b) * (c + d)) abcd a b c d ADD ADD ADD ADD MUL arc MUL actor Craig Knoblock University of Southern California 7

  8. Parallelism of Streaming Dataflow Parallelism of Streaming Dataflow • Dataflow (horizontal parallelism) • Decentralized, independent operator execution • Enables "maximally parallel" operator execution • Also known as the "dataflow limit" • Streaming/pipelining (vertical parallelism) • Producer emits tuples to consumer ASAP • Producer & consumer can process same relation simultaneously • Effective because information gathering latencies can be high – even at the tuple level • Data often "trickles" out of I/O-bound operators Craig Knoblock University of Southern California 8

  9. Example: The RepInfo RepInfo Agent Agent Example: The • INPUT • Any street address e.g., 4767 Admiralty Way, Marina del Rey, CA, 90292 • OUTPUT • Federal reps • 2 senators, • 1 house member • For each rep: • Recent news • Real-time funding information Craig Knoblock University of Southern California 9

  10. RepInfo Sources Sources RepInfo Vote-Smart: –List of officials Craig Knoblock University of Southern California 10

  11. RepInfo Sources Sources RepInfo Vote-Smart: –List of officials Yahoo –Recent news Craig Knoblock University of Southern California 11

  12. RepInfo Sources Sources RepInfo Vote-Smart: –List of officials Yahoo –Recent news Open Secrets –Funding graph Craig Knoblock University of Southern California 12

  13. OpenSecrets – – Navigation + Fetching! Navigation + Fetching! OpenSecrets Craig Knoblock University of Southern California 13

  14. OpenSecrets – – Navigation + Fetching! Navigation + Fetching! OpenSecrets Craig Knoblock University of Southern California 14

  15. OpenSecrets – – Navigation + Fetching! Navigation + Fetching! OpenSecrets Craig Knoblock University of Southern California 15

  16. OpenSecrets – – Navigation + Fetching! Navigation + Fetching! OpenSecrets Craig Knoblock University of Southern California 16

  17. RepInfo agent plan agent plan RepInfo Boxer Anthrax investigation continues… Barbara Boxer Boxer Bay area politicans meet… Dianne Feinstein Feinstein Bay area politicans meet… Jane Harman 4676 Admiralty Way Marina del Rey CA Harman Life in LA is just too sunny… address senators & house reps combined results recent news Join Wrapper name Yahoo News Select Wrapper graph URL senators, Vote-Smart house reps Wrapper Wrapper Wrapper OpenSecrets OpenSecrets OpenSecrets (funding page) (member page) (names page) all officials member URL funding URL George Bush Dick Cheney Barbara Boxer Dianne Feinstein Jane Harman James Hahn Craig Knoblock University of Southern California 17

  18. Streaming Dataflow Systems for Streaming Dataflow Systems for Network Environments Network Environments • Focus • Autonomous data sources on the Internet • Unpredictable network latencies • Network Query Engines • Build plans to support queries • Tukwila • Telegraph • Niagara • Agent-based Execution System • Support a richer plan language • Theseus Craig Knoblock University of Southern California 18

  19. Network Query Engine -- -- Tukwila Tukwila Network Query Engine Craig Knoblock University of Southern California 19

  20. Network Query Engines Network Query Engines • Focus on supporting streaming XML data • Plan is defined by a query on the XML sources • Xquery is the emerging standard for XML querying • Challenges • How to convert XML data into tuples for a streaming dataflow system • How to handle queries over graphs • How to optimize the query processing • Here we focus on how Tukwila handles the first issue [Ives, Halevy, Weld, VLDB Journal, 2002] Craig Knoblock University of Southern California 20

  21. Example XML Document Example XML Document Craig Knoblock University of Southern California 21

  22. Graph Representation of XML Graph Representation of XML Craig Knoblock University of Southern California 22

  23. XML Query and Result XML Query and Result Craig Knoblock University of Southern California 23

  24. Tukwila Architecture Tukwila Architecture Craig Knoblock University of Southern California 24

  25. Example Query Example Query Craig Knoblock University of Southern California 25

  26. Query Plan Query Plan Craig Knoblock University of Southern California 26

  27. X- -scan Processing scan Processing X Craig Knoblock University of Southern California 27

  28. Operators in Tukwila Operators in Tukwila Craig Knoblock University of Southern California 28

  29. Discussion Discussion • Tukwila has • operators for streaming data into and out of XML • X-scan • Output, element, attribute • Standard relational operations • Select, project, join • Sort, aggregate, nest, group, etc. • Focuses on the efficient processing of XML queries or streaming data sources Craig Knoblock University of Southern California 29

  30. A Streaming Dataflow A Streaming Dataflow Plan Language Plan Language Craig Knoblock University of Southern California 30

  31. Theseus Theseus • A plan language and execution system for Web- based information integration • Expressive enough for monitoring a variety of sources • Efficient enough for near-real-time monitoring Input Data Plan 01010101010110 PLAN myplan { 00011101101011 INPUT: x 11010101010101 OUTPUT: y BODY { Op (x : y) } } Theseus Executor Craig Knoblock University of Southern California 31

  32. Expressivity Expressivity • Basic relational-style operators • Select, Project, Join, Union, etc. • Operators for gathering Web data • Xwrapper • Queries Web source via Fetch agent (returns XML) • Xquery , Rel2Xml , and Xml2Rel • XML processing utilities • Operators for monitoring Web data • DbExport, DbQuery, DbAppend, DbUpdate • Facilitates the tracking of online data • Email, Phone, Fax • Facilitates asynchronous notification Craig Knoblock University of Southern California 32

  33. Expressivity Expressivity • Operators for extensibility • Apply : single-row functions (e.g., UPPER) • Aggregate : multi-row functions (e.g., SUM) • Operators for conditional plan execution • Null: Tests and routes data accordingly • Subplans and recursion • Plans are named and have INPUT & OUTPUT • We can use them as operators (subplans) in other plans • Subplans make recursion possible • Makes it easy to follow arbitrarily long list of result pages that are each separated by a NEXT page link • Subplans encourage modularity & reuse Craig Knoblock University of Southern California 33

  34. Operators Operators operator ( Input1,Input2,… : Output1,Output2,… ) WAIT: waitInput1,waitInput2, … ENABLE: enableInput1,enableInput2, … • Data formats • Operators pass relations • Relations are composed of tuples • Each attribute of a tuple can be primitive, relation, or XML object Craig Knoblock University of Southern California 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend