Deploying Information Deploying Information Agents on the Web - - PowerPoint PPT Presentation

deploying information deploying information agents on the
SMART_READER_LITE
LIVE PREVIEW

Deploying Information Deploying Information Agents on the Web - - PowerPoint PPT Presentation

Deploying Information Deploying Information Agents on the Web Agents on the Web Craig A. Knoblock Knoblock Craig A. University of Southern California University of Southern California and and Fetch Technologies Fetch Technologies Craig


slide-1
SLIDE 1

Craig Knoblock University of Southern California 1

Deploying Information Deploying Information Agents on the Web Agents on the Web

Craig A. Craig A. Knoblock Knoblock University of Southern California University of Southern California and and Fetch Technologies Fetch Technologies

slide-2
SLIDE 2

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 2 2

Acknowledgements Acknowledgements

  • Information Agents

Information Agents Research Group Research Group

  • Steve Minton, Fetch Tech.

Steve Minton, Fetch Tech.

  • Jose Luis Ambite, USC

Jose Luis Ambite, USC

  • Greg

Greg Barish Barish, Fetch Tech. , Fetch Tech.

  • Kristina

Kristina Lerman Lerman, USC , USC

  • Martin

Martin Michalowski Michalowski, USC , USC

  • Ion

Ion Muslea Muslea, SRI , SRI

  • Maria

Maria Muslea Muslea, USC , USC

  • Sheila

Sheila Tejada Tejada, UNO , UNO

  • Snehal Thakkar, USC

Snehal Thakkar, USC

  • Rattapoom

Rattapoom Tuchinda Tuchinda, USC , USC

  • Electric Elves

Electric Elves

  • Hans

Hans Chalupsky Chalupsky, USC , USC

  • Yolanda Gil, USC

Yolanda Gil, USC

  • Jean Oh, CMU

Jean Oh, CMU

  • David V.

David V. Pynadath Pynadath, USC , USC

  • Thomas A. Russ, USC

Thomas A. Russ, USC

  • Milind

Milind Tambe Tambe, USC , USC

  • Funding

Funding

  • DARPA

DARPA

  • AFOSR

AFOSR

  • NSF

NSF

  • Microsoft

Microsoft

slide-3
SLIDE 3

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 3 3

Introduction Introduction

  • The Web is a tremendous resource, but

The Web is a tremendous resource, but designed for browsing designed for browsing

  • Sites provide limited capabilities for

Sites provide limited capabilities for personalization personalization

  • Few sites are designed to be integrated with

Few sites are designed to be integrated with

  • thers
  • thers
  • Goal: Develop technology to rapidly

Goal: Develop technology to rapidly construct personal software agents construct personal software agents

  • Build agents that can perform retrieval,

Build agents that can perform retrieval, integration, and monitoring tasks on any integration, and monitoring tasks on any

  • nline source
  • nline source
slide-4
SLIDE 4

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 4 4

Outline Outline

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-5
SLIDE 5

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 5 5

Outline Outline

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-6
SLIDE 6

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 6 6

Electric Elves Project Electric Elves Project

[ [Chalupsky Chalupsky et al, 2001] et al, 2001]

Elves project goal: Apply agent technology to Elves project goal: Apply agent technology to support human organizations support human organizations

  • Develop software agents that automate routine tasks
  • Enable software agents and humans to work together
  • Support coordination of tasks
  • Applications: Office Elves and Travel Elves

Applications: Office Elves and Travel Elves

W W W A g e n t P r o x i e s F o r P e o p l e I n f o r m a t i o n A g e n t s O n t o l o g y - b a s e d M a t c h m a k e r s

GRID

slide-7
SLIDE 7

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 7 7

Agents for Monitoring Travel Agents for Monitoring Travel

[Ambite et al, 2002] [Ambite et al, 2002]

  • Office Elves created as an application of the

Office Elves created as an application of the Electric Elves Electric Elves

  • Given travel itinerary, generates set of agents for

Given travel itinerary, generates set of agents for anticipating travel anticipating travel-

  • related failures and

related failures and

  • pportunities:
  • pportunities:
  • Price changes

Price changes

  • Schedule changes

Schedule changes

  • Flight delays & cancellations

Flight delays & cancellations

  • Earlier and close connections

Earlier and close connections

  • Finding the closest restaurant given GPS coordinates

Finding the closest restaurant given GPS coordinates

slide-8
SLIDE 8

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 8 8

Travel Assistant Travel Assistant

slide-9
SLIDE 9

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 9 9

Monitoring Travel Plans Monitoring Travel Plans

slide-10
SLIDE 10

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 10 10

Agents Deployed to Agents Deployed to Monitor Travel Itinerary Monitor Travel Itinerary

Travel Itinerary

W W W A g e n t P r o x i e s F o r P e o p l e I n f o r m a t i o n A g e n t s O n t o l o g y - b a s e d M a t c h m a k e r s

GRID

Flight Prices & Schedules Weather Flight Status Restaurants

slide-11
SLIDE 11

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 11 11

Monitoring Agents Monitoring Agents

  • Flight

Flight-

  • Status Agent:

Status Agent:

  • Flight delayed message:

Flight delayed message:

Your United Airlines flight 190 has been delayed. Your United Airlines flight 190 has been delayed. It was originally scheduled to depart at 11:45 AM It was originally scheduled to depart at 11:45 AM and is now scheduled to depart at 12:30 PM. and is now scheduled to depart at 12:30 PM. The new arrival time is 7:59 PM. The new arrival time is 7:59 PM.

  • Flight cancelled message:

Flight cancelled message:

Your Delta Air Lines flight 200 has been cancelled. Your Delta Air Lines flight 200 has been cancelled.

  • Fax to hotel message:

Fax to hotel message:

Attention: Registration Desk Attention: Registration Desk I am sending this message on behalf of David I am sending this message on behalf of David Pynadath Pynadath, who has a reservation at your hotel. David , who has a reservation at your hotel. David Pynadath Pynadath is on United Airlines 190, which is now is on United Airlines 190, which is now scheduled to arrive at IAD at 7:59 PM. Since the scheduled to arrive at IAD at 7:59 PM. Since the flight will be arriving late, I would like to request flight will be arriving late, I would like to request that you indicate this in the reservation so that the that you indicate this in the reservation so that the room is not given away. room is not given away.

slide-12
SLIDE 12

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 12 12

Monitoring Agents Monitoring Agents

  • Airfare Agent: Airfare dropped message

Airfare Agent: Airfare dropped message

The airfare for your American Airlines itinerary The airfare for your American Airlines itinerary (IAD (IAD -

  • LAX) dropped to $281.

LAX) dropped to $281.

  • Earlier

Earlier-

  • Flight Agent: Earlier flights message

Flight Agent: Earlier flights message

The status of your currently scheduled flight is: The status of your currently scheduled flight is: # 190 LAX (11:45 AM) # 190 LAX (11:45 AM) -

  • IAD (7:29 PM) 45 minutes Late

IAD (7:29 PM) 45 minutes Late If you would like to return earlier, the following If you would like to return earlier, the following United Airlines flights will arrive earlier than your United Airlines flights will arrive earlier than your scheduled flights: scheduled flights: # 946 LAX (8:31 AM) # 946 LAX (8:31 AM) -

  • IAD (3:35 PM) 11 minutes Late

IAD (3:35 PM) 11 minutes Late

  • # 388 LAX (9:25 AM)

# 388 LAX (9:25 AM) -

  • DEN (12:25 PM) 10 minutes Late

DEN (12:25 PM) 10 minutes Late # 1534 DEN (1:20 PM) # 1534 DEN (1:20 PM) -

  • IAD (6:06 PM) On Time

IAD (6:06 PM) On Time

slide-13
SLIDE 13

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 13 13

Outline Outline

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-14
SLIDE 14

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 14 14

Wrappers for Live Access Wrappers for Live Access to Online Sources to Online Sources

  • HTML sources turned into agent

HTML sources turned into agent-

  • friendly

friendly sources sources

<YAHOO_WEATHER>

  • <ROW>

<TEMP>25</TEMP> <OUTLOOK>Sunny</OUTLOOK> <HI>32</HI> <LO>19</LO> <APPARTEMP>25</ APPARTEMP > <HUMIDITY>35%</HUMIDITY> <WIND>E/10 km/h</WIND> <VISIBILITY>20 km</VISIBILITY> <DEWPOINT>9</DEWPOINT> <BAROMETER>959 mb</BAROMETER> </ROW> </YAHOO_WEATHER>

Wrapper

slide-15
SLIDE 15

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 15 15

Extraction Rules Extraction Rules

  • Wrapper defined by a set of extraction

Wrapper defined by a set of extraction rules rules

  • Extraction rule: sequence of landmarks

Extraction rule: sequence of landmarks

  • Define both beginning and end of required

Define both beginning and end of required information on the page information on the page Name: Joel’s <p> Phone: <i> (310) 777-1111 </i><p> Review: … SkipTo(Phone) SkipTo(<i>) SkipTo(</i>)

slide-16
SLIDE 16

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 16 16

Learning the Extraction Rules Learning the Extraction Rules

[ [Muslea Muslea, Minton, & Knoblock, 01] , Minton, & Knoblock, 01]

  • Hierarchical wrapper induction

Hierarchical wrapper induction

  • Decomposes a hard problem into several easier

Decomposes a hard problem into several easier

  • nes
  • nes
  • Extracts items independently of each other

Extracts items independently of each other

Inductive Learning System

Extraction Rules

EC Tree Labeled Pages

GUI

Inductive Learning System

Extraction Rules

EC Tree EC Tree Labeled Pages

GUI

slide-17
SLIDE 17

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 17 17

Search Space:

SkipTo( ( ) … SkipTo(Phone) SkipTo(:) SkipTo( ( ) ...

SkipTo( <b> ( ) ...

SkipTo(Phone) SkipTo( ( ) ... SkipTo(:) SkipTo(()

Training Examples:

Name: Del Taco <p> Phone (toll free) : <b> ( 800 ) 123-4567 </b><p>Cuisine ... Name: Burger King <p> Phone : ( 310 ) 987-9876 <p> Cuisine: …

Example of Rule Induction Example of Rule Induction

slide-18
SLIDE 18

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 18 18

Active Learning Active Learning

  • Problem: May require large number of

Problem: May require large number of examples to achieve high accuracy examples to achieve high accuracy

  • Exploit active learning

Exploit active learning

  • System selects most informative examples to

System selects most informative examples to label label

  • Want to achieve 100% accuracy with as few

Want to achieve 100% accuracy with as few examples as possible examples as possible

slide-19
SLIDE 19

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 19 19

Name: Joel’s <p> Phone: (310) 777-1111 <p>Review: The chef… SkipTo( Phone: ) Name: Kim’s <p> Phone: (213) 757-1111 <p>Review: Korean … Name: Chez Jean <p> Phone: (310) 666-1111 <p> Review: … Name: Burger King <p> Phone:(818) 789-1211 <p> Review: ... Name: Café del Rey <p> Phone: (310) 111-1111 <p> Review: ... Name: KFC <p> Phone:<b> (800) 111-7171 </b> <p> Review:...

Unlabeled Examples Training Examples

Which Example to Label Next Which Example to Label Next

slide-20
SLIDE 20

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 20 20

Multi Multi-

  • view Learning

view Learning

[ [Muslea Muslea, Minton, Knoblock ’00] , Minton, Knoblock ’00]

Two ways to find start of the phone

number:

Name: KFC <p> Phone: (310) 111-1111 <p> Review: Fried chicken …

SkipTo( Phone: ) BackTo( ( Nmb ) )

slide-21
SLIDE 21

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 21 21

+ +

  • +
  • RULE 1

RULE 2

Unlabeled data

+

  • Labeled data

Multi Multi-

  • view Learning: Co

view Learning: Co-

  • Testing

Testing

slide-22
SLIDE 22

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 22 22

Name: Joel’s <p> Phone: (310) 777-1111 <p>Review: ... SkipTo( Phone: ) Name: Kim’s <p> Phone: (213) 757-1111 <p>Review: ... BackTo( (Nmb) )

Co Co-

  • Testing for Wrapper Induction

Testing for Wrapper Induction

Name: Chez Jean <p> Phone: (310) 666-1111 <p> Review: ... Name: KFC <p> Phone:<b> (800) 111-7171 </b> <p> Review:... Name: Burger King <p> Phone: (818) 789-1211 <p> Review: ... Name: Café del Rey <p> Phone: (310) 111-1111 <p> Review: ...

slide-23
SLIDE 23

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 23 23

SkipTo(Phone:) BackTo( (Nmb) )

… Phone: (800) 171-1771 <p> Fax: (111) 111-1111 <p> Review: … … Phone:<i> (800) 555-5555 </i><p> Review: A century ago (1891) … … Phone: (800) 171-1771 <p> Fax: (111) 111-1111 <p> Review: …

Not All Queries are Not All Queries are Equally Informative Equally Informative

slide-24
SLIDE 24

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 24 24

  • Learn “content description” for item to be extracted

Learn “content description” for item to be extracted

  • Too general for extraction

Too general for extraction

  • (

( Nmb Nmb ) ) Nmb Nmb – – Nmb Nmb can’t tell a can’t tell a phone number phone number from a from a fax fax number number

  • Useful at

Useful at discriminating discriminating among among query candidates query candidates

  • Learned content descriptions

Learned content descriptions

  • Starts with:

Starts with: ( ( Nmb Nmb ) )

  • Ends with:

Ends with: Nmb Nmb – – Nmb Nmb

  • Contains:

Contains: Nmb Nmb Punct Punct

  • Length:

Length: [6,6] [6,6]

Weak Views Weak Views

[ [Muslea Muslea, Minton, Knoblock ’03] , Minton, Knoblock ’03]

slide-25
SLIDE 25

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 25 25

Naïve & Aggressive Co Naïve & Aggressive Co-

  • Testing

Testing

  • Naïve Co

Naïve Co-

  • Testing:

Testing:

  • Query: randomly chosen contention point

Query: randomly chosen contention point

  • Output: rule with fewest mistakes on queries

Output: rule with fewest mistakes on queries

  • Aggressive Co

Aggressive Co-

  • Testing:

Testing:

  • Query: contention point that most violates

Query: contention point that most violates weak view weak view

  • Output: committee vote (2 rules + weak view)

Output: committee vote (2 rules + weak view)

slide-26
SLIDE 26

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 26 26

Results for Random Sampling Results for Random Sampling

  • 33 most difficult of the 140 tasks from [

33 most difficult of the 140 tasks from [Kushmerick Kushmerick ’97] ’97]

5 10 15 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+ Examples to 100% accuracy Random sampling

Extraction Tasks

slide-27
SLIDE 27

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 27 27

Results for Active Learning Results for Active Learning

5 10 15 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+ Examples to 100% accuracy Naïve Co-Testing Random sampling

Extraction Tasks

slide-28
SLIDE 28

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 28 28

Results for Active Learning Results for Active Learning with Weak Views with Weak Views

5 10 15 20 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20+ Examples to 100% accuracy Aggressive Co-Testing Naïve Co-Testing Random sampling

Extraction Tasks

slide-29
SLIDE 29

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 29 29

Outline Outline

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-30
SLIDE 30

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 30 30

Record Linkage Record Linkage (Object Consolidation) (Object Consolidation)

Name Street Phone Art’s Deli 12224 Ventura Boulevard 818-756-4124 Teresa's 80 Montague St. 718-520-2910 Steakhouse The 128 Fremont St. 702-382-1600 Les Celebrites 155 W. 58th St. 212-484-5113 Name Street Phone Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100 Teresa's 103 1st Ave. between 6th and 7th Sts. 212/228-0604 Binion's Coffee Shop 128 Fremont St. 702/382-1600 Les Celebrites 160 Central Park S 212/484-5113

Zagat’s Restaurants

  • Dept. of Health
slide-31
SLIDE 31

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 31 31

Active Learning to Active Learning to Determine Matched Records Determine Matched Records

[ [Tejada Tejada, , Knoblock Knoblock, Minton ’01,’02] , Minton ’01,’02]

  • Learn importance of attributes for matching records

Learn importance of attributes for matching records Zagat’s

Dept of Health

Art’s Deli 12224 Ventura Boulevard 818-756-4124 Art’s Delicatessen 12224 Ventura Blvd. 818/755-4100

Name Street Phone

Mapping rules: Name > .9 & Street > .87 => mapped Name > .95 & Phone > .96 => mapped

slide-32
SLIDE 32

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 32 32

Mapping Rule Learner Mapping Rule Learner

Set of Mapped Objects Choose initial examples Generate committee of learners

Learn Rules Classify Examples Votes Votes Votes

Choose Example

USER

Learn Rules Classify Examples Learn Rules Classify Examples

Label Label

slide-33
SLIDE 33

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 33 33

Committee Disagreement Committee Disagreement

  • Chooses an example based on the

Chooses an example based on the disagreement of the query committee disagreement of the query committee

  • CPK, California Pizza Kitchen is the most

CPK, California Pizza Kitchen is the most informative example informative example

Art’s Deli, Art’s Delicatessen CPK, California Pizza Kitchen Ca’Brea, La Brea Bakery Yes Yes Yes Yes No Yes No No No

Examples M1 M2 M3 Committee

slide-34
SLIDE 34

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 34 34

Exploiting Secondary Exploiting Secondary Sources for Record Linkage Sources for Record Linkage

[ [Michalowski Michalowski, Thakkar, Knoblock ’03] , Thakkar, Knoblock ’03]

  • Primary data source may be insufficient to

Primary data source may be insufficient to determine mappings determine mappings

  • Secondary sources can help reduce the

Secondary sources can help reduce the uncertainty uncertainty

  • Examples of secondary sources

Examples of secondary sources

  • Geocoder

Geocoder

  • Maps street addresses into lat/long coordinates

Maps street addresses into lat/long coordinates

  • Business directories

Business directories

  • Provide company officers and locations

Provide company officers and locations

  • Area code updates

Area code updates

  • Provide changes in area codes over time

Provide changes in area codes over time

slide-35
SLIDE 35

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 35 35

Missing Matches Missing Matches

Record Linkage

Matched Records

slide-36
SLIDE 36

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 36 36

Exploiting a Exploiting a Geocoder Geocoder

Record Linkage

Matched Records Secondary Source

26 Beach Cafe 26 Washington St. Venice, CA 26 Beach Cafe 26 Washington Boulevard Marina Del Rey, Calif

slide-37
SLIDE 37

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 37 37

Preliminary Results: Preliminary Results: Secondary Sources Secondary Sources

Secondary source reduces the depth of

the decision tree that needs to be learned

# Labeled Examples Total Correct Matches Precision Recall Average DT Depth Precision Recall Average DT Depth 25 109 51% 33% 5 66% 51% 1 35 109 73% 57% 8 81% 68% 3 50 109 83% 81% 10 85% 85% 3 Without Secondary Source With Secondary Source

slide-38
SLIDE 38

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 38 38

Outline Outline

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-39
SLIDE 39

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 39 39

Efficiently Executing Efficiently Executing Agent Plans Agent Plans

  • Problem

Problem

  • Information gathering may involve accessing

Information gathering may involve accessing and integrating data from many sources and integrating data from many sources

  • Total time to execute these plans may be large

Total time to execute these plans may be large

  • Why?

Why?

  • Slow remote sources

Slow remote sources

  • Unpredictable network latencies

Unpredictable network latencies

  • Binding patterns

Binding patterns

  • Source cannot be queried until a previous query has

Source cannot be queried until a previous query has been answered been answered

  • Result: execution is often I/O

Result: execution is often I/O-

  • bound

bound

slide-40
SLIDE 40

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 40 40

Theseus Agent Execution System Theseus Agent Execution System

[ [Barish Barish & Knoblock, ’02] & Knoblock, ’02]

  • Plan language

Plan language and and execution system execution system for Web for Web-

  • based information integration

based information integration

  • Expressive enough for monitoring a variety of sources

Expressive enough for monitoring a variety of sources

  • Efficient enough for real

Efficient enough for real-

  • time monitoring

time monitoring

Theseus

Executor

PLAN myplan { INPUT: x OUTPUT: y BODY { Op (x : y) } } 01010101010110 00011101101011 11010101010101

Plan Input Data

slide-41
SLIDE 41

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 41 41

Streaming Dataflow Streaming Dataflow

  • Plans consist of a network of operators

Plans consist of a network of operators

  • Examples

Examples: : Wrapper Wrapper, , Select Select, etc. , etc.

  • Operators produce and consume data

Operators produce and consume data

  • Operators “fire” upon any input data

Operators “fire” upon any input data

  • Data passed as

Data passed as tuples tuples of a relation

  • f a relation

Wrapper Select Join Wrapper

Address 100 Main St., Santa Monica, 90292 520 4th St. Santa Monica, 90292 2 Ocean Blvd, Venice, 90292

City State Max Price Santa Monica CA 200000

Input relation Output relation Plan

slide-42
SLIDE 42

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 42 42

MUL MUL ADD a b c d

Parallelism in Streaming Dataflow Parallelism in Streaming Dataflow

  • Dataflow

Dataflow

  • Operations scheduled by data

Operations scheduled by data availability availability

  • Independent operations execute in parallel

Independent operations execute in parallel

  • Maximizes horizontal parallelism

Maximizes horizontal parallelism

  • Example

Example: computing : computing (a*b) + (c*d)

  • Streaming

Streaming

  • Operations emit data as soon as

Operations emit data as soon as possible possible

  • Independent data processed in parallel

Independent data processed in parallel

  • Maximizes vertical parallelism

Maximizes vertical parallelism

Producer Consumer MUL MUL ADD a b c d

slide-43
SLIDE 43

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 43 43

CarInfo CarInfo Agent Agent

  • Agent for recommending used cars:

Agent for recommending used cars:

  • Combine information from

Combine information from

  • Prices of used cars

Prices of used cars

  • Safety ratings

Safety ratings

  • Reviews

Reviews

  • Example:

Example:

  • 2002 Midsize coupe/hatchback

2002 Midsize coupe/hatchback

  • $4K

$4K-

  • $12K,

$12K,

  • No

No Oldsmobiles Oldsmobiles

slide-44
SLIDE 44

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 44 44

The The CarInfo CarInfo agent agent

  • 1. Locate cars that

meet criteria

  • Edmunds.com
slide-45
SLIDE 45

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 45 45

The The CarInfo CarInfo agent agent

  • 1. Locate cars that

meet criteria

  • Edmunds.com
  • 2. Filter out

Oldsmobiles

slide-46
SLIDE 46

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 46 46

The The CarInfo CarInfo agent agent

  • 1. Locate cars that

meet criteria

  • Edmunds.com
  • 2. Filter out

Oldsmobiles

  • 3. Gather safety

reviews for each

  • NHSTA.gov
slide-47
SLIDE 47

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 47 47

The The CarInfo CarInfo agent agent

  • 1. Locate cars that

meet criteria

  • Edmunds.com
  • 2. Filter out

Oldsmobiles

  • 3. Gather safety

reviews for each

  • NHSTA.gov
  • 4. Gather detailed

reviews of each

  • ConsumerGuide.com
slide-48
SLIDE 48

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 48 48

ConsumerGuide ConsumerGuide Navigation Navigation

  • Requires navigating through multiple pages

Requires navigating through multiple pages

slide-49
SLIDE 49

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 49 49

Dataflow Dataflow-

  • style

style CarInfo CarInfo agent plan agent plan

WRAPPER

ConsumerGuide Search

(Midsize coupe/hatchback, $4000 to $12000, 2002) ((http://cg.com/summ/20812.htm),

  • ther summary review URLs)

((http://cg.com/full/20812.htm),

  • ther full review URLs)

search criteria

WRAPPER

ConsumerGuide Summary

WRAPPER

ConsumerGuide Full Review

(car reviews)

WRAPPER

Edmunds Search

((Oldsmobile Alero), (Dodge Stratus), (Pontiac Grand Am), (Mercury Cougar)) JOIN SELECT

maker != "Oldsmobile"

WRAPPER

NHTSA Search

(safety reports)

((Dodge Stratus), (Pontiac Grand Am), (Mercury Cougar))

slide-50
SLIDE 50

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 50 50

Speculative Execution Speculative Execution

[ [Barish Barish & Knoblock ’02, ’03] & Knoblock ’02, ’03]

  • Basic idea

Basic idea

  • Exploit idle resources to execute future

Exploit idle resources to execute future instructions in advance of when they are instructions in advance of when they are normally issued normally issued

  • Challenges

Challenges

  • How to augment plans for speculation

How to augment plans for speculation

  • How to ensure correctness and fairness

How to ensure correctness and fairness

  • How to decide what to speculate on

How to decide what to speculate on

slide-51
SLIDE 51

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 51 51

How to speculate? How to speculate?

  • General problem

General problem

  • Means for issuing and confirming predictions

Means for issuing and confirming predictions

  • Two new operators

Two new operators

  • Speculate

Speculate: Makes predictions based on "hints" : Makes predictions based on "hints"

  • Confirm

Confirm: Prevents errant results from exiting plan : Prevents errant results from exiting plan

Speculate

answers hints confirmations predictions/additions

Confirm

confirmations probable results actual results

slide-52
SLIDE 52

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 52 52

J S W W W W W

BEFORE

  • Example:

Example: CarInfo CarInfo

  • Predict cars based on search criteria

Predict cars based on search criteria

  • Makes practical sense:

Makes practical sense:

  • Same criteria yields same cars

Same criteria yields same cars

How to speculate? How to speculate?

slide-53
SLIDE 53

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 53 53

AFTER

How to speculate? How to speculate?

  • Example:

Example: CarInfo CarInfo

  • Predict cars based on search criteria

Predict cars based on search criteria

  • Makes practical sense:

Makes practical sense:

  • Same criteria yields same cars

Same criteria yields same cars

J S W W Speculate

hints predictions/additions confirmations answers

W Confirm W W

slide-54
SLIDE 54

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 54 54

Detailed example Detailed example

J S W W Speculate W Confirm W W

2002 Midsize coupe $4000-$12000

Time = 0.0 sec

slide-55
SLIDE 55

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 55 55

Issuing predictions Issuing predictions

J S W W Speculate W Confirm W W

Oldsmobile Alero T1 Dodge Stratus T2 Pontiac Grand Am T3 Mercury Cougar T4

Time = 0.1 sec

slide-56
SLIDE 56

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 56 56

Speculative parallelism Speculative parallelism

J S W W Speculate W Confirm W W

Dodge Stratus T2 Pontiac Grand Am T3 Mercury Cougar T4

Time = 0.2 sec

slide-57
SLIDE 57

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 57 57

Answers to hints Answers to hints

J S W W Speculate W Confirm W W

Oldsmobile Alero Dodge Stratus Pontiac Grand Am Mercury Cougar

Time = 1.0 sec

slide-58
SLIDE 58

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 58 58

Continued processing Continued processing

J S W W Speculate W Confirm W W

T1 T2 T3 T4

Time = 1.1 sec Additions (corrections), if any

slide-59
SLIDE 59

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 59 59

Generation of final results Generation of final results

J S W W Speculate W Confirm W W

Dodge Stratus (safety) (review) T2 Pontiac Grand Am (safety) (review) T3 Mercury Cougar (safety) (review) T4

Time = 3.2 sec

slide-60
SLIDE 60

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 60 60

Confirmation of results Confirmation of results

J S W W Speculate W Confirm W W

Dodge Stratus (safety) (review) Pontiac Grand Am (safety) (review) Mercury Cougar (safety) (review)

Time = 3.3 sec

slide-61
SLIDE 61

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 61 61

Safety and fairness Safety and fairness

  • Safety

Safety

  • Confirm

Confirm blocks predictions (and results of) from blocks predictions (and results of) from exiting plan before verification exiting plan before verification

  • Fairness

Fairness

  • CPU

CPU

  • Speculative operations use "speculative threads"

Speculative operations use "speculative threads"

  • Lower priority threads

Lower priority threads

  • Memory and bandwidth

Memory and bandwidth

  • Speculative operations allocate "speculative resources"

Speculative operations allocate "speculative resources"

  • Drawn from "speculative pool" of memory / objects

Drawn from "speculative pool" of memory / objects

slide-62
SLIDE 62

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 62 62

Cascading Speculation Cascading Speculation

  • Use predicted cars to speculate about the

Use predicted cars to speculate about the ConsumerGuide ConsumerGuide summary and full URLs summary and full URLs

  • Optimistic performance

Optimistic performance

  • Execution time:

Execution time: max

max { {1.2, 1.4, 1.5, 1.6

1.2, 1.4, 1.5, 1.6}

} =

= 1.6 sec 1.6 sec

  • Speedup over streaming dataflow:

Speedup over streaming dataflow: (4.2/1.6) (4.2/1.6) = = 2.63 2.63

W J S W W SPEC CONF SPEC W W SPEC

slide-63
SLIDE 63

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 63 63

Automatic plan transformation Automatic plan transformation

  • Agent plans are automatically modified for

Agent plans are automatically modified for speculative execution speculative execution

  • Successive runs of the plan benefit

Successive runs of the plan benefit

  • Even with different input data

Even with different input data

  • Leverage Amdahl's Law:

Leverage Amdahl's Law:

  • Consider optimizing only the most expensive

Consider optimizing only the most expensive path ( path (MEP MEP) )

  • Algorithm continually refines MEP

Algorithm continually refines MEP

  • Until overhead of further optimization

Until overhead of further optimization

  • utweighs benefits
  • utweighs benefits
slide-64
SLIDE 64

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 64 64

Learning for Speculative Execution Learning for Speculative Execution

  • Caching

Caching

  • Associate a hint with a predicted value

Associate a hint with a predicted value

  • 2002 Midsize coupe 4K

2002 Midsize coupe 4K-

  • 12K

12K

  • Olds

Olds Alero Alero, Dodge Stratus, Pontiac Grand Am, Mercury Cougar , Dodge Stratus, Pontiac Grand Am, Mercury Cougar

  • Classification

Classification

  • Use features of a hint to predict value

Use features of a hint to predict value

  • EXAMPLE

EXAMPLE: : Predicting car list from Edmunds Predicting car list from Edmunds type = SUV : (Nissan Pathfinder, Ford Explorer) type = Midsize : :...min <= 10000 : (Olds Alero, Dodge Stratus) min > 10000 : (Honda Accord, Toyota Camry)

Cache Decision list

Year Type Min Max Car list 2002 Midsize 8000 15000 (Oldmobile Alero, Dodge Stratus) 2002 Midsize 7500 14500 (Oldmobile Alero, Dodge Stratus) 2002 SUV 14000 20000 (Nissan Pathfinder, Ford Explorer) 2001 Midsize 11000 18000 (Honda Accord, Toyota Camry) 2002 SUV 18000 22000 (Nissan Pathfinder, Ford Explorer)

slide-65
SLIDE 65

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 65 65

1

"http://cg.com/summary/" : ε : COPY

3

"." :

2

ε : COPY

Learning for Speculative Execution Learning for Speculative Execution

  • Transduction

Transduction

  • Transducers are FSM that translate hints into predictions

Transducers are FSM that translate hints into predictions http://cg.com/summary/20812.htm http://cg.com/full/20812.htm

To create full review URL:

  • 1. Insert "http://cg.com/full/"
  • 2. Extract & insert the dynamic part
  • f the summary URL (e.g., 20812)
  • 3. Insert ".htm"
slide-66
SLIDE 66

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 66 66

Speculation Results: Last Tuple Speculation Results: Last Tuple

1000 2000 3000 4000 5000 6000 7000 8000 CarInfo RepInfo TheaterLoc FlightStatus StockInfo Time to last tuple (ms)

No speculation 50% correct 100% correct

slide-67
SLIDE 67

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 67 67

Outline of talk Outline of talk

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-68
SLIDE 68

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 68 68

Planning to Planning to Compose Web Services Compose Web Services

[Thakkar, Knoblock, & Ambite, ’03] [Thakkar, Knoblock, & Ambite, ’03]

  • Goal: Automatically compose new services from

Goal: Automatically compose new services from existing web services existing web services

  • We developed services that can dynamically

We developed services that can dynamically compose information producing services compose information producing services

  • Builds on data integration techniques to construct plans

Builds on data integration techniques to construct plans

  • Turns the plans into Theseus plans for efficient

Turns the plans into Theseus plans for efficient execution execution

  • We are extending this work to more complex

We are extending this work to more complex services that can change the world (side effects) services that can change the world (side effects)

slide-69
SLIDE 69

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 69 69

250 750 1250 1750 2250 12/8/2002 12/13/2002 12/18/2002 12/23/2002 12/28/2002 1/2/2003 1/7/2003

Date Price

American Airlines flights192 & 223, LAX-BOS, departing on Jan. 2 & 9

Learning to Make Predictions: Learning to Make Predictions: To Buy or Not To Buy To Buy or Not To Buy

  • Agents can go beyond gathering and monitoring

Agents can go beyond gathering and monitoring

  • nline sources
  • nline sources
  • They can help make decisions by exploiting the wealth

They can help make decisions by exploiting the wealth

  • f online information
  • f online information
slide-70
SLIDE 70

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 70 70

Learning to Make Predictions: Learning to Make Predictions: To Buy or Not To Buy To Buy or Not To Buy

  • Agents can go beyond gathering and monitoring

Agents can go beyond gathering and monitoring

  • nline sources
  • nline sources
  • They can help make decisions by exploiting the wealth

They can help make decisions by exploiting the wealth

  • f online information
  • f online information
  • Developed a learning system, Hamlet, to predict

Developed a learning system, Hamlet, to predict whether it is better to wait or buy whether it is better to wait or buy [

[Etzioni Etzioni, Knoblock, , Knoblock, Tuchinda Tuchinda, Yates, KDD’03] , Yates, KDD’03]

  • Collected data on airline prices over several

Collected data on airline prices over several months months

  • Learned a model of the pricing

Learned a model of the pricing

  • In our simulation on collected data, Hamlet

In our simulation on collected data, Hamlet saved $198,074 out of a possible $320,572 saved $198,074 out of a possible $320,572 (61.8% of optimal) (61.8% of optimal)

slide-71
SLIDE 71

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 71 71

Related Agent Systems Related Agent Systems

  • Some notable deployed systems

Some notable deployed systems

  • Internet

Internet Softbot Softbot [ [Etzioni Etzioni & Weld, ’94] & Weld, ’94]

  • BargainFinder

BargainFinder [ [Krulwich Krulwich, ’96] , ’96]

  • ShopBot

ShopBot [ [Perkowitz Perkowitz et al. ’96] et al. ’96]

  • Warren [Decker et al., ’97]

Warren [Decker et al., ’97]

  • Electric Elves [

Electric Elves [Chalupsky Chalupsky et al., ’01] et al., ’01]

  • and many others…

and many others…

slide-72
SLIDE 72

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 72 72

Related Work Related Work

  • Wrapper learning

Wrapper learning

  • Supervised [

Supervised [Kushmerick Kushmerick ’97, Hsu & Dung ’98] ’97, Hsu & Dung ’98]

  • Unsupervised [

Unsupervised [Lerman Lerman et al. ’01, et al. ’01, Crescenzi Crescenzi ’01] ’01]

  • Record linkage

Record linkage

  • Learning [Cohen ’00,

Learning [Cohen ’00, Sarawagi Sarawagi & & Bhamidipaty Bhamidipaty ’02] ’02]

  • Statistics [Winkler ’98]

Statistics [Winkler ’98]

  • Name matching [

Name matching [Bilenko Bilenko et al. ’03, Cohen et al. ’03] et al. ’03, Cohen et al. ’03]

  • Efficient plan execution

Efficient plan execution

  • Network query engines [Ives et al. 1999,

Network query engines [Ives et al. 1999, Naughton Naughton et et

  • al. 2000,
  • al. 2000, Hellerstein

Hellerstein et al. 2001] et al. 2001]

  • Agent execution systems [

Agent execution systems [Firby Firby ’94, Myers et al. 1996] ’94, Myers et al. 1996]

slide-73
SLIDE 73

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 73 73

Outline of talk Outline of talk

  • The Electric Elves: Information

The Electric Elves: Information agents for monitoring travel agents for monitoring travel

  • Wrapping online sources

Wrapping online sources

  • Linking records across sources

Linking records across sources

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Current and related work

Current and related work

  • Conclusions

Conclusions

slide-74
SLIDE 74

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 74 74

Conclusions Conclusions

  • Web provides the ideal environment for

Web provides the ideal environment for developing and testing software agents developing and testing software agents

  • Noted by

Noted by Etzioni Etzioni, AAAI’96 in his talk on , AAAI’96 in his talk on Softbots Softbots

  • Yet few have seized this opportunity…why?

Yet few have seized this opportunity…why?

  • Like robotics, wide variety of hard technical

Like robotics, wide variety of hard technical problems problems

  • With Web Services, the Semantic Web, etc. the

With Web Services, the Semantic Web, etc. the infrastructure is improving infrastructure is improving

  • Great opportunity for AI

Great opportunity for AI

  • Ability to demonstrate and test technologies in a real

Ability to demonstrate and test technologies in a real-

  • world setting

world setting

  • Opportunity to apply technologies to make a difference

Opportunity to apply technologies to make a difference in people’s lives in people’s lives

slide-75
SLIDE 75

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 75 75

Conclusions (cont.) Conclusions (cont.)

  • Many interesting technical challenges for building

Many interesting technical challenges for building software agents: software agents:

  • Wrapping online sources

Wrapping online sources

  • Linking records across sites

Linking records across sites

  • Efficiently executing agent plans

Efficiently executing agent plans

  • Extraction from text documents

Extraction from text documents

  • Aligning

Aligning ontologies

  • ntologies across sources

across sources

  • Planning to integrate data sources

Planning to integrate data sources

  • Learning to improve performance and capabilities

Learning to improve performance and capabilities

  • Integrating these capabilities in a robust architecture that

Integrating these capabilities in a robust architecture that can: can:

  • Respond to failures

Respond to failures

  • Explain its behavior

Explain its behavior

  • Communicate appropriately

Communicate appropriately

slide-76
SLIDE 76

Craig Knoblock Craig Knoblock University of Southern California University of Southern California 76 76

More Information More Information

  • My home page:

My home page: http:// http://www.isi.edu/~knoblock www.isi.edu/~knoblock

  • IJCAI’03 Workshop on Information

IJCAI’03 Workshop on Information Integration on the Web Integration on the Web

  • Proceedings available online (pointer

Proceedings available online (pointer from my homepage) from my homepage)

slide-77
SLIDE 77

Craig Knoblock University of Southern California 77

The End The End