CSCI-548: Information Integration on the Web Craig Knoblock - - PowerPoint PPT Presentation

csci 548 information integration on the web
SMART_READER_LITE
LIVE PREVIEW

CSCI-548: Information Integration on the Web Craig Knoblock - - PowerPoint PPT Presentation

CSCI-548: Information Integration on the Web Craig Knoblock University of Southern California January 05 University of Southern California 1 2 University of Southern California January 05 3 University of Southern California January 05


slide-1
SLIDE 1

January 05 University of Southern California 1

CSCI-548: Information Integration on the Web

Craig Knoblock University of Southern California

slide-2
SLIDE 2

January 05 University of Southern California 2

slide-3
SLIDE 3

January 05 University of Southern California 3

slide-4
SLIDE 4

January 05 University of Southern California 4

Example Applications

slide-5
SLIDE 5

January 05 University of Southern California 5

Agent

World Governments NATO Members

1995 1996

CIA World Factbook

1997

Integrating Country Information

slide-6
SLIDE 6

January 05 University of Southern California 6

Learned Flight Delay Predictor

Historical Flight Data Historical Weather Data

Yahoo Weather Prediction

Agent

Predicting Flight Delays

slide-7
SLIDE 7

January 05 University of Southern California 7

New Listing: 3br 2bath 200K Send Email Notification

Real Estate Notifications

slide-8
SLIDE 8

January 05 University of Southern California 8

Agent

Tiger Map Server Etak Geocoder CuisineNet Zagat Yahoo Movies Hollywood.com Trailers

TheaterLoc Entertainment Agent

slide-9
SLIDE 9

January 05 University of Southern California 9

Travel Planning Assistant

slide-10
SLIDE 10

January 05 University of Southern California 10

Geospatial Data Integration

slide-11
SLIDE 11

January 05 University of Southern California 11

WorldInfo Assistant

slide-12
SLIDE 12

January 05 University of Southern California 12

Course Overview

slide-13
SLIDE 13

January 05 University of Southern California 13

XML

XML widely used as an internet data interchange language Xquery – language for manipulating XML documents In this class I will cover the Xquery language

slide-14
SLIDE 14

January 05 University of Southern California 14

Wrappers

NAME Casablanca Restaurant STREET 220 Lincoln Boulevard CITY Venice PHONE (310) 392-5751

slide-15
SLIDE 15

January 05 University of Southern California 15

Wrappers

Turning online sources into structured information Research Topics

Wrapper Learning Automatic Wrapper Generation Wrapper Maintenance

Tools

AgentBuilder AgentRunner

slide-16
SLIDE 16

January 05 University of Southern California 16

Plan Execution

Wrapper

OpenSecrets (member page)

Join

name

Select

senators, house reps

Wrapper

Vote-Smart address all officials senators & house reps graph URL recent news combined results

Wrapper

OpenSecrets (funding page) funding URL

Wrapper

Yahoo News

Wrapper

OpenSecrets (names page) member URL

4676 Admiralty Way Marina del Rey CA

George Bush Dick Cheney Barbara Boxer Dianne Feinstein Jane Harman James Hahn

Barbara Boxer Dianne Feinstein Jane Harman Boxer Anthrax investigation continues… Boxer Bay area politicans meet… Feinstein Bay area politicans meet… Harman Life in LA is just too sunny…

slide-17
SLIDE 17

January 05 University of Southern California 17

Plan Execution

Research Topics

Streaming dataflow execution systems Optimizing execution systems

⌧Adaptive execution strategies ⌧Speculative Execution

Tools

Theseus agent execution system

slide-18
SLIDE 18

January 05 University of Southern California 18

CDW Yahoo Laptops

Mediator Mediator

Outlook Server Timeline Server

Local sources & services Remote sources & services

Data Integration

slide-19
SLIDE 19

January 05 University of Southern California 19

Data Integration Systems

Information mediators

Used to automatically select and compose information across sources Research Topics

⌧Global-as-view vs. Local-as-view integration ⌧Optimizing query plans

Tools

Prometheus information mediator

slide-20
SLIDE 20

January 05 University of Southern California 20

Record Linkage

Zagat’s Restaurant Guide Source Department of Health Restaurant Source

How can the same objects be identified when they are stored in inconsistent text formats?

Art’s Delicatessen Ca’ Brea CPK The Grill Patina Philippe’s The Original The Tillerman Art’s Deli California Pizza Kitchen Campanile Citrus Grill, The Philippe The Original Spago

slide-21
SLIDE 21

January 05 University of Southern California 21

Record Linkage

Align information across sources Research Topics:

Matching individual attributes Matching entire records

Tools

Apollo Record Linkage System

slide-22
SLIDE 22

January 05 University of Southern California 22

price agent-name agent-phone office-phone description

Aligning Schemas and Ontologies

listed-price contact-name contact-phone

  • ffice

comments Schema of realestate.com Mediated schema $250K James Smith (305) 729 0831 (305) 616 1822 Fantastic house $320K Mike Doan (617) 253 1429 (617) 112 2315 Great location listed-price contact-name contact-phone office comments realestate.com If “fantastic” & “great”

  • ccur frequently in

data instances => description sold-at contact-agent extra-info $350K (206) 634 9435 Beautiful yard $230K (617) 335 4243 Close to Seattle homes.com If “office”

  • ccurs in name

=> office-phone

slide-23
SLIDE 23

January 05 University of Southern California 23

Aligning Schemas and Ontologies

Given two different sources with different schemas, how do we automatically align the information Research Topics

Automatic schema alignment based on structure and naming Automatic alignment based on the source contents

slide-24
SLIDE 24

January 05 University of Southern California 24

Constraint Integration

slide-25
SLIDE 25

January 05 University of Southern California 25

Constraint Integration Frameworks

Approach to tightly integrating closely related sources Research:

Constraint propagation and constraint satisfaction techniques

Tools

Heracles constraint integration system

slide-26
SLIDE 26

January 05 University of Southern California 26

Geospatial Data Integration

Los Angeles County Assessor’s Site Property Tax Records Satellite Image Terraserver Census Master Address File Geocoded Houses Constraint Satisfaction Initial Hypothesis Result After Constraint Satisfaction Street Vector Data Corrected Tiger Line Files

610, Palm or 645,Sierra 645, Sierra or 639,Sierra 633, Sierra or 629,Sierra 604 or 642 604 or 610 642, Penn or 636,Penn 630,Penn or 628,Penn 636,Penn or 630,Penn 628,Penn or 624,Penn 624,Penn or 618,Penn 639, Sierra or 633,Sierra 629, Sierra or 623,Sierra 604 610 645, Sierra 642,644,646 Penn 639, Sierra 636,638,640 Penn 630,632,634 Penn 633, Sierra 629, Sierra 628, Penn 624, Penn 623, Sierra

Street Address City, State Zipcode 642 Penn St El Segundo, CA 90245 640 Penn St El Segundo, CA 90245 636 Penn St El Segundo, CA 90245 604 Palm Ave El Segundo, CA 90245 610 Palm Ave El Segundo, CA 90245 645 Sierra St El Segundo, CA 90245 639 Sierra St El Segundo, CA 90245 Address Latitude Longitude 642 Penn St 33.923413 -118.409809 640 Penn St 33.923412 -118.409809 636 Penn St 33.923412 -118.409809 604 Palm Ave 33.923414 -118.409809 610 Palm Ave 33.923414 -118.409810 645 Sierra St 33.923413 -118.409810 639 Sierra St 33.923412 -118.409810 Address # units Area(sq ft) Lot size 642 Penn St 3 1793 135.72 * 53.33 604 Palm Ave 1 884 69 * 42 610 Palm Ave 1 756 66 * 42 645 Sierra St 1 1337 120 * 62 639 Sierra St 1 1408 121*53.5

Data Extracted from On-line Site

slide-27
SLIDE 27

January 05 University of Southern California 27

Application Areas

Geospatial data integration

Includes satellite imagery, maps, vector data and many related online sources

Biological data integration

Huge number of sources on gene-related information Many sources available as web services

In this course we will focus on the first application area

slide-28
SLIDE 28

January 05 University of Southern California 28

And other topics

Semantic Web Data mining from the Web Information extraction

slide-29
SLIDE 29

January 05 University of Southern California 29

Course Details

slide-30
SLIDE 30

January 05 University of Southern California 30

Where to find me…

Research Associate Professor Computer Science Department PHE 416 (Only for office hour after class) Senior Project Leader Information Sciences Institute Marina del Rey ISI 922 (Office the rest of the time)

slide-31
SLIDE 31

January 05 University of Southern California 31

TA, Grader & Office Hours

Professor: Craig Knoblock (Knoblock@isi.edu) Office Hours:

⌧Tuesday 5-6pm (PHE 416) ⌧Thursday 3–4pm (ISI 922 or 310-448-8786)

TA: Martin Michalowski (martinm@isi.edu) Office Hours: Monday 1-2:30pm (SAL 200c) TA: Anshuman Chakravartty (achakrav@usc.edu) Office Hours: (all in SAL 200c)

⌧Tue: 11-12:30pm, Wed: 1-2:30pm, Th: 10-11:30pm, Fri: 2-3:30pm

Grader: Junaid Chaudhry (chaudhry@isi.edu)

slide-32
SLIDE 32

January 05 University of Southern California 32

Course Web Pages

Blackboard – totale.usc.edu

Your USC login works on this account If you are registered for 548, you will have access

All readings, slides, homeworks, etc will be posted on the site page Please check for announcements and read the discussion board on a regular basis All questions should be posted (not emailed!)

If you know the answer to a posted question, please answer it! But please don’t post answers to homeworks!

slide-33
SLIDE 33

January 05 University of Southern California 33

Prerequisites & Recommendations

Prerequisites

CS561 or CS573 -- Introduction to AI CS585 – Database Systems

Recommended Courses

CS571 – Issues of Programming Language Design CS573 – Advanced AI

slide-34
SLIDE 34

January 05 University of Southern California 34

Grading

Homework: 24%

8 homework assignments – 3pts each Must be turned in the week they are due Partial credit for one week extension only

Course project: 35% Quizes: 11% (1pt per quiz)

First or last 10 minutes of every class (don’t be late) There are no makeups if you miss the quiz

Final Exam: 30%

Final: May 3, 2-4pm (Check for conflicts!)

slide-35
SLIDE 35

January 05 University of Southern California 35

More on Grading

This is a hard class!

Lots of very technical reading – there is no good textbook Lots of homework Quizzes every week Final exam and course projects

I do give B’s and C’s Grade distribution will be roughly half A’s and B’s (I consider a C a failing grade) If you get 90pts or more you will definitely get an A

slide-36
SLIDE 36

January 05 University of Southern California 36

Readings

Posted on the site each week

You can read it online or print them

Try to read all required readings before the class they are covered Quizzes may cover material that is only presented in the readings!

slide-37
SLIDE 37

January 05 University of Southern California 37

Slides

Available online by midnight of the day before the lecture These are not intended as a replacement for the lecture You can print these out and make notes on them

I suggest you print 6 slides per page to save paper

slide-38
SLIDE 38

January 05 University of Southern California 38

Course Lab – SAL 200c

Microsoft Instructional Lab – SAL 200c Lab fee: $175 You must pay the fee even if you don’t use the lab – if you don’t think this is fair, don’t take the course All registered students should have an account Shared with other courses, so plan ahead You are encouraged to use your own computers, but you will need Windows 2000 or XP for wrapper tools TAs will hold office hours in the lab

slide-39
SLIDE 39

January 05 University of Southern California 39

Working Together

Each person must do their own homework

We will check for overlap in homeworks If we find any plagiarism, all parties loose credit so

⌧Don’t share your answers ⌧Don’t leave printouts in the trash with your answers ⌧Don’t give out your password ⌧Don’t copy others (they may have the wrong answer anyway!)

You can ask the TAs for help Encouraged to work in pairs on the course project

Both students must participate and present Expectations are the same for individual and joint projects

slide-40
SLIDE 40

January 05 University of Southern California 40

Cheating

Not tolerated! No second chances – all infractions will be reported Examples:

Turning in someone else’s homework Copying from someone else during a quiz or exam Doing a project that uses someone else’s work without giving them credit

slide-41
SLIDE 41

January 05 University of Southern California 41

Cell Phone Use

If it makes noise, turn it off in class

slide-42
SLIDE 42

January 05 University of Southern California 42

Quizes & Exams

The quizes and exam will cover the material in the lectures and the readings Format: problems and short answers If you keep up with the readings and participate in class, the exams won’t be too hard Timing:

Quizes: first or last 10 minutes of each class Final: 2 hours

slide-43
SLIDE 43

January 05 University of Southern California 43

Course Projects

Information integration project based on what you learned in class Be creative! An ideal project would be one you could publish a paper about Four components to this project:

Proposal Demonstration (presented in SAL 200c) Presentation (short presentation to the entire class) Paper (written in the form of a conference paper)

⌧6-8 pages

slide-44
SLIDE 44

January 05 University of Southern California 44

Grading of Projects

Overall: 35%

⌧Proposal 5% ⌧Paper 10% ⌧Demo 5% ⌧Presentation 5% ⌧Applied techniques learned from class 5% ⌧Innovation and creativity 5%

Written proposals: March 1 at 1:50pm (submit online) Proposal presentations: April 19 and 26 Demos due on date of presentation Papers dues on April 26

slide-45
SLIDE 45

January 05 University of Southern California 45

Project Presentation

Presentations are either posters or Powerpoint presentations to the class I will determine posters or presentations based on your project proposal You can still get full credit for a poster, but everyone should try to get a presentation This is the same thing that happens at conferences

slide-46
SLIDE 46

January 05 University of Southern California 46

Example Projects

A system that took an arbitrary web page and built a map showing all of the locations A meta comparison shopping engine A real estate notification agent that emailed a satellite image and map of the property A new approach for linking records across sources An improvement on an algorithm that we learn about in class An empirical evaluation of different approach to some task

slide-47
SLIDE 47

January 05 University of Southern California 47

When the Course is Over

Directed research (1-2 MS or Phd Students) M.S. Thesis Summer interns (MS or Phd) Research Assistantships (1-2 Phd Students)

⌧I can also recommend you for positions in other groups

Teaching Assistantships (both Phd & MS) Recommendation letters (anyone that gets at least an A-) Positions at related companies

⌧Last year Fetch Technologies hired two students that took the course in the past and made an offer to a third