[PPT] - Interac(vely Building Geospa(al Mashups Craig A. Knoblock PowerPoint Presentation

SLIDE 1

1

Interac(vely Building Geospa(al Mashups

Craig A. Knoblock University of Southern California

Work in collabora(on with Shubham Gupta, Pedro Szekely, and RaHapoom Tuchinda

SLIDE 2

MASHUPS

A website or application that combines

content from more than one source into an integrated experience [wikipedia]

2

a) LA crime map c) Ski bonk b) zillow.com

Combined Data gives new insight / provides new services

Crime Report from

different counties

Map
Real Estate Listing
Property Tax
Weather
Snow Report
Snow Resorts

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 3

PROBLEM

Most Mashups require significant exper(se to

create

Demand for crea(ng integrated applica(ons is

huge

Every user has their own unique requirements

for an integrated applica(on

Available sources and needs to integrated data

con(nues to grow

SLIDE 4

MASHUP BUILDING ISSUES

4

Wrapper Wrapper

Data Retrieval

Clean Clean AHribute AHribute

Calibra(on ‐source modeling ‐cleaning

Combine

Integra(on

Customize Display

Display

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 5

EXISTING APPROACHES

5

Goal: Create Mashups without Programming

Doesn’t translate to not having to understand programming

Yahoo’s Pipes Widget Paradigm

‐ Widgets (i.e., 43 for Pipes, 300+ for MS) represents an opera(on

n the data

‐ Loca(ng and learning to customize widget can be (me consuming ‐ Most tools focus on par(cular issues and ignore others Can we come up with a framework that addresses all of the issues while s(ll making the Mashup building process easy?

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 6

KEY CONTRIBUTIONS

A programming by demonstra(on approach that

uses a single table for building a Mashup

An integrated approach that links data extrac(on,

source modeling, data cleaning, and data integra(on together

A query formula(on technique that allows users

to specify examples to build complicated queries

6

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 7

KEY IDEAS

Focus on data, not opera(ons

– Users are more familiar with data

Leverage exis(ng data

– Help source modeling, cleaning, and data integra(on

Consolidate as opposed to Divide‐And‐Conquer

– Solving a problem in one issue can help solve another issue – Interac(ng within a single spreadsheet pladorm

7

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 8

KARMA USER INTERFACE

Data Source Types Currently Supported by Karma Various Informa(on Integra(on Opera(ons Data Table – Spreadsheet Type Interface

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 9

MAP

{EvacCenter_ID, Address, City} Extract {EvacCenter_ID, Address, City Evacua(on Centers CSV {Date, Injuries, Fatali(es} Injury sta(s(cs in Excel Spreadsheet Visualize as chart Extract {Headlines, Summary, Date, Link} Google News Website Visualize as bulleted list Extract Extract {Name, City, Phone No.} Clean Emergency Coordinator MySQL Database

, Name, Phone No.} Introduc(on Approach Evalua(on Related Work Conclusion

INTEGRATION SCENARIO

SLIDE 10

RETRIEVING DATA FROM DIVERSE SOURCES

Karma facilitates retrieval of data from structured data‐sources, such as

Excel spreadsheets, MySQL databases and CSV files

Karma also facilitates the extrac(on of data from semi‐structured data

sources such as web pages

CSV Text File MySQL Database Excel Spreadsheet HTML Web Page Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 11

EXTRACTION BY EXAMPLE

The retrieval of data from structured data‐sources, such as Excel sheets

and CSV files is done through a drag and drop mechanism

The user is only required to select a sample data‐element and drop it into

Karma’s data table

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 12

EXTRACTION FROM THE WEB

12

Tbody/tr[1]/td[2]/a

TBODY tr tr td td

1. 2. Japon Bistro

td a br br

970 E Colora.. Upscale yet affordabl..

td a br br

8400 Wilshir. Chic elegance….. Hokusai

Introduc(on Approach Evalua(on Related Work Conclusion

Tbody/tr*/td*/a

SLIDE 13

EXTRACTION FROM THE WEB

13

TBODY tr tr td td

1. 2. Japon Bistro

td a br br

970 E Colora.. Upscale yet affordab

td a br br

8400 Wilshir. Chic elegance… Hokusai

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 14

EXPLOITING WRAPPER LIBRARIES

Wrapper Library: Karma lists all the available wrappers on the local machine. Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 15

SOURCE MODELING

Karma automa(cally generates the seman(c types of each aHribute to

learn the underlying model of the data source

Supervised machine learning techniques are used to generate a set of

paHerns for each seman(c type from training data

Ini(al Type Manually label the data with the correct seman(c type to train Karma When the new data is imported

f same type, Karma

automa(cally labels it correctly

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 16

LEARNING SEMANTIC TYPES

:StreetAddress: :Email: 4DIG CAPS Rd ALPHA@ALPHA.edu 3DIG N CAPS Ave ALPHA@ALPHA.com … … :State: :Telephone: CA (3DIG) 3DIG-4DIG 2UPPER +1 3DIG 2DIG 4DIG … …

Background knowledge learn Patterns label  Idea: Learn a model of the content of data and use it to recognize new examples

SLIDE 17

DATA CLEANING

Karma performs the data cleaning by learning and applying the

transforma(on rules that are learned from examples

Ini(al data source User provides example Karma learns a transforma(on rule and applies to remaining data Data source auer cleaning

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 18

DATA CLEANING: PREDEFINED TRANSFORMATIONS

18

. . .

Predefined Rules

31 Reviews → 31 Subset Rule: (s1s2..sk) → (d1d2…dt) ∧ (k <= t) ∧ si ∈ {d1,d2,…,dt} ∧ di ≠ dj Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 19

DATA INTEGRATION

Karma discovers the related sources by detec(ng and ranking associa(ons

based on the common aHribute names and matching seman(c types

Karma suggests poten(al joins between the current data sources in the

form of column comple(ons

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 20

USER SELECTS FROM COLUMN COMPLETIONS

MySQL Database loaded as a another source in Karma Karma suggests the possible column comple(ons in a drop down list Karma executes the join query once the user selects an op(on

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 21

DATA VISUALIZATION

Visualiza(on by demonstra(on approach

– The user demonstrates to Karma the kind of visualiza(on desired for the data specified through examples using a drag and drop mechanism Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 22

DATA VISUALIZATION

Karma currently supports four types of visualiza(on formats: 1. Chart Format: Useful for visualizing numerical sta(s(cs, (me based events etc 2. Paragraph Format: Useful for visualizing descrip(ve text data such as Wikipedia defini(ons

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 23

DATA VISUALIZATION

3. List Format: Useful for visualizing informa(on in a bulleted list such as list

f summarized news ar(cles

4. Table Format: Useful for visualizing informa(on that is best presented in a row‐and‐column format such as numerical values etc

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 24

RESULTS CAN BE PUBLISHED IN MULTIPLE FORMATS

Karma lets you export your final mashup in variety of formats:

‐ HTML Page ‐ Database table ‐ KML Layer ‐ XML File ‐ CSV Text File

Different mashup publishing op(ons Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 25

AUTOMATICALLY FINDS GEOSPATIAL REFERENCES

Final mashup output in HTML web page format:

‐ Karma iden(fies geospa(al informa(on in the current data with the help of geographic seman(c types such as PR‐Address, PR‐La(tude etc ‐ The Google geocoding service is used to find the coordinates for a given address ‐ Karma uses the coordinates informa(on to place the markers in the final mashup

Poten(al geographic informa(on

Op(ons to publish mashup as HTML web page

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 26

CONSTRUCTS A MAP WITH USER‐DEFINED LAYOUT

Final mashup as a HTML web page:

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 27

RESULTS CAN BE EXPORTED AS KML

Final mashup output as a KML layer

Op(ons to publish mashup as KML layer

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 28

KML LAYERS CAN BE OPENED IN GOOGLE EARTH

The generated KML layer can be viewed in a GIS souware such as Google Earth Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 29

RESULTS CAN BE STORED IN A DB

The final mashup data can also be saved into a database table by providing

the details about the database loca(on, username and password, etc in Karma

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 30

EVALUATION

30

Baseline: A combina(on of Dapper/Pipes
Claims:

1. Users with no programming experiences can build all four Mashup types. 2. Karma takes less (me to complete each subtask and scales beHer as the tasks get harder 3. Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes

Users:

– Programmers (20) – Non‐programmers (3)

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 31

EVALUATION: SETUP

31

Introduc(on Approach Evalua(on Related Work Conclusion

Familiariza(on

‐Programmers (2 assignments on DP) ‐Review Package ‐30 minutes tutorial

Prac(ce

‐ 2‐3 tasks using Karma

Test (3 tasks)

‐Programmers: Alterna(ng between Karma vs. DP for each task ‐Non Programmers: use only Karma ‐Screen are recorded using video capture souware

5 minute cut off (me

SLIDE 32

EVALUATION: TASKS

Claim 1: Users with no programming experiences can build all four

Mashup types

Claim 2: When the Mashup subtask is difficult, Karma takes less

(me to complete that subtask

Claim 3: Overall, the user takes less (me to build the same Mashup

in Karma compared to Dapper/Pipes

Task No. Mashup Type Data Extraction Source Modeling Data Cleaning Data Integration 1 1 (1 source) Moderate Simple Difficult N/A 2 2,3 (union+form) Difficult Simple Simple Union (simple) 3 4 (join 2 sources) Simple Simple N/A Join (difficult)

32

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 33

33

Claim 1: Users with no programming experiences can build all four Mashup types

33

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 34

EVALUATION: NON‐PROGRAMMERS

34

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 35

35

Claim 2: Karma takes less (me to complete each subtask

35

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 36

EVALUATION: EXTRACTION

36

Introduc(on Approach Evalua(on Related Work Conclusion

As the extrac(on task gets more

difficult, Dapper/Pipes takes

longer
more subjects failing to complete

the task (11% for moderate and 25% for difficult)

Dapper/Pipes Karma Karma (programmer) Dapper/Pipes

SLIDE 37

EVALUATION: SOURCE MODELING

37

Introduc(on Approach Evalua(on Related Work Conclusion

Karma performed worse in task 1

and tasks 2

only 30 sec difference
subjects take (mes selec(ng

aHributes

the saving will be realized in the

data integra(on step.

Karma performed beHer in task 3

because it can automa(cally iden(fy the aHribute

Dapper/Pipes Karma

SLIDE 38

EVALUATION: DATA CLEANING

38

Introduc(on Approach Evalua(on Related Work Conclusion

Karma performed beHer in both tasks
When the cleaning task gets harder,

more subjects are failing in Dapper/ Pipes (35% for simple and 83% in hard)

Dapper/Pipes Karma

SLIDE 39

EVALUATION: DATA INTEGRATION

39

Introduc(on Approach Evalua(on Related Work Conclusion

Because of the table structure,

subjects can specify union indirectly by dropping data into the right cell

The (me spent in source modeling

step allows Karma to suggest the linking source

Dapper/Pipes: 30% fail in the union

case and 95% fail in the join case

Dapper/Pipes Karma

SLIDE 40

40

Claim 3: Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes

40

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 41

EVALUATION: OVERALL

41

Introduc(on Approach Evalua(on Related Work Conclusion

Dapper/Pipes

Karma

SLIDE 42

EVALUATION: AVERAGE

42

Introduc(on Approach Evalua(on Related Work Conclusion

2.22x

0.67x 4.16x 6.49x 3.32x

Dapper/Pipes Karma

SLIDE 43

RELATED WORK: MASHUP BUILDING TOOLS

1: Extrac(on, 2: Union, 3: Form‐based Interac(on, 4: Join

43

Introduc(on Approach Evalua(on Related Work Conclusion

Require an expert

Mainly focus on extrac(on / linear Q/A approach / linear / scalability Widgets Fancier UI/ more widgets Fewer Widgets / Confusion on workflow Early work. Focus on DOM, too basic Create points on Map RDF / Manually specify data int Tuple = card. Drawing links for rela(ons

SLIDE 44

RELATED WORK: DATA EXTRACTION

Automa(c extrac(on: table and lists only

– RoadRunner (exploit HTML structure) [Crescenzi et al., 2001] – Adel (grammer induc(on to detect rows) [Lerman+ 2001] – VisualWeb (OCR technique to detect tables) [GaHerbauer+ 2007]

Semi‐Automa(c: require more label examples

– WIEN (induc(ve – less expressive than stalker) [Kushmerick 1997] – Stalker (Cotes(ng) [Muslea+ 1999] – SouMealy (finite state transducer) [Hsu 1998] – WHISK (rigid format, exact delimiter) [Soderland 1998]

DOM: rely on well‐formed HTML and less labeling

– Simile [Huynh+ 2005] – Dapper – Interac(ve Wrapper Genera(on (ML + predic(on on DOM)[Irmak+ 2006] – PLOW (add natural language) [Allen+ 2007] – Cards [Dontcheva+ 2007] – Karma [Tuchinda+ 2008]

44

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 45

RELATED WORK: SOURCE MODELING

1:1 mapping, N:M mapping

– Schema‐level match

TranScm [Milo+ 98]
DIKE [Palopoli+ 99]
Artemis [Castano+ 01]
Delta [Cliuon+ 97]

– +Instance‐based matcher

SemInt [Li 00]
LSD [Doan 01]
ILA [Etzioni 95]
iMapp [Dhamanka 04]
Clio (interac(ve) [Ling 01]
Inducing Source Descrip(on [Carman 07]
Karma leverages exis(ng techniques to narrow candidate matches

– String Similari(es [Cohen+ 2003]

45

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 46

RELATED WORK: DATA CLEANING

Commercial Tools: Focus on wri(ng transforma(on

– ACR/Data, Migra(on Architect [Chaudhuri+ 1997]

Discrepancy Detec(on: Use as a stepping stone for record linkage and

cleaning system

– Levenshtein distance [Needleman+ 70] – Vector based [Baeza‐Yates+ 99] – EM [Ristad+ 98] – SVM [Bilenko+ 03]

Record linkage & cleaning systems: Focus on ranking [Winkler 06]

– Fuzzy Match [Chaudhuri+ 03] – Apollo [Michalowski+ 05] – Phoebus [Michelson+ 07] – PoHer’s wheel [Raman+ 01]

Karma

– Gains reference sources through source modeling process – Provides predefined transforma(ons

46

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 47

RELATED WORK: DATA INTEGRATION

Universal Rela.on: Make it easier to formulate the query but users s(ll need to

formulate the query [Ullman 1980, 1988]

Query by example: Need to know which data sources to use and the query may

not return results

– QBE [Zloof 1975]

Retrieval by formula.on: Need to understand domain model to formulate par(al

descrip(on

– Helgon [Fischer 1989], RABBIT [Williams 1982]

Graphical Query Language: Users s(ll need to navigate through sources (graphs)

– Gql [Benzi 1998, Haw 1994, Papantonakis 1988]

Ques.on‐Answering Techniques: Understanding about database opera(ons

required

– Agent Wizard [Tuchinda+ 2004]

Interac.ve Schema/data integra.on: Understanding about source schema

required

– Clio [Ling 01]

Karma is based on Programming by Demonstra.on [Cyper 2001; Lau2001]

47

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 48

CONCLUSION

Mashups are a fast growing area

– Need an efficient way to for casual web users to build them

Contribu(ons

– A PBD approach that uses a single table for building a Mashup – An integrated approach that solves the various Mashup building issues – A query formula(on technique that allows users to specify examples to build complicated queries

Evaluated the validity of the Karma approach

– Subjects were able to complete Mashup building tasks in Karma – The overall improvement is at least a factor of 3.5

48

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 49

FUTURE WORK

Learn and generalize over the task

– Store the integra(on plan so that it can be reexecuted on current data

Support the integra(on of geospa(al data

types (i.e., vector layers, raster layers)

Improve the techniques for automa(c source

modeling

Learn new transforma(ons from examples for

data cleaning

49

Introduc(on Approach Evalua(on Related Work Conclusion

SLIDE 50

PAPERS

Building geospa.al mashups to visualize informa.on for crisis
management. Shubham Gupta and Craig A. Knoblock. In Proceedings of

the 7th Interna2onal Conference on Informa2on Systems for Crisis Response and Management, 2010.

Interac.ve data integra.on through smart copy & paste.

Zachary G. Ives, Craig A. Knoblock, Steven Minton, Marie Jacob, Partha Pra(m Talukdar, RaHapoom Tuchinda, Jose Luis Ambite, Maria Muslea, and Cenk Gazen, Fourth Biennial Conference on Innova2ve Data Systems Research (CIDR), 2009.

Building mashups by example.

RaHapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. Proceedings of the 2008 Interna2onal Conference on Intelligent User Interfaces, 2008

Building data integra.on queries by demonstra.on.

RaHapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. In Proceedings of the Interna2onal Conference on Intelligent User Interfaces, 2007

50 50

Introduc(on Approach Evalua(on Related Work Conclusion