Interac(vely Building Geospa(al Mashups Craig A. Knoblock - - PowerPoint PPT Presentation

interac vely building geospa al mashups
SMART_READER_LITE
LIVE PREVIEW

Interac(vely Building Geospa(al Mashups Craig A. Knoblock - - PowerPoint PPT Presentation

Interac(vely Building Geospa(al Mashups Craig A. Knoblock University of Southern California Work in collabora(on with Shubham Gupta, Pedro Szekely, and RaHapoom Tuchinda 1 MASHUPS A website or application that combines content from more


slide-1
SLIDE 1

1

Interac(vely Building Geospa(al Mashups

Craig A. Knoblock University of Southern California

Work in collabora(on with Shubham Gupta, Pedro Szekely, and RaHapoom Tuchinda

slide-2
SLIDE 2

MASHUPS

  • A website or application that combines

content from more than one source into an integrated experience [wikipedia]

2

a) LA crime map c) Ski bonk b) zillow.com

Combined Data gives new insight / provides new services

  • Crime Report from

different counties

  • Map
  • Real Estate Listing
  • Property Tax
  • Weather
  • Snow Report
  • Snow Resorts

Introduc(on Approach Evalua(on Related Work Conclusion

slide-3
SLIDE 3

PROBLEM

  • Most Mashups require significant exper(se to

create

  • Demand for crea(ng integrated applica(ons is

huge

  • Every user has their own unique requirements

for an integrated applica(on

  • Available sources and needs to integrated data

con(nues to grow

slide-4
SLIDE 4

MASHUP BUILDING ISSUES

4

Wrapper Wrapper

Data Retrieval

Clean Clean AHribute AHribute

Calibra(on ‐source modeling ‐cleaning

Combine

Integra(on

Customize Display

Display

Introduc(on Approach Evalua(on Related Work Conclusion

slide-5
SLIDE 5

EXISTING APPROACHES

5

Goal: Create Mashups without Programming

  • Doesn’t translate to not having to understand programming

Yahoo’s Pipes Widget Paradigm

‐ Widgets (i.e., 43 for Pipes, 300+ for MS) represents an opera(on

  • n the data

‐ Loca(ng and learning to customize widget can be (me consuming ‐ Most tools focus on par(cular issues and ignore others Can we come up with a framework that addresses all of the issues while s(ll making the Mashup building process easy?

Introduc(on Approach Evalua(on Related Work Conclusion

slide-6
SLIDE 6

KEY CONTRIBUTIONS

  • A programming by demonstra(on approach that

uses a single table for building a Mashup

  • An integrated approach that links data extrac(on,

source modeling, data cleaning, and data integra(on together

  • A query formula(on technique that allows users

to specify examples to build complicated queries

6

Introduc(on Approach Evalua(on Related Work Conclusion

slide-7
SLIDE 7

KEY IDEAS

  • Focus on data, not opera(ons

– Users are more familiar with data

  • Leverage exis(ng data

– Help source modeling, cleaning, and data integra(on

  • Consolidate as opposed to Divide‐And‐Conquer

– Solving a problem in one issue can help solve another issue – Interac(ng within a single spreadsheet pladorm

7

Introduc(on Approach Evalua(on Related Work Conclusion

slide-8
SLIDE 8

KARMA USER INTERFACE

Data Source Types Currently Supported by Karma Various Informa(on Integra(on Opera(ons Data Table – Spreadsheet Type Interface

Introduc(on Approach Evalua(on Related Work Conclusion

slide-9
SLIDE 9

MAP

{EvacCenter_ID, Address, City} Extract {EvacCenter_ID, Address, City Evacua(on Centers CSV {Date, Injuries, Fatali(es} Injury sta(s(cs in Excel Spreadsheet Visualize as chart Extract {Headlines, Summary, Date, Link} Google News Website Visualize as bulleted list Extract Extract {Name, City, Phone No.} Clean Emergency Coordinator MySQL Database

, Name, Phone No.} Introduc(on Approach Evalua(on Related Work Conclusion

  • INTEGRATION SCENARIO
slide-10
SLIDE 10

RETRIEVING DATA FROM DIVERSE SOURCES

  • Karma facilitates retrieval of data from structured data‐sources, such as

Excel spreadsheets, MySQL databases and CSV files

  • Karma also facilitates the extrac(on of data from semi‐structured data

sources such as web pages

CSV Text File MySQL Database Excel Spreadsheet HTML Web Page Introduc(on Approach Evalua(on Related Work Conclusion

slide-11
SLIDE 11

EXTRACTION BY EXAMPLE

  • The retrieval of data from structured data‐sources, such as Excel sheets

and CSV files is done through a drag and drop mechanism

  • The user is only required to select a sample data‐element and drop it into

Karma’s data table

Introduc(on Approach Evalua(on Related Work Conclusion

slide-12
SLIDE 12

EXTRACTION FROM THE WEB

12

Tbody/tr[1]/td[2]/a

TBODY tr tr td td

1. 2. Japon Bistro

td a br br

970 E Colora.. Upscale yet affordabl..

td a br br

8400 Wilshir. Chic elegance….. Hokusai

Introduc(on Approach Evalua(on Related Work Conclusion

  • Tbody/tr*/td*/a
slide-13
SLIDE 13

EXTRACTION FROM THE WEB

13

TBODY tr tr td td

1. 2. Japon Bistro

td a br br

970 E Colora.. Upscale yet affordab

td a br br

8400 Wilshir. Chic elegance… Hokusai

Introduc(on Approach Evalua(on Related Work Conclusion

slide-14
SLIDE 14

EXPLOITING WRAPPER LIBRARIES

Wrapper Library: Karma lists all the available wrappers on the local machine. Introduc(on Approach Evalua(on Related Work Conclusion

slide-15
SLIDE 15

SOURCE MODELING

  • Karma automa(cally generates the seman(c types of each aHribute to

learn the underlying model of the data source

  • Supervised machine learning techniques are used to generate a set of

paHerns for each seman(c type from training data

Ini(al Type Manually label the data with the correct seman(c type to train Karma When the new data is imported

  • f same type, Karma

automa(cally labels it correctly

Introduc(on Approach Evalua(on Related Work Conclusion

slide-16
SLIDE 16

LEARNING SEMANTIC TYPES

:StreetAddress: :Email: 4DIG CAPS Rd ALPHA@ALPHA.edu 3DIG N CAPS Ave ALPHA@ALPHA.com … … :State: :Telephone: CA (3DIG) 3DIG-4DIG 2UPPER +1 3DIG 2DIG 4DIG … …

Background knowledge learn Patterns label  Idea: Learn a model of the content of data and use it to recognize new examples

slide-17
SLIDE 17

DATA CLEANING

  • Karma performs the data cleaning by learning and applying the

transforma(on rules that are learned from examples

Ini(al data source User provides example Karma learns a transforma(on rule and applies to remaining data Data source auer cleaning

Introduc(on Approach Evalua(on Related Work Conclusion

slide-18
SLIDE 18

DATA CLEANING: PREDEFINED TRANSFORMATIONS

18

. . .

Predefined Rules

31 Reviews → 31 Subset Rule: (s1s2..sk) → (d1d2…dt) ∧ (k <= t) ∧ si ∈ {d1,d2,…,dt} ∧ di ≠ dj Introduc(on Approach Evalua(on Related Work Conclusion

slide-19
SLIDE 19

DATA INTEGRATION

  • Karma discovers the related sources by detec(ng and ranking associa(ons

based on the common aHribute names and matching seman(c types

  • Karma suggests poten(al joins between the current data sources in the

form of column comple(ons

Introduc(on Approach Evalua(on Related Work Conclusion

slide-20
SLIDE 20

USER SELECTS FROM COLUMN COMPLETIONS

MySQL Database loaded as a another source in Karma Karma suggests the possible column comple(ons in a drop down list Karma executes the join query once the user selects an op(on

Introduc(on Approach Evalua(on Related Work Conclusion

slide-21
SLIDE 21

DATA VISUALIZATION

  • Visualiza(on by demonstra(on approach

– The user demonstrates to Karma the kind of visualiza(on desired for the data specified through examples using a drag and drop mechanism Introduc(on Approach Evalua(on Related Work Conclusion

slide-22
SLIDE 22

DATA VISUALIZATION

Karma currently supports four types of visualiza(on formats: 1. Chart Format: Useful for visualizing numerical sta(s(cs, (me based events etc 2. Paragraph Format: Useful for visualizing descrip(ve text data such as Wikipedia defini(ons

Introduc(on Approach Evalua(on Related Work Conclusion

slide-23
SLIDE 23

DATA VISUALIZATION

3. List Format: Useful for visualizing informa(on in a bulleted list such as list

  • f summarized news ar(cles

4. Table Format: Useful for visualizing informa(on that is best presented in a row‐and‐column format such as numerical values etc

Introduc(on Approach Evalua(on Related Work Conclusion

slide-24
SLIDE 24

RESULTS CAN BE PUBLISHED IN MULTIPLE FORMATS

  • Karma lets you export your final mashup in variety of formats:

‐ HTML Page ‐ Database table ‐ KML Layer ‐ XML File ‐ CSV Text File

Different mashup publishing op(ons Introduc(on Approach Evalua(on Related Work Conclusion

slide-25
SLIDE 25

AUTOMATICALLY FINDS GEOSPATIAL REFERENCES

  • Final mashup output in HTML web page format:

‐ Karma iden(fies geospa(al informa(on in the current data with the help of geographic seman(c types such as PR‐Address, PR‐La(tude etc ‐ The Google geocoding service is used to find the coordinates for a given address ‐ Karma uses the coordinates informa(on to place the markers in the final mashup

Poten(al geographic informa(on

Op(ons to publish mashup as HTML web page

Introduc(on Approach Evalua(on Related Work Conclusion

slide-26
SLIDE 26

CONSTRUCTS A MAP WITH USER‐DEFINED LAYOUT

  • Final mashup as a HTML web page:

Introduc(on Approach Evalua(on Related Work Conclusion

slide-27
SLIDE 27

RESULTS CAN BE EXPORTED AS KML

  • Final mashup output as a KML layer

Op(ons to publish mashup as KML layer

Introduc(on Approach Evalua(on Related Work Conclusion

slide-28
SLIDE 28

KML LAYERS CAN BE OPENED IN GOOGLE EARTH

The generated KML layer can be viewed in a GIS souware such as Google Earth Introduc(on Approach Evalua(on Related Work Conclusion

slide-29
SLIDE 29

RESULTS CAN BE STORED IN A DB

  • The final mashup data can also be saved into a database table by providing

the details about the database loca(on, username and password, etc in Karma

Introduc(on Approach Evalua(on Related Work Conclusion

slide-30
SLIDE 30

EVALUATION

30

  • Baseline: A combina(on of Dapper/Pipes
  • Claims:

1. Users with no programming experiences can build all four Mashup types. 2. Karma takes less (me to complete each subtask and scales beHer as the tasks get harder 3. Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes

  • Users:

– Programmers (20) – Non‐programmers (3)

Introduc(on Approach Evalua(on Related Work Conclusion

slide-31
SLIDE 31

EVALUATION: SETUP

31

Introduc(on Approach Evalua(on Related Work Conclusion

  • Familiariza(on

‐Programmers (2 assignments on DP) ‐Review Package ‐30 minutes tutorial

Prac(ce

‐ 2‐3 tasks using Karma

Test (3 tasks)

‐Programmers: Alterna(ng between Karma vs. DP for each task ‐Non Programmers: use only Karma ‐Screen are recorded using video capture souware

5 minute cut off (me

slide-32
SLIDE 32

EVALUATION: TASKS

  • Claim 1: Users with no programming experiences can build all four

Mashup types

  • Claim 2: When the Mashup subtask is difficult, Karma takes less

(me to complete that subtask

  • Claim 3: Overall, the user takes less (me to build the same Mashup

in Karma compared to Dapper/Pipes

Task No. Mashup Type Data Extraction Source Modeling Data Cleaning Data Integration 1 1 (1 source) Moderate Simple Difficult N/A 2 2,3 (union+form) Difficult Simple Simple Union (simple) 3 4 (join 2 sources) Simple Simple N/A Join (difficult)

32

Introduc(on Approach Evalua(on Related Work Conclusion

slide-33
SLIDE 33

33

Claim 1: Users with no programming experiences can build all four Mashup types

33

Introduc(on Approach Evalua(on Related Work Conclusion

slide-34
SLIDE 34

EVALUATION: NON‐PROGRAMMERS

34

Introduc(on Approach Evalua(on Related Work Conclusion

slide-35
SLIDE 35

35

Claim 2: Karma takes less (me to complete each subtask

35

Introduc(on Approach Evalua(on Related Work Conclusion

slide-36
SLIDE 36

EVALUATION: EXTRACTION

36

Introduc(on Approach Evalua(on Related Work Conclusion

  • As the extrac(on task gets more

difficult, Dapper/Pipes takes

  • longer
  • more subjects failing to complete

the task (11% for moderate and 25% for difficult)

Dapper/Pipes Karma Karma (programmer) Dapper/Pipes

slide-37
SLIDE 37

EVALUATION: SOURCE MODELING

37

Introduc(on Approach Evalua(on Related Work Conclusion

  • Karma performed worse in task 1

and tasks 2

  • only 30 sec difference
  • subjects take (mes selec(ng

aHributes

  • the saving will be realized in the

data integra(on step.

  • Karma performed beHer in task 3

because it can automa(cally iden(fy the aHribute

Dapper/Pipes Karma

slide-38
SLIDE 38

EVALUATION: DATA CLEANING

38

Introduc(on Approach Evalua(on Related Work Conclusion

  • Karma performed beHer in both tasks
  • When the cleaning task gets harder,

more subjects are failing in Dapper/ Pipes (35% for simple and 83% in hard)

Dapper/Pipes Karma

slide-39
SLIDE 39

EVALUATION: DATA INTEGRATION

39

Introduc(on Approach Evalua(on Related Work Conclusion

  • Because of the table structure,

subjects can specify union indirectly by dropping data into the right cell

  • The (me spent in source modeling

step allows Karma to suggest the linking source

  • Dapper/Pipes: 30% fail in the union

case and 95% fail in the join case

Dapper/Pipes Karma

slide-40
SLIDE 40

40

Claim 3: Overall, the user takes less (me to build the same Mashup in Karma compared to Dapper/Pipes

40

Introduc(on Approach Evalua(on Related Work Conclusion

slide-41
SLIDE 41

EVALUATION: OVERALL

41

Introduc(on Approach Evalua(on Related Work Conclusion

  • Dapper/Pipes

Karma

slide-42
SLIDE 42

EVALUATION: AVERAGE

42

Introduc(on Approach Evalua(on Related Work Conclusion

  • 2.22x

0.67x 4.16x 6.49x 3.32x

Dapper/Pipes Karma

slide-43
SLIDE 43

RELATED WORK: MASHUP BUILDING TOOLS

1: Extrac(on, 2: Union, 3: Form‐based Interac(on, 4: Join

43

Introduc(on Approach Evalua(on Related Work Conclusion

  • Require an expert

Mainly focus on extrac(on / linear Q/A approach / linear / scalability Widgets Fancier UI/ more widgets Fewer Widgets / Confusion on workflow Early work. Focus on DOM, too basic Create points on Map RDF / Manually specify data int Tuple = card. Drawing links for rela(ons

slide-44
SLIDE 44

RELATED WORK: DATA EXTRACTION

  • Automa(c extrac(on: table and lists only

– RoadRunner (exploit HTML structure) [Crescenzi et al., 2001] – Adel (grammer induc(on to detect rows) [Lerman+ 2001] – VisualWeb (OCR technique to detect tables) [GaHerbauer+ 2007]

  • Semi‐Automa(c: require more label examples

– WIEN (induc(ve – less expressive than stalker) [Kushmerick 1997] – Stalker (Cotes(ng) [Muslea+ 1999] – SouMealy (finite state transducer) [Hsu 1998] – WHISK (rigid format, exact delimiter) [Soderland 1998]

  • DOM: rely on well‐formed HTML and less labeling

– Simile [Huynh+ 2005] – Dapper – Interac(ve Wrapper Genera(on (ML + predic(on on DOM)[Irmak+ 2006] – PLOW (add natural language) [Allen+ 2007] – Cards [Dontcheva+ 2007] – Karma [Tuchinda+ 2008]

44

Introduc(on Approach Evalua(on Related Work Conclusion

slide-45
SLIDE 45

RELATED WORK: SOURCE MODELING

  • 1:1 mapping, N:M mapping

– Schema‐level match

  • TranScm [Milo+ 98]
  • DIKE [Palopoli+ 99]
  • Artemis [Castano+ 01]
  • Delta [Cliuon+ 97]

– +Instance‐based matcher

  • SemInt [Li 00]
  • LSD [Doan 01]
  • ILA [Etzioni 95]
  • iMapp [Dhamanka 04]
  • Clio (interac(ve) [Ling 01]
  • Inducing Source Descrip(on [Carman 07]
  • Karma leverages exis(ng techniques to narrow candidate matches

– String Similari(es [Cohen+ 2003]

45

Introduc(on Approach Evalua(on Related Work Conclusion

slide-46
SLIDE 46

RELATED WORK: DATA CLEANING

  • Commercial Tools: Focus on wri(ng transforma(on

– ACR/Data, Migra(on Architect [Chaudhuri+ 1997]

  • Discrepancy Detec(on: Use as a stepping stone for record linkage and

cleaning system

– Levenshtein distance [Needleman+ 70] – Vector based [Baeza‐Yates+ 99] – EM [Ristad+ 98] – SVM [Bilenko+ 03]

  • Record linkage & cleaning systems: Focus on ranking [Winkler 06]

– Fuzzy Match [Chaudhuri+ 03] – Apollo [Michalowski+ 05] – Phoebus [Michelson+ 07] – PoHer’s wheel [Raman+ 01]

  • Karma

– Gains reference sources through source modeling process – Provides predefined transforma(ons

46

Introduc(on Approach Evalua(on Related Work Conclusion

slide-47
SLIDE 47

RELATED WORK: DATA INTEGRATION

  • Universal Rela.on: Make it easier to formulate the query but users s(ll need to

formulate the query [Ullman 1980, 1988]

  • Query by example: Need to know which data sources to use and the query may

not return results

– QBE [Zloof 1975]

  • Retrieval by formula.on: Need to understand domain model to formulate par(al

descrip(on

– Helgon [Fischer 1989], RABBIT [Williams 1982]

  • Graphical Query Language: Users s(ll need to navigate through sources (graphs)

– Gql [Benzi 1998, Haw 1994, Papantonakis 1988]

  • Ques.on‐Answering Techniques: Understanding about database opera(ons

required

– Agent Wizard [Tuchinda+ 2004]

  • Interac.ve Schema/data integra.on: Understanding about source schema

required

– Clio [Ling 01]

  • Karma is based on Programming by Demonstra.on [Cyper 2001; Lau2001]

47

Introduc(on Approach Evalua(on Related Work Conclusion

slide-48
SLIDE 48

CONCLUSION

  • Mashups are a fast growing area

– Need an efficient way to for casual web users to build them

  • Contribu(ons

– A PBD approach that uses a single table for building a Mashup – An integrated approach that solves the various Mashup building issues – A query formula(on technique that allows users to specify examples to build complicated queries

  • Evaluated the validity of the Karma approach

– Subjects were able to complete Mashup building tasks in Karma – The overall improvement is at least a factor of 3.5

48

Introduc(on Approach Evalua(on Related Work Conclusion

slide-49
SLIDE 49

FUTURE WORK

  • Learn and generalize over the task

– Store the integra(on plan so that it can be reexecuted on current data

  • Support the integra(on of geospa(al data

types (i.e., vector layers, raster layers)

  • Improve the techniques for automa(c source

modeling

  • Learn new transforma(ons from examples for

data cleaning

49

Introduc(on Approach Evalua(on Related Work Conclusion

slide-50
SLIDE 50

PAPERS

  • Building geospa.al mashups to visualize informa.on for crisis
  • management. Shubham Gupta and Craig A. Knoblock. In Proceedings of

the 7th Interna2onal Conference on Informa2on Systems for Crisis Response and Management, 2010.

  • Interac.ve data integra.on through smart copy & paste.

Zachary G. Ives, Craig A. Knoblock, Steven Minton, Marie Jacob, Partha Pra(m Talukdar, RaHapoom Tuchinda, Jose Luis Ambite, Maria Muslea, and Cenk Gazen, Fourth Biennial Conference on Innova2ve Data Systems Research (CIDR), 2009.

  • Building mashups by example.

RaHapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. Proceedings of the 2008 Interna2onal Conference on Intelligent User Interfaces, 2008

  • Building data integra.on queries by demonstra.on.

RaHapoom Tuchinda, Pedro Szekely, and Craig A. Knoblock. In Proceedings of the Interna2onal Conference on Intelligent User Interfaces, 2007

50 50

Introduc(on Approach Evalua(on Related Work Conclusion