[PPT] - DIADEM data extraction methodology Web data as you want it T E A M PowerPoint Presentation

SLIDE 1

W E L C O M E

1

DIADEM

data extraction methodology domain-centric intelligent automated

Web data as you want it

SLIDE 2

T E A M

2

SLIDE 3

I N T R O D U C T I O N

3

Tim Furche ¡ Stefano Ortona Cheng Wang ¡

now Facebook

Giorgio Orsi ¡

Poster  Session II, № 57 

Today at 17:15-19:00

Demo paper  WaDaR   Today @ 10:30-12:00

SLIDE 4

What? Data Extraction

H O W: T E C H N O L O G Y & T E A M

4

ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm

SLIDE 5

What? Data Extraction

H O W: T E C H N O L O G Y & T E A M

5

ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm

>10000

SLIDE 6

6

– N I L E S H D A LV I e t a l .

“For many kinds of information one has to extract from thousands of sites in order to build a comprehensive database”

VLDB 2012

SLIDE 7

Result Summary

H O W: T E C H N O L O G Y & T E A M

7

500-5000

Sites for each domain

85-95% >96%

Precision of extracted primary attributes Perfect recall wrappers (consistently in all domains)

6

Domains (real estate, used cars, locations, electronics, …)

SLIDE 8

DIADEM: Many Domains

D I A D E M

8

○ Domains considered from 2014–2015

◗ Real estate UK & US ◗ Used cars UK & US ◗ Products:

consumer electronics (Singapore, Malaysia)
fashion (UK)

◗ Locations:

restaurant (chains & open web, US)
hotels (US)

SLIDE 9

DIADEM: Process

D I A D E M

9 Exploration Induction Extraction Ontology Record & attribute identification Form understanding & filling

Site URL

SLIDE 10

D I A D E M E X A M P L E

10

1 2

SLIDE 11

D I A D E M E X A M P L E

11

3 4 1

2

SLIDE 12

D I A D E M E X A M P L E

12

1 3 4 2 5

Up to £250,000

2

iFrame with results <250k Contact Form

1

3

SLIDE 13

Strong Principles

H O W: T E C H N O L O G Y & T E A M

13

1

ROSeAnn (VLDB’14)

Entity extraction from text and structure

2

OPAL (WWW’12, VLDBJ’13)

Form understanding & filling

3

AMBER (under submission)

Record identification for listing pages

4

OXPath (VLDB’11, VLDBJ’13)

Extraction language

6

DIADEM (VLDB’14)

World-first accurate, automatic full-site extraction system

5

WaDaR (demo @ VLDB’15)

Joint wrapper and relation repair

SLIDE 14

Control Flow: guarded FST

D I A D E M

14

Decision: Which action to take?

Stage 5: Finalize Stage 1: Init Page

success

crawler next link filling back iFrame

1 2 6 7

Browser Interaction

failure

5 3 4

G u a rd e d F S Ts : e x p o n e n t i a l l y m o re s u c c i n c t t h a n p l a i n F S Ts .

SLIDE 15

Control Flow: guarded FST

D I A D E M

15

field set selection behavior selection value selection field iteration browser interaction modification classifier

4 3 1 2

Stage 1: Page Init Stage 3: Crawling

1 2 3 4

G u a rd e d F S Ts a s re l a t i o n a l t r a n s d u c e r s : s c a l e t o h u n d re d s

f s t a t e s a n d m i l l i o n s o f f a c t s

SLIDE 16

R E S U LT PA G E P H E N O M E N O L O G Y

16

2: Multi-node location GRID Layout 1: Single-level GRID innesmackay.com 1: Outlier record 2 : O p t i

n

a l b a t h r

m

GRID Layout remax.co.uk GRID Layout 1 : M i s s i n g r e c

r

d ( ? ) motorclick.co.uk 2 : R e c

r

d w i t h

u

t p r i c e a n d m a k e GRID Layout perrys.co.uk 1 : M u l t i p l e p r i c e s 2 : M u l t i

a

t t r i b u t e t i t l e 1: Interspersed ad LIST Layout 3: Location in title and separate adzuna.co.uk LIST Layout 1: Frequent description attribute girardlettings.co.uk LIST Layout 1: Multiple prices auto100.co.uk 2: Multi-attribute title LIST Layout 1: Many attributes 2: Structured location 3: Unit of measure finders.co.uk

SLIDE 17

H O W: T E C H N O L O G Y & T E A M

17

http://diadem.cs.ox.ac.uk/demo

SLIDE 18

Full-site extraction

D I A D E M A N A LY S I S

18

5

wrapper efgective wrong or missing data no data UK real estate 91% 7% 2% Oxford real estate 90% 6% 4% ViNTs10 4% 5% 91% UK used cars 93% 4% 3% US real estate 90% 5% 5%

SLIDE 19

Competition: Segmentation

D I A D E M A N A LY S I S

19

precision recall

99% 98% 95% 88% 84% 77% 56% 38% 99% 97% 81% 78% 58% 53% 72% 48%

DIADEM

V iN T s

DEPTA

MDR 0% 25% 50% 75% 100% 0% 25% 50% 75% 100%

RE−RND UC−RND

Records

C O N C L U S I O N :

Do only a part of the job, and poorly

SLIDE 20

Competition: Attributes

D I A D E M A N A LY S I S

20

precision recall

95% 97% 84% 83% 48% 42% 95% 96% 58% 74% 60% 65%

DIA DEM DEPTA Road Runne r 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% RE−RND UC−RND

Attributes

C O N C L U S I O N :

Do only a part of the job, and poorly

SLIDE 21

Competition: Forms

D I A D E M A N A LY S I S

21

ICQ dataset HA [14] ExQ [41] StatParser [36]

DIADEM [17]

F1 for labeling 92% 96% 96% 98%

n l y l a b e l l i n g

n o c l a s s i f i c a t i o n o r f i l l i n g

SLIDE 22

Performance: Analysis Phase

D I A D E M A N A LY S I S

22 5 10 15 20 RE−FULL

time (minutes)

10 20 30 40

visited pages

SLIDE 23

Performance: Extraction Phase

D I A D E M A N A LY S I S

23

500 1000 1500 2000 250 500 750 1000

number of records time (seconds)

SLIDE 24

DIADEM extracts from the web as it is

– H I R O M U A R A K AWA

“It's a cruel and random world, but the chaos is all so beautiful.”

SLIDE 25

DIADEM extracts full sites automatically

Form filling Crawling Object extraction Segmentation Alignment Wrapper induction Pagination

SLIDE 26

DIADEM extracts full domains per-site supervision + no at all

SLIDE 27

B O D Y L E V E L O N E

27

SLIDE 28

Chain locations

H O W: T E C H N O L O G Y & T E A M

28

○ Following a presentation of DIADEM

◗ they didn’t believe that this works

○ We need locations of restaurant chains allover ○ Challenge: what can you do in 2-3 weeks?

◗ from a given list of some 300 chains

technology evaluation by a US tech company

SLIDE 29

Chain locations

H O W: T E C H N O L O G Y & T E A M

29

160,000

Restaurant chain locations, from over 295 chains including all major chains

85%

Effective wrappers, all automatically maintained

95%

Precision of extracted location information

30

days from start to finish 3 person team

technology evaluation by large US tech company

SLIDE 30

30

PER ATTRIBUTE ACCURACY

Correct Wrong Wrong & empty Good Scrape, Bad Data No Benchmark Accuracy Precision

More

D I A D E M

31

http://diadem.cs.ox.ac.uk/vldb15/demo.mp4 http://diadem.cs.ox.ac.uk/evaluation/14/02/

Demo: Evaluation: Selected   papers:

Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database. PVLDB 7(14): 1845-1856 (2014) Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers: OXPath: A language for scalable data extraction, automation, and crawling

n the deep web. VLDB J. 22(1): 47-72 (2013)

Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, Cheng Wang: Little Knowledge Rules the Web: Domain-Centric Result Page

Extraction. RR 2011: 61-76

Jens Lehmann, Tim Furche, Giovanni Grasso, Axel-Cyrille Ngonga Ngomo, Christian Schallhart, Andrew Jon Sellers, Christina Unger, Lorenz Bühmann, Daniel Gerber, Konrad Höffner, David Liu, Sören Auer: DEQA: Deep Web Extraction for Question Answering. International Semantic Web Conference (2) 2012: 131-147 Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn: Reconciling Opinions of Semantic Annotators. PVLDB 6(12): 1238-1241 (2013)

http://diadem.cs.ox.ac.uk/vldb15/slides.pdf

Slides:

SLIDE 32

Summary

D I A D E M

32

You want the location of all the restaurants in the US ? … or the price of all the houses in the UK ?

amenities

pening times
ffered services

hotels hairdressers rock concerts

UK Brasil Germany Indonesia World

terms features availability rental cars headphones mortgage loans

from yielding

>95%

precision

>75-95%

sources with 100% recall

100,000s

restaurant, real estate used car, … websites

1,000,000s

products, businesses, places, and other entities at

Delivered at little human effort with automatic maintenance 2-3 weeks

for any vertical once

3 engineers

independently verified with just

▪automated data extraction effectively covering entire verticals (100k+ sources) ▪unrivalled performance in extracting entities, including places, people, products ▪highly disruptive technology with value for even established players

DIADEM in less then 30 words

… …

SLIDE 33

Open web locations

H O W: T E C H N O L O G Y & T E A M

33

110,000 sources

All sources Sources with data Effective wrapper 73.0% 86.9% Partial data 2.7% 3.2% DIADEM failure 8.3% 9.9% No data 16.0% — invalid source 4.8% — invalid technology 7.4% — invalid location 3.9%

SLIDE 34

Open web locations

H O W: T E C H N O L O G Y & T E A M

34

Record-level Recall

Only for sources

that have at least one result
and that are non-chain restaurants

# % # % Total extracted 81505 100.00% 11596 100.00%

- correctly identified

78857 96.75% 11150 96.15%

--- superfluous, but correct variant records

2648 3.25% 669 5.77%

e.g., different spellings of the same street address

- incorrect

294 0.36% 0.00%

- missing

2942 3.61% 446 3.85%

Wrapidity Dataset FaceBook Dataset

(sourced from Zagat, Urbanspoon verified w/ Google search) entities tagged restaurant that do not have complete or any location information in FB graph