DIADEM data extraction methodology Web data as you want it T E A M - - PowerPoint PPT Presentation

diadem
SMART_READER_LITE
LIVE PREVIEW

DIADEM data extraction methodology Web data as you want it T E A M - - PowerPoint PPT Presentation

W E L C O M E 1 domain-centric intelligent automated DIADEM data extraction methodology Web data as you want it T E A M 2 I N T R O D U C T I O N 3 Cheng Wang Tim Furche Poster now Facebook Session II, 57 Today at


slide-1
SLIDE 1

W E L C O M E

1

DIADEM

data extraction methodology domain-centric intelligent automated

Web data as you want it

slide-2
SLIDE 2

T E A M

2

slide-3
SLIDE 3

I N T R O D U C T I O N

3

Tim Furche ¡ Stefano Ortona Cheng Wang ¡

now Facebook

Giorgio Orsi ¡

Poster
 Session II, № 57


Today at 17:15-19:00

Demo paper
 WaDaR 
 Today @ 10:30-12:00

slide-4
SLIDE 4

What? Data Extraction

H O W: T E C H N O L O G Y & T E A M

4

ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm

slide-5
SLIDE 5

What? Data Extraction

H O W: T E C H N O L O G Y & T E A M

5

ref-code postcode bedrooms bathrooms available price 33453 OX2 6AR 3 2 15/10/2013 £1280 pcm 33433 OX4 7DG 2 1 18/04/2013 £995 pcm

>10000

slide-6
SLIDE 6

6

– N I L E S H D A LV I e t a l .

“For many kinds of information one has to extract from thousands of sites in order to build a comprehensive database”

VLDB 2012

slide-7
SLIDE 7

Result Summary

H O W: T E C H N O L O G Y & T E A M

7

500-5000

Sites for each domain

85-95% >96%

Precision of extracted primary attributes Perfect recall wrappers (consistently in all domains)

6

Domains (real estate, used cars, locations, electronics, …)

slide-8
SLIDE 8

DIADEM: Many Domains

D I A D E M

8

○ Domains considered from 2014–2015

◗ Real estate UK & US ◗ Used cars UK & US ◗ Products:

  • consumer electronics (Singapore, Malaysia)
  • fashion (UK)

◗ Locations:

  • restaurant (chains & open web, US)
  • hotels (US)
slide-9
SLIDE 9

DIADEM: Process

D I A D E M

9 Exploration Induction Extraction Ontology Record & attribute identification Form understanding & filling

Site URL

slide-10
SLIDE 10

D I A D E M E X A M P L E

10

1 2

slide-11
SLIDE 11

D I A D E M E X A M P L E

11

3 4 1

2

slide-12
SLIDE 12

D I A D E M E X A M P L E

12

1 3 4 2 5

Up to £250,000

2

iFrame with results <250k Contact Form

1

3

slide-13
SLIDE 13

Strong Principles

H O W: T E C H N O L O G Y & T E A M

13

1

ROSeAnn (VLDB’14)

Entity extraction from text and structure

2

OPAL (WWW’12, VLDBJ’13)

Form understanding & filling

3

AMBER (under submission)

Record identification for listing pages

4

OXPath (VLDB’11, VLDBJ’13)

Extraction language

6

DIADEM (VLDB’14)

World-first accurate, automatic full-site extraction system

5

WaDaR (demo @ VLDB’15)

Joint wrapper and relation repair

slide-14
SLIDE 14

Control Flow: guarded FST

D I A D E M

14

Decision: Which action to take?

Stage 5: Finalize Stage 1: Init Page

success

crawler next link filling back iFrame

1 2 6 7

Browser Interaction

failure

5 3 4

G u a rd e d F S Ts : e x p o n e n t i a l l y m o re s u c c i n c t t h a n p l a i n F S Ts .

slide-15
SLIDE 15

Control Flow: guarded FST

D I A D E M

15

field set selection behavior selection value selection field iteration browser interaction modification classifier

4 3 1 2

Stage 1: Page Init Stage 3: Crawling

1 2 3 4

G u a rd e d F S Ts a s re l a t i o n a l t r a n s d u c e r s : s c a l e t o h u n d re d s

  • f s t a t e s a n d m i l l i o n s o f f a c t s
slide-16
SLIDE 16

R E S U LT PA G E P H E N O M E N O L O G Y

16

2: Multi-node location GRID Layout 1: Single-level GRID innesmackay.com 1: Outlier record 2 : O p t i

  • n

a l b a t h r

  • m

GRID Layout remax.co.uk GRID Layout 1 : M i s s i n g r e c

  • r

d ( ? ) motorclick.co.uk 2 : R e c

  • r

d w i t h

  • u

t p r i c e a n d m a k e GRID Layout perrys.co.uk 1 : M u l t i p l e p r i c e s 2 : M u l t i

  • a

t t r i b u t e t i t l e 1: Interspersed ad LIST Layout 3: Location in title and separate adzuna.co.uk LIST Layout 1: Frequent description attribute girardlettings.co.uk LIST Layout 1: Multiple prices auto100.co.uk 2: Multi-attribute title LIST Layout 1: Many attributes 2: Structured location 3: Unit of measure finders.co.uk

slide-17
SLIDE 17

H O W: T E C H N O L O G Y & T E A M

17

http://diadem.cs.ox.ac.uk/demo

slide-18
SLIDE 18

Full-site extraction

D I A D E M A N A LY S I S

18

5

wrapper efgective wrong or missing data no data UK real estate 91% 7% 2% Oxford real estate 90% 6% 4% ViNTs10 4% 5% 91% UK used cars 93% 4% 3% US real estate 90% 5% 5%

slide-19
SLIDE 19

Competition: Segmentation

D I A D E M A N A LY S I S

19

precision recall

99% 98% 95% 88% 84% 77% 56% 38% 99% 97% 81% 78% 58% 53% 72% 48%

DIADEM

V iN T s

DEPTA

MDR 0% 25% 50% 75% 100% 0% 25% 50% 75% 100%

RE−RND UC−RND

Records

C O N C L U S I O N :

Do only a part of the job, and poorly

slide-20
SLIDE 20

Competition: Attributes

D I A D E M A N A LY S I S

20

precision recall

95% 97% 84% 83% 48% 42% 95% 96% 58% 74% 60% 65%

DIA DEM DEPTA Road Runne r 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% RE−RND UC−RND

Attributes

C O N C L U S I O N :

Do only a part of the job, and poorly

slide-21
SLIDE 21

Competition: Forms

D I A D E M A N A LY S I S

21

ICQ dataset HA [14] ExQ [41] StatParser [36]

DIADEM [17]

F1 for labeling 92% 96% 96% 98%

  • n l y l a b e l l i n g 


n o c l a s s i f i c a t i o n o r f i l l i n g

slide-22
SLIDE 22

Performance: Analysis Phase

D I A D E M A N A LY S I S

22 5 10 15 20 RE−FULL

time (minutes)

10 20 30 40

visited pages

slide-23
SLIDE 23

Performance: Extraction Phase

D I A D E M A N A LY S I S

23

500 1000 1500 2000 250 500 750 1000

number of records time (seconds)

slide-24
SLIDE 24

DIADEM extracts from the web as it is

– H I R O M U A R A K AWA

“It's a cruel and random world, but the chaos is all so beautiful.”

slide-25
SLIDE 25

DIADEM extracts full sites automatically

Form filling Crawling Object extraction Segmentation Alignment Wrapper induction Pagination

slide-26
SLIDE 26

DIADEM extracts full domains per-site supervision + no at all

slide-27
SLIDE 27

B O D Y L E V E L O N E

27

slide-28
SLIDE 28

Chain locations

H O W: T E C H N O L O G Y & T E A M

28

○ Following a presentation of DIADEM

◗ they didn’t believe that this works

○ We need locations of restaurant chains allover ○ Challenge: what can you do in 2-3 weeks?

◗ from a given list of some 300 chains

technology evaluation by a US tech company

slide-29
SLIDE 29

Chain locations

H O W: T E C H N O L O G Y & T E A M

29

160,000

Restaurant chain locations, from over 295 chains including all major chains

85%

Effective wrappers, all automatically maintained

95%

Precision of extracted location information

30

days from start to finish 3 person team

technology evaluation by large US tech company

slide-30
SLIDE 30

30

PER ATTRIBUTE ACCURACY

Correct Wrong Wrong & empty Good Scrape, Bad Data No Benchmark Accuracy Precision

category

826 9 100.00% 100.00%

city

829 6 99.28% 99.28%

closed

11 3 3 821 78.57% 100.00%

hours

382 446 446 5 2 46.14% 100.00%

latlong

745 89 88 1 89.33% 99.87%

located_in

17 59 59 1 758 22.37% 100.00%

name

831 4 99.52% 99.52%

phone

709 126 117 84.91% 98.75%

postal_code

803 9 8 23 98.89% 99.88%

street_address

803 16 2 16 98.05% 98.29%

website_0

818 6 9 2 99.27% 99.27% 83.30% 99.53%

H O W: T E C H N O L O G Y & T E A M

This evaluation is done by independent, external evaluators on a sample of more than 830 locations.

slide-31
SLIDE 31

More

D I A D E M

31

http://diadem.cs.ox.ac.uk/vldb15/demo.mp4 http://diadem.cs.ox.ac.uk/evaluation/14/02/

Demo: Evaluation: Selected 
 papers:

Tim Furche, Georg Gottlob, Giovanni Grasso, Xiaonan Guo, Giorgio Orsi, Christian Schallhart, Cheng Wang: DIADEM: Thousands of Websites to a Single Database. PVLDB 7(14): 1845-1856 (2014) Tim Furche, Georg Gottlob, Giovanni Grasso, Christian Schallhart, Andrew Jon Sellers: OXPath: A language for scalable data extraction, automation, and crawling

  • n the deep web. VLDB J. 22(1): 47-72 (2013)

Tim Furche, Georg Gottlob, Giovanni Grasso, Giorgio Orsi, Christian Schallhart, Cheng Wang: Little Knowledge Rules the Web: Domain-Centric Result Page

  • Extraction. RR 2011: 61-76

Jens Lehmann, Tim Furche, Giovanni Grasso, Axel-Cyrille Ngonga Ngomo, Christian Schallhart, Andrew Jon Sellers, Christina Unger, Lorenz Bühmann, Daniel Gerber, Konrad Höffner, David Liu, Sören Auer: DEQA: Deep Web Extraction for Question Answering. International Semantic Web Conference (2) 2012: 131-147 Luying Chen, Stefano Ortona, Giorgio Orsi, Michael Benedikt: ROSeAnn: Reconciling Opinions of Semantic Annotators. PVLDB 6(12): 1238-1241 (2013)

http://diadem.cs.ox.ac.uk/vldb15/slides.pdf

Slides:

slide-32
SLIDE 32

Summary

D I A D E M

32

You want the location of all the restaurants in the US ? … or the price of all the houses in the UK ?

amenities

  • pening times
  • ffered services

hotels hairdressers rock concerts

UK Brasil Germany Indonesia World

terms features availability rental cars headphones mortgage loans

from yielding

>95%

precision

>75-95%

sources with 100% recall

100,000s

restaurant, real estate used car, … websites

1,000,000s

products, businesses, places, and other entities at

Delivered at little human effort with automatic maintenance 2-3 weeks

for any vertical once

3 engineers

independently verified with just

▪automated data extraction effectively covering entire verticals (100k+ sources) ▪unrivalled performance in extracting entities, including places, people, products ▪highly disruptive technology with value for even established players

DIADEM in less then 30 words

… …

slide-33
SLIDE 33

Open web locations

H O W: T E C H N O L O G Y & T E A M

33

110,000 sources

All sources Sources with data Effective wrapper 73.0% 86.9% Partial data 2.7% 3.2% DIADEM failure 8.3% 9.9% No data 16.0% — invalid source 4.8% — invalid technology 7.4% — invalid location 3.9%

slide-34
SLIDE 34

Open web locations

H O W: T E C H N O L O G Y & T E A M

34

Record-level Recall

Only for sources

  • that have at least one result
  • and that are non-chain restaurants

# % # % Total extracted 81505 100.00% 11596 100.00%

  • - correctly identified

78857 96.75% 11150 96.15%

  • --- superfluous, but correct variant records

2648 3.25% 669 5.77%

e.g., different spellings of the same street address

  • - incorrect

294 0.36% 0.00%

  • - missing

2942 3.61% 446 3.85%

Wrapidity Dataset FaceBook Dataset

(sourced from Zagat, Urbanspoon verified w/ Google search) entities tagged restaurant that do not have complete or any location information in FB graph