HighPrecisionWebExtrac3onusing SiteKnowledge MeghanaKshirsagar - - PowerPoint PPT Presentation

high precision web extrac3on using site knowledge
SMART_READER_LITE
LIVE PREVIEW

HighPrecisionWebExtrac3onusing SiteKnowledge MeghanaKshirsagar - - PowerPoint PPT Presentation

HighPrecisionWebExtrac3onusing SiteKnowledge MeghanaKshirsagar RajeevRastogi


slide-1
SLIDE 1

High
Precision
Web
Extrac3on
using
 Site
Knowledge


Meghana
Kshirsagar
 Rajeev
Rastogi
 Sandeepkumar
Satpal
 Srinivasan
H
Sengamedu
 Venu
Satuluri


slide-2
SLIDE 2

Outline


  • Mo3va3on

  • Problem
Defini3on

  • Proposed
Approach


– Site‐Knowledge
 – Segmenta3on
 – Segment
Label
Selec3on
 – Node
Label
Correc3on
 – Extensions


  • Experimental
Results

  • Conclusions


2

slide-3
SLIDE 3

Informa5on
Extrac5on:
What
&
Why?


Name
 Price
 Ra5ng
 Num
 Ra5ng
 Resolu5on
 Lens
 Canon
 EOS
5D
 2399.99
 5
 140
 12.8
 Body
 Only


3

slide-4
SLIDE 4

Approaches
to
Extrac5on:
Wrapper


4 X1
 X2


slide-5
SLIDE 5

Structural
Changes


Nymag.com
 Yelp.com
 5 Site‐specific
training
data
is
required


slide-6
SLIDE 6

IE
as
a
Labeling
Problem


Chimichurri
Grill
 New
York
 10036
 (212)
586‐8655
 Phone:


Name
 Address
 Address
 Noise
 Phone


6

Input:
Web
Page
 Output:
Labels
for
different
parts
of
the
page
 Labels
can
be

Restaurant
Name,
Address,
Phone,
Ra2ng,
Noise,
etc.


slide-7
SLIDE 7

Features


7 Regex
features
 isAllCapsWord
 hasTwoCon3nuousCaps
 isDay
 1‐2digitNumber
 3digitNumber
 4digitNumber
 5digitNumber
 >5digitNumber
 dashBetweenDigits
 isAlpha
 isNumber
 Node‐level
features
 noOfWords>20
 noOfWords>50
 noOfWords>100
 propOfTitleCase<0.2
 propOfTitleCase>0.8


  • verlapWithPageTitle


prefixOverlapWithPageTitle


slide-8
SLIDE 8

ML
Models
for
Labeling


  • Classifica3on

  • Sequen3al
Models:
HMM,
CRF

  • FOPC
+
uncertainty:
Markov
Logic
Networks


8 Tokens,
Features
 X1
 X2
 X3
 X4
 Y1
 Y2
 Y3
 Y4
 Labels
 X5
 X6
 Y5
 Y6


slide-9
SLIDE 9

Condi5onal
Random
Fields


X1
 X2
 X3
 X4
 Y1
 Y2
 Y3
 Y4
 Tokens
 Labels
 X5
 X6
 Y5
 Y6


  • Features
are
defined
over
(x,y):
f(x,y)

  • f([0‐9]*,
Phone)

  • f(New
York,
Address)

  • Condi3onal
random
field
is
a
log‐linear
func3on
over
these
features


9

slide-10
SLIDE 10

Approaches
to
Extrac5on:
Summary


  • Wrapper


– High
Precision
(>
99%)
 – Large
editorial
requirement


  • Machine
Learning
Models


– Low
editorial
requirements

 – Low
precision
due
to
variable
site
structure
and
abundance
of
 noise
in
web
pages


10

slide-11
SLIDE 11

Problem
Defini5on


  • Problem

  • Extract
en33es
from
the
Web
pages
with
high
precision
(>
99%)
and
very


low
editorial
requirements


  • Approach


– Use
CRFs
for
ini3al
labeling
 – Apply
Site
Knowledge
to
improve
the
precision
on
a
small
number
of
 pages
 – Construct

Wrappers
using
these
labels


  • Site
Knowledge


– Uniqueness:
Alributes
like
Name,
Address,
Hours
are
unique
per
page.
 – Proximity:
Alributes
describing
product/business
are
close
to
each
other
 in
a
page.
 – Sequen3ality:
Alributes
in
a
site
occur
in
the
same
sequence
in
its
web
 pages.
 11

slide-12
SLIDE 12

Our
Approach


12

slide-13
SLIDE 13

Sta5c
Text
in
Scripted
Pages


Sta3c
Text


13

slide-14
SLIDE 14

Segmenta5on


  • Sta3c
Node


– Same
(text,xpath)
in
majority
of
pages


  • Segmen3ng
Web
page


– Par33on
Web
page
into
Segments
 using
Sta3c
nodes
 Segmented
Sequence


  • [
Chimichuri
Gril
]

  • [
based
on
17
reviews
]

  • {
Ra3ng
details
}

  • {
Categories
}

  • [
Steakhouses,
Argen3ne
]

  • {
Neighbourhoods
}

  • [
Theatre
district,
Kitchen
]

  • [
603
9th
Ave
]

  • […..]

  • [
(212)
586‐8655
]

  • [
ww.chimichurigril.com
]

  • {
Nearest
Transit:
}

  • [
8th
Ave
…..]

  • [……]


14

slide-15
SLIDE 15

Benefits
of
Sta5c
Text
and
Segmenta5on


  • Noise
removal
(40%)

  • Time
requires
to
train
a
model
is
less
due
to
small


Instances


  • Beler
control
on
Precision
and
Recall
by
controlling


number
of
Noisy
segments
(10%)


  • Very
useful
to
define
context



15

slide-16
SLIDE 16

Our
Approach


16

slide-17
SLIDE 17

CRF
Labeling


Iden5fy
aOribute
labels
at
segment
level
 seg(“address”)
=
e2
 Use
A9ribute
Uniqueness
&
Proximity
 Fix

node
labels
 “Noise”
‐>
“Address”
in
Segment
e2
 Use
Sequen2ality
 17

slide-18
SLIDE 18

Label
Correc5on


Segment
 Web
Pages
 Label
 Segments
 (CRF)
 Correct
 Node
Labels
 (Sequen3ality)
 Select
 Segment
 (Uniqueness
&
 Proximity)


18

slide-19
SLIDE 19

Segment
Selec5on


  • Intra‐page
Constraint

(Site
Knowledge)


– Uniqueness
Constraint:
Alributes
like
Name,
Address,
Hours
are
 unique
per
page
 – Proximity:
Alributes
describing
product/business
are
close
to
 each
other


  • Intui3on
is
to
select
the
segments
which
are
in
close


proximity


19

slide-20
SLIDE 20

Segment
Selec5on


Segment
Selected
for
Address
 Noise
 20

slide-21
SLIDE 21

Segment
Selec5on


  • For
each
alribute
A,
select
single
segment
seg(A)
such
that








is
minimum.


  • This
problem
is
NP
Hard

  • Heuris3c:
for
each
segment
e,
define
weight
we
as

  • For
each
alribute
A,
choose
the
segment
with
minimum
weight.


21

slide-22
SLIDE 22

Correct
Node
Labels


  • Inter‐page
Constraint
(Site
Knowledge)


– Same
Template:
Since
pages
are
script
generated,
they
follow
same
 template


  • Label
Varia3ons
across
same
segment
will
be
minor
and
primarily
due


to


– Missing
or
addi3onal
nodes
in
certain
segments
 – Incorrectly
labeled
nodes
in
some
segments


  • Intui3on:
If
CRF
model
assign
correct
labels
in
majority
of
the
cases


then
applying
“wisdom
of
crowd”
helps
to
correct
labels


22

slide-23
SLIDE 23

Correct
Node
Labels


Segment
Selected
for
Address
 Noise
 Address
 23 Choose
the
majority
label?
Segment
alignment
is
needed.


slide-24
SLIDE 24

Correct
Node
Labels
–
Node
Alignment


s
 n1,
l1,
x1
 n2,
l2,
x2
 s’
 n’1,
l’1,
x’1
 n’2,
l’2,
x’2
 del(ni)
 ins(n’i,
l’i,
x’i)
 rep(n’i,
li,
l’i)


  • 1. Find
the
min
cost
edit

  • pera3on
sequence
with


every
other
sequence
with
the
 same
id.


  • 2. For
each
node,
choose
the


majority
opera3on.


  • 3. If
the
selected
opera3on
is


replace,
then
the
label
of
the
 node
is
changed.



24 “categories”


slide-25
SLIDE 25

Extensions


  • Alributes
Spanning
Segments


– Cluster
the
segments
 – Select
cluster
whose
average
weight
is
minimum


  • Missing
Sta3c
Nodes


– Insert
Sta3c
node
at
appropriate
posi3on
using
Edit
Distance


25

slide-26
SLIDE 26

Experiments


  • Dataset


– 5
restaurant
sites,
~
100
pages
from
each
site
 – Alribute:
Name,
Address,
Phone,
Hours,
Descrip3on
 – Alribute
order:
NAPHD,
NHAPD,
NAPDH,
NAPH


  • Features


– Lexicon,
Regex,
Node‐level


  • Experiment


– Learn
on
four
sites,
Label
the
fiyh


  • Baselines


– Full‐page
CRF
 – HCRF
 26

slide-27
SLIDE 27

Results


0.2 0.4 0.6 0.8 1 1.2 CRF NODE SS ED 0.2 0.4 0.6 0.8 1 1.2 CRF NODE SS ED 0.2 0.4 0.6 0.8 1 1.2 CRF NODE SS ED

Precision
 Recall
 F1
 HCRF



  • 1. Memory
and
compute‐intensive.

  • 2. On
a
subset
of
data,
the
F1
was
0.262


compared
to
0.5271
for
the
proposed
 approach.
 27

slide-28
SLIDE 28

Conclusions


  • Unsupervised
extrac3on
is
a
challenging
problem.

  • The
framework
proposed
in
this
paper,
leverages


site‐knowledge
to
boost
the
precision
of
 underlying
extrac3on
schemes.


  • When
applied
to
CRF‐based
extractors,
the


proposed
method
boosts
both
precision
and
 recall.


28

slide-29
SLIDE 29

shs@yahoo‐inc.com


Ques3ons?


29