high precision web extrac3on using site knowledge
play

HighPrecisionWebExtrac3onusing SiteKnowledge MeghanaKshirsagar - PowerPoint PPT Presentation

HighPrecisionWebExtrac3onusing SiteKnowledge MeghanaKshirsagar RajeevRastogi


  1. ��������� ���� ���� ��������� ���� ��������� High
Precision
Web
Extrac3on
using
 Site
Knowledge
 Meghana
Kshirsagar
 Rajeev
Rastogi
 Sandeepkumar
Satpal
 Srinivasan
H
Sengamedu
 Venu
Satuluri


  2. Outline
 • Mo3va3on
 • Problem
Defini3on
 • Proposed
Approach 
 – Site‐Knowledge
 – Segmenta3on
 – Segment
Label
Selec3on
 – Node
Label
Correc3on
 – Extensions
 • Experimental
Results
 • Conclusions
 2

  3. Informa5on
Extrac5on:
What
&
Why?
 Name
 Price
 Ra5ng
 Num
 Resolu5on
 Lens
 Ra5ng
 Canon
 2399.99
 5
 140
 12.8
 Body
 EOS
5D
 Only
 3

  4. Approaches
to
Extrac5on:
Wrapper
 X 1
 X 2
 4

  5. Structural
Changes
 Nymag.com
 Yelp.com
 Site‐specific
training
data
is
required
 5

  6. IE
as
a
Labeling
Problem
 Input: 
Web
Page
 Output: 
Labels
for
different
parts
of
the
page
 Labels
can
be
 
Restaurant
Name,
Address,
Phone,
Ra2ng,
Noise,
etc.
 Name
 Address
 Address
 Noise
 Phone
 Chimichurri
Grill
 New
York
 10036
 Phone:
 (212)
586‐8655
 6

  7. Features
 Regex
features
 Node‐level
features
 isAllCapsWord
 noOfWords>20
 hasTwoCon3nuousCaps
 noOfWords>50
 isDay
 noOfWords>100
 1‐2digitNumber
 propOfTitleCase<0.2
 3digitNumber
 propOfTitleCase>0.8
 4digitNumber
 overlapWithPageTitle
 5digitNumber
 prefixOverlapWithPageTitle
 >5digitNumber
 dashBetweenDigits
 isAlpha
 isNumber
 7

  8. ML
Models
for
Labeling
 • Classifica3on
 • Sequen3al
Models:
HMM,
CRF
 • FOPC
+
uncertainty:
Markov
Logic
Networks
 Y1
 Y2
 Y3
 Y4
 Y5
 Y6
 Labels
 Tokens,
Features
 X1
 X2
 X3
 X4
 X5
 X6
 8

  9. Condi5onal
Random
Fields
 Y1
 Y2
 Y3
 Y4
 Y5
 Y6
 Labels
 Tokens
 X1
 X2
 X3
 X4
 X5
 X6
 • 
Features
are
defined
over
(x,y):
f(x,y)
 • 
f([0‐9]*,
Phone)
 • 
f(New
York,
Address)
 • 
Condi3onal
random
field
is
a
log‐linear
func3on
over
these
features
 9

  10. Approaches
to
Extrac5on:
Summary
 • Wrapper
 – High
Precision
(>
99%)
 – Large
editorial
requirement
 • Machine
Learning
Models
 – Low
editorial
requirements

 – Low
precision
due
to
variable
site
structure
and
abundance
of
 noise
in
web
pages
 10

  11. Problem
Defini5on
 • Problem
 • Extract
en33es
from
the
Web
pages
with
high
precision
(>
99%)
and
very
 low
editorial
requirements
 • Approach
 – Use
CRFs
for
ini3al
labeling
 – Apply
Site
Knowledge
to
improve
the
precision
on
a
small
number
of
 pages
 – Construct

Wrappers
using
these
labels 
 • Site
Knowledge
 – Uniqueness:
Alributes
like
 Name,
Address,
Hours
 are
unique
per
page.
 – Proximity:
Alributes
describing
product/business
are
close
to
each
other
 in
a
page.
 – Sequen3ality:
Alributes
in
a
site
occur
in
the
same
sequence
in
its
web
 pages.
 11

  12. Our
Approach
 12

  13. Sta5c
Text
in
Scripted
Pages
 Sta3c
Text
 13

  14. Segmenta5on
 • Sta3c
Node
 – Same
(text,xpath)
in
majority
of
pages
 • Segmen3ng
Web
page
 – Par33on
Web
page
into
Segments
 using
Sta3c
nodes
 Segmented
Sequence 
 • 
[
Chimichuri
Gril
]
 • 
[
based
on
17
reviews
]
 • 
{
Ra3ng
details
}
 • 
{
Categories
}
 • 
[
Steakhouses,
Argen3ne
]
 • 
{
Neighbourhoods
}
 • 
[
Theatre
district,
Kitchen
]
 • 
[
603
9 th 
Ave
]
 • 
[…..]
 • 
[
(212)
586‐8655
]
 • 
[
ww.chimichurigril.com
]
 • 
{
Nearest
Transit:
}
 • 
[
8 th 
Ave
…..]
 • 
[……]
 14

  15. Benefits
of
Sta5c
Text
and
Segmenta5on
 • Noise
removal
(40%)
 • Time
requires
to
train
a
model
is
less
due
to
small
 Instances
 • Beler
control
on
Precision
and
Recall
by
controlling
 number
of
Noisy
segments
(10%)
 • Very
useful
to
define
context

 15

  16. Our
Approach
 16

  17. CRF
Labeling
 Iden5fy
aOribute
labels
at
segment
level
 seg(“address”)
=
 e2
 Use
 A9ribute
Uniqueness
&
Proximity
 Fix

node
labels
 “Noise”
‐>
“Address”
in
Segment
 e2
 Use
 Sequen2ality
 17

  18. Label
Correc5on
 Select
 Label
 Correct
 Segment
 Segment
 Segments
 Node
Labels
 Web
Pages
 (Uniqueness
&
 (CRF)
 (Sequen3ality)
 Proximity)
 18

  19. Segment
Selec5on
 • Intra‐page
Constraint

(Site
Knowledge)
 – Uniqueness
Constraint:
Alributes
like
 Name,
Address,
Hours
 are
 unique
per
page
 – Proximity:
Alributes
describing
product/business
are
close
to
 each
other
 • Intui3on
is
to
select
the
segments
which
are
in
close
 proximity
 19

  20. Segment
Selec5on
 Segment
Selected
for
Address
 Noise
 20

  21. Segment
Selec5on
 • For
each
alribute
 A ,
select
single
segment
 seg(A) 
such
that
 





is
minimum.
 • This
problem
is
NP
Hard
 • Heuris3c:
for
each
segment
 e ,
define
weight
 w e 
as
 • For
each
alribute
 A ,
choose
the
segment
with
minimum
weight.
 21

  22. Correct
Node
Labels
 • Inter‐page
Constraint
(Site
Knowledge)
 – Same
Template:
Since
pages
are
script
generated,
they
follow
same
 template
 • Label
Varia3ons
across
same
segment
will
be
minor
and
primarily
due
 to
 – Missing
or
addi3onal
nodes
in
certain
segments
 – Incorrectly
labeled
nodes
in
some
segments 
 • Intui3on:
If
CRF
model
assign
correct
labels
in
majority
of
the
cases
 then
applying
“wisdom
of
crowd”
helps
to
correct
labels
 22

  23. Correct
Node
Labels
 Address
 Segment
Selected
for
Address
 Noise
 Choose
the
majority
label?
Segment
alignment
is
needed.
 23

  24. Correct
Node
Labels
–
Node
Alignment
 “categories”
 s 
 s’ 
 del(n i )
 n 1 ,
l 1 ,
x 1
 n’ 1 ,
l’ 1 ,
x’ 1
 ins(n’ i ,
l’ i ,
x’ i )
 rep(n’ i ,
l i ,
l’ i )
 1. Find
the
min
cost
edit
 n’ 2 ,
l’ 2 ,
x’ 2
 n 2 ,
l 2 ,
x 2
 opera3on
sequence
with
 every
other
sequence
with
the
 same
id.
 2. For
each
node,
choose
the
 majority
opera3on.
 3. If
the
selected
opera3on
is
 replace ,
then
the
label
of
the
 node
is
changed.

 24

  25. Extensions
 • Alributes
Spanning
Segments
 – Cluster
the
segments
 – Select
cluster
whose
average
weight
is
minimum
 • 
Missing
Sta3c
Nodes
 – Insert
Sta3c
node
at
appropriate
posi3on
using
Edit
Distance
 25

  26. Experiments
 • Dataset
 – 5
restaurant
sites,
~
100
pages
from
each
site
 – Alribute:
Name,
Address,
Phone,
Hours,
Descrip3on
 – Alribute
order:
NAPHD,
NHAPD,
NAPDH,
NAPH
 • Features
 – Lexicon,
Regex,
Node‐level
 • Experiment
 – Learn
on
four
sites,
Label
the
fiyh
 • Baselines
 – Full‐page
CRF
 – HCRF
 26

  27. Results
 Recall
 Precision
 1.2 1.2 1 1 0.8 0.8 0.6 0.6 CRF CRF NODE NODE 0.4 0.4 SS SS 0.2 0.2 ED ED 0 0 F1
 HCRF

 1.2 1 1. Memory
and
compute‐intensive.
 0.8 2. On
a
subset
of
data,
the
F1
was
0.262
 0.6 CRF 0.4 NODE compared
to
0.5271
for
the
proposed
 0.2 SS approach.
 0 ED 27

  28. Conclusions
 • Unsupervised
extrac3on
is
a
challenging
problem.
 • The
framework
proposed
in
this
paper,
leverages
 site‐knowledge
to
boost
the
precision
of
 underlying
extrac3on
schemes.
 • When
applied
to
CRF‐based
extractors,
the
 proposed
method
boosts
both
precision
and
 recall.
 28

  29. Ques3ons?
 shs@yahoo‐inc.com
 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend