search engine architecture

SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 - PDF document

3/17/09 SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 BenCartereGe SearchEngineArchitecture AsoIwarearchitectureconsistsofsoIware


  1. 3/17/09
 Search
Engine
Architecture
 CISC489/689‐010,
Lecture
#2
 Wednesday,
Feb.
11
 Ben
CartereGe
 Search
Engine
Architecture
 • A
soIware
architecture
consists
of
soIware
 components,
the
interfaces
provided
by
those
 components,
and
the
relaPonships
between
 them
 – describes
a
system
at
a
parPcular
level
of
abstracPon
 • Architecture
of
a
search
engine
determined
by
2
 requirements
 – effecPveness
(quality
of
results)
and
efficiency
 (response
Pme
and
throughput)
 1


  2. 3/17/09
 Indexing
Process
 Corpus
 Accessible
data
store
 Server(s)
 Text
acquisiPon
 Index
creaPon
 (Crawler,
feeds,

 (Document/term
stats,

 filter,
…)
 weighPng,
inversion,
…)
 Text
transformaPon
 (Parsing,
stopping,

 stemming,
extracPon,
…)
 Documents
 (E‐mails,
web
pages,

 Word
docs,
news
arPcles,
…)
 Indexing
Process
 • Text
acquisiPon
 – idenPfies
and
stores
documents
for
indexing
 • Text
transformaPon
 – transforms
documents
into
 index terms or
 features • Index
creaPon
 – takes
index
terms
and
creates
data
structures
 ( indexes )
to
support
fast
searching
 2


  3. 3/17/09
 Query
Process
 Corpus
 Accessible
data
store
 Server(s)
 Ranking
 f(Q,D) EvaluaPon
 (Precision,
recall,

 clicks,
…)
 Query
Process
 • User
interacPon
 – supports
creaPon
and
refinement
of
query,
display
 of
results
 • Ranking
 – uses
query
and
indexes
to
generate
ranked
list
of
 documents
 • EvaluaPon
 – monitors
and
measures
effecPveness
and
 efficiency
(primarily
offline)
 3


  4. 3/17/09
 Details:
Text
AcquisiPon
 • Crawler
 – IdenPfies
and
acquires
documents
for
search
 engine
 – Many
types
–
web,
enterprise,
desktop
 – Web
crawlers
follow
 links 
to
find
documents • Must
efficiently
find
huge
numbers
of
web
pages
 ( coverage )
and
keep
them
up‐to‐date
( freshness )
 • Single
site
crawlers
for
 site search • Topical or focused crawlers
for
verPcal search
 – Document 
crawlers
for
enterprise
and
desktop
 search
 • Follow
links
and
scan
directories
 Text
AcquisiPon
 • Feeds

 – Real‐Pme
streams
of
documents
 • e.g.,
web
feeds
for
news,
blogs,
video,
radio,
tv
 – RSS
is
common
standard
 • RSS
“reader”
can
provide
new
XML
documents
to
search
 engine
 • Conversion
 – Convert
variety
of
documents
into
a
consistent
text
 plus
metadata
format
 • e.g.
HTML,
XML,
Word,
PDF,
etc.
→
XML
 – Convert
text
encoding
for
different
languages
 • Using
a
Unicode
standard
like
UTF‐8
 4


  5. 3/17/09
 Text
AcquisiPon
 • Document
data
store
 – Stores
text,
metadata,
and
other
related
content
 for
documents

 • Metadata
is
informaPon
about
document
such
as
type
 and
creaPon
date
 • Other
content
includes
links,
anchor
text
 – Provides
fast
access
to
document
contents
for
 search
engine
components
 • e.g.
result
list
generaPon
 – Could
use
relaPonal
database
system

 • More
typically,
a
simpler,
more
efficient
storage
system
 is
used
due
to
huge
numbers
of
documents
 Text
TransformaPon
 • Parser
 – Processing
the
sequence
of
text
 tokens in
the
 document
to
recognize
structural
elements
 • e.g.,
Ptles,
links,
headings,
etc.
 – Tokenizer 
recognizes
“words”
in
the
text
 • must
consider
issues
like
capitalizaPon,
hyphens,
 apostrophes,
non‐alpha
characters,
separators
 – Markup languages such
as
HTML,
XML
oIen
used
to
 specify
structure
 • Tags 
used
to
specify
document
 elements – E.g.,
<h2>
Overview
</h2>
 • Document
parser
uses
 syntax 
of
markup
language
(or
other
 formanng)
to
idenPfy
structure
 5


  6. 3/17/09
 Text
TransformaPon
 • Stopping
 – Remove
common
words • e.g.,
“and”,
“or”,
“the”,
“in”
 – Some
impact
on
efficiency
and
effecPveness
 – Can
be
a
problem
for
some
queries
 • Stemming
 – Group
words
derived
from
a
common
 stem • e.g.,
“computer”,
“computers”,
“compuPng”,
“compute”
 – Usually
effecPve,
but
not
for
all
queries
 – Benefits
vary
for
different
languages
 Text
TransformaPon
 • Link
Analysis
 – Makes
use
of
 links 
and
 anchor text in
web
pages
 – Link
analysis
idenPfies
 popularity 
and
 community 
 informaPon
 • e.g.,
PageRank
 – Anchor
text
can
significantly
enhance
the
 representaPon
of
pages
pointed
to
by
links
 – Significant
impact
on
web
search
 • Less
importance
in
other
applicaPons
 6


  7. 3/17/09
 Text
TransformaPon
 • InformaPon
ExtracPon
 – IdenPfy
classes
of
index
terms
that
are
important
 for
some
applicaPons
 – e.g.,
 named en;ty recognizers idenPfy
classes
 such
as
 people , loca;ons , companies , dates, 
etc.
 • Classifier
 – IdenPfies
class‐related
metadata
for
documents
 • i.e.,
assigns
labels
to
documents
 • e.g.,
topics,
reading
levels,
senPment,
genre
 – Use
depends
on
applicaPon
 Index
CreaPon
 • Document
StaPsPcs
 – Gathers
counts
and
posiPons
of
words
and
other
 features
 – Used
in
ranking
algorithm
 • WeighPng
 – Computes
weights
for
index
terms
 – Used
in
ranking
algorithm
 – e.g.,
 =.idf 
weight
 • CombinaPon
of
 term frequency in
document
and
 inverse document frequency in
the
collecPon
 7


  8. 3/17/09
 Index
CreaPon
 • Inversion
 – Core
of
indexing
process
 – Converts
document‐term
informaPon
to
term‐ document
for
indexing
 • Difficult
for
very
large
numbers
of
documents
 – Format
of
inverted
file
is
designed
for
fast
query
 processing
 • Must
also
handle
updates
 • Compression
used
for
efficiency
 Index
CreaPon
 • Index
DistribuPon
 – Distributes
indexes
across
mulPple
computers
 and/or
mulPple
sites
 – EssenPal
for
fast
query
processing
with
large
 numbers
of
documents
 – Many
variaPons
 • Document
distribuPon,
term
distribuPon,
replicaPon
 – P2P 
and
 distributed IR 
involve
search
across
 mulPple
sites 8


  9. 3/17/09
 User
InteracPon
 • Query
input
 – Provides
interface
and
parser
for
 query language 
 – Most
web
queries
are
very
simple,
other
 applicaPons
may
use
forms
 – Query
language
used
to
describe
more
complex
 queries
and
results
of
query
transformaPon
 • e.g.,
Boolean
queries,
Indri
and
Galago
query
languages
 • 
similar
to
SQL
language
used
in
database
applicaPons
 • IR
query
languages
also
allow
content
and
structure
 specificaPons,
but
focus
on
content
 User
InteracPon
 • Query
transformaPon
 – Improves
iniPal
query,
both
before
and
aIer
iniPal
 search
 – Includes
text
transformaPon
techniques
used
for
 documents
 – Spell checking and
 query sugges;on 
provide
 alternaPves
to
original
query
 – Query expansion and
 relevance feedback 
modify
 the
original
query
with
addiPonal
terms 9


  10. 3/17/09
 User
InteracPon
 • Results
output
 – Constructs
the
display
of
ranked
documents
for
a
 query
 – Generates
 snippets 
to
show
how
queries
match
 documents
 – Highlights 
important
words
and
passages
 – Retrieves
appropriate
 adver;sing 
in
many
 applicaPons
 – May
provide
 clustering 
and
other
visualizaPon
 tools
 Ranking
 • Scoring
 – Calculates
scores
for
documents
using
a
ranking
 algorithm
 – Core
component
of
search
engine
 – Basic
form
of
score
is

 • q t 
and
d t 
are
query
and
document
term
weights
for
 term
t
 – Many
variaPons
of
ranking
algorithms
and
 retrieval
models
 10


Recommend


More recommend