search engine architecture
play

SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 - PDF document

3/17/09 SearchEngineArchitecture CISC489/689010,Lecture#2 Wednesday,Feb.11 BenCartereGe SearchEngineArchitecture AsoIwarearchitectureconsistsofsoIware


  1. 3/17/09
 Search
Engine
Architecture
 CISC489/689‐010,
Lecture
#2
 Wednesday,
Feb.
11
 Ben
CartereGe
 Search
Engine
Architecture
 • A
soIware
architecture
consists
of
soIware
 components,
the
interfaces
provided
by
those
 components,
and
the
relaPonships
between
 them
 – describes
a
system
at
a
parPcular
level
of
abstracPon
 • Architecture
of
a
search
engine
determined
by
2
 requirements
 – effecPveness
(quality
of
results)
and
efficiency
 (response
Pme
and
throughput)
 1


  2. 3/17/09
 Indexing
Process
 Corpus
 Accessible
data
store
 Server(s)
 Text
acquisiPon
 Index
creaPon
 (Crawler,
feeds,

 (Document/term
stats,

 filter,
…)
 weighPng,
inversion,
…)
 Text
transformaPon
 (Parsing,
stopping,

 stemming,
extracPon,
…)
 Documents
 (E‐mails,
web
pages,

 Word
docs,
news
arPcles,
…)
 Indexing
Process
 • Text
acquisiPon
 – idenPfies
and
stores
documents
for
indexing
 • Text
transformaPon
 – transforms
documents
into
 index terms or
 features • Index
creaPon
 – takes
index
terms
and
creates
data
structures
 ( indexes )
to
support
fast
searching
 2


  3. 3/17/09
 Query
Process
 Corpus
 Accessible
data
store
 Server(s)
 Ranking
 f(Q,D) EvaluaPon
 (Precision,
recall,

 clicks,
…)
 Query
Process
 • User
interacPon
 – supports
creaPon
and
refinement
of
query,
display
 of
results
 • Ranking
 – uses
query
and
indexes
to
generate
ranked
list
of
 documents
 • EvaluaPon
 – monitors
and
measures
effecPveness
and
 efficiency
(primarily
offline)
 3


  4. 3/17/09
 Details:
Text
AcquisiPon
 • Crawler
 – IdenPfies
and
acquires
documents
for
search
 engine
 – Many
types
–
web,
enterprise,
desktop
 – Web
crawlers
follow
 links 
to
find
documents • Must
efficiently
find
huge
numbers
of
web
pages
 ( coverage )
and
keep
them
up‐to‐date
( freshness )
 • Single
site
crawlers
for
 site search • Topical or focused crawlers
for
verPcal search
 – Document 
crawlers
for
enterprise
and
desktop
 search
 • Follow
links
and
scan
directories
 Text
AcquisiPon
 • Feeds

 – Real‐Pme
streams
of
documents
 • e.g.,
web
feeds
for
news,
blogs,
video,
radio,
tv
 – RSS
is
common
standard
 • RSS
“reader”
can
provide
new
XML
documents
to
search
 engine
 • Conversion
 – Convert
variety
of
documents
into
a
consistent
text
 plus
metadata
format
 • e.g.
HTML,
XML,
Word,
PDF,
etc.
→
XML
 – Convert
text
encoding
for
different
languages
 • Using
a
Unicode
standard
like
UTF‐8
 4


  5. 3/17/09
 Text
AcquisiPon
 • Document
data
store
 – Stores
text,
metadata,
and
other
related
content
 for
documents

 • Metadata
is
informaPon
about
document
such
as
type
 and
creaPon
date
 • Other
content
includes
links,
anchor
text
 – Provides
fast
access
to
document
contents
for
 search
engine
components
 • e.g.
result
list
generaPon
 – Could
use
relaPonal
database
system

 • More
typically,
a
simpler,
more
efficient
storage
system
 is
used
due
to
huge
numbers
of
documents
 Text
TransformaPon
 • Parser
 – Processing
the
sequence
of
text
 tokens in
the
 document
to
recognize
structural
elements
 • e.g.,
Ptles,
links,
headings,
etc.
 – Tokenizer 
recognizes
“words”
in
the
text
 • must
consider
issues
like
capitalizaPon,
hyphens,
 apostrophes,
non‐alpha
characters,
separators
 – Markup languages such
as
HTML,
XML
oIen
used
to
 specify
structure
 • Tags 
used
to
specify
document
 elements – E.g.,
<h2>
Overview
</h2>
 • Document
parser
uses
 syntax 
of
markup
language
(or
other
 formanng)
to
idenPfy
structure
 5


  6. 3/17/09
 Text
TransformaPon
 • Stopping
 – Remove
common
words • e.g.,
“and”,
“or”,
“the”,
“in”
 – Some
impact
on
efficiency
and
effecPveness
 – Can
be
a
problem
for
some
queries
 • Stemming
 – Group
words
derived
from
a
common
 stem • e.g.,
“computer”,
“computers”,
“compuPng”,
“compute”
 – Usually
effecPve,
but
not
for
all
queries
 – Benefits
vary
for
different
languages
 Text
TransformaPon
 • Link
Analysis
 – Makes
use
of
 links 
and
 anchor text in
web
pages
 – Link
analysis
idenPfies
 popularity 
and
 community 
 informaPon
 • e.g.,
PageRank
 – Anchor
text
can
significantly
enhance
the
 representaPon
of
pages
pointed
to
by
links
 – Significant
impact
on
web
search
 • Less
importance
in
other
applicaPons
 6


  7. 3/17/09
 Text
TransformaPon
 • InformaPon
ExtracPon
 – IdenPfy
classes
of
index
terms
that
are
important
 for
some
applicaPons
 – e.g.,
 named en;ty recognizers idenPfy
classes
 such
as
 people , loca;ons , companies , dates, 
etc.
 • Classifier
 – IdenPfies
class‐related
metadata
for
documents
 • i.e.,
assigns
labels
to
documents
 • e.g.,
topics,
reading
levels,
senPment,
genre
 – Use
depends
on
applicaPon
 Index
CreaPon
 • Document
StaPsPcs
 – Gathers
counts
and
posiPons
of
words
and
other
 features
 – Used
in
ranking
algorithm
 • WeighPng
 – Computes
weights
for
index
terms
 – Used
in
ranking
algorithm
 – e.g.,
 =.idf 
weight
 • CombinaPon
of
 term frequency in
document
and
 inverse document frequency in
the
collecPon
 7


  8. 3/17/09
 Index
CreaPon
 • Inversion
 – Core
of
indexing
process
 – Converts
document‐term
informaPon
to
term‐ document
for
indexing
 • Difficult
for
very
large
numbers
of
documents
 – Format
of
inverted
file
is
designed
for
fast
query
 processing
 • Must
also
handle
updates
 • Compression
used
for
efficiency
 Index
CreaPon
 • Index
DistribuPon
 – Distributes
indexes
across
mulPple
computers
 and/or
mulPple
sites
 – EssenPal
for
fast
query
processing
with
large
 numbers
of
documents
 – Many
variaPons
 • Document
distribuPon,
term
distribuPon,
replicaPon
 – P2P 
and
 distributed IR 
involve
search
across
 mulPple
sites 8


  9. 3/17/09
 User
InteracPon
 • Query
input
 – Provides
interface
and
parser
for
 query language 
 – Most
web
queries
are
very
simple,
other
 applicaPons
may
use
forms
 – Query
language
used
to
describe
more
complex
 queries
and
results
of
query
transformaPon
 • e.g.,
Boolean
queries,
Indri
and
Galago
query
languages
 • 
similar
to
SQL
language
used
in
database
applicaPons
 • IR
query
languages
also
allow
content
and
structure
 specificaPons,
but
focus
on
content
 User
InteracPon
 • Query
transformaPon
 – Improves
iniPal
query,
both
before
and
aIer
iniPal
 search
 – Includes
text
transformaPon
techniques
used
for
 documents
 – Spell checking and
 query sugges;on 
provide
 alternaPves
to
original
query
 – Query expansion and
 relevance feedback 
modify
 the
original
query
with
addiPonal
terms 9


  10. 3/17/09
 User
InteracPon
 • Results
output
 – Constructs
the
display
of
ranked
documents
for
a
 query
 – Generates
 snippets 
to
show
how
queries
match
 documents
 – Highlights 
important
words
and
passages
 – Retrieves
appropriate
 adver;sing 
in
many
 applicaPons
 – May
provide
 clustering 
and
other
visualizaPon
 tools
 Ranking
 • Scoring
 – Calculates
scores
for
documents
using
a
ranking
 algorithm
 – Core
component
of
search
engine
 – Basic
form
of
score
is

 • q t 
and
d t 
are
query
and
document
term
weights
for
 term
t
 – Many
variaPons
of
ranking
algorithms
and
 retrieval
models
 10


Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend