Search Engines Issues Avi Rappoport Search Tools Consulting Search - - PowerPoint PPT Presentation

search engines issues
SMART_READER_LITE
LIVE PREVIEW

Search Engines Issues Avi Rappoport Search Tools Consulting Search - - PowerPoint PPT Presentation

Search Engines Issues Avi Rappoport Search Tools Consulting Search Issues Enterprise Search Engines Corporate and institutional sites E-commerce Intranets P2P, Meta search and distributed search CMSs and Search Engines


slide-1
SLIDE 1

Search Engines Issues

Avi Rappoport Search Tools Consulting

slide-2
SLIDE 2

Search Issues

  • Enterprise Search Engines
  • Corporate and institutional sites
  • E-commerce
  • Intranets
  • P2P, Meta search and distributed

search

  • CMSs and Search Engines
  • Security and Search
slide-3
SLIDE 3

P2P Search

  • Address the centralized index

problem

  • Everyone serves their content
  • Gnutella and FreeNet (MP3s)
  • OpenCOLA
  • scientific collaborations
  • auctions
  • Does not scale
  • Problems with completeness
  • Privacy issues - what to share?
slide-4
SLIDE 4

Meta Search

  • Send queries to several sources
  • text search engines
  • databases
  • email
  • Extract text from result
  • Display all together
  • Successful on the Web
  • Problems with “screen scraping”
  • Problems with relevance ranking
slide-5
SLIDE 5

Distributed Search

  • Common language for query &

response

  • Transport mechanism (HTTP)
  • Basic query syntax
  • Single relevance score range
  • Maybe standard algorithm
  • Results with XML
  • Deal with the “Best Sources” issue
slide-6
SLIDE 6

Past Implementations

  • Z39.50
  • Pioneer, for better and worse
  • Too complex, never finished
  • Limited to speed of slowest

server

  • Harvest
  • Early web system
  • Stanford STARTS & LORE
slide-7
SLIDE 7

Protocols

  • JXTA
  • Java distributed system at Sun
  • XQuery
  • XML equivalent to SQL
  • no relevance ranking
  • Open Archives Meta data
  • export meta data about

collections

  • address “best source” issues
  • Google APIs
slide-8
SLIDE 8

Current Projects

  • Science.gov
  • Access to public databases
  • Commercial Products
  • Verity Federated Search
  • Intelliseek, translates to SQL
  • Library Systems
  • MuseGlobal
slide-9
SLIDE 9

Future

  • Centralized search engines will

index databases and other silos

  • More meta search
  • Complex databases
  • Integrating library content
  • Distributed search protocols
  • Libraries are pioneers
  • Middleware interpreters
  • Sit between search and dbs
  • Index and search time
slide-10
SLIDE 10

Search & CMS

  • CMS: Content Mangement System
  • Related to document management
  • Templates
  • Workflow
  • Editorial accountability
  • Publishing
slide-11
SLIDE 11

Search & CMS

  • Navigation links are not enough
  • Labels can be confusing
  • Categories often limiting
  • Search allows ad-hoc access
  • Other ways of finding
  • Wide variety in use of language
  • Integrate CMS-generated pages

with other content

  • Avoid becoming data silos
slide-12
SLIDE 12

Improve search

  • Synchronize indexing & publishing
  • Everything is current
  • Only unique pages
  • Duplicate pages a big problem

for robots

  • Content only
  • No indexing of navigation text
  • Actual content modification date
  • Web servers often lie
  • Require page titles
slide-13
SLIDE 13

Meta Data

  • CMSs simplify meta data entry
  • Use the Dublin Core
  • Automate some meta tags
  • Author, department
  • Language & character set
  • Subject tags
  • Use controlled vocabulary
  • Category "facets"
  • Non-hierarchical attributes
  • Based on content
slide-14
SLIDE 14

CMSs With Search

  • Commercial
  • Atomz Publish ASP
  • divine Eprise
  • Microsoft Site Server
  • Plumtree
  • Vignette
  • Open Source
  • OpenCMS
  • Red Hat CMS
  • Zope
slide-15
SLIDE 15

External Search

  • Integrate CMS content
  • Search together with intranet,

external content

  • Indexing
  • Robot crawler
  • CMS API for indexing
  • Syndication publishing
  • RSS 1.0
  • ICE
  • Two features for one
slide-16
SLIDE 16

Search & security

  • Content security
  • Private data types
  • Access control issues
  • Results with teaser content
  • Hiding inaccessible results
slide-17
SLIDE 17

Types of Private Data

  • Personal Records
  • Financial, legal, health, academic,

employment, etc.

  • Special case, very difficult
  • Research and analysis
  • Business discussions
  • Sales proposals
  • Licensed content
  • Personal files and email
slide-18
SLIDE 18

Protect Privacy

  • Search should never expose private

data to public view

  • Use HTTPS encryption in transit
  • Indexer client
  • Serving search results
  • Secure the index file and server

against intrusion

slide-19
SLIDE 19

Access Control

  • Basic Authentication
  • User name and password
  • Lightweight security
  • Indexer can store and issue
  • File-based permissions for users

and groups

  • Windows NT Challenge &

Response

  • LDAP authorization systems
  • Others...
slide-20
SLIDE 20

Indexing access

  • Search indexer
  • Becomes a “user”
  • Member of all relevant groups
  • Indexer must send passwords or

certificates

  • Store flag for the protected

documents

slide-21
SLIDE 21

Results as Teasers

  • Show protected documents in

search results

  • Among public pages
  • In a separate section
  • Encourage payments or

subscriptions

  • Encourage registration
  • Intranets
  • Limited-access databases
  • Other departments
slide-22
SLIDE 22

Why Restrict?

  • Showing in results is vulnerable to

reverse engineering

  • Example: search for “merger”
  • If protected pages are displayed
  • Employee or outsider can search

for merger candidates

  • Gleaning information from the

existence of results

slide-23
SLIDE 23

Permissions in Index

  • Store the access permissions
  • Mark for each document in the

index

  • Search engine checks before

displaying

  • Very fast at retrieval
  • Index must be always current
  • Good with CMS integration
  • Replicate access control

functionality

slide-24
SLIDE 24

Results-Time Check

  • Work with access control system
  • Ask about top batch of results
  • Send user credentials and

document info

  • Ask if they’re allowed to see it
  • Always current
  • Can be a bit slow
  • Can perform parallel requests
  • Show results as they come back
slide-25
SLIDE 25

Conclusions

  • Meta and distributed search

provide access to external content

  • Indexing CMS content can be

powerful and timely

  • Search should never expose private

data

  • Integrate search with access control

More search info: www.searchtools.com