Apache Solr la piattaforma di ricerca enterprise LucaBonesini | - - PowerPoint PPT Presentation

apache solr
SMART_READER_LITE
LIVE PREVIEW

Apache Solr la piattaforma di ricerca enterprise LucaBonesini | - - PowerPoint PPT Presentation

Apache Solr la piattaforma di ricerca enterprise LucaBonesini | Titulus User Group, Kion Bologna 4/dic/2013 Chi s hi son ono Luca Bonesini Infor orma matico Lanciatore di giavellotti Prog ogramma mmator ore Suonatore di chitarra


slide-1
SLIDE 1

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Solr

la piattaforma di ricerca enterprise

slide-2
SLIDE 2

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Infor

  • rma

matico Lanciatore di giavellotti Prog

  • gramma

mmator

  • re

Suonatore di chitarra basso Sistemista Imprenditore

IT M T Manager Marito

Tecnico di prevendita Mountainbike-ista Webmaster Padre2

Venditor

  • re

Cantore Markettaro

Chi s hi son

  • no

Luca Bonesini

http://www.lucabonesini.it @lbonesini http://it.linkedin.com/in/lucabonesini/ l.bonesini@sourcesense.com +39 366 688 7125

slide-3
SLIDE 3

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Sour

  • urces

cesen ense

Making sense of Open Source

Contributors Lucene/Solr Apache Chemistry Apache Jackrabbit OpenSSO-Alfresco Comm mmitters Hibernate Search Project Apache/UIMA project JBoss GateIn Portal Le Lead developer Lucene Infinispan integration

slide-4
SLIDE 4

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Lucene e Solr

Cosa sono?

slide-5
SLIDE 5

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Apache Lu Lucene ne (core) (core)

Search by ASF

“Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform”.

http://lucene.apache.org/core/ fast and efficient scoring and indexing algorithms lots of contributions to make common tasks easier: highlighting, spatial, query parsers, benchmarking tools, etc. most widely deployed search library on the planet

slide-6
SLIDE 6

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Apache Solr

  • lr

Search by ASF

“Solr is the popular, blazing fast open source enterprise search platfor

  • rm from the Apache Lucene
  • project. Its major features include powerful full-text

search, hit highlighting, faceted search, near real- time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search”.

Highly reliable, scalable, fault tolerant, distributed indexing, replication, load-balanced querying, automated failover and recovery, centralized configuration.

slide-7
SLIDE 7

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Apache Solr

  • lr

Search by ASF

Solr is written in Java and runs as a standalone e full-text searc earch h se server er within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like e HTTP/X TTP/XML a and JSON APIs nd JSON APIs that make it easy to use from virtually any pro any programming gramming language language.

http://lucene.apache.org/solr Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc. Most programmi mming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support

slide-8
SLIDE 8

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise Search

La ricerca con la cravatta

slide-9
SLIDE 9

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.

  • me.

“Enterprise search is the practice of making con

  • ntent

tent from multiple enterprise- type sources, such as databases and intranets, sear arch chable able to a defined d au audi dien ence ce”. [wikipedia]

Pull ull Integration API Pus Push Crawler connector Documents types and formats ( XML, HTML, Office, etc.) to plain lain text text Stemming, lemmatization, synonym expansion, entity extraction, part of speech tagging, tokenization. Dictio ictionary nary of all unique words in the corpus. Rankin Ranking. Term freq frequency uency. User ser query. query. Faceting. Paging. Query-index co comparis mparison. References to source e do document cuments.

Ingestion → P Processing and a analysis → Indexing → Qu Query parsing → M Matching Ingestion → P Processing and a analysis → Indexing → Qu Query parsing → M Matching

slide-10
SLIDE 10

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.

  • me.
slide-11
SLIDE 11

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.

  • me.
  • Crawler: an Internet bot that systematically browses the World Wide Web, typically for

the purpose of Web indexing (also called Web spider, ant, automatic indexer, web scutter

  • Pre

reci cisi sion/R /Reca call: in pattern recognition and information retrieval, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved

  • Stemmi

mming: the process for reducing inflected (or sometimes derived) words to their stem, base or root form (ie: "fishing", "fished", and "fisher" to the root word, "fish")

  • Lemma

mmati tizati tion: in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item (ie: word "better" has "good" as its lemma)

  • Name

med-enti tity re reco cogniti tion (entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

  • Part

rt of f sp speech ch: a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question (ie: noun and verb)

  • Tokenizati

tion: the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of

  • processing. The process can be considered a sub-task of parsing input.
slide-12
SLIDE 12

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Search e Open Source

slide-13
SLIDE 13

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sear Search: pr : prodott

  • dotti e

i e vendor endor

Vendors of proprietary y enterprise search s software

AskMeNow, Attivio, Concept Searching Limited, Content Analyst Company LLC, Coveo, Dassault Systèmes (acquired Exalead alead), Denodo, Dieselpoint, Inc., dtSearch Corp., EMC Corp., Exorbyte GmbH, Expert System S.p.A., Exterro, Inc., Fabasoft, Funnelback, Go Google gle Sear earch h Applianc Appliance, HP (acquired Autonomy Corporation which in turn acquired Verity K2 and Ultraseek), IBM (acquired Vi Vivis isimo), Inbenta, inter:gator Enterprise Search, ISYS Search Software, MarkLogic, Microsoft (includes Microsoft Search Server, Fas ast Sear earch h & Trans ansfer er), Mindbreeze, Neofonie (includes WeFind), Omniture (acquired by Adobe Systems), Open Text Corporation, Oracle Corporation (includes Secure Enterprise Search and End ndec eca a Tec echno hnolo logie gies Inc.), Perception Software, PolySpot, Q-go, Q-Sensei, Recommind, SAP (includes SAP NetWeaver Enterprise Search, Search Services in SAP NetWeaver AS ABAP, and Search and Classification TREX), Sine inequa qua, SLI_Systems, Sophia Search Limited, TeraText, X1 Technologies, Inc., ZyLAB Technologies, ZL Technologies

Free e and d open n source ent nterpr pris ise sear earch softwar are Apac pache he Solr lr, DataparkSearch, ElasticSearch, ht://Dig, Jumper 2.0, mnoGoSearch, OpenSearchServer, Searchdaimon, Sphinx Vendors of open source en terprise search so ftw are 30 D ig its, Apache Softw are Foundation , LucidW orks, Se m ate xt, Flax

slide-14
SLIDE 14

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ope pen n Sour

  • urce,

ce, lo

  • fanno

anno anche anche lor loro.

slide-15
SLIDE 15

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Perché Perché Innov Innovazione azione = = Bu$ u$ine$$ ine$$

Open Source Open Standard

Innovazione

OAGi OASIS W3C IETF IEEE ETSI Ecma OGF IEC ISO ITU CENELEC CEN BSI UNI CEI DKE DIN AFNOR GIETS LDTI

Interoperabilità

slide-16
SLIDE 16

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr e Business

slide-17
SLIDE 17

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr f

  • lr featur

eatures es

  • Advanced Ful

ull- l-Te Text Search Capabilities

  • Optimized for High

igh Vo Volume lume Web Traffic

  • Standards Based Open

pen Int Inter erfac aces es - XML, JSON and HTTP

  • Comprehensive HTML

Administration Interfaces

  • Server statistics exposed over JMX

for monit itoring ng

  • Linearly scalab

able le, auto index replication, auto failover and recovery

  • Near

ear Real-t Real-time ime indexing

  • Flexible and Adaptable with XML

configuration

  • Extens

nsib ible le Plugin Architecture

  • A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
  • Powerful Extensions to the Lucene Query Language
  • Faceted Search and Filtering
  • Geo

eospat patial ial Sear earch with support for multiple points per document and geo polygons

  • Advanced, Configurable Te

Text Analy Analysis is

  • Highly Configurable and User Extensible Cachin

hing

  • Performance Optimizations
  • External Configuration via XML
  • An AJAX based administration interface
  • Monitorable Logging
  • Fast near real-time incremental indexing and index replication
  • Highly

ighly Scalable alable Dis istribut ibuted ed sear earch with sharded index across multiple hosts

  • JSON, XML, CSV/delimited-text, and binary update formats
  • Easy ways to pull in data from databases and XML files from local disk

and HTTP sources

  • Ric

Rich h Document ument Par Parsing ing and and Index Indexing ing (PDF, Word, HTML, etc) using Apache Tika

  • Apache UIMA integration for configurable metadata extraction
  • Multiple search indices

Relat elated ed Pr Projec jects: Apache Hadoop, Apache ManifoldCF, Apache Lucene.Net, Apache Lucy, Apache Mahout, Apache Nutch, Apache OpenNLP, Apache Tika, Apache Zookeeper

slide-18
SLIDE 18

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Sear earch, ch, già già una una 'commodity 'commodity'

Se arch is Everyw h ere ! K eyw o rd sea rch is a com m od ity H olistic view of the data and the users is critica l Scalable Search, D iscovery and Analytics are th e key to unlocking this view of u sers and data

Documen ts Access Content Relation- ships User interacti

  • n

Tr Traditional

  • Fast, fuzzy text matching across a

large document collection

  • De-normalized data, “light”

relational

  • Top N problems
  • Key-value (top 1)
  • Recommendations
  • “Good enough” classification,

clustering

  • Faceting, slicing and dicing of

enumerated data

  • Spatial, spell checking, record

linkage, highlighting

  • NoSQL

And:

  • eCommerce
  • Search + Recs + Analysis of users
  • Knowledge Management
  • Financial, transportation, pharma
  • Fraud detection
  • Social media
  • Trend monitoring
  • Information technology
  • Log monitoring, analysis
  • Healthcare
  • DNA Analysis
slide-19
SLIDE 19

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Smar mart s t senza S nza Sea earch? rch?

slide-20
SLIDE 20

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr: chi l

  • lr: chi lo usa?
  • usa?

Buy.com

slide-21
SLIDE 21

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Oltre il Search

slide-22
SLIDE 22

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Un caso di n caso di success successo

slide-23
SLIDE 23

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-24
SLIDE 24

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-25
SLIDE 25

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-26
SLIDE 26

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-27
SLIDE 27

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-28
SLIDE 28

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-29
SLIDE 29

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-30
SLIDE 30

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-31
SLIDE 31

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-32
SLIDE 32

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-33
SLIDE 33

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-34
SLIDE 34

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-35
SLIDE 35

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-36
SLIDE 36

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-37
SLIDE 37

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-38
SLIDE 38

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

slide-39
SLIDE 39

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Buon search a tutti.

Grazie!

Luca Luca Bo Bone nesini sini

www.sourcesense.com

l.bonesini@sourcesense.com

  • Tel. +39 366 688 7125

www www.l .lucabon

  • nesini.i

.it twitter: @lbonesini skype: lbonesini