[PPT] - Apache Solr la piattaforma di ricerca enterprise LucaBonesini

SLIDE 1

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Solr

la piattaforma di ricerca enterprise

SLIDE 2

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Infor

rma

matico Lanciatore di giavellotti Prog

gramma

mmator

re

Suonatore di chitarra basso Sistemista Imprenditore

IT M T Manager Marito

Tecnico di prevendita Mountainbike-ista Webmaster Padre2

Venditor

re

Cantore Markettaro

Chi s hi son

no

Luca Bonesini

http://www.lucabonesini.it @lbonesini http://it.linkedin.com/in/lucabonesini/ l.bonesini@sourcesense.com +39 366 688 7125

SLIDE 3

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Sour

urces

cesen ense

Making sense of Open Source

Contributors Lucene/Solr Apache Chemistry Apache Jackrabbit OpenSSO-Alfresco Comm mmitters Hibernate Search Project Apache/UIMA project JBoss GateIn Portal Le Lead developer Lucene Infinispan integration

SLIDE 4

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Lucene e Solr

Cosa sono?

SLIDE 5

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Apache Lu Lucene ne (core) (core)

Search by ASF

“Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform”.

http://lucene.apache.org/core/ fast and efficient scoring and indexing algorithms lots of contributions to make common tasks easier: highlighting, spatial, query parsers, benchmarking tools, etc. most widely deployed search library on the planet

SLIDE 6

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Apache Solr

lr

Search by ASF

“Solr is the popular, blazing fast open source enterprise search platfor

rm from the Apache Lucene
project. Its major features include powerful full-text

search, hit highlighting, faceted search, near real- time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search”.

Highly reliable, scalable, fault tolerant, distributed indexing, replication, load-balanced querying, automated failover and recovery, centralized configuration.

SLIDE 7

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Apache Solr

lr

Search by ASF

Solr is written in Java and runs as a standalone e full-text searc earch h se server er within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like e HTTP/X TTP/XML a and JSON APIs nd JSON APIs that make it easy to use from virtually any pro any programming gramming language language.

http://lucene.apache.org/solr Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc. Most programmi mming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support

SLIDE 8

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise Search

La ricerca con la cravatta

SLIDE 9

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.

me.

“Enterprise search is the practice of making con

ntent

tent from multiple enterprise- type sources, such as databases and intranets, sear arch chable able to a defined d au audi dien ence ce”. [wikipedia]

Pull ull Integration API Pus Push Crawler connector Documents types and formats ( XML, HTML, Office, etc.) to plain lain text text Stemming, lemmatization, synonym expansion, entity extraction, part of speech tagging, tokenization. Dictio ictionary nary of all unique words in the corpus. Rankin Ranking. Term freq frequency uency. User ser query. query. Faceting. Paging. Query-index co comparis mparison. References to source e do document cuments.

Ingestion → P Processing and a analysis → Indexing → Qu Query parsing → M Matching Ingestion → P Processing and a analysis → Indexing → Qu Query parsing → M Matching

SLIDE 10

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.

me.

SLIDE 11

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.

me.
Crawler: an Internet bot that systematically browses the World Wide Web, typically for

the purpose of Web indexing (also called Web spider, ant, automatic indexer, web scutter

Pre

reci cisi sion/R /Reca call: in pattern recognition and information retrieval, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved

Stemmi

mming: the process for reducing inflected (or sometimes derived) words to their stem, base or root form (ie: "fishing", "fished", and "fisher" to the root word, "fish")

Lemma

mmati tizati tion: in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item (ie: word "better" has "good" as its lemma)

Name

med-enti tity re reco cogniti tion (entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

Part

rt of f sp speech ch: a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question (ie: noun and verb)

Tokenizati

tion: the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of

processing. The process can be considered a sub-task of parsing input.

SLIDE 12

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Search e Open Source

SLIDE 13

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ent nterp erpris rise e Sear Search: pr : prodott

dotti e

i e vendor endor

Vendors of proprietary y enterprise search s software

AskMeNow, Attivio, Concept Searching Limited, Content Analyst Company LLC, Coveo, Dassault Systèmes (acquired Exalead alead), Denodo, Dieselpoint, Inc., dtSearch Corp., EMC Corp., Exorbyte GmbH, Expert System S.p.A., Exterro, Inc., Fabasoft, Funnelback, Go Google gle Sear earch h Applianc Appliance, HP (acquired Autonomy Corporation which in turn acquired Verity K2 and Ultraseek), IBM (acquired Vi Vivis isimo), Inbenta, inter:gator Enterprise Search, ISYS Search Software, MarkLogic, Microsoft (includes Microsoft Search Server, Fas ast Sear earch h & Trans ansfer er), Mindbreeze, Neofonie (includes WeFind), Omniture (acquired by Adobe Systems), Open Text Corporation, Oracle Corporation (includes Secure Enterprise Search and End ndec eca a Tec echno hnolo logie gies Inc.), Perception Software, PolySpot, Q-go, Q-Sensei, Recommind, SAP (includes SAP NetWeaver Enterprise Search, Search Services in SAP NetWeaver AS ABAP, and Search and Classification TREX), Sine inequa qua, SLI_Systems, Sophia Search Limited, TeraText, X1 Technologies, Inc., ZyLAB Technologies, ZL Technologies

Free e and d open n source ent nterpr pris ise sear earch softwar are Apac pache he Solr lr, DataparkSearch, ElasticSearch, ht://Dig, Jumper 2.0, mnoGoSearch, OpenSearchServer, Searchdaimon, Sphinx Vendors of open source en terprise search so ftw are 30 D ig its, Apache Softw are Foundation , LucidW orks, Se m ate xt, Flax

SLIDE 14

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Ope pen n Sour

urce,

ce, lo

fanno

anno anche anche lor loro.

SLIDE 15

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Perché Perché Innov Innovazione azione = = Bu$ u$ine$$ ine$$

Open Source Open Standard

Innovazione

OAGi OASIS W3C IETF IEEE ETSI Ecma OGF IEC ISO ITU CENELEC CEN BSI UNI CEI DKE DIN AFNOR GIETS LDTI

Interoperabilità

SLIDE 16

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr e Business

SLIDE 17

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr f

lr featur

eatures es

Advanced Ful

ull- l-Te Text Search Capabilities

Optimized for High

igh Vo Volume lume Web Traffic

Standards Based Open

pen Int Inter erfac aces es - XML, JSON and HTTP

Comprehensive HTML

Administration Interfaces

Server statistics exposed over JMX

for monit itoring ng

Linearly scalab

able le, auto index replication, auto failover and recovery

Near

ear Real-t Real-time ime indexing

Flexible and Adaptable with XML

configuration

Extens

nsib ible le Plugin Architecture

A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys
Powerful Extensions to the Lucene Query Language
Faceted Search and Filtering
Geo

eospat patial ial Sear earch with support for multiple points per document and geo polygons

Advanced, Configurable Te

Text Analy Analysis is

Highly Configurable and User Extensible Cachin

hing

Performance Optimizations
External Configuration via XML
An AJAX based administration interface
Monitorable Logging
Fast near real-time incremental indexing and index replication
Highly

ighly Scalable alable Dis istribut ibuted ed sear earch with sharded index across multiple hosts

JSON, XML, CSV/delimited-text, and binary update formats
Easy ways to pull in data from databases and XML files from local disk

and HTTP sources

Ric

Rich h Document ument Par Parsing ing and and Index Indexing ing (PDF, Word, HTML, etc) using Apache Tika

Apache UIMA integration for configurable metadata extraction
Multiple search indices

Relat elated ed Pr Projec jects: Apache Hadoop, Apache ManifoldCF, Apache Lucene.Net, Apache Lucy, Apache Mahout, Apache Nutch, Apache OpenNLP, Apache Tika, Apache Zookeeper

SLIDE 18

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Sear earch, ch, già già una una 'commodity 'commodity'

Se arch is Everyw h ere ! K eyw o rd sea rch is a com m od ity H olistic view of the data and the users is critica l Scalable Search, D iscovery and Analytics are th e key to unlocking this view of u sers and data

Documen ts Access Content Relation- ships User interacti

n

Tr Traditional

Fast, fuzzy text matching across a

large document collection

De-normalized data, “light”

relational

Top N problems
Key-value (top 1)
Recommendations
“Good enough” classification,

clustering

Faceting, slicing and dicing of

enumerated data

Spatial, spell checking, record

linkage, highlighting

NoSQL

And:

eCommerce
Search + Recs + Analysis of users
Knowledge Management
Financial, transportation, pharma
Fraud detection
Social media
Trend monitoring
Information technology
Log monitoring, analysis
Healthcare
DNA Analysis

SLIDE 19

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Smar mart s t senza S nza Sea earch? rch?

SLIDE 20

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr: chi l

lr: chi lo usa?
usa?

Buy.com

SLIDE 21

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Oltre il Search

SLIDE 22

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Un caso di n caso di success successo

SLIDE 23

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 24

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 25

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 26

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 27

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 28

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 29

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 30

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 31

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 32

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 33

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 34

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 35

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 36

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 37

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 38

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

SLIDE 39

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Buon search a tutti.

Grazie!

Luca Luca Bo Bone nesini sini

www.sourcesense.com

l.bonesini@sourcesense.com

Tel. +39 366 688 7125

www www.l .lucabon

nesini.i

.it twitter: @lbonesini skype: lbonesini