LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Apache Solr
la piattaforma di ricerca enterprise
Apache Solr la piattaforma di ricerca enterprise LucaBonesini | - - PowerPoint PPT Presentation
Apache Solr la piattaforma di ricerca enterprise LucaBonesini | Titulus User Group, Kion Bologna 4/dic/2013 Chi s hi son ono Luca Bonesini Infor orma matico Lanciatore di giavellotti Prog ogramma mmator ore Suonatore di chitarra
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
la piattaforma di ricerca enterprise
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Suonatore di chitarra basso Sistemista Imprenditore
Tecnico di prevendita Mountainbike-ista Webmaster Padre2
Cantore Markettaro
Chi s hi son
Luca Bonesini
http://www.lucabonesini.it @lbonesini http://it.linkedin.com/in/lucabonesini/ l.bonesini@sourcesense.com +39 366 688 7125
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Sour
cesen ense
Making sense of Open Source
Contributors Lucene/Solr Apache Chemistry Apache Jackrabbit OpenSSO-Alfresco Comm mmitters Hibernate Search Project Apache/UIMA project JBoss GateIn Portal Le Lead developer Lucene Infinispan integration
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Cosa sono?
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Apache Apache Lu Lucene ne (core) (core)
Search by ASF
“Apache Lucene is a high-performance, full- featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform”.
http://lucene.apache.org/core/ fast and efficient scoring and indexing algorithms lots of contributions to make common tasks easier: highlighting, spatial, query parsers, benchmarking tools, etc. most widely deployed search library on the planet
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Apache Apache Solr
Search by ASF
“Solr is the popular, blazing fast open source enterprise search platfor
search, hit highlighting, faceted search, near real- time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search”.
Highly reliable, scalable, fault tolerant, distributed indexing, replication, load-balanced querying, automated failover and recovery, centralized configuration.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Apache Apache Solr
Search by ASF
Solr is written in Java and runs as a standalone e full-text searc earch h se server er within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like e HTTP/X TTP/XML a and JSON APIs nd JSON APIs that make it easy to use from virtually any pro any programming gramming language language.
http://lucene.apache.org/solr Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc. Most programmi mming tasks in Lucene are configuration tasks in Solr Faceting (guided navigation, filters, etc.) Replication and distributed search support
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
La ricerca con la cravatta
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.
“Enterprise search is the practice of making con
tent from multiple enterprise- type sources, such as databases and intranets, sear arch chable able to a defined d au audi dien ence ce”. [wikipedia]
Pull ull Integration API Pus Push Crawler connector Documents types and formats ( XML, HTML, Office, etc.) to plain lain text text Stemming, lemmatization, synonym expansion, entity extraction, part of speech tagging, tokenization. Dictio ictionary nary of all unique words in the corpus. Rankin Ranking. Term freq frequency uency. User ser query. query. Faceting. Paging. Query-index co comparis mparison. References to source e do document cuments.
Ingestion → P Processing and a analysis → Indexing → Qu Query parsing → M Matching Ingestion → P Processing and a analysis → Indexing → Qu Query parsing → M Matching
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Ent nterp erpris rise e Sea earch, rch, cosa cosa e e come.
the purpose of Web indexing (also called Web spider, ant, automatic indexer, web scutter
reci cisi sion/R /Reca call: in pattern recognition and information retrieval, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved
mming: the process for reducing inflected (or sometimes derived) words to their stem, base or root form (ie: "fishing", "fished", and "fisher" to the root word, "fish")
mmati tizati tion: in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item (ie: word "better" has "good" as its lemma)
med-enti tity re reco cogniti tion (entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.
rt of f sp speech ch: a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question (ie: noun and verb)
tion: the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Ent nterp erpris rise e Sear Search: pr : prodott
i e vendor endor
Vendors of proprietary y enterprise search s software
AskMeNow, Attivio, Concept Searching Limited, Content Analyst Company LLC, Coveo, Dassault Systèmes (acquired Exalead alead), Denodo, Dieselpoint, Inc., dtSearch Corp., EMC Corp., Exorbyte GmbH, Expert System S.p.A., Exterro, Inc., Fabasoft, Funnelback, Go Google gle Sear earch h Applianc Appliance, HP (acquired Autonomy Corporation which in turn acquired Verity K2 and Ultraseek), IBM (acquired Vi Vivis isimo), Inbenta, inter:gator Enterprise Search, ISYS Search Software, MarkLogic, Microsoft (includes Microsoft Search Server, Fas ast Sear earch h & Trans ansfer er), Mindbreeze, Neofonie (includes WeFind), Omniture (acquired by Adobe Systems), Open Text Corporation, Oracle Corporation (includes Secure Enterprise Search and End ndec eca a Tec echno hnolo logie gies Inc.), Perception Software, PolySpot, Q-go, Q-Sensei, Recommind, SAP (includes SAP NetWeaver Enterprise Search, Search Services in SAP NetWeaver AS ABAP, and Search and Classification TREX), Sine inequa qua, SLI_Systems, Sophia Search Limited, TeraText, X1 Technologies, Inc., ZyLAB Technologies, ZL Technologies
Free e and d open n source ent nterpr pris ise sear earch softwar are Apac pache he Solr lr, DataparkSearch, ElasticSearch, ht://Dig, Jumper 2.0, mnoGoSearch, OpenSearchServer, Searchdaimon, Sphinx Vendors of open source en terprise search so ftw are 30 D ig its, Apache Softw are Foundation , LucidW orks, Se m ate xt, Flax
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Ope pen n Sour
ce, lo
anno anche anche lor loro.
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Perché Perché Innov Innovazione azione = = Bu$ u$ine$$ ine$$
Open Source Open Standard
OAGi OASIS W3C IETF IEEE ETSI Ecma OGF IEC ISO ITU CENELEC CEN BSI UNI CEI DKE DIN AFNOR GIETS LDTI
Interoperabilità
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Solr f
eatures es
ull- l-Te Text Search Capabilities
igh Vo Volume lume Web Traffic
pen Int Inter erfac aces es - XML, JSON and HTTP
Administration Interfaces
for monit itoring ng
able le, auto index replication, auto failover and recovery
ear Real-t Real-time ime indexing
configuration
nsib ible le Plugin Architecture
eospat patial ial Sear earch with support for multiple points per document and geo polygons
Text Analy Analysis is
hing
ighly Scalable alable Dis istribut ibuted ed sear earch with sharded index across multiple hosts
and HTTP sources
Rich h Document ument Par Parsing ing and and Index Indexing ing (PDF, Word, HTML, etc) using Apache Tika
Relat elated ed Pr Projec jects: Apache Hadoop, Apache ManifoldCF, Apache Lucene.Net, Apache Lucy, Apache Mahout, Apache Nutch, Apache OpenNLP, Apache Tika, Apache Zookeeper
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Sear earch, ch, già già una una 'commodity 'commodity'
Se arch is Everyw h ere ! K eyw o rd sea rch is a com m od ity H olistic view of the data and the users is critica l Scalable Search, D iscovery and Analytics are th e key to unlocking this view of u sers and data
Documen ts Access Content Relation- ships User interacti
Tr Traditional
large document collection
relational
clustering
enumerated data
linkage, highlighting
And:
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Smar mart s t senza S nza Sea earch? rch?
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Solr: chi l
Buy.com
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Un caso di n caso di success successo
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013
Buon search a tutti.
www.sourcesense.com
l.bonesini@sourcesense.com
www www.l .lucabon
.it twitter: @lbonesini skype: lbonesini