Large Scale Search, Discovery and Analytics with Solr, Mahout and - - PowerPoint PPT Presentation

large scale search discovery and analytics with solr
SMART_READER_LITE
LIVE PREVIEW

Large Scale Search, Discovery and Analytics with Solr, Mahout and - - PowerPoint PPT Presentation

Search Discover Analyze Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop Grant Ingersoll Chief Scientist Lucid Imagination | | 1 1 Search is Dead, Long Live Search User Interaction User Interaction


slide-1
SLIDE 1

1 | 1 |

Search Discover Analyze

Grant Ingersoll Chief Scientist Lucid Imagination

Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop

slide-2
SLIDE 2

2 |

Good keyword search is a commodity and easy to get up and running The Bar is Raised Relevance is (always will be?) hard Holistic view of the data AND the users is critical

Search is Dead, Long Live Search

User Interaction User Interaction Access Access Content Relationships Content Relationships

slide-3
SLIDE 3

3 |

Quick Background and needs Architecture Abstract Practical SDA In Practice Components Challenges and Lessons Learned Wrap Up

Topics

slide-4
SLIDE 4

4 |

User Needs Real-time, ad hoc access to content Aggressive Prioritization based on Importance Serendipity Feedback/Learning from past Business Needs Deeper insight into users Leverage existing internal knowledge Cost effective

Why Search, Discovery and Analytics (SDA)?

slide-5
SLIDE 5

5 |

Fast, efficient, scalable search Bulk and Near Real Time Indexing Handle billions of records w/ sub-second search and faceting Large scale, cost effective storage and processing capabilities Need whole data consumption and analysis Experimentation/Sampling tools Distributed In Memory where appropriate NLP and machine learning tools that scale to enhance discovery and analysis

What Do Developers Need for SDA?

slide-6
SLIDE 6

6 |

Abstract -> Practical SDA Architecture

Access (API, UI,Visualizat ion) Provisioning, M

  • nit
  • ring, Infrast

ructure

Computation and Storage

Shards Shards Shards Shards Shards DFS Shards Shards Logs

Search, Discovery and Analytics

Glue

Data Mgmt Service Mgmt Admin Content Acquisition Search DB NoSQL KV Dist. Process Experiment Mgmt Stats Package Machine Learning Docs Access User Modeling

Pig, Mahout, R, GATE, Others

slide-7
SLIDE 7

7 |

Computation and Storage

Solr

  • SolrCloud
  • Document

Storage?

  • Document

Index Hadoop

  • WebHDFS
  • Small file are an unnatural act
  • Stores Logs, Raw files, intermediate files,

etc. HBase

  • User

Histories

  • Document

Storage?

  • Metric

Storage

Challenges

  • Who is the authoritative store? Solr or HBase?
  • Real time vs. Batch
  • Where should analysis be done?
slide-8
SLIDE 8

8 |

Three primary concerns Performance/Scaling Relevance Operations: monitoring, failover, etc. Business typically cares more about relevance Devs more about performance (and then ops)

Search In Practice

slide-9
SLIDE 9

9 |

SolrCloud takes care of distributed indexing and search needs Transaction logs for recovery Automatic leader election, so no more master/worker Have to declare number of shards now, but splitting coming soon Use CloudSolrServer in SolrJ NRT Config tips: 1 second soft commits for NRT updates 1 minute hard commits (no searcher reopen)

Search with Solr: Scaling and NRT

slide-10
SLIDE 10

10 |

ABT – Always Be T esting

Experiment management is critical T

  • p X + Random Sampling of Long T

ail Click logs T rack Everything! Queries Clicks Displayed Documents Mouse/Scroll tracking???

Phrases are your friend

Search: Relevance

slide-11
SLIDE 11

11 |

Serendipity

  • Related Items
  • Topics
  • Recommendations
  • Did you mean?
  • More Like This
  • Trends
  • Stat. Interesting Phrases

Organization

  • Clustering
  • Named

Entities

  • Importance
  • Time

Factors

  • Faceting
  • Classificati
  • n

Data Quality

  • Duplicates
  • Boosts
  • Length
  • Document factor Distributions

Discovery Components Challenges

  • Many of these are intense calculations or iterative
  • Many are subjective and require a lot of experimentation
slide-12
SLIDE 12

12 |

Mahout’s 3 “C”s provide tools for helping across many aspects of discovery Collaborative Filtering Classification Clustering Also: Collocations (Statistically Interesting Phrases) SVD Others Challenges: High cost to iterative machine learning algorithms Mahout is very command line oriented Some areas less mature

Discovery with Mahout

slide-13
SLIDE 13

13 |

Plan for running experiments from the beginning across Search and Discovery components Your analytics engine should help! Types of Experiments to consider Indexing/Analysis Query parsing Scoring formulas Machine Learning Models Recommendations, many more Make it easy to do A/B testing across all experiments and compare and contrast the results

Aside: Experiment Management

slide-14
SLIDE 14

14 |

Commonly used components Solr R Stats Hive Pig Commercial

Starting with Search and Discovery metrics and analysis gives context into where to make investments for broader analytics

Analytics Components

slide-15
SLIDE 15

15 |

Simple Counts: Facets T erm and Document frequencies Clicks Search and Discovery example metrics Relevance measures like Mean Reciprocal Rank Histograms/Drilldowns around Number of Results Log and navigation analysis Data cleanliness analysis is helpful for finding potential issues in content

Analytics in Practice

slide-16
SLIDE 16

16 |

Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users Solr + Hadoop + Mahout Design for the big picture when building search-based applications

Wrap

slide-17
SLIDE 17

17 |

http://www.lucidimagination.com grant@lucidimagination.com @gsingers

Find me