Large Scale Search, Discovery and Analytics with Solr, Mahout and - - PowerPoint PPT Presentation

▶

Oct 29, 2022 273 likes •456 views

Search Discover Analyze Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop Grant Ingersoll Chief Scientist Lucid Imagination | | 1 1 Search is Dead, Long Live Search User Interaction User Interaction

SLIDE 1

1 | 1 |

Search Discover Analyze

Grant Ingersoll Chief Scientist Lucid Imagination

Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop

SLIDE 2

2 |

Good keyword search is a commodity and easy to get up and running The Bar is Raised Relevance is (always will be?) hard Holistic view of the data AND the users is critical

Search is Dead, Long Live Search

User Interaction User Interaction Access Access Content Relationships Content Relationships

SLIDE 3

3 |

Quick Background and needs Architecture Abstract Practical SDA In Practice Components Challenges and Lessons Learned Wrap Up

Topics

SLIDE 4

4 |

User Needs Real-time, ad hoc access to content Aggressive Prioritization based on Importance Serendipity Feedback/Learning from past Business Needs Deeper insight into users Leverage existing internal knowledge Cost effective

Why Search, Discovery and Analytics (SDA)?

SLIDE 5

5 |

Fast, efficient, scalable search Bulk and Near Real Time Indexing Handle billions of records w/ sub-second search and faceting Large scale, cost effective storage and processing capabilities Need whole data consumption and analysis Experimentation/Sampling tools Distributed In Memory where appropriate NLP and machine learning tools that scale to enhance discovery and analysis

What Do Developers Need for SDA?

SLIDE 6

6 |

Abstract -> Practical SDA Architecture

Access (API, UI,Visualizat ion) Provisioning, M

nit
ring, Infrast

ructure

Computation and Storage

Shards Shards Shards Shards Shards DFS Shards Shards Logs

Search, Discovery and Analytics

Glue

Data Mgmt Service Mgmt Admin Content Acquisition Search DB NoSQL KV Dist. Process Experiment Mgmt Stats Package Machine Learning Docs Access User Modeling

Pig, Mahout, R, GATE, Others

SLIDE 7

7 |

Computation and Storage

Solr

SolrCloud
Document

Storage?

Document

Index Hadoop

WebHDFS
Small file are an unnatural act
Stores Logs, Raw files, intermediate files,

etc. HBase

User

Histories

Document

Storage?

Metric

Storage

Challenges

Who is the authoritative store? Solr or HBase?
Real time vs. Batch
Where should analysis be done?

SLIDE 8

8 |

Three primary concerns Performance/Scaling Relevance Operations: monitoring, failover, etc. Business typically cares more about relevance Devs more about performance (and then ops)

Search In Practice

SLIDE 9

9 |

SolrCloud takes care of distributed indexing and search needs Transaction logs for recovery Automatic leader election, so no more master/worker Have to declare number of shards now, but splitting coming soon Use CloudSolrServer in SolrJ NRT Config tips: 1 second soft commits for NRT updates 1 minute hard commits (no searcher reopen)

Search with Solr: Scaling and NRT

SLIDE 10

10 |

ABT – Always Be T esting

Experiment management is critical T

p X + Random Sampling of Long T

ail Click logs T rack Everything! Queries Clicks Displayed Documents Mouse/Scroll tracking???

Phrases are your friend

Search: Relevance

SLIDE 11

11 |

Serendipity

Related Items
Topics
Recommendations
Did you mean?
More Like This
Trends
Stat. Interesting Phrases

Organization

Clustering
Named

Entities

Importance
Time

Factors

Faceting
Classificati
n

Data Quality

Duplicates
Boosts
Length
Document factor Distributions

Discovery Components Challenges

Many of these are intense calculations or iterative
Many are subjective and require a lot of experimentation

SLIDE 12

12 |

Mahout’s 3 “C”s provide tools for helping across many aspects of discovery Collaborative Filtering Classification Clustering Also: Collocations (Statistically Interesting Phrases) SVD Others Challenges: High cost to iterative machine learning algorithms Mahout is very command line oriented Some areas less mature

Discovery with Mahout

SLIDE 13

13 |

Plan for running experiments from the beginning across Search and Discovery components Your analytics engine should help! Types of Experiments to consider Indexing/Analysis Query parsing Scoring formulas Machine Learning Models Recommendations, many more Make it easy to do A/B testing across all experiments and compare and contrast the results

Aside: Experiment Management

SLIDE 14

14 |

Commonly used components Solr R Stats Hive Pig Commercial



Starting with Search and Discovery metrics and analysis gives context into where to make investments for broader analytics

Analytics Components

SLIDE 15

15 |

Simple Counts: Facets T erm and Document frequencies Clicks Search and Discovery example metrics Relevance measures like Mean Reciprocal Rank Histograms/Drilldowns around Number of Results Log and navigation analysis Data cleanliness analysis is helpful for finding potential issues in content

Analytics in Practice

SLIDE 16

16 |

Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users Solr + Hadoop + Mahout Design for the big picture when building search-based applications

Wrap

SLIDE 17

17 |

Search Discover Analyze

Grant Ingersoll Chief Scientist Lucid Imagination

Large Scale Search, Discovery and Analytics with Solr, Mahout and Hadoop

Good keyword search is a commodity and easy to get up and running The Bar is Raised Relevance is (always will be?) hard Holistic view of the data AND the users is critical

Search is Dead, Long Live Search

User Interaction User Interaction Access Access Content Relationships Content Relationships

Quick Background and needs Architecture Abstract Practical SDA In Practice Components Challenges and Lessons Learned Wrap Up

Topics

User Needs Real-time, ad hoc access to content Aggressive Prioritization based on Importance Serendipity Feedback/Learning from past Business Needs Deeper insight into users Leverage existing internal knowledge Cost effective

Why Search, Discovery and Analytics (SDA)?

What Do Developers Need for SDA?

Abstract -> Practical SDA Architecture

Access (API, UI,Visualizat ion) Provisioning, M

ructure

Computation and Storage

Search, Discovery and Analytics

Glue

Data Mgmt Service Mgmt Admin Content Acquisition Search DB NoSQL KV Dist. Process Experiment Mgmt Stats Package Machine Learning Docs Access User Modeling

Pig, Mahout, R, GATE, Others

Computation and Storage

Solr

Storage?

Index Hadoop

etc. HBase

Histories

Storage?

Storage

Challenges

Three primary concerns Performance/Scaling Relevance Operations: monitoring, failover, etc. Business typically cares more about relevance Devs more about performance (and then ops)

Search In Practice

Search with Solr: Scaling and NRT

ABT – Always Be T esting

Experiment management is critical T

ail Click logs T rack Everything! Queries Clicks Displayed Documents Mouse/Scroll tracking???

Phrases are your friend

Search: Relevance

Serendipity

Organization

Entities

Factors

Data Quality

Discovery Components Challenges

Discovery with Mahout

Aside: Experiment Management

Commonly used components Solr R Stats Hive Pig Commercial

Starting with Search and Discovery metrics and analysis gives context into where to make investments for broader analytics

Analytics Components

Simple Counts: Facets T erm and Document frequencies Clicks Search and Discovery example metrics Relevance measures like Mean Reciprocal Rank Histograms/Drilldowns around Number of Results Log and navigation analysis Data cleanliness analysis is helpful for finding potential issues in content

Analytics in Practice

Search, Discovery and Analytics, when combined into a single, coherent system provides powerful insight into both your content and your users Solr + Hadoop + Mahout Design for the big picture when building search-based applications

Wrap



http://www.lucidimagination.com grant@lucidimagination.com @gsingers

Find me