10 B EST P RACTICES F OR S OLUTION A RCHITECTURES THAT WOULD TAME BIG - - PowerPoint PPT Presentation
10 B EST P RACTICES F OR S OLUTION A RCHITECTURES THAT WOULD TAME BIG - - PowerPoint PPT Presentation
10 B EST P RACTICES F OR S OLUTION A RCHITECTURES THAT WOULD TAME BIG DATA !!! B IG D ATA B EST P RACTICE -1 U SE CASE ! U SE C ASE ! U SE C ASE ( F RAME IT TIGHT ) T HE IDEA IN B RIEF What are the questions at the heart of the problem ?
10 BEST PRACTICES FOR SOLUTION ARCHITECTURES THAT WOULD TAME BIG DATA !!!
BIG DATA BEST PRACTICE-1
USE CASE ! USE CASE ! USE CASE ( FRAME IT TIGHT)
THE IDEA IN BRIEF …
What are the questions
at the heart of the problem ?
Formulate the
hypothesis/questions at the heart of the issue ! Distill them into a clear set of hypothesis to be tested
Remember Hadoop and
associated technology components are a means
Isolate $ Denting
Analytical Use Case
REAL LIFE EXAMPLE : CURATING USE CASE IN TELECOM SECURITY INTELLIGENCE
Business Context What new signals to listen to
prevent adverse events from happening ?
4 Data Pools Netsweepeer logs Radius logs Switch CDR MMS logs 2 Use cases Watch list analysis + Network
link analysis
MMS Video virality
Have an intensive ½ day cross functional workshop with business to boil down the game changing use case
Is it a “nice to have” use case or a “$ impacting use case” ?
Who is the consumer of the use case ?
How does it help him optimize cost or reduce risk or increase revenue ?
Business backwards and NOT technology forward
BIG DATA BEST PRACTICE-2
IMPACT “AHA” MOMENT IN 6090 DAYS. START WITH A SKELETAL WORKING SOLUTION ( MVP )
THE IDEA IN BRIEF DELIVER FIRST BIG DATA “AHA” MOMENT IN 6090 DAYS
Skeletal MVP : End to
end implementation that links all architectural components together
Could be the answer
to a previously unanswered question
Propels momentum
- f Big data project
A REAL LIFE EXAMPLE
Industry = OTA
Context : Important to improve look to book
Is there a co-relation between response time of a web page and the look to book ratio ?
Hadoop cluster + Infobright + Hive jobs ready in 3 weeks
Scaled data and improvised dashboard experience for another 3 weeks
Business readout in 6 weeks
THEREFORE
Break it into 3 chunks
30 day milestones
60 day milestones
90 day milestones
In 30 days plan to cover functional breadth
Hadoop infrastructure + cluster
Integrate disparate components – data pipeline, Columnar database, machine learning process , Hadoop cluster
Have a small file go from start to end thru the process chain
In 60 days plan to cover scalability
Scale for 12 months data atleast
Tableau / Pentaho
In 90 days plan to cover bells n whistles
Configurators
Alerters
Additional abtraction
Don’t wait for 6-9 months !
BIG DATA BEST PRACTICE-3
ACTIONS NOT INSIGHTS
DATA INSIGHTS ACTION
BEST PRACTICE-3 ACTIONS NOT INSIGHTS
Actions are executed in the frontline
Call centre
Mobile
Store channel
Digital channel
Actions could be
Behaviour based discounts
Help close a digital transaction
Serve customized webpage
Take proactive actions
Insights are nice to know Actions impact $
THEREFORE
WHAT ACTIONS ARE
DRIVEN AS A RESULT OF THESE INSIGHTS ?
HOW ARE WE
DISSEMINATING INSIGHTS TO FRONT LINE CHANNELS ?
ASK “SO WHAT” 5 TIMES
!!!
BIG DATA BEST PRACTICE-4 :
LISTEN TO UNSTRUCTURED INTELLIGENCE FOR STRONG SIGNALS
REAL LIFE EXAMPLE
Keyword frequency “Leaks”, “Leakage”, “Noise”, “Sound”, “Vibrations” Noise / leakage frequency is a better
predictor of repeat sales than any
- ther indicators including marketing
spends !!!
A REAL LIFE EXAMPLE
Slide 16 XYZ Online Buzz analysis
How can we create a strategy to respond to what we are hearing about XYZs buzz
- nline ?
Business Question
- Text mining
- Visual data exploration
- Hypothesis testing
- Affinity analysis
Statistical Technique
Sentiment trends :+/- Sentiment benchmark with McDonalds Top keywords for XYZ Top keywords for McDonalds Keyword affinities
Insights derived
- Theme specific campaigns
- NPD process
- Instore experience
- Reverse impact of negative buzz
Business Action
www.yelp.com
Raw data
www.twitter.com
WHERE DO CUSTOMERS EXPRESS THEMSELVES ?
Slide 17
Universe of XYZ sentiment data = 5 sources, 5556 posts,3 years data we’s phase-1 analysis = www.yelp.com, 136 posts, 2 years data
136 posts
Yelp.com
552 posts
Epinions.com
2854
posts
planetfeedback.com
1500
posts
Twitter.com
500 posts
Facebook.com
SOURCE = TWITTER.COM
Slide 18
SOURCE = YELP.COM
Slide 19
SOURCE = FACEBOOK.COM
Slide 20
STEP BY STEP SENTIMENT TEXT MINING
PROCESS
Slide 21
Process
- Blogs
- Customer
review sites
- Online
consumer forum
- Customers\Ven
dors emails
- Unstructured
data from Applications
Input Output
- Inferences
- Customer’s
sentiments
OVERALL SENTIMENTS DASHBOARD
Slide 22
THEREFORE
R text mining algorithm RHadoop
BIG DATA BEST PRACTICE-5 :
COLUMNAR &IN MEMORY ARCHITECTURES TO SPEED UP CHAIN OF THOUGHT Which devices are infected from a malicious attack ?
HOW TO HANDLE “NEEDLE IN A HAYSTACK” WORKLOADS ?
What happened on
firewall-3 between 3:17 and 3:21 am ?
How many payment
gateway drops happened between 9:47 am and 9:52 am
- n 15-Nov-2012 ?
Data forensic queries
supporting chain of thoughts
26
Id Name Designation Tenure S1 Prem Founder 8 S2 Simon Security Architect 5 S3 Bhavana Sales Head 6 S4 Ram CEO 3 S5 Shyam Developer 1 S1PremFounder8 S2SimonSecurityArchitect5 S3BhavanaSalesHead6 S4RamCEO3 S5ShyamDeveloper1 S1S2S3S4S5PremSimonBhavanaRamShyamFounderSecurityHeadSalesHeadCEODeveloper85631
Columnar DB – Concept in Brief
- interactive or real-time query for large datasets =key to analyst productivity
(support chain of thought analysis).
- Chain of thought analysis = Explore data torrent by quickly running off a series of
iterative queries, each informed by the last.
- Most solutions aren’t fast enough and reduce analytical effectiveness when
users chain of thought process is interrupted In memoy DB Tools Dremel at Google, Druid at Metamarkets, Sting at Netflix, Cloudera’s Impala C Berkeley’s AMPLab’s Spark, SAP Hana, Platfora.
IN MEMORY DATABASES !
THEREFORE
Examine columnar databases and inmemory databases to
speed up important query workloads
Download evaluation version of Actian, Infobright and do a
POC
BEST PRACTICE-6 HOW TO PLAN FOR 100 X
SCALABILITY ?
BIG DATA BEST PRACTICE-6 :
THINK 100 X SCALABILITY !!!
REAL LIFE EXAMPLE
Industry
= Telecom
Business context National content filtering solution Events Generated Per Day
: 1 Billion Events
New URL’s Classified per Day
: 1 Million
Daily log Volume
: 400Gb average
Price sensitive search Store search Ratings based
- rdering
Comparator events Basket add events Payment Gateway events
The data torrent The Organisation
BIG DATA BEST PRACTICE-7 :
DETECT DATA PATTERNS IN REAL TIME !!!
Real time sense making
THE CONTEXT
Velocity is high Decision making window is low Cost of not intervening is high
REAL TIME EXAMPLE
Decision window = 8 mins If a high value customer ( decile = 1 on last 36 months revenue )
and intra book interval > threshold and recency of search < 70 then route to call center channel
THEREFORE
Include S4 and other real time analytics into your
Big data reference architecture
BIG DATA BEST PRACTICE-8
CAPTOLOGY = PERSUASION THRU TECHNOLOGY
THE BASICS
Captology = Persuasion thru technology DESIGN FOR BEHAVIOURAL CHANGE Persuasion examples
Users to change channel behaviour ( Move from Desktop to
Mobile channel )
Persuade users to advocate friends
CAPTOLOGY IN ACTION
Captology in Insurance Reduce rates each time a person reports his or her exercise behaviour to a group of peers
- nline
Captology in Social
THERE ARE TOO MANY GOOD PRODUCTS HIDDEN BEHIND
BAD USER INTERFACES
PRODUCT = INTERFACE
FOR BIZ USER, WHAT LIES UNDER THE HOOD DOES NOT MATTER
BIG DATA BEST PRACTICE-9
STRETCH KEY BIG DATA COMPONENTS TO SEE WHAT BREAKS !
BEST PRACTICE-9 INTERSECT OF MOVING PARTS ARE THE WEAK LINKS
Big Data Moving moving parts
Columnar databases
Hadoop clusters
Advanced visualisation layer
Real time components
Data pipelines
API’s scrappers to syndicate info
Bridge to existing DW
The intersect can give away as data / user volumes increase
A real life big data architecture architecure
Event loggers
Hbase/Cassandra for high velocity event absorption
Sqoop/Flume for data ingestion
Hadoop cluster for massive data crunching
R for extracting patterns
Columnar database for 10 x lightning retrieval
Tableau for advanced visualisation
S4 for real time analytics
Channel integration components
Hadoop Cluster R Predictor ranking Infobright Columnar DB
THEREFORE … WATCH THE FOLLOWING 4 WEAK LINKS
1.
Link between Operational event streams and Hadoop cluster
2.
Link between Hadoop cluster and Columnar database
3.
Link between Columnar database and the visualisation tool
4.
Time it takes for the machine learning algorithm to run
HIGH VELOCITY DATA PIPELINE WHAT'S THE INGESTION RATE IN EVENTS PER SECOND OF THE DATA PIPELINE ?
BIG DATA BEST PRACTICE-10
EMBED MACHINE LEARNING PROCESSES TO DETECT PATTERNS
BEST PRACTICE-10 : 7 CORE MACHINE LEARNING BUILDING BLOCKS FOR ORCHESTRATING ANALYTICAL PROCESSES
Collaborative filtering
Apriori
Text mining
A/B testing
Clustering
Scoring models