 
              10 B EST P RACTICES F OR S OLUTION A RCHITECTURES THAT WOULD TAME BIG DATA !!!
B IG D ATA B EST P RACTICE -1 U SE CASE ! U SE C ASE ! U SE C ASE ( F RAME IT TIGHT )
T HE IDEA IN B RIEF …  What are the questions at the heart of the problem ?  Formulate the hypothesis/questions at the heart of the issue ! Distill them into a clear set of hypothesis to be tested  Remember Hadoop and associated technology components are a means  Isolate $ Denting Analytical Use Case
R EAL LIFE EXAMPLE : C URATING USE CASE IN TELECOM SECURITY INTELLIGENCE  Business Context  What new signals to listen to prevent adverse events from happening ?  4 Data Pools  Netsweepeer logs  Radius logs  Switch CDR  MMS logs  2 Use cases  Watch list analysis + Network link analysis  MMS Video virality
Have an intensive ½ day cross functional  workshop with business to boil down the game changing use case Is it a “nice to have” use case or a “$  impacting use case” ? Who is the consumer of the use case ?  How does it help him optimize cost or  reduce risk or increase revenue ? Business backwards and NOT technology forward
B IG D ATA B EST P RACTICE -2 I MPACT “ AHA ” MOMENT IN 60  90 DAYS . S TART WITH A S KELETAL WORKING SOLUTION ( MVP )
T HE IDEA IN BRIEF D ELIVER F IRST B IG D ATA “ AHA ” MOMENT IN 60  90 DAYS  Skeletal MVP : End to end implementation that links all architectural components together  Could be the answer to a previously unanswered question  Propels momentum of Big data project
A REAL LIFE EXAMPLE Industry = OTA  Context : Important to  improve look to book Is there a co-relation  between response time of a web page and the look to book ratio ? Hadoop cluster + Infobright  + Hive jobs ready in 3 weeks Scaled data and  improvised dashboard experience for another 3 weeks Business readout in 6  weeks
T HEREFORE Break it into 3 chunks  30 day milestones  60 day milestones  90 day milestones  In 30 days plan to cover functional breadth  Hadoop infrastructure + cluster  Integrate disparate components – data pipeline, Columnar  database, machine learning process , Hadoop cluster  Have a small file go from start to end thru the process chain In 60 days plan to cover scalability  Scale for 12 months data atleast  Tableau / Pentaho  In 90 days plan to cover bells n whistles  Configurators  Alerters  Additional abtraction  Don’t wait for 6 -9 months !
B IG D ATA B EST P RACTICE -3 A CTIONS NOT INSIGHTS ACTION INSIGHTS DATA
B EST P RACTICE -3 A CTIONS NOT INSIGHTS  Actions are executed in the frontline Call centre  Mobile  Store channel  Digital channel   Actions could be Behaviour based discounts  Help close a digital transaction  Serve customized webpage  Take proactive actions   Insights are nice to know  Actions impact $
T HEREFORE  W HAT ACTIONS ARE DRIVEN AS A RESULT OF THESE INSIGHTS ?  H OW ARE WE DISSEMINATING INSIGHTS TO FRONT LINE CHANNELS ?  A SK “ SO WHAT ” 5 TIMES !!!
B IG D ATA B EST P RACTICE -4 : L ISTEN TO UNSTRUCTURED INTELLIGENCE FOR S TRONG SIGNALS
R EAL LIFE EXAMPLE  Keyword frequency  “Leaks”, “Leakage”,  “Noise”, “Sound”,  “Vibrations”  Noise / leakage frequency is a better predictor of repeat sales than any other indicators including marketing spends !!!
A REAL LIFE EXAMPLE Statistical Technique Raw data • Text mining Business Question • Visual data exploration www.yelp.com How can we create a strategy • Hypothesis testing Slide 16 to respond to what we are • Affinity analysis hearing about XYZs buzz www.twitter.com online ? Insights derived Sentiment trends :+/- Sentiment benchmark with McDonalds XYZ Online Top keywords for XYZ Buzz analysis Top keywords for McDonalds Keyword affinities Business Action • Theme specific campaigns • NPD process • Instore experience • Reverse impact of negative buzz
W HERE DO CUSTOMERS EXPRESS THEMSELVES ? 2854 136 posts 552 posts posts Yelp.com Epinions.com planetfeedback.com 1500 500 posts posts Twitter.com Facebook.com Universe of XYZ sentiment data = 5 sources, 5556 posts,3 years data we’s phase-1 analysis = www.yelp.com, 136 posts, 2 years data Slide 17
S OURCE = T WITTER . COM Slide 18
S OURCE = Y ELP . COM Slide 19
S OURCE = F ACEBOOK . COM Slide 20
S TEP BY STEP SENTIMENT TEXT MINING PROCESS Process • Blogs • Customer review sites • Inferences • Online consumer • Customer’s forum sentiments • Customers\Ven dors emails • Unstructured data from Applications Output Input Slide 21
O VERALL S ENTIMENTS D ASHBOARD Slide 22
T HEREFORE  R text mining algorithm  RHadoop
Which devices are infected from a malicious attack ? B IG D ATA B EST PRACTICE - 5 : C OLUMNAR &I N M EMORY ARCHITECTURES TO SPEED UP CHAIN OF THOUGHT
H OW TO H ANDLE “N EEDLE IN A H AYSTACK ” W ORKLOADS ?  What happened on firewall-3 between 3:17 and 3:21 am ?  How many payment gateway drops happened between 9:47 am and 9:52 am on 15-Nov-2012 ?  Data forensic queries supporting chain of thoughts
Columnar DB – Concept in Brief Id Name Designation Tenure S1 Prem Founder 8 S2 Simon Security Architect 5 S3 Bhavana Sales Head 6 S4 Ram CEO 3 S5 Shyam Developer 1 S1PremFounder8 S2SimonSecurityArchitect5 S3BhavanaSalesHead6 S4RamCEO3 S5ShyamDeveloper1 S1S2S3S4S5PremSimonBhavanaRamShyamFounderSecurityHeadSalesHeadCEODeveloper85631 26
I N M EMORY D ATABASES !  interactive or real-time query for large datasets =key to analyst productivity (support chain of thought analysis).  Chain of thought analysis = Explore data torrent by quickly running off a series of iterative queries, each informed by the last.  Most solutions aren’t fast enough and reduce analytical effectiveness when users chain of thought process is interrupted In memoy DB Tools  Dremel at Google,  Druid at Metamarkets,  Sting at Netflix,  Cloudera’s Impala  C Berkeley’s AMPLab’s Spark,  SAP Hana,  Platfora.
T HEREFORE  Examine columnar databases and inmemory databases to speed up important query workloads  Download evaluation version of Actian, Infobright and do a POC
B EST P RACTICE -6 H OW TO P LAN FOR 100 X SCALABILITY ? B IG D ATA B EST PRACTICE - 6 : T HINK 100 X S CALABILITY !!!
R EAL LIFE EXAMPLE  Industry = Telecom  Business context  National content filtering solution  Events Generated Per Day : 1 Billion Events  New URL’s Classified per Day : 1 Million  Daily log Volume : 400Gb average
The Organisation The data torrent Real time sense making Price sensitive search Ratings based ordering Store search Basket add Comparator events events Payment Gateway events B IG D ATA B EST PRACTICE - 7 : D ETECT D ATA PATTERNS IN REAL TIME !!!
T HE CONTEXT  Velocity is high  Decision making window is low  Cost of not intervening is high
R EAL TIME EXAMPLE  Decision window = 8 mins  If a high value customer ( decile = 1 on last 36 months revenue ) and intra book interval > threshold and recency of search < 70 then route to call center channel
T HEREFORE  Include S4 and other real time analytics into your Big data reference architecture
B IG D ATA B EST P RACTICE -8 C APTOLOGY = P ERSUASION THRU TECHNOLOGY
T HE BASICS  Captology = Persuasion thru technology  D ESIGN FOR B EHAVIOURAL C HANGE  Persuasion examples  Users to change channel behaviour ( Move from Desktop to Mobile channel )  Persuade users to advocate friends
C APTOLOGY IN A CTION Captology in Insurance Reduce rates each time a person reports his or her exercise behaviour to a group of peers online Captology in Social
T HERE ARE TOO MANY GOOD PRODUCTS HIDDEN BEHIND BAD USER INTERFACES P RODUCT = I NTERFACE F OR B IZ USER , WHAT LIES UNDER THE HOOD DOES NOT MATTER
B IG D ATA B EST P RACTICE -9 S TRETCH KEY B IG D ATA COMPONENTS TO SEE WHAT BREAKS !
B EST P RACTICE -9 I NTERSECT OF M OVING P ARTS ARE THE WEAK LINKS Big Data Moving moving parts  Hadoop Columnar databases  Cluster Hadoop clusters   Advanced visualisation layer Real time components  Data pipelines  API’s scrappers to syndicate info  Bridge to existing DW  The intersect can give away as data / user volumes increase  A real life big data architecture architecure  Event loggers  Hbase/Cassandra for high velocity event absorption  Sqoop/Flume for data ingestion  Hadoop cluster for massive data crunching  R for extracting patterns   Columnar database for 10 x lightning retrieval Tableau for advanced visualisation  S4 for real time analytics  Channel integration components  Infobright R Columnar Predictor DB ranking
T HEREFORE … W ATCH THE FOLLOWING 4 W EAK LINKS Link between Operational 1. event streams and Hadoop cluster Link between Hadoop 2. cluster and Columnar database Link between Columnar 3. database and the visualisation tool Time it takes for the 4. machine learning algorithm to run
Recommend
More recommend