 
              From MUD to MIRE: Managing Inherent Risk in the Enterprise Peter J. Haas IBM Almaden Research Center San Jose, CA 1 MUD Workshop, September, 2010
The Two Perpetual Questions • “Where do the probabilities come from?” • “Who is going to use this stuff in the real world?” 2 MUD Workshop, September, 2010
My background in probabilistic DB 3 MUD Workshop, September, 2010
RAQA: Resolution-Aware Query Answering for Business Intelligence (Sismanis et al. 2009) • OLAP querying (datacubes: roll-up, drill-down) • Uncertainty due to entity City State Strict range Status resolution San Francisco CA [$30,$230] guaranteed San Jose CA [$70,$200] non-guaranteed • Bounds on query answers Sum(Sales) group by City,State • Implemented via SQL queries State Strict range Status • Conservative approach CA [$230,$230] guaranteed Sum(Sales) group by State 4 MUD Workshop, September, 2010
The MCDB System (with Chris Jermaine & students) i.i.d. samples from possible-worlds dist’n Random DB = D d 1 Q Schema Q( d 1 ) VG Functions Q Monte Carlo Q( d 2 ) d 2 Parameter : Generator Tables Q( d n ) ... Q( D ) = Q i.i.d. samples from Select SUM(sales) query-result AS t_sales d n dist’n ˆ E [ t_sales ] ˆ Var [ t_sales ] Many implementation tricks ˆ q .01 [ t_sales ] to ensure acceptable performance Monte Histogram Carlo Error bounds Estimator Inference 5 MUD Workshop, September, 2010
Query-Result Distributions Long tail in Delivery times Q1 Q2 80 60 Frequency Frequency 60 40 40 20 20 0 0 8.2 8.25 8.3 8.35 8.4 8.45 8.5 200 250 300 350 400 450 Revenue change Days until completion 9 x 10 Q3 Q4 40 80 Frequency Frequency 30 60 20 40 10 20 0 0 1.3375 1.338 1.3385 1.339 1.3395 1.34 1.3405 1.341 −8.842 −8.84 −8.838−8.836−8.834−8.832 −8.83 −8.828 Total supplier cost Additional profits 10 10 x 10 x 10 6 MUD Workshop, September, 2010
MC 3 : MapReduce + MCDB (Xu et al. 2009) High-level query www.jaql.org language for //code.google.com/p/jaql semi-structured JSON data Jaql Parallel batch processing Map-Reduce Hadoop HDFS Distributed File System Tricks to manage Pseudo-random numbers 7 MUD Workshop, September, 2010
Where do the probabilities come from? 8 MUD Workshop, September, 2010
Data-Warehouse Uncertainty Data Integration {John Smith, San Jose} Name City {John Smith, San Jose} ETL John Smith (SJ, 0.66), (LA, 0.33) {John Smith, Los Angeles} Name City City Sales Similarity Name Sales John Smith LA LA $50K ? (0.92) Join J. Smith $50K Information extraction Hotels (Michelakis et al., 2009) NY Marriott A lovely thing to behold System T is Paris Hilton in the ? (0.20) Hotel Annotator Paris Hilton Springtime … 09/09/2007 Re: system crash Source Problem type -------------------------- Text Miner This morning, my ORACLE Cust0385 (DBMS, 0.8), (OS, 0.2) system on LINUX exploded in a spectacular fireball … 9 MUD Workshop, September, 2010
Data-Warehouse Uncertainty – Cont’d Measurement Uncertainty f(t) Sensor_ID Temp (F) Sensor S23 78.32 t 78.32 f(t) Event Time System Monitor Buffer overflow 10/17/2007:18:20:02 t 10 MUD Workshop, September, 2010
Real-World Challenges with Data-Warehouse Uncertainty • People don’t like to admit that it exists! – Retailers view uncertainty as failure of security, supply chain management • IBM research relationship manager for retail – Law enforcement • Photo ID in meth dealer trial – Scientists pretend data is perfect: uncertainty undermines results • Hans-Joachim Lenz – Database vendors • Data “cleaning” products • Data warehouse may not even exist! – Ex: cancer data at medical center – Ex: tomato soup supply chain data 11 MUD Workshop, September, 2010
Stochastic Predictive Analytics on Big Data • Uncertain data describes future or hypothetical events – Based on complex, fine-grained stochastic model over big data – Minimizes denial problem • Intense recent interest in “business analytics” driven by – Need for low risk, quick payback projects (flexibility, low cost, fine data granularity) – Technical advances • Cloud computing • Software as a Service (SaaS) • Next generation tools, portals, visualization • Often with a spreadsheet front end – $8 Billion of such tools [Gnatovich06] – IBM services pricing • Lots of prototype activity – Fox/GreenPlum [Cohen09 MAD analytics paper] – VISA/IBM [Das10 SIGMOD paper] 12 MUD Workshop, September, 2010
Ex. 1: Portfolio Values Customer EuroCallOptions CustID OptionID NumShares … OptionID InitVal … StrikeP OVal John Smith 23 50 … 23 $2.35 … $4.00 ? … … … … … … … … SELECT SUM (c.NumShares * o.Val) Option value one month from now FROM Customer c, EuroCallOptions o (exercise date) WHERE c.OptionID = o.OptionID AND c.CustType = ‘Institutional’ Modified Black-Scholes model for European call option:         OVal max ( ) ,0 dV rVdt a V VdW V t S final Simulation approximation (Euler approach):          ( ) ( ) ( ) ( ) ( ) V t t V t rV t t a V t V t tZ Sample from j Normal dist’n 13 Also CMOs, etc. MUD Workshop, September, 2010
Ex. 2: Pricing Decisions Bayes Theorem price price Data for all Data for one customers customer demand demand Unit Order Amount CustID Price J. Smith $10.20 500 Global demand Individual demand … … … distribution (prior) distribution (posterior) • Can analyze arbitrary dynamically-defined customer segments when determining effect of price increase 14 MUD Workshop, September, 2010
Ex. 3: Individual Click Behavior (EBay) Click data for all EBay customers x 13 y 13 p 3 p 3 p 1 p 1 x 34 Data for one y 34 x 14 x 32 y 14 y 32 customer p 4 p 2 p 4 p 2 x 24 y 24 Global Markov model Individual Markov model distribution (Dirichelet prior) distribution (posterior) • Can analyze arbitrary dynamic customer segments when determining effect of changing EBay pages 15 MUD Workshop, September, 2010
Ex. 4: Clinic-Capacity Risk Medical data for all Stochastic Pharmacy data for all customers dosage model customers Cox hazard-rate disease model Clinic-resource demand model CustID Time period Resource needed Jane Smith June-Sept ? … … 16 MUD Workshop, September, 2010
MCDB: Improvement of Traditional Analytics Workflow Arena, R, Matlab,… Arena, R, Matlab,… Model Model Data reduction Analyst (PhD) Develops model Model fitting Model application & querying • Data extraction slow and bug-prone • Hard to re-link model results to DB • Only coarse-grained modeling • Hard to deal with data updates • No encapsulation for user • Sensitivity, what-if analysis are hard Goal: Integrate model with Database Model 17 MUD Workshop, September, 2010
Where do the probabilities come from? From stochastic predictive models over big data 18 MUD Workshop, September, 2010
Who is going to use this stuff in the real world? 19 MUD Workshop, September, 2010
Key Driver: Risk Management • Ex: Projected sales under SELECT SUM (s.amount) FROM SALES s, CUST c micromarketing campaign WHERE s.ID = c.ID • Ex: ERP AND c.city = ‘Los Angeles’ – # OS experts for help desk – Demand projected from historical Query-result text data (2x uncertainty) Loss distribution distribution probability probability – Provide principled safety factor • Regulatory pressure – Basel II, Solvency II • Business pressure expected expected 5% – Ex.: Energy Risk Professionals answer loss VaR Total LA sales Loss 20 MUD Workshop, September, 2010
Challenge: Decision-makers’ Poor Intuition About Risk Flaw of averages (weak form): Flaw of averages (strong form): Mean correct, Variance ignored Wrong value of mean: Sam Savage’s book f(E[X]) ≠ E[f(X)] (why we underestimate risk) 21 MUD Workshop, September, 2010
Examples • Red River (ND) flooding • Perishable Inventory (Red Lobster) • U.S. accounting standards (FASB) • Project completion time: 10 parallel tasks, E [ T i ] = 6 mo. • Data cleansing • Machine learning • Trio agg. paper “Expected to crest at 50 feet” (MUD 2008) $800 • Basic probability $600 cost $400 stock = E[demand] = 5 $200 0 2 4 6 8 10 22 demand MUD Workshop, September, 2010
Probability Management and Interactive Spreadsheets • DIST 1.1 standard – DIST = distribution string – IID Monte Carlo (multivariate) samples – Compressed, with metadata • Ensures correct, coherent risk computations throughout enterprise and beyond – E.g., Royal Dutch Shell Audit seal of • “Electricity network” for probability approval – Royal Dutch Shell, Merck Pharmaceutical, Oracle, Wells Fargo Bank, Bessemer Trust, and IBM • DISTs can be manipulated like numbers – Facilitates interactive spreadsheets (demo) 23 MUD Workshop, September, 2010
Demo 1 24 MUD Workshop, September, 2010
Demo 2 25 MUD Workshop, September, 2010
Recommend
More recommend