3548
play

3548 Hypothetical Solution using Lambda Architecture Where BigData - PowerPoint PPT Presentation

Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical Solution using Lambda


  1. Enterprise Data Problems in Investment Banks “BigData” History and Trend – Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical Solution using Lambda Architecture Where “BigData” Industry is Going? SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS CHARLES CAI ASHWANI ROY 8 March 2013 1

  2. Presenter: Charles Cai ¨ Charles Cai makes a living by designing and implementing trading and risk systems for investment banks. ¨ Currently a Chief Front Office Technical Architect in a global energy trading firm. ¤ Twitter: @caidong ¤ Linkedin: charlescai

  3. Presenter: Ashwani Roy ¨ Ashwani Roy – Masters in Finance Student at London Business School and VP at a Tier 1 Investment Bank. ¨ Love to mix programming and Applied Mathematics to solve difficult problems in Investment Banking ¤ Twitter: @Ashwani_Roy ¤ Linkedin: ashwaniroy

  4. 3548

  5. Why Finance Industry should care ? ¨ We care because of ¤ Compliance requirements ¤ Risk Management ¤ Pricing ¤ Rise of Machines (Ecommerce) ¤ Cost Cutting ¤ BTW: Twitter is also part of Market Data

  6. Sample Interest Model / Simulations

  7. A quick Monte Carlo Demo ¨ Demo – Computing this is functional Some Terminology ¨ PV = present value = Cash flows discounted to current time ¨ Delta = change in price / change in interest rate ¨ Gamma .. Vega .. Rho .. Theta .. Vanna …. And other Greeks

  8. Monte Carlo Simulations -Results ¨ <results> = func<I,j,k…… ¨ Parallelize computation with mappers ¨ Save results and run reducers ¨ [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP} ..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data} ..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]

  9. Compliance ¤ Dodd-Frank requires >= five years records ¤ Fast Disaster recovery requirements (Tapes backup not acceptable) ¤ All Bloomberg and other chats to be saves in quick reportable form ¤ … Many more in Basel 3 and Dodd Frank Act You need to # get chats for AshwaniRoy@bloomberg.net and ashwaniR@reuters.net # from the 5 years Bloomberg and Reuters log of a global investment bank of 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5 years) # for all EURUSD swaps only ….. Additional filters and aggregation requirements

  10. Big Data Industry History: Google’s Papers 1

  11. Google’s Big Data Papers: 2003 – 2006 GFS – Google File MapReduce BigTable System • 2003 • 2004 • 2006 • Input à Map à • Distributed file • Distributed Key- Partition à system Value column- Compare à family based • 3 x copies Shuffle à Sort à database • Commodity Reduce à Output machines • Colossus (2012) 1

  12. Hadoop Distributed File System (HDFS) http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png ¨

  13. Google’s MapReduce Programming Model 1

  14. Apache Hbase: Column Family Distributed K-V Store

  15. Google’s Big Data Papers 2: 2010 - now Percolator Dremel Pregel • 2010 • 2010 • 2010 • Incremental update/ • Online analytics and • Scalable graph computing compute visualization • Worker threads à nodes • built on BigTable • SQL like language for à parallel “superstep” à structured data messages à nodes à • Adds transactions, locks, Aggregator/Combiners notifications • Each row is JSON object – (global statistics) in protobuf format • SPFs: “Stream Processing • PageRank , shortest path, Frameworks” + underlying • Column based bipartite matching database • Spanner (2012), BigQuery, F1 Impala Microsoft Trinity Tez/Stinger 1

  16. Unstructured Data: Index/Search Engine ¨ Github Code Search: 17 TB

  17. Apache Lucene/SOLR ¨ Open Source Indexing and Search Engine ¨ 4,000+ Enterprise users ¤ IBM, HP , Cisco ¤ Apple, Linkedin ¤ Wikipedia ¤ CNet, Sky ¤ Twitter

  18. What’s Next for Hadoop? Real-time! Nathan Marz

  19. Some more use cases ¤ Save money to save your jobs ¤ Save money to your firm can do more ¤ E Commerce is norm… ¤ Market sentiment analysis cannot be relied on using “Bloomberg's sentiment analysis” only ¤ .. Add some more

  20. “Lambda Architecture” – Nathan Marz, BackType/Twitter ¨ query = func (data, ...) • Technical analysis … • Alerts … • Excel / VBA • Join across data sources (e.g. • Java, C#/F#... • Real-time ticks, events … correlation among weather / energy) • MatLab • Historical (all history data • Curating/cleanse curves … • 3 rd party ETL Tools points) • Derive curves, building models … • R • Curated/cleansed curves … • Back-testing models … • … • Derived curves … • Visualization of the above! • Back-testing models … • … • … 2

  21. Lambda ¡Architecture : ¡ query ¡= ¡func ¡(data, ¡...) Batch ¡Layer ¡(Hadoop) Servicing ¡Layer Excel/Apps Batch ¡recompute QFD ¡1 Access ¡/ ¡Centralization ¡/ ¡ All ¡data ¡ Precompute ¡Views ¡ Manipulation (HDFS/HBase) (MapReduce) QFD ¡2 Acquisition QFD ¡N Merge Batch ¡views ¡(HDFS/Impala) Quality ¡/ ¡Access ¡/ ¡ Manipulation New ¡data ¡stream Ad-­‑hock ¡Analysis/Writeback: ¡Java/ C#,R/Clojure, ¡HIVE/PIG, ¡Talend/3 rd ¡ party, ¡... Tableau/Spotfire Speed ¡Layer ¡(Storm) QFD ¡N Realtime ¡views ¡(Apache ¡Hbase) Visualization QFD ¡2 Realtime ¡increment ¡ QFD ¡1 Process ¡stream Alerts Automation ¡/ ¡Agg regation ¡/ ¡Centralization RDBMS/DW ¡+ ¡Full-­‑text ¡Search ¡+ ¡Graph ¡Database COTS ¡ Reporting ¡Tools Visualization Full-­‑text ¡Search Graph ¡Database RDBMS MDX/DW (T+ 1) Metadata ¡/ ¡Classification ¡/ ¡Curation

  22. Online resources and alternative stacks An Introduction to Data Science.PDF – Free e-book on Data Science with R under Creative ¨ Commons Licenses Berkeley Data Analytics Stack (Open Source: Mesos – cluster management, Spark/Streaming ¨ – cluster computing, Shark-SQL/DW) Learning Statistics with R, Free Big Data Education: Advanced Data Science ¨ DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr…) ¨ An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop ¨ and Splout SQL Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture ¨ Open source clustered Lucene: elasticsearch used by GitHub (17 TB code) ¨

  23. Distributed Computing System: CAP Theorem Consistency • all nodes see the same data at the same time Availability • a guarantee that every request receives a response about whether it was successful or failed Partition tolerance • the system continues to operate despite arbitrary message loss or failure of part of the system https://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overview http://en.wikipedia.org/wiki/CAP_theorem http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

  24. “Lambda Architecture”: Enterprise Data • Data size • Speed of change • Retention granular • Speed of reaction level… Volume Velocity Variety Value • Data sources • Quality of data • Data formats (./ • Ways to improve semi-/non- data quality structured…) • Discover hidden business insights

  25. “Lambda Architecture” – Nathan Marz, BackType/Twitter ¨ Design Principle: ¤ Human fault-tolerance ¤ Immutability ¤ Pre-computation ¨ Lambda Architecture: ¤ Batch Layer ¤ Serving Layer ¤ Speed Layer ¨ Technology Stack ¤ Apache Hadoop/HBase/Cloudera Impala ¤ Twitter Storm

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend