3548 Hypothetical Solution using Lambda Architecture Where BigData - - PowerPoint PPT Presentation

3548
SMART_READER_LITE
LIVE PREVIEW

3548 Hypothetical Solution using Lambda Architecture Where BigData - - PowerPoint PPT Presentation

Enterprise Data Problems in Investment Banks BigData History and Trend Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. 3548 Hypothetical Solution using Lambda


slide-1
SLIDE 1

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

CHARLES CAI ASHWANI ROY

8 March 2013

1

Enterprise Data Problems in Investment Banks “BigData” History and Trend – Driven by Google CAP Theorem for Distributed Computer System Open Source Building Blocks: Hadoop, Solr, Storm.. Hypothetical Solution using Lambda Architecture Where “BigData” Industry is Going?

3548

slide-2
SLIDE 2

Presenter: Charles Cai

¨ Charles Cai makes a living by designing and

implementing trading and risk systems for investment banks.

¨ Currently a Chief Front Office Technical Architect in

a global energy trading firm.

¤ Twitter: @caidong ¤ Linkedin: charlescai

slide-3
SLIDE 3

Presenter: Ashwani Roy

¨ Ashwani Roy – Masters in Finance Student at London

Business School and VP at a Tier 1 Investment Bank.

¨ Love to mix programming and Applied Mathematics

to solve difficult problems in Investment Banking

¤ Twitter: @Ashwani_Roy ¤ Linkedin: ashwaniroy

slide-4
SLIDE 4

3548

slide-5
SLIDE 5

Why Finance Industry should care ?

¨ We care because of

¤ Compliance requirements ¤ Risk Management ¤ Pricing ¤ Rise of Machines (Ecommerce) ¤ Cost Cutting ¤ BTW: Twitter is also part of Market Data

slide-6
SLIDE 6

Sample Interest Model / Simulations

slide-7
SLIDE 7

A quick Monte Carlo Demo

¨ Demo – Computing this is functional

Some Terminology

¨ PV = present value = Cash flows discounted to

current time

¨ Delta = change in price / change in interest rate ¨ Gamma .. Vega .. Rho .. Theta .. Vanna …. And

  • ther Greeks
slide-8
SLIDE 8

Monte Carlo Simulations -Results

¨ <results> = func<I,j,k…… ¨ Parallelize computation with mappers ¨ Save results and run reducers ¨ [[ trade: 1 curveid: Orig PV:100 Delta:200]{ to OLAP}

..[ trade: 1 curveid: Sim1 PV:100 Delta:200] {big data} ..[ trade: 1 curveid: Sim2 PV:99 Delta:220]{big data} ] ]

slide-9
SLIDE 9

Compliance

¤ Dodd-Frank requires >= five years records ¤ Fast Disaster recovery requirements (Tapes backup not acceptable) ¤ All Bloomberg and other chats to be saves in quick reportable form ¤ … Many more in Basel 3 and Dodd Frank Act

You need to

# get chats for AshwaniRoy@bloomberg.net and ashwaniR@reuters.net # from the 5 years Bloomberg and Reuters log of a global investment bank

  • f 1TB(assume 1MB/Day/Trader * 220 trading days * 1000 traders* 5

years)

# for all EURUSD swaps only

….. Additional filters and aggregation requirements

slide-10
SLIDE 10

Big Data Industry History: Google’s Papers

1

slide-11
SLIDE 11

Google’s Big Data Papers: 2003 – 2006

1

GFS – Google File System

  • 2003
  • Distributed file

system

  • 3 x copies
  • Commodity

machines

  • Colossus (2012)

MapReduce

  • 2004
  • Input à Map à

Partition à Compare à Shuffle à Sort à Reduce à Output BigTable

  • 2006
  • Distributed Key-

Value column- family based database

slide-12
SLIDE 12

Hadoop Distributed File System (HDFS)

¨ http://ecomcanada.files.wordpress.com/2012/11/hadoop-architecture.png
slide-13
SLIDE 13

Google’s MapReduce Programming Model

1

slide-14
SLIDE 14

Apache Hbase: Column Family Distributed K-V Store

slide-15
SLIDE 15

Google’s Big Data Papers 2: 2010 - now

1

Percolator
  • 2010
  • Incremental update/
compute
  • built on BigTable
  • Adds transactions, locks,
notifications
  • SPFs: “Stream Processing
Frameworks” + underlying database Dremel
  • 2010
  • Online analytics and
visualization
  • SQL like language for
structured data
  • Each row is JSON object –
in protobuf format
  • Column based
  • Spanner (2012),
BigQuery, F1 Pregel
  • 2010
  • Scalable graph computing
  • Worker threads à nodes
à parallel “superstep” à messages à nodes à Aggregator/Combiners (global statistics)
  • PageRank, shortest path,
bipartite matching

Impala Tez/Stinger

Microsoft Trinity

slide-16
SLIDE 16

Unstructured Data: Index/Search Engine

¨ Github Code Search: 17 TB

slide-17
SLIDE 17

Apache Lucene/SOLR

¨ Open Source Indexing

and Search Engine

¨ 4,000+ Enterprise users

¤ IBM, HP

, Cisco

¤ Apple, Linkedin ¤ Wikipedia ¤ CNet, Sky ¤ Twitter

slide-18
SLIDE 18

What’s Next for Hadoop? Real-time! Nathan Marz

slide-19
SLIDE 19

Some more use cases

¤ Save money to save your jobs ¤ Save money to your firm can do more ¤ E Commerce is norm… ¤ Market sentiment analysis cannot be relied on using

“Bloomberg's sentiment analysis” only

¤ .. Add some more

slide-20
SLIDE 20

“Lambda Architecture” – Nathan Marz, BackType/Twitter

¨ query = func (data, ...)

2

  • Real-time ticks, events…
  • Historical (all history data
points)
  • Curated/cleansed curves…
  • Derived curves…
  • Back-testing models…
  • Technical analysis…
  • Alerts…
  • Join across data sources (e.g.
correlation among weather / energy)
  • Curating/cleanse curves…
  • Derive curves, building models…
  • Back-testing models…
  • Visualization of the above!
  • Excel / VBA
  • Java, C#/F#...
  • MatLab
  • 3rd party ETL Tools
  • R
slide-21
SLIDE 21 Batch ¡Layer ¡(Hadoop) Servicing ¡Layer Speed ¡Layer ¡(Storm) RDBMS/DW ¡+ ¡Full-­‑text ¡Search ¡+ ¡Graph ¡Database QFD ¡1 Batch ¡recompute All ¡data ¡ (HDFS/HBase) QFD ¡1 Tableau/Spotfire Excel/Apps MDX/DW (T+ 1) New ¡data ¡stream Process ¡stream Precompute ¡Views ¡ (MapReduce) Realtime ¡increment ¡ QFD ¡2 QFD ¡2 QFD ¡N Batch ¡views ¡(HDFS/Impala) QFD ¡N Realtime ¡views ¡(Apache ¡Hbase) Metadata ¡/ ¡Classification ¡/ ¡Curation Automation ¡/ ¡Agg regation ¡/ ¡Centralization RDBMS Graph ¡Database Merge Full-­‑text ¡Search COTS ¡ Reporting ¡Tools Ad-­‑hock ¡Analysis/Writeback: ¡Java/ C#,R/Clojure, ¡HIVE/PIG, ¡Talend/3rd ¡ party, ¡... Alerts Visualization Quality ¡/ ¡Access ¡/ ¡ Manipulation Access ¡/ ¡Centralization ¡/ ¡ Manipulation Acquisition Visualization

Lambda ¡Architecture : ¡query ¡= ¡func ¡(data, ¡...)

slide-22
SLIDE 22

Online resources and alternative stacks

¨

An Introduction to Data Science.PDF – Free e-book on Data Science with R under Creative Commons Licenses

¨

Berkeley Data Analytics Stack (Open Source: Mesos – cluster management, Spark/Streaming – cluster computing, Shark-SQL/DW)

¨

Learning Statistics with R, Free Big Data Education: Advanced Data Science

¨

DataStax Enterprise (Apache C*/Cassandra, Apache Hadoop, Apache Solr…)

¨

An example “lambda architecture” for real-time analysis of hashtags using Trident, Hadoop and Splout SQL

¨

Nathan Marz (BackType, acquired by Twitter) Big Data Lambda Architecture

¨

Open source clustered Lucene: elasticsearch used by GitHub (17 TB code)

slide-23
SLIDE 23

Distributed Computing System: CAP Theorem

https://github.com/thinkaurelius/titan/wiki/Storage-Backend-Overview http://en.wikipedia.org/wiki/CAP_theorem http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

Consistency

  • all nodes see the same data at

the same time

Availability

  • a guarantee that every request

receives a response about whether it was successful or failed

Partition tolerance

  • the system continues to operate

despite arbitrary message loss

  • r failure of part of the system
slide-24
SLIDE 24

“Lambda Architecture”: Enterprise Data

  • Quality of data
  • Ways to improve

data quality

  • Discover hidden

business insights

  • Data sources
  • Data formats (./

semi-/non- structured…)

  • Speed of change
  • Speed of reaction
  • Data size
  • Retention granular

level…

Volume Velocity Value Variety

slide-25
SLIDE 25

“Lambda Architecture” – Nathan Marz, BackType/Twitter

¨ Design Principle:

¤ Human fault-tolerance ¤ Immutability ¤ Pre-computation

¨ Lambda Architecture:

¤ Batch Layer ¤ Serving Layer ¤ Speed Layer

¨ Technology Stack

¤ Apache Hadoop/HBase/Cloudera Impala ¤ Twitter Storm