DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION - PowerPoint PPT Presentation

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc

GET TO KNOW CONCURRENT Leader in Application Infrastructure for Big Data • Building enterprise software to simplify Big Data application development and management Products and Technology • CASCADING   Open Source - The most widely used application infrastructure for building Big Data apps with over 175,000 downloads each month Founded: 2008 HQ: San Francisco, CA • DRIVEN   CEO: Gary Nakamura Enterprise data application management for Big Data apps CTO, Founder: Chris Wensel Proven — Simple, Reliable, Robust www.concurrentinc.com • Thousands of enterprises rely on Concurrent to provide their data application infrastructure. 2

ENTERPRISE NEEDS FOR DATA APP INFRASTRUCTURE • Need reliable, reusable tooling to quickly build and consistently deliver data products • Need the degrees of freedom to solve problems ranging from simple to complex with existing skill sets • Need the flexibility to easily adapt an application to meet business needs (latency, scale, SLA), without having to rewrite the application • Need operational visibility for entire data application lifecycle 3

CASCADING - DE-FACTO FRAMEWORK FOR DATA APPS • Standard for enterprise Cascading Apps SQL data app development Clo j ure Ruby • Your programming language of choice • Cascading applications System Integration New Fabrics that run on MapReduce Tez Storm will also run on Apache Mainframe In-Memory DB / DW Data Stores Hadoop Tez, Spark, Storm, and … 4

WORD COUNT EXAMPLE WITH CASCADING String docPath = args [ 0 ]; String wcPath = args [ 1 ]; configuration Properties properties = new Properties (); AppProps . setApplicationJarClass ( properties , Main . class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector ( properties ); // create source and sink taps integration Tap docTap = new Hfs ( new TextDelimited ( true, "\t" ), docPath ); Tap wcTap = new Hfs ( new TextDelimited ( true, "\t" ), wcPath ); // specify a regex to split "document" text lines into token stream Fields token = new Fields ( "token" ); Fields text = new Fields ( "text" ); RegexSplitGenerator splitter = new RegexSplitGenerator ( token , "[ \\[\\]\$\$,.]" ); processing // only returns "token" Pipe docPipe = new Each ( "token" , text , splitter , Fields . RESULTS ); // determine the word counts Pipe wcPipe = new Pipe ( "wc" , docPipe ); wcPipe = new GroupBy ( wcPipe , token ); wcPipe = new Every ( wcPipe , Fields . ALL , new Count (), Fields . ALL ); // connect the taps, pipes, etc., into a flow definition FlowDef flowDef = FlowDef . flowDef (). setName ( "wc" ) . addSource ( docPipe , docTap ) . addTailSink ( wcPipe , wcTap ); scheduling // create the Flow Flow wcFlow = flowConnector.connect( flowDef ); // <<-- Unit of Work wcFlow . complete (); // <<-- Runs jobs on Cluster 5

SOME COMMON PATTERNS Join Split • Functions Pipeline Merge filter • Filters • Joins data data function filter function filter ‣ Inner / Outer / Mixed ‣ Asymmetrical / Symmetrical • Merge (Union) function • Grouping ‣ Secondary Sorting Topology ‣ Unique (Distinct) • Aggregations ‣ Count, Average, etc 6

PLUMBING METAPHOR FOR BUILDING DATA FLOWS The Cascading processing model is based   • on a metaphor of flows based on patterns Source Tap Pipe Sink Tap Tuple Stream 7

CASCADING PROCESSING MODEL TERMINOLOGY Tuple Stream Series of tuples (data record) Fields Representation of the Tuple Stream, used in operations Pipe Applies operations to tuples or groups of tuples Branch Pipes linked together under a common Pipe name Pipe Assembly An interconnected set of pipe branches Tap Source or sink for data Flow Pipe assembly with taps Cascade Multiple flows grouped together & executed as a single process 8

TUPLE STREAM A Tuple represents a set of values. • Consider a Tuple the same as a database record where • every value is a column in that table. A "tuple stream" is a set of Tuple instances passed • consecutively through a Pipe assembly. 9

PIPES CAN BE CHAINED TO PERFORM COMPLEX OPERATIONS Pipes control the flow of data applying operations to • each Tuple or groups of Tuples. Pipes work on fields of one or more tuples. • Pipes allow you to manage a data flow such as doing: • Grouping - Joining - Filtering - Buffering - Aggregating - 10

PIPES CAN BE BRANCHED AND MERGED Pipe Assemblies are an interconnected set of pipe • branches modeled as a DAG (Directed Acyclic Graph) Pipe Assemblies can consist of splits and/or merges. • Pipe assemblies are specified independently of the data • source they are to process. For a pipe assembly to be executed, it must be bound to • data sources and sinks (which becomes a flow) DAG: collection of vertices and directed edges, each edge connecting one vertex to another, such that there is no way to start at some vertex v and follow a sequence of edges that eventually loops back to v again. 11

TAPS ABSTRACT INTEGRATION TO THIRD-PARTY SYSTEMS Taps provide the ability to read and write data. • Taps can be shared between flows and can be restricted • to being either sources or sinks. Taps can be set up to have the actual file identifiers • determined when they run. Examples of Taps are: • File on the local file system - File on a Hadoop distributed file system - File on Amazon S3 - 12

FLOWS CONNECT IT ALL TOGETHER FOR EXECUTION Flows consist of pipe assemblies with data sources and • sinks Flows contain one or more data sources, a DAG • (Directed Acyclic Graph) of pipes, and one or more data sinks. Flows are designed to be re-useable units of work. • Flows show the business and programming process. • A flow is a basic unit of work of arbitrary size. • 13

FLOWS CAN BE CONNECTED INTO A CASCADE Cascade joins together multiple flows. • Use Cascade if there are dependencies among the Flows: • Cascade will cause a flow to not be executed - until all of its data dependencies are satisfied. A cascade can determine that a Flow does not - need to run. A CascadeConnector makes a Cascade from Flows. • 14

CASCADING RUNTIME FRAMEWORK ABSTRACTS INTEGRATION & COMPUTE FABRIC • Java API • Separates business logic from integration Scripting Enterprise Java • Testable at every lifecycle stage Scala, Clojure, JRuby, Jython, Groovy Cascading Processing API Integration API • Works with any JVM language Scheduler API Process Planner • Many integration adapters Scheduler Apache Hadoop Data Stores 15

CASCADING - INTEGRATION WITH EXTERNAL SYSTEMS Third-party Systems Sink Source http://www.cascading.org/extensions/ 16

CASCADING - APP PORTABILITY “Write once and deploy on your fabric of choice.” • The Innovation — Cascading allows for data apps to execute on existing and emerging fabrics through its new customizable query planner. Enterprise Data Applications • Cascading 3.0 supports — Local In- Other Memory, Apache MapReduce and Custom Local In-Memory MapReduce Apache Tez. 1H 2015 - Apache Computation Fabrics Spark and Apache Storm • Flexibility to meet changing business needs 17

THE STANDARD FOR DATA APPLICATION DEVELOPMENT Application platform that addresses: Build data apps Systems Application that are   Integration Portability scale-free Write once, then run on Hadoop never lives alone. Design principals ensure different computation Easily integrate to existing best practices at any scale fabrics systems Proven application development Staffing Test-Driven Operational framework for building data apps Bottleneck Development Complexity Use existing Java,Scala, Efficiently test code and Simple - Package up into www.cascading.org SQL, modeling skill sets process local files before one jar and hand to deploying on a cluster operations 18

STRONG ORGANIC GROWTH 280K+ downloads / month 7000+ Deployments 19

CASCADING DATA APPLICATIONS Enterprise IT Marketing / Retail Finance Extract Transform Load Mobile, Social, Search Analytics Fraud and Anomaly Detection Log File Analysis Funnel Analysis Fraud Experiments Systems Integration Revenue Attribution Customer Analytics Operations Analysis Customer Experiments Insurance Risk Metric Ad Optimization Retail Recommenders Corporate Apps Health / Biotech HR Analytics Aggregate Metrics For Govt Consumer / Entertainment Employee Behavioral Analysis Person Biometrics Customer Support | eCRM Music Recommendation Veterinary Diagnostics Business Reporting Comparison Shopping Next-Gen Genomics Restaurant Rankings Argonomics Real Estate Telecom Environmental Maps Rental Listings Data processing of Open Data Travel Search & Forecast Geospatial Indexing Consumer Mobile Apps Location based services 20

BUSINESSES DEPEND ON US • Cascading Java API • Data normalization and cleansing of search and click-through logs for use by analytics tools, Hive analysts • Easy to operationalize heavy lifting of data in one framework 21

BUSINESSES DEPEND ON US • Cascalog (Clojure) • Weather pattern modeling to protect growers against loss • ETL against 20+ datasets daily • Machine learning to create models • Purchased by Monsanto for $930M US 22

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION - PowerPoint PPT Presentation

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure for Big Data Building enterprise

How Do We Mainstream Sustainable Innovation? Dr Richard Miller Head of Sustainability V2 140508

Distracted Driving Jennifer Smith What is Distracted Driving? Driving while engaged in any

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Safe Driving Techniques Road Safety Management Use of mobile phones Safe Driving Policy

DRIVING AI 1 Driving AI AI world representation Path finding AI driving

1 CLIENTS DRIVING INNOVATION CLIENTS DRIVING INNOVATION CHANGING PRIORITIES? CHANGING MARKET

Winter Driving Safety PPT-SM-WNTRDRVNG 1 V.A.0.0 Winter Driving The leading cause of death

Intelligent Driving Agents Intelligent Driving Agents Microscopic traffic simulation with

DRIVING CHANGE THE FIA A WORLDWIDE PRESENCE DRIVING CHANGE From track to road DRIVING CHANGE

Innovation Strategy & Guidelines 26/09/19 Innovation in TII What is Innovation? Innovation

Driving Success Driving Success Through Transparency Through Transparency 2013-14 IAE Industry

Retrofit for the Future Neil Morgan Project Leader Driving Innovation Who are the Technology

Volvo Group presentation Driving prosperity through transport solutions OUR MISSION Driving

DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning Lex Fridman

Definition of Innovation 1 Definition of Innovation 01. Defining Innovation 02. Grades of

INNOVATION IN LARGE ORGANIZATIONS Tony AMBROZIE Innovation Innovation is the creation of

Object Recognition/Detection Radovan Fusek 2 nd International summer school on "Deep Learning

Criminal Network Formation and Optimal Detection Policy: The Role of Cascade of Detection Liuchun

Improved Cascade for Search Mission Detection Matthias Hagen Jakob Gomoll Benno Stein

Diffusion and Propagation Social and Economic Networks Jafar Habibi MohammadAmin Fazli Social

Fast Video Classification via Adaptive Cascading of Deep Models Haichen Shen Seungyeop Han

Better proofs for rekeying D. J. Bernstein Security of AES-256 key k is far below 2 256 in most

PURE: Background and Aims 80% of the global burden of CVD occurs in MIC and LIC. Why?:

SARS Outbreaks in Ontario, Hong Kong and Singapore DIMACS Workshop on Facing the Challenge of

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION - PowerPoint PPT Presentation

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING Supreet Oberoi VP Field Engineering, Concurrent Inc GET TO KNOW CONCURRENT Leader in Application Infrastructure for Big Data Building enterprise

How Do We Mainstream Sustainable Innovation? Dr Richard Miller Head of Sustainability V2 140508

Distracted Driving Jennifer Smith What is Distracted Driving? Driving while engaged in any

Self-Driving Cars As Edge Computing Devices Matt Ranney - @mranney Uber ATG Why Self-Driving?

Safe Driving Techniques Road Safety Management Use of mobile phones Safe Driving Policy

DRIVING AI 1 Driving AI AI world representation Path finding AI driving

1 CLIENTS DRIVING INNOVATION CLIENTS DRIVING INNOVATION CHANGING PRIORITIES? CHANGING MARKET

Winter Driving Safety PPT-SM-WNTRDRVNG 1 V.A.0.0 Winter Driving The leading cause of death

Intelligent Driving Agents Intelligent Driving Agents Microscopic traffic simulation with

DRIVING CHANGE THE FIA A WORLDWIDE PRESENCE DRIVING CHANGE From track to road DRIVING CHANGE

Innovation Strategy &amp; Guidelines 26/09/19 Innovation in TII What is Innovation? Innovation

Driving Success Driving Success Through Transparency Through Transparency 2013-14 IAE Industry

Retrofit for the Future Neil Morgan Project Leader Driving Innovation Who are the Technology

Volvo Group presentation Driving prosperity through transport solutions OUR MISSION Driving

DeepTraffic: Driving Fast through Dense Traffic with Deep Reinforcement Learning Lex Fridman

Definition of Innovation 1 Definition of Innovation 01. Defining Innovation 02. Grades of

INNOVATION IN LARGE ORGANIZATIONS Tony AMBROZIE Innovation Innovation is the creation of

Object Recognition/Detection Radovan Fusek 2 nd International summer school on &quot;Deep Learning

Criminal Network Formation and Optimal Detection Policy: The Role of Cascade of Detection Liuchun

Improved Cascade for Search Mission Detection Matthias Hagen Jakob Gomoll Benno Stein

Diffusion and Propagation Social and Economic Networks Jafar Habibi MohammadAmin Fazli Social

Fast Video Classification via Adaptive Cascading of Deep Models Haichen Shen Seungyeop Han

Better proofs for rekeying D. J. Bernstein Security of AES-256 key k is far below 2 256 in most

PURE: Background and Aims 80% of the global burden of CVD occurs in MIC and LIC. Why?:

SARS Outbreaks in Ontario, Hong Kong and Singapore DIMACS Workshop on Facing the Challenge of

Innovation Strategy & Guidelines 26/09/19 Innovation in TII What is Innovation? Innovation

Object Recognition/Detection Radovan Fusek 2 nd International summer school on "Deep Learning