Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop Keuntae - PowerPoint PPT Presentation

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop

Keuntae Park • IT Manager of SK Telecom, South Korea’s largest wireless communications provider • Work on commercial products (~’12) – T-FS: Distributed File System – Windows compatible layer on TimOS – T-MR: on-demand MapReduce service like E-MR • Open source activity (‘13~) – Committer of Apache Tajo project

Overview • Background – Telco requirements • Before Tajo – Commercial product – Open source (Hadoop) outsourcing • After Tajo – Issues & solutions – Performance • win-win between community and company • Future Works

Telco data characteristics • Huge amount of data – 40 TB/day (compressed) – 15 PB (estimated, end of 2014) • Report & OLAP ad-hoc query – Filtering – Summary – BI tools

Requirements - different size, different speed Filtering & Data re- Summary BI report Ad-hoc Query aggregation construction accumulated daily sum of entire Target mart data summary data for 5 minutes filtered data summary data every 5 daily or non-regularly Frequency ah-hoc ah-hoc minutes monthly (rare) Amount of hundreds of tens of tens of terabytes petabytes data terabytes gigabytes terabytes Response within a no strict within two within a hour within a hour time minute deadline seconds

Previous approach - DBMS based on MPP DBMS

Previous approach - DBMS Too Expensive Not Scalable based on MPP DBMS

Previous approach - Hadoop(MapReduce, Hive) + DBMS Hadoop MPP DBMS

Previous approach - Hadoop(MapReduce, Hive) + DBMS Working (but…) Hadoop MPP DBMS

Still has Problems • Hadoop outsourcing – quality of outcome is not good (actually bad) – communication overhead – hard to reflect requirements on open source • Data Warehouse and Mart becomes bigger

Solution - Tajo!! • It can replace both DBMS and Hadoop – High throughput for batch processing – Low latency for ad-hoc queries – ANSI SQL compatible • Can do by myself – very open community • easily make issues about what I really need – fast growing • issues solved very fast

About Tajo • Tajo (since 2010) – Big Data Warehouse System on Hadoop – Apache top-level project (entered the ASF in March 2013) • Features – SQL standard compliance – Fully distributed SQL query processing – HDFS as a primary storage – Relational model (will be extended to nested model in the future) – ETL as well as low-latency relational query processing (100 ms ~) • News – 0.2-incubating: released November 2013 – graduation to top-level: April 2014

Tajo logical optimizer • Cost-based join ordering • Projection/Filter push down & Duplicated expression removal aggr_sum1 aggr_sum2 GroupBy Filter sel_> sel_< Projection Join ID QTY Date ID Tax Price Table A Table B

Tajo logical optimizer • Cost-based join ordering • Projection/Filter push down & Duplicated expression removal aggr_sum1 aggr_sum1 aggr_sum2 GroupBy aggr_sum2 GroupBy Filter sel_> sel_< Join Projection sel_> sel_< Filter Join Projection ID QTY Date ID Tax Price ID QTY Date ID Price Tax Table A Table B Table A Table B

Tajo progressive optimization • dynamically adjust number of tasks input data • estimate data size   at planning time execution block • check size and adjust plan   intermediate data unknown priorly at execution time … shuffled shuffled shuffled how many tasks   data data data • shuffle intermediate data (and workers)? � … over workers uniformly � execution block

Tajo progressive optimization • dynamically adjust join order or type Hash-Join Hash-Join

Tajo progressive optimization • dynamically adjust join order or type Hash-Join Broadcast-Join Hash-Join

Tajo - what is improved past 9 months ? • Resource Manager • Scheduler & Storage Manager • Data types & Functions • SQL Interface • Management

Tajo resource manager • Fine resource allocation Tajo Master Tajo Worker   (as a query master) Tajo Worker   Tajo Worker   Tajo Worker   (as a worker) (as a worker) (as a worker) TAJO-127 without YARN

Tajo resource manager • Fine resource allocation Tajo Master Tajo Master Tajo Worker   Query Master (as a query master) Tajo Worker   Tajo Worker   Tajo Worker   Tajo Worker   Tajo Worker   (as a worker) (as a worker) (as a worker) Tajo Worker (as a worker) (as a worker) TAJO-127 TAJO-275 without YARN separating Query master

Tajo resource manager • Fine resource allocation Tajo Master Tajo Master Tajo Master Tajo Worker   Query Master Query Master (as a query master) Tajo Worker   Tajo Worker   Tajo Worker   Tajo Worker   Tajo Worker Tajo Worker Tajo Worker   (as a worker) (as a worker) Tajo Worker Tajo Worker (as a worker) Tajo Worker (as a worker) Tajo Worker Tajo Worker (I/O-intensive) (I/O-intensive) (as a worker) (I/O-intensive) (I/O-intensive) (I/O-intensive) (CPU/memory) TAJO-127 TAJO-275 TAJO-317 without YARN separating Query master elaborate resource allocation

Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread

Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Thread Thread Tajo Worker Thread Thread Thread Thread Tajo Worker Thread Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread Thread Storage Manager TAJO-84 considering disk load balance TAJO-178 asynchronous scan

Scheduler & Storage manager • disk-aware scheduling (volume info from HDFS-3672) Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Tajo Worker Thread Thread Tajo Worker Thread Thread Thread Thread Tajo Worker Thread Thread Thread Tajo Worker Tajo Worker Thread Tajo Worker Thread TAJO-134 Thread text compression Storage (gzip, snappy, lz4, bzip2) Manager TAJO-200 RCFile � TAJO-30 Parquet TAJO-84 TAJO-435 considering disk load balance intermediate file TAJO-178 asynchronous scan

Functions & data types • supporting more functions and UDFs function1 function2 Tajo Master function3 registered at startup (class name is coded in source)

Functions & data types • supporting more functions and UDFs function function function1 Tajo Master function2 Tajo Master user defined user defined function3 function function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} ) TAJO-408 Improve function system

Functions & data types • supporting more functions and UDFs automatic function function registration function1 Tajo Master function2 Tajo Master runtime user defined user defined function3 function registration function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", description registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} ) TAJO-408 Improve function system

Functions & data types • supporting more functions and UDFs automatic function function registration function1 Tajo Master function2 Tajo Master runtime user defined user defined function3 function registration function @Description( functionName = "to_timestamp", description = "Convert UNIX epoch to time stamp", description registered at startup example = "> SELECT to_timestamp(1389071574);\n" (class name is coded in source) + "2014-01-07 14:12:54", returnType = TajoDataTypes.Type.TIMESTAMP, paramTypes = {@ParamTypes(paramTypes = {TajoDataTypes.Type.INT4}), @ParamTypes(paramTypes = {TajoDataTypes.Type.INT8})} TAJO-52 ) standard SQL TAJO-408 data types Improve function system

JDBC Driver, HCatalog TAJO-16, 433 Hive metastore TAJO-176 HCatalog JDBC Driver JDBC ANSI SQL SQL parser Tajo Algebra Query Master expression HiveQL HiveQL parser TAJO-101 HiveQL converter

Management TAJO-239 Improving Web UI

Management TAJO-564 Execution block progress

Management TAJO-589 Task progress

Management TAJO-468 Task detail info

Management TAJO-474 Task admin utility

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop Keuntae - PowerPoint PPT Presentation

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop Keuntae Park IT Manager of SK Telecom, South Koreas largest wireless communications provider Work on commercial products (~12) T-FS: Distributed File System

SAS Data Loader for Hadoop Agenda Intro What is Hadoop? What do I get from Hadoop?

COMP9313: Big Data Management Hadoop and HDFS Hadoop Apache Hadoop is an open-source

SQL SQL SQL = Structured Query Language Standard query language for relational

Towards an Economic Valuation of Telco-based Valuation of Telco based Identity

Hadoop on HPC: Integrating Hadoop and Pilot-based Dynamic Resource Management Andre Luckow,

A1 (Part 2): Injection SQL Injection SQL injection is prevalent SQL injection is impactful Why a

What is SQL? SQL stands for Structured Query Language SQL lets you access and manipulate

BASIC SQL CHAPTER 4 (6/E) CHAPTER 8 (5/E) 1 CHAPTER 4 OUTLINE SQL Data Definition and

This Lecture SQL The SQL language SQL, the relational model, and E/R diagrams SQL Data

Intermezzo: A typical database architecture 136 A typical database architecture SQL SQL SQL

Basic SQL Lecture 2 1 Outline Data in SQL Simple Queries in SQL Queries with more

Big Data with R and Hadoop Jamie F Olson June 11, 2015 ; R and Hadoop Review various tools

VAS Management Platform. Solution from TELCO to TELCO Powered by 1Click The fastest payment

Agenda Telco and 5G Network Functions Virtualization OPNFV and Software Defined

Spark and Hadoop at Yahoo: Brought to you by YARN Andy Feng Yahoo! Hadoop (afeng@yahoo-inc.com)

DDR solution Sprites overview Moving right arrow Moving left arrow Moving down arrow Moving up

T8: NodeJS CPSC 513 Dr. P. Federl University of Calgary Arshia Hosseini T01/T02 What is

no. 4 Rum Tiki Monday, 16 January 2012 Rum Photo Credit: parkclub.net Monday, 16 January 2012

How is each group the same? GENETICS AND MENDEL How is each group different? HISTORY OF

Analysis Update Aaron Hanson Indiana University 1 Outline Research summary Global

Recitation 6: Filesystems Kai Mast Filesystem Abstraction ext4 btrfs (mounted to /) (mounted

Online Statistics Teaching with R using `learnr interactive lessons and tutorials Dr. Amira

Higgs to tau muon in a MSSM flavor extended model XV Mexican Workshop on Particles and Fields

LARGE SCALE ELECTRONIC STRUCTURE COMPUTATIONS FOR PLASMA DEVICES Presenter: Purnima Ghale