MARS RAPIDS GPU - - PowerPoint PPT Presentation

mars rapids gpu
SMART_READER_LITE
LIVE PREVIEW

MARS RAPIDS GPU - - PowerPoint PPT Presentation

MARS RAPIDS GPU Mars+RAPIDS CONTENT Mars+RAPIDS


slide-1
SLIDE 1

阿里云智能 秦续业 何开圣

当 MARS 遇上 RAPIDS: 使用 GPU 加速分布式海量数据处理的原理和实战

slide-2
SLIDE 2

CONTENT

背景 Mars+RAPIDS 能做什么 Mars+RAPIDS 怎么做 性能和展望

目录

slide-3
SLIDE 3
slide-4
SLIDE 4

机器学习生命周期

数据处理/ 数据分析 特征工程/ 模型训练 模型部署/维 护/改进

Data

训练的模型 { 新的数据 } { 预测 }

往往要占用 80% 的 时间

slide-5
SLIDE 5

Google 趋势(全球)

slide-6
SLIDE 6

日益增长的数据科学技术栈

slide-7
SLIDE 7

Data Scientist Data Engineer

slide-8
SLIDE 8

Mars:numpy、pandas、scikit-learn 的并行和 分布式加速器,处理更多数据

slide-9
SLIDE 9

1 import numpy as np 2 from scipy.special import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = np.log(P / S) 7 b = T * -rate 8 9 z = T * (vol * vol * 2) 10 c = 0.25 * z 11 y = 1.0 / np.sqrt(z) 12 13 w1 = (a - b + c) * y 14 w2 = (a - b - c) * y 15 16 d1 = 0.5 + 0.5 * erf(w1) 17 d2 = 0.5 + 0.5 * erf(w2) 18 19 Se = np.exp(b) * S 20 21 call = P * d1 - Se * d2 22 put = call - P + Se 23 24 return call, put 25 26 27 N = 50000000 28 price = np.random.uniform(10.0, 50.0, N) 29 strike = np.random.uniform(10.0, 50.0, N) 30 t = np.random.uniform(1.0, 2.0, N) 31 print(black_scholes(price, strike, t, 0.1, 0.2))

Numpy

1 import mars.tensor as mt 2 from mars.tensor.special import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = mt.log(P / S) 7 b = T * -rate 8 9 z = T * (vol * vol * 2) 10 c = 0.25 * z 11 y = 1.0 / mt.sqrt(z) 12 13 w1 = (a - b + c) * y 14 w2 = (a - b - c) * y 15 16 d1 = 0.5 + 0.5 * erf(w1) 17 d2 = 0.5 + 0.5 * erf(w2) 18 19 Se = mt.exp(b) * S 20 21 call = P * d1 - Se * d2 22 put = call - P + Se 23 24 return call, put 25 26 27 N = 50000000 28 price = mt.random.uniform(10.0, 50.0, N) 29 strike = mt.random.uniform(10.0, 50.0, N) 30 t = mt.random.uniform(1.0, 2.0, N) 31 print(mt.ExecutableTuple(black_scholes(price, 32 strike, t, 0.1, 0.2)).execute())

Mars tensor

运行时间:11.9 s 峰值内存:5479.47 MiB 运行时间:5.48 s 峰值内存:1647.85 MiB

slide-10
SLIDE 10

Pandas Mars DataFrame

1 import numpy as np 2 import pandas as pd 3 4 df = pd.DataFrame(np.random.rand(100000000, 4), 5 columns=list('abcd')) 6 print(df.sum()) 1 import mars.tensor as mt 2 import mars.dataframe as md 3 4 df = md.DataFrame(mt.random.rand(100000000, 4), 5 columns=list('abcd')) 6 print(df.sum().execute())

运行时间:18.7 s 峰值内存:3430.29 MiB 运行时间:5.25 s 峰值内存:2007.92 MiB

slide-11
SLIDE 11

Scikit-learn Mars learn

1 from sklearn.datasets import make_blobs 2 from sklearn.decomposition.pca import PCA 3 4 X, y = make_blobs( 5 n_samples=100000000, n_features=3, 6 centers=[[3, 3, 3], [0, 0, 0], 7 [1, 1, 1], [2, 2, 2]], 8 cluster_std=[0.2, 0.1, 0.2, 0.2], 9 random_state=9) 10 pca = PCA(n_components=3) 11 pca.fit(X) 12 print(pca.explained_variance_ratio_) 13 print(pca.explained_variance_) 1 from sklearn.datasets import make_blobs 2 from mars.learn.decomposition import PCA 3 4 X, y = make_blobs( 5 n_samples=100000000, n_features=3, 6 centers=[[3, 3, 3], [0, 0, 0], 7 [1, 1, 1], [2, 2, 2]], 8 cluster_std=[0.2, 0.1, 0.2, 0.2], 9 random_state=9) 10 pca = PCA(n_components=3) 11 pca.fit(X) 12 print(pca.explained_variance_ratio_.execute()) 13 print(pca.explained_variance_.execute())

运行时间:19.1 s 峰值内存:7314.82 MiB 运行时间:12.8 s 峰值内存:3814.32 MiB

slide-12
SLIDE 12

机器学习生命周期

数据处理/ 数据分析 特征工程/ 模型训练 模型部署/维 护/改进

Data

训练的模型 { 新的数据 } { 预测 }

往往要占用 80% 的 时间

支持GPU加速 支持GPU加速 GPU???

slide-13
SLIDE 13
slide-14
SLIDE 14

Numpy Cupy

In [1]: import numpy as np In [4]: %%time ...: a = np.random.rand(8000, 10) ...: _ = ((a[:, np.newaxis, :] - a) ** 2).sum(axis=-1) ...: CPU times: user 17 s, sys: 1.84 s, total: 18.8 s Wall time: 5.23 s In [2]: import cupy as cp In [5]: %%time ...: a = cp.random.rand(8000, 10) ...: _ = ((a[:, cp.newaxis, :] - a) ** 2).sum(axis=-1) ...: CPU times: user 590 ms, sys: 292 ms, total: 882 ms Wall time: 880 ms

slide-15
SLIDE 15

Pandas RAPIDS cuDF

In [6]: %%time ...: import pandas as pd ...: ratings = pd.read_csv('ml-20m/ratings.csv') ...: ratings.groupby('userId').agg({'rating': [ 'sum', 'mean', 'max', 'min']}) ...: CPU times: user 10.5 s, sys: 1.58 s, total: 12.1 s Wall time: 18 s In [7]: %%time ...: import cudf ...: ratings = cudf.read_csv('ml-20m/ratings.csv') ...: ratings.groupby('userId').agg({'rating': [ 'sum', 'mean', 'max', 'min']} ) ...: CPU times: user 1.2 s, sys: 409 ms, total: 1.61 s Wall time: 1.66 s

slide-16
SLIDE 16

Scikit-learn RAPIDS cuML

In [4]: import pandas as pd In [5]: from sklearn.neighbors import NearestNeighbors In [6]: %%time ...: df = pd.read_csv('data.csv') ...: nn = NearestNeighbors(n_neighbors=10) ...: nn.fit(df) ...: neighbors = nn.kneighbors(df) ...: CPU times: user 3min 34s, sys: 1.73 s, total: 3min 36s Wall time: 1min 52s In [1]: import cudf In [2]: from cuml.neighbors import NearestNeighbors In [3]: %%time ...: df = cudf.read_csv('data.csv') ...: nn = NearestNeighbors(n_neighbors=10) ...: nn.fit(df) ...: neighbors = nn.kneighbors(df) ...: CPU times: user 41.6 s, sys: 2.84 s, total: 44.4s Wall time: 17.8 s

slide-17
SLIDE 17

Mars+RAPIDS:更快地处理更多数据

slide-18
SLIDE 18

Mars tensor:实现了70%常见 Numpy 接口

  • Tensor creation
  • ones
  • empty
  • zeros
  • ones_like
  • Random sampling
  • rand
  • randint
  • beta
  • binomial
  • Basic manipulations
  • astype
  • transpose
  • broadcast_to
  • sort
  • Aggregation
  • sum
  • nansum
  • max
  • all
  • mean
  • Indexing
  • Slice
  • Boolean indexing
  • Fancy indexing
  • newaxis
  • Ellipsis
  • Discrete Fourier transform
  • Linear Algebra
  • QR
  • SVD
  • Cholesky
  • inv
  • norm
slide-19
SLIDE 19

Mars DataFrame 和 learn

  • DataFrame 实现接口:https://github.com/mars-project/mars/issues/495
  • 创建 DataFrame:DataFrame、from_records
  • IO:read_csv
  • Basic arithmetic:基本算数运算
  • Math:数学运算
  • Indexing: iloc,列选择,set_index
  • Reduction:聚合
  • Groupby:分组聚合
  • merge/join
  • Learn:
  • Decomposition:PCA,TruncatedSVD
  • TensorFlow:run_tensorflow_script,MarsDataset 进行中
  • XGBoost:XGBClassifier、XGBRegressor
  • PyTorch:进行中
slide-20
SLIDE 20

24core 4 x 24core 1 x Tesla V100 4 x Tesla V100

Scale up Scale out

蒙特卡洛求解 PI

In [4]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2), gpu=True) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute( n_parallel=1)) ...: 3.14157076 CPU times: user 2.72 s, sys: 1.27 s, total: 3.99s Wall time: 3.98 s In [3]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2)) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute()) ...: 3.14160312 CPU times: user 3min 31s, sys: 1min 42s, total: 5min 14s Wall time: 25.8 s In [4]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2), gpu=True) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute( n_parallel=4)) ...: 3.14156894 CPU times: user 1.64 s, sys: 918 ms, total: 2.56 s Wall time: 2.4 s In [4]: from mars.session import new_session In [5]: new_session('http://192.168.0.111:40002').as_default() In [6]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2)) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute()) ...: ...: 3.141611406 CPU times: user 12.2 ms, sys: 2.02 ms, total: 14.3 ms Wall time: 7.66 s

slide-21
SLIDE 21

Mars 如何作到并行和分布式? 让我们看看 Mars 背后的设计哲学

slide-22
SLIDE 22

哲学1:分而治之

In [1]: import mars.tensor as mt In [2]: import mars.dataframe as md In [3]: a = mt.ones((10, 10), chunk_size=5) In [4]: a[5, 5] = 8 In [5]: df = md.DataFrame(a) In [6]: s = df.sum() In [7]: s.execute() Out[7]: 0 10.0 1 10.0 2 10.0 3 10.0 4 10.0 5 17.0 6 10.0 7 10.0 8 10.0 9 10.0 dtype: float64

粗粒度计算图

Ones

TensorData

IndexS etValue

TensorData

FromT ensor

DatFrameData

Sum

SeriesData

tensor(a) data DataFrame(df) data Series(s) data Tileable TileableData Operand

indexes: (5, 5) value: 8

slide-23
SLIDE 23

细粒度计算图 粗粒度计算图

Ones

TensorData

IndexS etValue

TensorData

FromT ensor

DatFrameData

Sum

SeriesData

Tile

indexes: (5, 5) value: 8

Ones Ones Ones Ones

TensorChunkData (0,0) TensorChunkData (1,0) TensorChunkData (0,1) TensorChunkData (1,1)

IndexS etValue

indexes: (0, 0) value: 8 TensorChunkData

FromT ensor FromT ensor FromT ensor FromT ensor

DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a

Sum Sum Sum Sum

DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a

Conc at Conc at

DatFrameChunkDat a DatFrameChunkDat a

Sum

SeriesChunkData

Sum

SeriesChunkData

slide-24
SLIDE 24

Ones Ones Ones Ones

TensorChunkData (0,0) TensorChunkData (1,0) TensorChunkData (0,1) TensorChunkData (1,1)

IndexS etValue

indexes: (0, 0) value: 8 TensorChunkData

FromT ensor FromT ensor FromT ensor FromT ensor

DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a

Sum Sum Sum Sum

DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a

Conc at Conc at

DatFrameChunkDat a DatFrameChunkDat a

Sum

SeriesChunkData

Sum

SeriesChunkData

细粒度计算图 (算子融合后) 细粒度计算图 (算子融合前)

Compo se Compo se Compo se Compo se

DatFrameChunkD ata DatFrameChunk Data DatFrameChunk Data DatFrameChunk Data

Compo se Compo se

SeriesChunkData SeriesChunkData

Fuse

slide-25
SLIDE 25

哲学2:站在巨人肩膀上

算子实现 CPU GPU CPU 算子融合 GPU 算子融合 Tensor Numpy Cupy Numexpr Jax(进行中) Cupy DataFrame Pandas RAPIDS cuDF 暂不支持 暂不支持 Learn Scikit- learn/XGBoost/TensorFlo w/PyTorch RAPIDS cuML/XGBoost/TensorFlo w/PyTorch 暂不支持 暂不支持

slide-26
SLIDE 26

哲学3:多策略,更智能

A B C … 999 1000 … 1999 2000 … 2999

Chunk 0 Chunk 1 Chunk 2

DataFrameData

GroupB yAgg

by: ‘B’ 策略1:Tree Reduction DatFrameChunkD ata DatFrameChunkD ata DatFrameChunkD ata

Group ByAgg Group ByAgg Group ByAgg Conca t Group ByAgg

策略2:Shuffle DatFrameChunkD ata DatFrameChunkD ata DatFrameChunkD ata

Group ByAgg Group ByAgg Group ByAgg Group ByAgg Group ByAgg Group ByAgg

策略3:Adaptive DatFrameChunkD ata DatFrameChunkD ata DatFrameChunkD ata

Group ByAgg Group ByAgg Group ByAgg

Tree Reduction Shuffle

?

AI-based?

Shuffl e

slide-27
SLIDE 27

客户端 服务端

Rest服务

1 import mars.tensor as mt 2 import mars.dataframe as md 3 from mars.session import new_session 4 5 new_session('http://web:12345').as_default() 6 7 a = mt.random.rand((10, 10), chunk_size=5) 8 df = md.DataFrame(a) 9 print(df.sum().execute())

Rand

TensorData

FromT ensor

DataFrameData

SUM

SeriesData

序列化

Ran d

TensorData

From Tenso r DataFrameData

SU M

SeriesData

反序列化

Ran d

TensorChun kData

From Tenso r DataFrameChun kData

SU M

DataFrameChun kData

Ran d

TensorChun kData

From Tenso r DataFrameChun kData

SU M

DataFrameChun kData

Conc at

DataFrameChun kData

SU M

SeriesChunk Data

Tile

Comp

  • se

Comp

  • se

DataFrameChun kData DataFrameChun kData Comp

  • se

SeriesChunk Data

Fuse

Comp

  • se

DataFrameChun kData Comp

  • se

DataFrameChun kData Fetch DataFrameChun kData Fetch DataFrameChun kData Comp

  • se

SeriesChunk Data

Schedulers Workers 分配到各schedulers

Disk/Cloud Storage Shared Memory/GPU Memory Processes(CPUs/GPU) Disk/Cloud Storage Shared Memory/GPU Memory Processes(CPUs/GPU)

Data Data Data

Data Data Data

Data Data Data Data Data Data Data

slide-28
SLIDE 28

Mars 和 Dask、Spark 对比

对比项 Mars Dask Spark API丰富程度 😮 (Numpy、Pandas、Scikit-learn) 😂 (Numpy、Pandas、Scikit-learn、 delayed) 😂 (SQL/DataFrame、RDD、Spark streaming、MLlib、GraphX) 社区 😮 😂 😂 调度 😮 (Kubernetes支持较好) 😂 (Kubernetes支持一般,另外支持 Yarn, 超算) 😂 (Kubernetes、Yarn、Mesos) 细粒度计算图 😂 😂 😮 运行时优化(算子融合) 😂 😮 😂 可变数据 😂 😮 😮 多语言支持设计 😂 😮 😂 容错 😂 (worker 级别容错) 😮 😂 分布式 master 去单点 😂 😮 😂

slide-29
SLIDE 29

Mars 和 Dask 性能对比(CPU,单机)

  • Macbook pro,2.2 GHz Intel Core i7,16G 内存
  • Mars 0.3.0a2,dask 2.2.0

36.47 54.92 83.37 18.10 143.64 43.97 48.50 24.90 6.04 27.33 35 53 31 17 51 44 36 22 5 25 40 80 120 160 Black-Scholes Cholesky Dot FFT Inv LU Monte-Carlo-Pi QR RNG SVD 运行时间(秒)

Dask Mars

slide-30
SLIDE 30

Mars 和 Dask 性能对比(CPU,分布式)

  • 1 个 scheduler,4 个 worker(24核80G)
  • 麻烦的是,我们进行 Benchmark 的过程中,Dask 有时会 crash

148.95 164.93 2,361.37 343.62 448.64 621.78 442.58 163.64 6.84 226.57 28 159 844 269 355 549 131 111 6 132 750 1500 2250 3000 Black-Scholes Cholesky Dot FFT Inv LU Monte-Carlo-Pi QR RNG SVD 运行时间(秒)

Dask Mars

slide-31
SLIDE 31

Mars 和 Dask 性能对比(GPU,CPU)

  • 500G 内存,96 核,NVIDIA v100 显卡

11.00 12.20 11.70

9 10 9 94 95 100 24.8 41.2 39.5

groupby1 groupby2 groupby3 运行时间(秒)

Dask-GPU Mars-GPU Dask-CPU Mars-CPU

数据大小:5G

  • groupby1:df.groupby('id1').agg({'v1':'sum'})

groupby2:df.groupby(['id1','id2']).agg({'v1':'sum’}) groupby3:df.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})

45.10 49.10 50.80

38 44 36 383 385 375 82 135 115

groupby1 groupby2 groupby3 运行时间(秒)

Dask-GPU Mars-GPU Dask-CPU Mars-CPU

数据大小:20G

1.29 1.55 2.57

0.95 1.20 2.25 8.49 8.56 8.44 4.65 7.5 4.73

groupby1 groupby2 groupby3 运行时间(秒)

Dask-GPU Mars-GPU Dask-CPU Mars-CPU

数据大小:500M

slide-32
SLIDE 32
  • pip install pymars
  • Mars 开源地址:https://github.com/mars-project/mars
  • Mars 文档:https://docs.mars-project.io/zh_CN/latest/
  • 双版本发布
  • 方向:
  • 社区是重点
  • 技术:
  • Roadmap 和 Enhancement Proposal:https://github.com/mars-project/mars/issues/537
  • 更好的 GPU 支持,执行和数据传输效率提升,减少调度开销
  • 丰富 DataFrame、 learn 和 Tensor 的接口,提升和深度学习框架之间的交互
  • Mars actors 层优化,支持更高效的执行和网络效率
  • 支持更多调度

Mars 现状及展望

slide-33
SLIDE 33