MARS RAPIDS GPU - - PowerPoint PPT Presentation
MARS RAPIDS GPU - - PowerPoint PPT Presentation
MARS RAPIDS GPU Mars+RAPIDS CONTENT Mars+RAPIDS
CONTENT
背景 Mars+RAPIDS 能做什么 Mars+RAPIDS 怎么做 性能和展望
目录
机器学习生命周期
数据处理/ 数据分析 特征工程/ 模型训练 模型部署/维 护/改进
Data
训练的模型 { 新的数据 } { 预测 }
往往要占用 80% 的 时间
Google 趋势(全球)
日益增长的数据科学技术栈
Data Scientist Data Engineer
Mars:numpy、pandas、scikit-learn 的并行和 分布式加速器,处理更多数据
1 import numpy as np 2 from scipy.special import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = np.log(P / S) 7 b = T * -rate 8 9 z = T * (vol * vol * 2) 10 c = 0.25 * z 11 y = 1.0 / np.sqrt(z) 12 13 w1 = (a - b + c) * y 14 w2 = (a - b - c) * y 15 16 d1 = 0.5 + 0.5 * erf(w1) 17 d2 = 0.5 + 0.5 * erf(w2) 18 19 Se = np.exp(b) * S 20 21 call = P * d1 - Se * d2 22 put = call - P + Se 23 24 return call, put 25 26 27 N = 50000000 28 price = np.random.uniform(10.0, 50.0, N) 29 strike = np.random.uniform(10.0, 50.0, N) 30 t = np.random.uniform(1.0, 2.0, N) 31 print(black_scholes(price, strike, t, 0.1, 0.2))
Numpy
1 import mars.tensor as mt 2 from mars.tensor.special import erf 3 4 5 def black_scholes(P, S, T, rate, vol): 6 a = mt.log(P / S) 7 b = T * -rate 8 9 z = T * (vol * vol * 2) 10 c = 0.25 * z 11 y = 1.0 / mt.sqrt(z) 12 13 w1 = (a - b + c) * y 14 w2 = (a - b - c) * y 15 16 d1 = 0.5 + 0.5 * erf(w1) 17 d2 = 0.5 + 0.5 * erf(w2) 18 19 Se = mt.exp(b) * S 20 21 call = P * d1 - Se * d2 22 put = call - P + Se 23 24 return call, put 25 26 27 N = 50000000 28 price = mt.random.uniform(10.0, 50.0, N) 29 strike = mt.random.uniform(10.0, 50.0, N) 30 t = mt.random.uniform(1.0, 2.0, N) 31 print(mt.ExecutableTuple(black_scholes(price, 32 strike, t, 0.1, 0.2)).execute())
Mars tensor
运行时间:11.9 s 峰值内存:5479.47 MiB 运行时间:5.48 s 峰值内存:1647.85 MiB
Pandas Mars DataFrame
1 import numpy as np 2 import pandas as pd 3 4 df = pd.DataFrame(np.random.rand(100000000, 4), 5 columns=list('abcd')) 6 print(df.sum()) 1 import mars.tensor as mt 2 import mars.dataframe as md 3 4 df = md.DataFrame(mt.random.rand(100000000, 4), 5 columns=list('abcd')) 6 print(df.sum().execute())
运行时间:18.7 s 峰值内存:3430.29 MiB 运行时间:5.25 s 峰值内存:2007.92 MiB
Scikit-learn Mars learn
1 from sklearn.datasets import make_blobs 2 from sklearn.decomposition.pca import PCA 3 4 X, y = make_blobs( 5 n_samples=100000000, n_features=3, 6 centers=[[3, 3, 3], [0, 0, 0], 7 [1, 1, 1], [2, 2, 2]], 8 cluster_std=[0.2, 0.1, 0.2, 0.2], 9 random_state=9) 10 pca = PCA(n_components=3) 11 pca.fit(X) 12 print(pca.explained_variance_ratio_) 13 print(pca.explained_variance_) 1 from sklearn.datasets import make_blobs 2 from mars.learn.decomposition import PCA 3 4 X, y = make_blobs( 5 n_samples=100000000, n_features=3, 6 centers=[[3, 3, 3], [0, 0, 0], 7 [1, 1, 1], [2, 2, 2]], 8 cluster_std=[0.2, 0.1, 0.2, 0.2], 9 random_state=9) 10 pca = PCA(n_components=3) 11 pca.fit(X) 12 print(pca.explained_variance_ratio_.execute()) 13 print(pca.explained_variance_.execute())
运行时间:19.1 s 峰值内存:7314.82 MiB 运行时间:12.8 s 峰值内存:3814.32 MiB
机器学习生命周期
数据处理/ 数据分析 特征工程/ 模型训练 模型部署/维 护/改进
Data
训练的模型 { 新的数据 } { 预测 }
往往要占用 80% 的 时间
支持GPU加速 支持GPU加速 GPU???
Numpy Cupy
In [1]: import numpy as np In [4]: %%time ...: a = np.random.rand(8000, 10) ...: _ = ((a[:, np.newaxis, :] - a) ** 2).sum(axis=-1) ...: CPU times: user 17 s, sys: 1.84 s, total: 18.8 s Wall time: 5.23 s In [2]: import cupy as cp In [5]: %%time ...: a = cp.random.rand(8000, 10) ...: _ = ((a[:, cp.newaxis, :] - a) ** 2).sum(axis=-1) ...: CPU times: user 590 ms, sys: 292 ms, total: 882 ms Wall time: 880 ms
Pandas RAPIDS cuDF
In [6]: %%time ...: import pandas as pd ...: ratings = pd.read_csv('ml-20m/ratings.csv') ...: ratings.groupby('userId').agg({'rating': [ 'sum', 'mean', 'max', 'min']}) ...: CPU times: user 10.5 s, sys: 1.58 s, total: 12.1 s Wall time: 18 s In [7]: %%time ...: import cudf ...: ratings = cudf.read_csv('ml-20m/ratings.csv') ...: ratings.groupby('userId').agg({'rating': [ 'sum', 'mean', 'max', 'min']} ) ...: CPU times: user 1.2 s, sys: 409 ms, total: 1.61 s Wall time: 1.66 s
Scikit-learn RAPIDS cuML
In [4]: import pandas as pd In [5]: from sklearn.neighbors import NearestNeighbors In [6]: %%time ...: df = pd.read_csv('data.csv') ...: nn = NearestNeighbors(n_neighbors=10) ...: nn.fit(df) ...: neighbors = nn.kneighbors(df) ...: CPU times: user 3min 34s, sys: 1.73 s, total: 3min 36s Wall time: 1min 52s In [1]: import cudf In [2]: from cuml.neighbors import NearestNeighbors In [3]: %%time ...: df = cudf.read_csv('data.csv') ...: nn = NearestNeighbors(n_neighbors=10) ...: nn.fit(df) ...: neighbors = nn.kneighbors(df) ...: CPU times: user 41.6 s, sys: 2.84 s, total: 44.4s Wall time: 17.8 s
Mars+RAPIDS:更快地处理更多数据
Mars tensor:实现了70%常见 Numpy 接口
- Tensor creation
- ones
- empty
- zeros
- ones_like
- …
- Random sampling
- rand
- randint
- beta
- binomial
- …
- Basic manipulations
- astype
- transpose
- broadcast_to
- sort
- …
- Aggregation
- sum
- nansum
- max
- all
- mean
- …
- Indexing
- Slice
- Boolean indexing
- Fancy indexing
- newaxis
- Ellipsis
- Discrete Fourier transform
- Linear Algebra
- QR
- SVD
- Cholesky
- inv
- norm
- …
Mars DataFrame 和 learn
- DataFrame 实现接口:https://github.com/mars-project/mars/issues/495
- 创建 DataFrame:DataFrame、from_records
- IO:read_csv
- Basic arithmetic:基本算数运算
- Math:数学运算
- Indexing: iloc,列选择,set_index
- Reduction:聚合
- Groupby:分组聚合
- merge/join
- Learn:
- Decomposition:PCA,TruncatedSVD
- TensorFlow:run_tensorflow_script,MarsDataset 进行中
- XGBoost:XGBClassifier、XGBRegressor
- PyTorch:进行中
24core 4 x 24core 1 x Tesla V100 4 x Tesla V100
Scale up Scale out
蒙特卡洛求解 PI
In [4]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2), gpu=True) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute( n_parallel=1)) ...: 3.14157076 CPU times: user 2.72 s, sys: 1.27 s, total: 3.99s Wall time: 3.98 s In [3]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2)) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute()) ...: 3.14160312 CPU times: user 3min 31s, sys: 1min 42s, total: 5min 14s Wall time: 25.8 s In [4]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2), gpu=True) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute( n_parallel=4)) ...: 3.14156894 CPU times: user 1.64 s, sys: 918 ms, total: 2.56 s Wall time: 2.4 s In [4]: from mars.session import new_session In [5]: new_session('http://192.168.0.111:40002').as_default() In [6]: %%time ...: a = mt.random.uniform(-1, 1, size=( 2000000000, 2)) ...: print(((mt.linalg.norm(a, axis=1) < 1). sum() * 4 / 2000000000).execute()) ...: ...: 3.141611406 CPU times: user 12.2 ms, sys: 2.02 ms, total: 14.3 ms Wall time: 7.66 s
Mars 如何作到并行和分布式? 让我们看看 Mars 背后的设计哲学
哲学1:分而治之
In [1]: import mars.tensor as mt In [2]: import mars.dataframe as md In [3]: a = mt.ones((10, 10), chunk_size=5) In [4]: a[5, 5] = 8 In [5]: df = md.DataFrame(a) In [6]: s = df.sum() In [7]: s.execute() Out[7]: 0 10.0 1 10.0 2 10.0 3 10.0 4 10.0 5 17.0 6 10.0 7 10.0 8 10.0 9 10.0 dtype: float64
粗粒度计算图
Ones
TensorData
IndexS etValue
TensorData
FromT ensor
DatFrameData
Sum
SeriesData
tensor(a) data DataFrame(df) data Series(s) data Tileable TileableData Operand
indexes: (5, 5) value: 8
细粒度计算图 粗粒度计算图
Ones
TensorData
IndexS etValue
TensorData
FromT ensor
DatFrameData
Sum
SeriesData
Tile
indexes: (5, 5) value: 8
Ones Ones Ones Ones
TensorChunkData (0,0) TensorChunkData (1,0) TensorChunkData (0,1) TensorChunkData (1,1)
IndexS etValue
indexes: (0, 0) value: 8 TensorChunkData
FromT ensor FromT ensor FromT ensor FromT ensor
DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a
Sum Sum Sum Sum
DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a
Conc at Conc at
DatFrameChunkDat a DatFrameChunkDat a
Sum
SeriesChunkData
Sum
SeriesChunkData
Ones Ones Ones Ones
TensorChunkData (0,0) TensorChunkData (1,0) TensorChunkData (0,1) TensorChunkData (1,1)
IndexS etValue
indexes: (0, 0) value: 8 TensorChunkData
FromT ensor FromT ensor FromT ensor FromT ensor
DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a
Sum Sum Sum Sum
DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a DatFrameChunkDat a
Conc at Conc at
DatFrameChunkDat a DatFrameChunkDat a
Sum
SeriesChunkData
Sum
SeriesChunkData
细粒度计算图 (算子融合后) 细粒度计算图 (算子融合前)
Compo se Compo se Compo se Compo se
DatFrameChunkD ata DatFrameChunk Data DatFrameChunk Data DatFrameChunk Data
Compo se Compo se
SeriesChunkData SeriesChunkData
Fuse
哲学2:站在巨人肩膀上
算子实现 CPU GPU CPU 算子融合 GPU 算子融合 Tensor Numpy Cupy Numexpr Jax(进行中) Cupy DataFrame Pandas RAPIDS cuDF 暂不支持 暂不支持 Learn Scikit- learn/XGBoost/TensorFlo w/PyTorch RAPIDS cuML/XGBoost/TensorFlo w/PyTorch 暂不支持 暂不支持
哲学3:多策略,更智能
A B C … 999 1000 … 1999 2000 … 2999
Chunk 0 Chunk 1 Chunk 2
DataFrameData
GroupB yAgg
by: ‘B’ 策略1:Tree Reduction DatFrameChunkD ata DatFrameChunkD ata DatFrameChunkD ata
Group ByAgg Group ByAgg Group ByAgg Conca t Group ByAgg
策略2:Shuffle DatFrameChunkD ata DatFrameChunkD ata DatFrameChunkD ata
Group ByAgg Group ByAgg Group ByAgg Group ByAgg Group ByAgg Group ByAgg
策略3:Adaptive DatFrameChunkD ata DatFrameChunkD ata DatFrameChunkD ata
Group ByAgg Group ByAgg Group ByAgg
Tree Reduction Shuffle
?
AI-based?
Shuffl e
客户端 服务端
Rest服务
1 import mars.tensor as mt 2 import mars.dataframe as md 3 from mars.session import new_session 4 5 new_session('http://web:12345').as_default() 6 7 a = mt.random.rand((10, 10), chunk_size=5) 8 df = md.DataFrame(a) 9 print(df.sum().execute())
Rand
TensorData
FromT ensor
DataFrameData
SUM
SeriesData
序列化
Ran d
TensorData
From Tenso r DataFrameData
SU M
SeriesData
反序列化
Ran d
TensorChun kData
From Tenso r DataFrameChun kData
SU M
DataFrameChun kData
Ran d
TensorChun kData
From Tenso r DataFrameChun kData
SU M
DataFrameChun kData
Conc at
DataFrameChun kData
SU M
SeriesChunk Data
Tile
Comp
- se
Comp
- se
DataFrameChun kData DataFrameChun kData Comp
- se
SeriesChunk Data
Fuse
Comp
- se
DataFrameChun kData Comp
- se
DataFrameChun kData Fetch DataFrameChun kData Fetch DataFrameChun kData Comp
- se
SeriesChunk Data
Schedulers Workers 分配到各schedulers
Disk/Cloud Storage Shared Memory/GPU Memory Processes(CPUs/GPU) Disk/Cloud Storage Shared Memory/GPU Memory Processes(CPUs/GPU)
Data Data Data
…
Data Data Data
…
Data Data Data Data Data Data Data
Mars 和 Dask、Spark 对比
对比项 Mars Dask Spark API丰富程度 😮 (Numpy、Pandas、Scikit-learn) 😂 (Numpy、Pandas、Scikit-learn、 delayed) 😂 (SQL/DataFrame、RDD、Spark streaming、MLlib、GraphX) 社区 😮 😂 😂 调度 😮 (Kubernetes支持较好) 😂 (Kubernetes支持一般,另外支持 Yarn, 超算) 😂 (Kubernetes、Yarn、Mesos) 细粒度计算图 😂 😂 😮 运行时优化(算子融合) 😂 😮 😂 可变数据 😂 😮 😮 多语言支持设计 😂 😮 😂 容错 😂 (worker 级别容错) 😮 😂 分布式 master 去单点 😂 😮 😂
Mars 和 Dask 性能对比(CPU,单机)
- Macbook pro,2.2 GHz Intel Core i7,16G 内存
- Mars 0.3.0a2,dask 2.2.0
36.47 54.92 83.37 18.10 143.64 43.97 48.50 24.90 6.04 27.33 35 53 31 17 51 44 36 22 5 25 40 80 120 160 Black-Scholes Cholesky Dot FFT Inv LU Monte-Carlo-Pi QR RNG SVD 运行时间(秒)
Dask Mars
Mars 和 Dask 性能对比(CPU,分布式)
- 1 个 scheduler,4 个 worker(24核80G)
- 麻烦的是,我们进行 Benchmark 的过程中,Dask 有时会 crash
148.95 164.93 2,361.37 343.62 448.64 621.78 442.58 163.64 6.84 226.57 28 159 844 269 355 549 131 111 6 132 750 1500 2250 3000 Black-Scholes Cholesky Dot FFT Inv LU Monte-Carlo-Pi QR RNG SVD 运行时间(秒)
Dask Mars
Mars 和 Dask 性能对比(GPU,CPU)
- 500G 内存,96 核,NVIDIA v100 显卡
11.00 12.20 11.70
9 10 9 94 95 100 24.8 41.2 39.5
groupby1 groupby2 groupby3 运行时间(秒)
Dask-GPU Mars-GPU Dask-CPU Mars-CPU
数据大小:5G
- groupby1:df.groupby('id1').agg({'v1':'sum'})
groupby2:df.groupby(['id1','id2']).agg({'v1':'sum’}) groupby3:df.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})
45.10 49.10 50.80
38 44 36 383 385 375 82 135 115
groupby1 groupby2 groupby3 运行时间(秒)
Dask-GPU Mars-GPU Dask-CPU Mars-CPU
数据大小:20G
1.29 1.55 2.57
0.95 1.20 2.25 8.49 8.56 8.44 4.65 7.5 4.73
groupby1 groupby2 groupby3 运行时间(秒)
Dask-GPU Mars-GPU Dask-CPU Mars-CPU
数据大小:500M
- pip install pymars
- Mars 开源地址:https://github.com/mars-project/mars
- Mars 文档:https://docs.mars-project.io/zh_CN/latest/
- 双版本发布
- 方向:
- 社区是重点
- 技术:
- Roadmap 和 Enhancement Proposal:https://github.com/mars-project/mars/issues/537
- 更好的 GPU 支持,执行和数据传输效率提升,减少调度开销
- 丰富 DataFrame、 learn 和 Tensor 的接口,提升和深度学习框架之间的交互
- Mars actors 层优化,支持更高效的执行和网络效率
- 支持更多调度