[PPT] - In-Database Machine Learning: Using Gradient Descent and Tensor PowerPoint Presentation

SLIDE 1

Maximilian E. Schüle | In-Database Machine Learning 1

Chair III: Database Systems, Professorship for Data Mining and Analytics Chair XXV: Data Science and Engineering Department of Informatics Technical University of Munich

In-Database Machine Learning: Using Gradient Descent and Tensor Algebra

Maximilian E. Schüle, Frédéric Simonis, Thomas Heyenbrock, Alfons Kemper, Stephan Günnemann, Thomas Neumann Rostock, 04. März 2019

SLIDE 2

Maximilian E. Schüle | In-Database Machine Learning 2

What Need Database Systems for ML?

Why don‘t use HyPer? Database Systems Machine Learning

SLIDE 3

Maximilian E. Schüle | In-Database Machine Learning 3

What Need Database Systems for ML?

Machine Learning: data in tensors and a parametrised loss function Advantages: Optimisation problems are solvable in the core of database servers Goal: Make database systems more attractive What it is: Architectural blueprint for the integration of optimisation models in DBMS What it is not: Study about the quality of different optimisation problems

HyPer Gradient Descent Tensors

CC BY-SA 3.0, TimothyRias https://commons.wikimedia.org/w/index.php? curid=14729540

SLIDE 4

Maximilian E. Schüle | In-Database Machine Learning 4

What is Gradient Descent?

predict #Rooms Median Value Linear Regression

ma, b(rm)=a∗rm+b≈medv

Optimal weights? Gradient Descent! Initial weights Optimal weights Gradient Descent RM MEDV Training Data RM Test Data MEDV How to optimse weights? How to label data?

lrm,medv(a,b)=(ma,b(rm)−medv)2

SLIDE 5

Maximilian E. Schüle | In-Database Machine Learning 5

Approach

Integration as operators in relational algebra Representation of mathematical functions on relations Concept of pipelines

HyPer

Gradient needed Automatic differentiation

Gradient Descent

Representation of tensors Either: one relation represents one tensor Or: own tensor data type

Tensors

CC BY-SA 3.0, TimothyRias https://commons.wikimedia.org/w/index.php? curid=14729540

SLIDE 6

Maximilian E. Schüle | In-Database Machine Learning 6

Integration in Relational Algebra

Operators for labelling and gradient descent: Pipelines (Weights/Data) Operator Tree Representation of a loss- as well as of a model function Model / Loss Function Integration as a pipeline breaker Pipelining

λ

m

d

e l f u n c t i

n

m l

s

s f u n c t i

n

l mw(x)=∑

i∈m

xi∗wi≈ y lx , y(w)=(mw(x)− y)2

SLIDE 7

Maximilian E. Schüle | In-Database Machine Learning 7

Integration in Relational Algebra: Operator Tree

Labelling Input: test dataset and optimal weights Label: evaluated lambda expression for each tuple

G r a d i e n t D e s c e n t T r a i n i n g D a t a I n i t i a l We i g h t s

λ

L

s

s F u n c t i

n

L a b e l l i n g

λ

M

d

e l F u n c t i

n

T e s t D a t a C a l c u l a t e d We i g h t s

Two Operators needed Gradient descent to optimise weights of a parametrised loss function Labelling operator to label predicted values Gradient Descent Initial weights and training data as input and optimised weights as output Lambda expression as loss function to be optimised

SLIDE 8

Maximilian E. Schüle | In-Database Machine Learning 8

Integration in Rel. Algebra: Lambda Functions

Lambda Expression To inject user-defined code

Operator Left Pipeline λ Injected Code Right Pipeline k-Means Input Points λ

λ ( a , b ) s q r t ( ( a . x

b

. x ) ^ 2 + ( a . y

b

. y ) ^ 2 ) s e l e c t * f r

m

k m e a n s ( ( t a b l e p

i

n t s ) , λ ( a , b ) s q r t ( ( a . x

b

. x ) ^ 2 + ( a . y

b

. y ) ^ 2 ) , 2 )

Euclidean Distance

SLIDE 9

Maximilian E. Schüle | In-Database Machine Learning 9

Integration in Rel. Algebra: Lambda Functions

Notation

L

s

s f u n c t i

n

Relations/Lambda Functions L a mb d a F u n c t i

n

s i n S Q L

M

d

e l f u n c t i

n

We i g h t s n t u p l e w i t h m a t t r i b u t e s W { [ w _ 1 , w _ 2 , … , w _ m ] } X { [ x _ 1 , x _ 2 , . . , x _ m , y ] } λ ( W , X ) ( W . w _ 1 * X . x _ 1 + . . . + X . x _ m

y

) ² λ ( W , X ) ( W . w _ 1 * X . x _ 1 + . . . + X . x _ m ) c r e a t e t a b l e t r a i n i n g d a t a ( x f l

a

t , y f l

a

t ) ; c r e a t e t a b l e w e i g h t s ( a f l

a

t , b f l

a

t ) ; i n s e r t i n t

t

r a i n i n g d a t a … i n s e r t i n t

w

e i g h t s … s e l e c t * f r

m

g r a d i e n t d e s c e n t ( – l

s

s f u n c t i

n

a s λ

e

x p r e s s i

n

λ ( d a t a , w e i g h t s ) ( w e i g h t s . a * d . x + w e i g h t s . b

d

. y ) ² ,

t

r a i n i n g s e t a n d i n i t i a l w e i g h t s ( s e l e c t x , y f r

m

t r a i n i n g d a t a d ) , ( s e l e c t a , b f r

m

w e i g h t s ) ,

l

e a r n i n g r a t e a n d m a x . n u m b e r

f

i t e r a t i

n

. 5 , 1 ) ; c r e a t e t a b l e t e s t d a t a ( x f l

a

t ) ; c r e a t e t a b l e w e i g h t s ( a f l

a

t , b f l

a

t ) ; i n s e r t i n t

t

r a i n i n g d a t a … i n s e r t i n t

w

e i g h t s … s e l e c t * f r

m

l a b e l i n g ( –

m
d

e l f u n c t i

n

a s λ

e

x p r e s s i

n

λ ( d a t a , w e i g h t s ) ( w e i g h t s . a * d . x + w e i g h t s . b ) ,

t

r a i n i n g s e t a n d i n i t i a l w e i g h t s ( s e l e c t x , y f r

m

t e s t d a t a d ) , ( s e l e c t a , b f r

m

w e i g h t s ) ) ; mw(x)=∑

i∈m

xi∗wi≈ y lx , y(w)=(mw(x)− y)2 w=(w1, w2,..., wm) x=( x1, x2,...,xm, y)

SLIDE 10

Maximilian E. Schüle | In-Database Machine Learning 10

Integration in Relational Algebra: Pipelining

Materialising Pipelined Combined

m a i n s u b s u b s u b m a x . I t e r a t i

n

s B a t c h / S t

.

G D T r a i n i n g D a t a I n i t i a l We i g h t s λ m a i n s u b s u b s u b S t

c

h a s t i c G r a d i e n t D e s c e n t 1 . . . m a x I t e r I n i t i a l We i g h t s λ m a i n m a x . I t e r a t i

n

s

1

s u b s u b s u b S t

c

h a s t i c G r a d i e n t D e s c e n t B a t c h / S t

.

G D T r a i n i n g D a t a I n i t i a l We i g h t s λ

G r a d i e n t D e s c e n t T r a i n i n g D a t a I n i t i a l We i g h t s

λ

L a b e l l i n g

λ

T e s t D a t a

SLIDE 11

Maximilian E. Schüle | In-Database Machine Learning 11

Integration in Relational Algebra: Pipelining

m a i n s u b s u b s u b m a x . I t e r a t i

n

s B a t c h / S t

.

G D T r a i n i n g D a t a I n i t i a l We i g h t s λ m a i n s u b s u b s u b S t

c

h a s t i c G r a d i e n t D e s c e n t 1 . . . m a x I t e r λ m a i n m a x . I t e r a t i

n

s

1

s u b s u b s u b S t

c

h a s t i c G r a d i e n t D e s c e n t B a t c h / S t

.

G D T r a i n i n g D a t a λ

Materialising Pipelined Combined Materialisation of all tuples (parallel/serial) Any optimisation method possible Parallelism: parallel_for No materialisation Stochastic gradient descent only Distribution to pipelines Downside: multiple copys of the operator tree First iteration in pipelines Remaining ones in the main thread

I n i t i a l We i g h t s I n i t i a l We i g h t s

SLIDE 12

Maximilian E. Schüle | In-Database Machine Learning 12

Automatic Differentiation for Gradient Descent

Need of a gradient for gradient descent: Automatic differentiation necessary HyPer compiles SQL before execution → precompilation of the gradient, evaluation for each tuple using placeholders

p

w
y

P l a c e h

l

d e r + 2 C

n

s t a n t * b V a r i a b l e a V a r i a b l e x P l a c e h

l

d e r

E x p r e s s i

n

T r e e .

* p

w

+ 1 * 1 a s e t t

1

x * 2 c h a i n r u l e

y

+ * b a x O r i g i n a l T r e e

G r a d i e n t D e s c e n t T r a i n i n g D a t a I n i t i a l We i g h t s

λ

L

s

s F u n c t i

n

1 2 3 4 5

Compilation Execution

c

n

s u m e ( ) c

n

s u m e ( ) p r

d

u c e ( ) p r

d

u c e ( )

a u t

s

t a t u s = m

d

e l _

p

t i m i z e r

>

t r a i n a b l e . t r a i n ( V a l u e d N

d

e s { m

d

e l _ g r a d i e n t

>

m

d

e l . p l a c e h

l

d e r s , t e n s

r

s } ) ;

SLIDE 13

Maximilian E. Schüle | In-Database Machine Learning 13

Tensor Data Type

Extension of the PostgreSQL array data type transpose addition/subtraction/scalar product multiplication (inner tensor product)

s e l e c t ( a r r a y _ i n v e r s e ( a r r a y _ t r a n s p

s

e ( x ) * x ) ) * ( a r r a y _ t r a n s p

s

e ( x ) * y ) f r

m

( s e l e c t a r r a y _ a g g ( x ) x f r

m

( s e l e c t a r r a y [ 1 , x _ 1 , x _ 2 ] a s x f r

m

d a t a p

i

n t s ) s x ) t x , ( s e l e c t a r r a y _ a g g ( y ) y f r

m

( s e l e c t a r r a y [ y ] y f r

m

d a t a p

i

n t s ) s y ) t y ; w=(X ' T X ')−1 X ' T y

L i n e a r R e g r e s s i

n

i n S Q L w i t h T e n s

r

s L i n e a r R e g r e s s i

n

w=(w1, w2,..., wm) x=( x1, x2,...,xm ) T∈ℝ

I 1×...×Im=o,U ∈ℝ J1=o×...×Jn,

(t t)i1i2i3...im=ti2i1i3...im (t +s)i1i2... im=ti1i2...im+si1i 2... im si1i2...im−1 j2... jm=∑k∈[o] ti1i2...im−1kuk j2... j m y=( y1, y2,..., yn)

t

SLIDE 14

Maximilian E. Schüle | In-Database Machine Learning 14

Evaluation

CC BY-SA 2.0, https://flic.kr/p/sJG5

SLIDE 15

Maximilian E. Schüle | In-Database Machine Learning 15

Evaluation

Tools HyPer, MariaDB 10.1.30, PostgreSQL 9.6.8 with MADlib v1.13, TensorFlow 1.3.0, R 3.4.2 Machine Intel Xeon E5-2660 v2 CPU (20x 2.20 GHz) 256 GB DDR4 RAM Nvidia GeForce GTX 1050 Ti Data Chicago Taxi Rides Dataset (10^6 Tupel) Tests Linear regression (2-3 attributes) Logistic regression (2 attributes) k-Means clustering

CC BY-SA 2.0, https://flic.kr/p/sJG5

SLIDE 16

Maximilian E. Schüle | In-Database Machine Learning 16

Evaluation – Runtimes of GD

1

2

1

3

1

4

1

5

1

6

1

−

1

1 1

1

2

1

3

1

4

N u m b e r

f

T u p l e s T i m e i n s S i m p l e l i n e a r r e g r e s s i

n

.

1

2

1

3

1

4

1

5

1

6

1

−1

1 1

1

2

1

3

1

4

N u m b e r

f

T u p l e s T i m e i n s Mu l t i p l e l i n e a r r e g r e s s i

n

.

1

2

1

3

1

4

1

5

1

6

1

−

1

1 1

1

2

1

3

1

4

1

5

N u m b e r

f

T u p l e s T i m e i n s L

g

i s t i c r e g r e s s i

n

.

Runs 5000 iterations Database systems: no time for data loading needed HyPer faster, PSQL and MariaDB (using procedures) slower

R T e n s

r

F l

w

T e n s

r

F l

w
G

P U Ma r i a D B P S Q L H y P e r MA D l i b

SLIDE 17

Maximilian E. Schüle | In-Database Machine Learning 17

Evaluation – Ratio Computation/Loading Time

L i n . R e g .

2 4 6 8 10 T ensorFlow T ensorFlow T ensorFlow T ensorFlow Time in s Declaration Data Loading TF Session Computation

R e g .

M u l .

L i n . R e g . L i n . R e g . G D L

g

.

2 4 6 8 10 T ensorFlow-GPU T ensorFlow-GPU T ensorFlow-GPU T ensorFlow-GPU Time in s

L i n .

2 4 6 8 10 HyPer HyPer HyPer HyPer Time in s 2 4 6 8 10 R R R R Time in s

( a ) 1

5t

u p l e s .

R e g . R e g . G D R e g .

10 20 30 T ensorFlow T ensorFlow T ensorFlow T ensorFlow Time in s Declaration Data Loading TF Session Computation 10 20 30 T ensorFlow-GPU T ensorFlow-GPU T ensorFlow-GPU T ensorFlow-GPU Time in s 10 20 30 HyPer HyPer HyPer HyPer Time in s 10 20 30 R R R R Time in s

( b ) 1

6t

u p l e s .

L i n . L i n . L i n . R e g . L

g

. L i n .

Runs parameters: 10 iterations, 10^6/10^7 tuples Most of the time: data loading Not necessary, when computation is done inside of the database system

SLIDE 18

Maximilian E. Schüle | In-Database Machine Learning 18

Evaluation – Architectures

Evaluation of the architectures: materialising, pipelined, combined Standard parameters: 10 iterations, 10^6 tuples, one thread Observation Pipelined faster, but only allows stochastic GD and needs fixed number of iterations All implementations scale Combined plan: low

2 4 6 8 1

.4 .5 .6 .7 .8

N u m b e r

f

T h r e a d s

T i m e i n s m a t e r i a l i s i n g p i p e l i n e d c

m

b i n e d

( a ) S c a l e T e s t s .

1 1 1 2 1 3 1 4 1 5 1 6 1 − 5 1 − 4 1 − 3 1 − 2 1 − 1 1

N u m b e r

f

T u p l e s

T i m e i n s m a t e r i a l i s i n g p i p e l i n e d c

m

b i n e d

( b ) V a r y i n g t h e i n p u t s i z e .

1 1

1

2

1

3 1 − 2 1 − 1 1 1 1

N u m b e r

f

I t e r a t i

n

s

T i m e i n s m a t e r i a l i s i n g p i p e l i n e d c

m

b i n e d

( c ) V a r y i n g t h e n u m b e r

f

i t e r a t i

n

s .

SLIDE 19

Maximilian E. Schüle | In-Database Machine Learning 19

Conclusion

Database systems: more computations (tensors + gd) Aim of the work Saving time by moving ML operations into the core of DBMS Gradient descent and labelling in SQL + Lambda Different architectures for gradient descent Future Work Support of tensor datatypes Second view on relations: combining SQL and ArrayQL Generic language for machine learning Dedicated language that compiles to SQL Embedding of Python or R in SQL

T ensorFlow T ensorFlow T ensorFlow T ensorFlow Declaration Data Loading T ensorFlow-GPU T ensorFlow-GPU T ensorFlow-GPU T ensorFlow-GPU HyPer HyPer HyPer HyPer R R R R