AN INTRODUCTION TO DEEP LEARNING FOR ASTRONOMY Marc - - PowerPoint PPT Presentation

an introduction to deep learning for astronomy
SMART_READER_LITE
LIVE PREVIEW

AN INTRODUCTION TO DEEP LEARNING FOR ASTRONOMY Marc - - PowerPoint PPT Presentation

AN INTRODUCTION TO DEEP LEARNING FOR ASTRONOMY Marc Huertas-Company IAC WINTER School 2018 REFERENCES SEVERAL SLIDES / INFOS SHOWN HERE ARE INSPIRED/ TAKEN FROM OTHER WORKS / COURSES FOUND ONLINE Deep Learning: Do-It-Yourself! [Bursuc,


slide-1
SLIDE 1

AN INTRODUCTION TO DEEP LEARNING FOR ASTRONOMY

Marc Huertas-Company IAC WINTER School 2018

slide-2
SLIDE 2

REFERENCES

  • Deep Learning: Do-It-Yourself! [Bursuc, Krzakala, Lelarge]
  • DEEPLEARNING.AI [COURSERA, Ng, Bensouda, Katanforoosh]
  • MACHINE LEARNING LECTURES [Keck]
  • EPFL DEEP LEARNING COURSE [Fleuret]

SEVERAL SLIDES / INFOS SHOWN HERE ARE INSPIRED/ TAKEN FROM OTHER WORKS / COURSES FOUND ONLINE Thanks to all of them!

slide-3
SLIDE 3

SOME PRELIMINARY NOTES

I AM NOT A MACHINE LEARNING RESEARCHER

slide-4
SLIDE 4

SOME PRELIMINARY NOTES

I AM NOT A MACHINE LEARNING RESEARCHER ONLY AN ASTRONOMER WHO HAS BEEN USING MACHINE LEARNING FOR THE LAST ~14 YEARS FOR MY RESEARCH THIS LECTURE IS INTENDED TO PROVIDE A GLOBAL UNDERSTANDING OF HOW AI TECHNIQUES WORK AND ESPECIALLY HOW TO USE THEM FOR YOUR RESEARCH

slide-5
SLIDE 5

WHAT ARE WE GOING TO LEARN?

slide-6
SLIDE 6

A BUNCH OF SOMETIMES 
 CONFUSING TERMS…

WHAT ARE WE GOING TO LEARN?

slide-7
SLIDE 7
slide-8
SLIDE 8
slide-9
SLIDE 9
slide-10
SLIDE 10

AN AMAZING MEDIA ATTENTION

slide-11
SLIDE 11

AI FEVER?

PUBLICATIONS (ADS) Source CONFERENCES

slide-12
SLIDE 12

BEFORE 2012….

CAT? DOG? TRIVIAL HUMAN TASKS REMAINED CHALLENGING FOR COMPUTERS

slide-13
SLIDE 13

AFTER 2012

IT HAS BECOME TRIVIAL….

slide-14
SLIDE 14

THIS IS A CHANGE OF PARADIGM!

slide-15
SLIDE 15

ONE OF THE MAIN REASONS OF THIS BREAKTHROUGH IS THE AVAILABILITY OF VERY LARGE DATASETS TO LEARN

slide-16
SLIDE 16

COMBINED WITH THE TECHNOLOGY TO PROCESS ALL THIS DATA

slide-17
SLIDE 17

ONE OF THE MAIN REASONS OF THIS BREAKTHROUGH IS THE AVAILABILITY OF VERY LARGE DATASETS TO LEARN

HOWEVER THERE HAS NOT BEEN A MAJOR REVOLUTIONARY IDEA

slide-18
SLIDE 18

BASICS OF CLASSICAL MACHINE LEARNING
 (this is mostly covered by my colleagues) BASICS OF DEEP LEARNING
 (BOTH SUPERVISED AND UNSUPERVISED) HOPING THAT THIS WOULD BE USEFUL FOR YOUR RESEARCH! (Apologies in advance for biases on Extra-Galactic Science + imaging)

WHAT ARE WE GOING TO LEARN?

slide-19
SLIDE 19

WHY DO WE NEED THESE TOOLS IN ASTRONOMY?

slide-20
SLIDE 20

WHY DO WE NEED THESE TOOLS IN ASTRONOMY? AS IN MANY OTHER DISCIPLINES THE BIG-DATA REVOLUTION HAS ARRIVED TO ASTRONOMY TOO

slide-21
SLIDE 21

we are here BIG-DATA 
 REVOLUTION

EXTREMELY LARGE IMAGING SURVEYS DELIVERING BILLIONS OF OBJECTS IN 2-5 YEARS

LSST simulation

slide-22
SLIDE 22

(Thanks to J. Brinchmann)

slide-23
SLIDE 23

MANGA Survey

NOT ONLY VOLUME: AN INCREASING COMPLEXITY OF DATA

MUSE@VLT

slide-24
SLIDE 24

AND ALSO SIMULATIONS!

Ceverino+15

Genel+14

slide-25
SLIDE 25

PROGRAM FOR THE WEEK

  • PART I: A VERY QUICK INTRODUCTION TO

‘CLASSICAL’ MACHINE LEARNING

  • UNSUPERVISED / SUPERVISED
  • GENERAL STEPS TO “TEACH A MACHINE”
  • “CLASSICAL” CLASSIFIERS
slide-26
SLIDE 26

PROGRAM FOR THE WEEK

  • PART II: FOCUS ON ‘SHALLOW’ NEURAL NETWORKS
  • PERECPTRON, NEURON DEFINITION
  • LAYER OF NEURONS, HIDDEN LAYERS
  • ACTIVATION FUNCTIONS
  • OPTIMIZATION [GRADIENT DESCENT, LEARNING

RATES]

  • BACKPROPAGATION
slide-27
SLIDE 27

PROGRAM FOR THE WEEK

  • PART III: CONVOLUTIONAL NEURAL NETWORKS
  • CONVOLUTIONS AS NEURONS
  • CNNs [POOLING, DROPOUT]
  • VANISHING GRADIENT / BATCH

NORMALIZATION

slide-28
SLIDE 28

PROGRAM FOR THE WEEK

  • PART IV: IMAGE TO IMAGE NETOWRKS +

INTRODUCTION TO UNSUPERVISED DEEP LEARNING

  • NETWORKS FOR IMAGE SEGMENTATION
  • AUTO-ENCODERS
  • GENERATIVE ADVERSARIAL NETOWRKS
  • ANOMALY DETECTION
slide-29
SLIDE 29

PROGRAM FOR THE WEEK

  • PART V: SOME PRACTICAL CONSIDERATIONS
  • HOW DO I SETUP MY CNN?
  • HOW LARGE DO TRAINING SETS NEED TO BE?
  • OPTIMIZING YOUR NET: HYPER PARAMETER

SEARCH

  • VISUALIZING CNNs [DECONVNETS,

INCEPTIONISM, INTEGRATED GRADIENTS]

slide-30
SLIDE 30

HANDS-ON SESSION

LET’S TRY TO DISCUSS AS MUCH AS POSSIBLE! WE WILL TRY TO IMPLEMENT SOME OF THE THINGS LEARNED MORE PRECISELY WE WILL SET UP A DEEP NETWORK TO MEASURE GALAXY ELLIPTICITIES

slide-31
SLIDE 31

SOFTWARE REQUIREMENTS

  • PYTHON 3 OR GREATER
  • TENSORFLOW FOR DEEP LEARNING
  • KERAS - HIGH LEVEL LIBRARY WHICH MAKES

GPU CODING TRANSPARENT - SIMPLIFIES THINGS A LOT AND MOST OF THE TIME ENOUGH FOR OUR APPLICATIONS

slide-32
SLIDE 32

PART I: AN INTRODUCTION TO “CLASSICAL” MACHINE LEARNING

slide-33
SLIDE 33

THRE IS NO MAGIC IN MACHINE LEARNING, 
 AND IT IS ACTUALLY PRETTY SIMPLE

Liu+18

slide-34
SLIDE 34

fW (~ x) = ~ y

Liu+18

slide-35
SLIDE 35

fW (~ x) = ~ y

LABEL Q , SF

Liu+18

slide-36
SLIDE 36

fW (~ x) = ~ y

LABEL Q(0) , SF(1) (U-V, V-J) FEATURES

Liu+18

slide-37
SLIDE 37

fW (~ x) = ~ y

LABEL Q(0) , SF(1) (U-V, V-J) FEATURES sgn[(u-v)-0.8*(v-j)-0.7] WEIGHTS NETWORK FUNCTION

Liu+18

slide-38
SLIDE 38

fW (~ x) = ~ y

LABEL Q , SF REPLACE THIS BY A GENERAL 
 NON LINEAR FUNCTION WITH SOME PARAMETERS W “CLASSICAL” MACHINE LEARNING sgn[(u-v)-W1*(v-j)-W2]

slide-39
SLIDE 39

WHAT DOES MACHINE LEARNING DO?

SUPERVISED UN-SUPERVISED

Classification Regression Clustering Generative
 (deep learning)

the machine is told what to look for the machine is NOT told what to look for

slide-40
SLIDE 40

WHAT DOES MACHINE LEARNING DO?

SUPERVISED UN-SUPERVISED

Classification Regression Clustering Generative
 (deep learning)

the machine is told what to look for the machine is NOT told what to look for

[LECTURES BY BIEHL] [LECTURES BY BARON]

slide-41
SLIDE 41

WHAT DOES MACHINE LEARNING DO?

SUPERVISED UN-SUPERVISED

Classification Regression Clustering Generative
 (deep learning)

DEEP LEARNING

slide-42
SLIDE 42

LET’S HAVE A LOOK AT SOME EXAMPLES OF DEEP LEARNING APPLIED…

slide-43
SLIDE 43

MHC+15b

99.8 96.3 88.5 97.1 93.7 11.5 3.0 5.6 2.9 0.2 0.5 0.8 0.8 0.8 0.4 0.4 0.4 0.3 0.3 0.3 0.0 0.0 0.0 0.2 0.2 SPHEROID DISK IRR PS Unc VISUAL DOMINANT CLASS SPHEROID DISK IRR PS Unc AUTO DOMINANT CLASS

97 99 VISUAL AUTOMATIC

“OUR CATS AND DOGS”: GALAXY MORPHOLOGY

CNNs

DEEP LEARNING SOLVES 
 THE PROBLEM
 OF GALAXY MORPHOLOGICAL 
 CLASSIFICATION?

slide-44
SLIDE 44

MHC+15b

99.8 96.3 88.5 97.1 93.7 11.5 3.0 5.6 2.9 0.2 0.5 0.8 0.8 0.8 0.4 0.4 0.4 0.3 0.3 0.3 0.0 0.0 0.0 0.2 0.2 SPHEROID DISK IRR PS Unc VISUAL DOMINANT CLASS SPHEROID DISK IRR PS Unc AUTO DOMINANT CLASS

97 99 VISUAL AUTOMATIC

“OUR CATS AND DOGS”: GALAXY MORPHOLOGY

CNNs

DEEP LEARNING SOLVES 
 THE PROBLEM
 OF GALAXY MORPHOLOGICAL 
 CLASSIFICATION?

87

13

75

25

Early-Type Late-Type

Early-Type Late-Type

AUTOMATIC

SVMs

slide-45
SLIDE 45

CLASSIFICATION: LENS FINDER

Jacobs+17

slide-46
SLIDE 46

CLASSIFICATION: LENS FINDER

Jacobs+17

Metcalf+18

slide-47
SLIDE 47

REGRESSION

Hezaveh+17, Nature

REGRESSION ON 
 STRONG LENSES PARAMETERS

slide-48
SLIDE 48

GENERATIVE MODELS

(UNSUPERVISED)

Margalef,MHC+19

slide-49
SLIDE 49

GENERATIVE MODELS

(UNSUPERVISED)

Ravanbakhsh+16

Generation of realistic galaxy images

slide-50
SLIDE 50

GENERATIVE MODELS TO BOOST DISCOVERY

Schlegl+17

slide-51
SLIDE 51

GENERATIVE MODELS

Schawinsky+17

(UNSUPERVISED)

slide-52
SLIDE 52

( ~ x1, ~ x2, ~ x3, ..., ~ xn)

(~ y1, ~ y2, ~ y3, ..., ~ yn)

Training set

Measurements 
 (colors, fluxes, spectra indices…) Label 
 (morphology, object type, transit …)

Given a dataset with known labels (measurements) - find a function that can assign (predict) measurements for an unlabeled dataset

SUPERVISED LEARNING

slide-53
SLIDE 53

Given a dataset with known labels (measurements) - find a function that can assign (predict) measurements for an unlabeled dataset

( ~ x1, ~ x2, ~ x3, ..., ~ xn)

(~ y1, ~ y2, ~ y3, ..., ~ yn)

Training set

fW (~ x) = ~ y

?

SUPERVISED LEARNING

slide-54
SLIDE 54

(~ y1, ~ y2, ~ y3, ..., ~ yn)

Training set

fW (~ x) = ~ y

?

( ~ x1, ~ x2, ~ x3, ..., ~ xn)

Unlabeled set

( ~ x1

0, ~

x2

0, ~

x3

0, ..., ~

xn

0)

(~ y1

0, ~

y2

0, ~

y3

0, ..., ~

yn

0)

SUPERVISED LEARNING

slide-55
SLIDE 55

( ~ x1, ~ x2, ~ x3, ..., ~ xn)

(~ y1, ~ y2, ~ y3, ..., ~ yn)

~ x ∈ Rd ~ y ∈ R ~ y ∈ N GENERAL GOAL: Find a (non-linear) function that outputs the correct class / measurement for a given input object:

fW (~ x)

Number of parameters - can be large

It is translated into a minimization problem : find W such as the prediction error is minimal over all unseen vectors

slide-56
SLIDE 56

Different “classical” supervised machine learning methods

RANDOM FORESTS CARTS ARTIFICAL 
 NEURAL NETWORKS (DEEP LEARNING) SUPPORT VECTOR MACHINES decision trees kernel algorithms

this is not 
 classical..

slide-57
SLIDE 57

RANDOM FORESTS CARTS ARTIFICAL 
 NEURAL NETWORKS (DEEP LEARNING) SUPPORT VECTOR MACHINES decision trees kernel algorithms

fW (~ x)

The differences are 
 in the function 
 that is used

slide-58
SLIDE 58

We need two key elements

  • 1. A LOSS FUNCTION
  • 2. A MINIMIZATION OR OPTIMIZATION 


ALGORITHM

slide-59
SLIDE 59

We need two key elements

  • 1. A LOSS FUNCTION
  • 2. A MINIMIZATION OR OPTIMIZATION 


ALGORITHM THIS IS COMMON TO ALL MACHINE LEARNING ALGORITHMS

slide-60
SLIDE 60
  • 1. DEFINE A LOSS FUNCTION

loss(FW (.), ~ xi, ~ yi)

For example: Quadratic loss function (FW (~ xi) − ~ yi)2

  • 2. MINIMIZE THE EMPIRICAL RISK

MINIMIZE THE RISK <empirical(W) = 1 N

N

X

i

[loss(W, ~ x, ~ y)]

slide-61
SLIDE 61

EMPIRICAL RISK?

<empirical(W) = 1 N

N

X

i

[loss(W, ~ x, ~ y)]

WE ARE MINIMIZING WITH RESPECT TO A FINITE NUMBER OF OBSERVED EXAMPLES

slide-62
SLIDE 62

EMPIRICAL RISK?

<empirical(W) = 1 N

N

X

i

[loss(W, ~ x, ~ y)]

WE ARE MINIMIZING WITH RESPECT TO A FINITE NUMBER OF OBSERVED EXAMPLES

OBSERVED DATASET

slide-63
SLIDE 63

EMPIRICAL RISK?

<empirical(W) = 1 N

N

X

i

[loss(W, ~ x, ~ y)]

WE ARE MINIMIZING WITH RESPECT TO A FINITE NUMBER OF OBSERVED EXAMPLES

OBSERVED DATASET ALL “GALAXIES IN THE UNIVERSE”

slide-64
SLIDE 64

In practice

TRAINING VALIDATION TEST

OPTIMIZATION ERROR training set: use to train the classifier validation set: use to monitor performance in real time - check for overfitting test set: use to train the classifier

slide-65
SLIDE 65

In practice

TRAINING VALIDATION TEST

OPTIMIZATION ERROR NO CHEATING! NEVER USE TRAINING TO VALIDATE YOUR ALGORITHM!

slide-66
SLIDE 66

The algorithm used to minimize is called OPTIMIZATION

THERE ARE SEVERAL OPTIMIZATION TECHNIQUES

slide-67
SLIDE 67

Optimization

THERE ARE SEVERAL OPTIMIZATION TECHNIQUES

THEY DEPEND ON THE MACHINE LEARNING ALGORITHM

slide-68
SLIDE 68

Optimization

THERE ARE SEVERAL OPTIMIZATION TECHNIQUES

THEY DEPEND ON THE MACHINE LEARNING ALGORITHM

Wt+1 = Wt λh 5 f(Wt)

learning rate epoch weights to be learned

NEURAL NETWORKS USE THE GRADIENT DESCENT AS WE WILL SEE LATER

slide-69
SLIDE 69

RANDOM FORESTS CARTS ARTIFICAL 
 NEURAL NETWORKS (DEEP LEARNING) SUPPORT VECTOR MACHINES decision trees kernel algorithms

fW (~ x)

The differences are 
 in the function 
 that is used

slide-70
SLIDE 70

HOW TO CHOOSE YOUR CLASSICAL CLASSIFIER?

NO RULE OF THUMB - REALLY DEPENDS ON APPLICATION

ML METHOD

++ — Python

CARTS / RANDOM FOREST

Easy to interpret (“White box”) Litte data preparation Both numerical + categorical Over-complex trees Unstable Biased tress if some classes dominate sklearn.ensemble.RandomFo restClassifier sklearn.ensemble.RandomFo restRegressor

SVM

Easy to interpret + Fast Kernel trick allows no linear problems not very well suited to multi-class problems sklearn.svm sklearn.svc

NN

seed of deep-learning very efficient with large amount of data as we will see more difficult to interpret computing intensive sklearn.neural_network.MP L_CLassifier sklearn.neural_network.MP L_Regressor

slide-71
SLIDE 71

credit

CAN DEPEND ON YOUR MAIN INTEREST

slide-72
SLIDE 72

Source

ALSO INFLUENCED BY “MAINSTREAM” TRENDS

slide-73
SLIDE 73

PART II: A FOCUS ON “SHALLOW” NEURAL NETWORKS

slide-74
SLIDE 74

THE NEURON

INSPIRED BY NEURO - SCIENCE?

Credit: Karpathy

slide-75
SLIDE 75

INSPIRED BY NEURO - SCIENCE?

Credit: Karpathy

THE NEURON

slide-76
SLIDE 76

Mark I Perceptron

FIRST IMPLEMENTATION OF NEURAL NETWORK [Rosenblatt, 1957!] INTENDED TO BE A MACHINE (NOT AN ALGORITHM) it had an array of 400 photocells, randomly connected to the "neurons". Weights were encoded in potentiometers, and weight updates during learning were performed by electric motors

slide-77
SLIDE 77

TODAY’S ARTIFICIAL NEURON

z(~ x) = ~ W.~ x + b

f(~ x) = g( ~ W.~ x + b) Weights Bias Activation Function Output Input

Pre-Activation

slide-78
SLIDE 78

LAYER OF NEURONS

f(~ x) = g(W.~ x +~ b)

SAME IDEA. NOW W becomes a matrix and b a vector

slide-79
SLIDE 79

INPUT

zh(x) = W hx + bh

FIRST LAYER

Hidden Layers of Neurons

slide-80
SLIDE 80

HIDDEN LAYER

ACTIVATION FUNCTION

h(x) = g(zh(x)) = g(W hx + bh)

slide-81
SLIDE 81

OUTPUT LAYER

z0(x) = W 0h(x) + b0

slide-82
SLIDE 82

PREDICTION LAYER

f(x) = softmax(z0)

slide-83
SLIDE 83

fW (~ x) = ~ y

LABEL Q , SF REPLACE THIS BY A GENERAL 
 NON LINEAR FUNCTION WITH SOME PARAMETERS W

p = g3(W3g2(W2g1(W1 ~ x0)))

NETWORK
 FUNCTION

“CLASSICAL” MACHINE LEARNING

slide-84
SLIDE 84

WHY HIDDEN LAYERS?

More complex functions allow increasing complexity

Credit: Karpathy

slide-85
SLIDE 85

SO LET’S GO DEEPER AND DEEPER!

slide-86
SLIDE 86

SO LET’S GO DEEPER AND DEEPER! YES BUT… NOT SO STRAIGHTFORWARD, DEEPER MEANS MORE WEIGHTS, MORE DIFFICULT OPTIMIZATION, RISK OF OVERFITTING…

slide-87
SLIDE 87

LET’S FIRST EXAMINE IN MORE DETAIL HOW SIMPLE “SHALLOW” NETWORKS WORK

slide-88
SLIDE 88

ACTIVATION FUNCTIONS?

Function

ADD NON LINEARITIES TO THE PROCESS

slide-89
SLIDE 89

ACTIVATION FUNCTIONS

Function

slide-90
SLIDE 90

ACTIVATION FUNCTIONS

Sigmoid: f(x) = 1 1 + e−x ReLu: f(x) = max(0, x) Tanh: f(x) = tanh(x) f(x) = log(1 + ex) f(x) = ✏x + (1 − ✏)max(0, x) Leaky ReLu: Soft ReLu:

slide-91
SLIDE 91

ACTIVATION FUNCTIONS

Sigmoid: f(x) = 1 1 + e−x ReLu: f(x) = max(0, x) Tanh: f(x) = tanh(x) f(x) = log(1 + ex) f(x) = ✏x + (1 − ✏)max(0, x) Leaky ReLu: Soft ReLu: +
 MANY
 OTHERS!

slide-92
SLIDE 92

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

slide-93
SLIDE 93

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-94
SLIDE 94

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-95
SLIDE 95

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-96
SLIDE 96

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-97
SLIDE 97

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-98
SLIDE 98

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-99
SLIDE 99

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-100
SLIDE 100

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-101
SLIDE 101

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-102
SLIDE 102

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-103
SLIDE 103

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-104
SLIDE 104

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-105
SLIDE 105

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-106
SLIDE 106

Any real function in a interval (a,b) can be approximated with a linear combination of translated and scaled ReLu functions

WHAT IS THE MEANING OF THE ACTIVATION FUNCTION?

slide-107
SLIDE 107

SOFTMAX

A generalization of the SIGMOID ACTIVATION

softmax(x) = ex Pn

i=1 exi

THE OUTPUT IS NORMALIZED BETWEEN 0 AND 1 THE COMPONENTS ADD TO 1 CAN BE INTERPRETED AS A PROBABILITY p(Y = c|X = x) = softmax(z(x))c

slide-108
SLIDE 108

SOFTMAX

A generalization of the SIGMOID ACTIVATION

softmax(x) = ex Pn

i=1 exi

THE OUTPUT IS NORMALIZED BETWEEN 0 AND 1 THE COMPONENTS ADD TO 1 CAN BE INTERPRETED AS A PROBABILITY p(Y = c|X = x) = softmax(z(x))c GENERALLY 
 USED AS ACTIVATION 
 OF LAST LAYER (will come back later)