Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy
Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original.
Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy Mark - - PowerPoint PPT Presentation
Polychotomizers: One-Hot Vectors, Softmax, and Cross-Entropy Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original. Outline Dichotomizers and Polychotomizers Dichotomizer: what
Mark Hasegawa-Johnson, 3/9/2019. CC-BY 3.0: You are free to share and adapt these slides if you cite the original.
%, " ']
% = degree to which the animal is domesticated, e.g., comes when called
' = size of the animal is domesticated, e.g., in kilograms
"
$ = & '()** 1 ⃗ " ), 0 ≤ # $ ≤ 1
that one of the two classes is called “class 1,” but nobody agrees on what to call the
#, ⃗
%, … , ⃗
'
( = [" (#, … , " (+]
#
%
2
3
4
5
#, ⃗
%, … , ⃗
' ,
( = [" (#, … , " (+]
#
%
2
3
4
5
#, ⃗
%, … , ⃗
' ,
( = [" (#, … , " (+]
#
%
2
3
4
5
%, '%, ⃗
(, '(, … , ⃗
*, '*
%, ⃗
(, … , ⃗
* ,
+ = [$ +%, … , $ +-]
%
(
3
4
5
6
%, … , " (]
%, … , " (]
/2% ,
%, ⃗
(, ⃗
*, ⃗
%, ⃗
(, … , ⃗
* ,
+ = [$ +%, … , $ +-]
*), because the true probability is always
*), 0 ≤ -
9
#).
#)
#) = 1 − "#
"), given to you with
") estimated by the neural net.
89ℓ9;
Inputs Perceptrons w/ weights wc Max
The softmax function is defined as: ! "#$ = softmax
$
1
# =
234/ ⃗
56
∑ℓ89
:
23ℓ/ ⃗
56
For example, the figure to the right shows ! "9 = softmax
9
1
ℓ =
25
;
∑ℓ89
<
25ℓ Notice that it’s close to 1 (yellow) when1
9 = max1 ℓ, and close to zero (blue)
between.
9
<
9
ℓ
$
#
56
:
56
$89 :
#).
9
G
9
ℓ
# #$ % &
( & #% #$ − % &* #& #$
# #$ ,% = ,% #% #$
# #$ ./ = /
(
(
ℓ
First, we use the rule for
! !" # $
=
& $ !# !" − # $( !$ !":
) *+, = softmax
,
4ℓ 6 ⃗ 8
+
= 9":6 ⃗
;<
∑ℓ>&
?
9"ℓ6 ⃗
;<
@) *+, @4AB = 1 ∑ℓ>&
?
9"ℓ6 ⃗
;<
@9":6 ⃗
;<
@4AB − 9":6 ⃗
;<
∑ℓ>&
?
9"ℓ6 ⃗
;< D
@ ∑ℓ>&
?
9"ℓ6 ⃗
;<
@4AB = 1 ∑ℓ>&
?
9"ℓ6 ⃗
;<
@9":6 ⃗
;<
@4AB − 9":6 ⃗
;<
∑ℓ>&
?
9"ℓ6 ⃗
;< D
@ ∑ℓ>&
?
9"ℓ6 ⃗
;<
@4AB E = F − 9":6 ⃗
;<
∑ℓ>&
?
9"ℓ6 ⃗
;< D
@ ∑ℓ>&
?
9"ℓ6 ⃗
;<
@4AB E ≠ F
&
D
&
ℓ
Next, we use the rule !
!" #$ = #$ !$ !": ! & '() !"*+=
1 ∑ℓ01
2
#"ℓ3 ⃗
5(
6#")3 ⃗
5(
6789 − #")3 ⃗
5(
∑ℓ01
2
#"ℓ3 ⃗
5( ;
6 ∑ℓ01
2
#"ℓ3 ⃗
5(
6789 < = = − #")3 ⃗
5(
∑ℓ01
2
#"ℓ3 ⃗
5( ;
6 ∑ℓ01
2
#"ℓ3 ⃗
5(
6789 < ≠ = = #")3 ⃗
5(
∑ℓ01
2
#"ℓ3 ⃗
5( −
#")3 ⃗
5( ;
∑ℓ01
2
#"ℓ3 ⃗
5( ;
6(78 3 ⃗ @
A)
6789 < = = − #")3 ⃗
5(#"*3 ⃗ 5(
∑ℓ01
2
#"ℓ3 ⃗
5( ;
6(78 3 ⃗ @
A)
6789 < ≠ =
1
;
1
ℓ
Next, we use the rule
! !" #$ = $:
& ' ()* &#+, =
12
∑ℓ56
7
12 −
12 9
∑ℓ56
7
12 9
&(#+ / ⃗ $
))
&#+, < = = − -"./ ⃗
12-">/ ⃗ 12
∑ℓ56
7
12 9
&(#+ / ⃗ $
))
&#+, < ≠ = =
12
∑ℓ56
7
12 −
12 9
∑ℓ56
7
12 9
$
),
< = = − -"./ ⃗
12-">/ ⃗ 12
∑ℓ56
7
12 9
$
),
< ≠ =
6
9
6
ℓ
… and, simplify. ! " #$% !&'( = *+,- ⃗
/0
∑ℓ34
5
*+ℓ- ⃗
/0 −
*+,- ⃗
/0 7
∑ℓ34
5
*+ℓ- ⃗
/0 7
8
$(
9 = : − *+,- ⃗
/0*+;- ⃗ /0
∑ℓ34
5
*+ℓ- ⃗
/0 7
8
$(
9 ≠ : ! " #$% !&'( = = " #$% − " #$%
7 8 $(
9 = : −" #$% " #$'8
$(
9 ≠ :
4
7
4
ℓ
"#$ is the probability of the %&' class, estimated by the neural net, in response to the (&' training token
to the -&' class label The dependence of ! "#$ on )*+ for - ≠ % is weird, and people who are learning this for the first time often forget about it. It comes from the denominator of the softmax. ! "#$ = softmax
$
)ℓ 8 ⃗ :
# =
;<=8 ⃗
>?
∑ℓAB
C
;<ℓ8 ⃗
>?
D ! "#$ D)*+ = E ! "#$ − ! "#$
G : #+
−! "#$ ! "#*:
#+
"#* is the probability of the -&' class for the (&' training token
#+ is the value of the ,&' input feature for the (&' training
token
B
G
B
ℓ
6
6
#89 :
:
#<= >
.
012 3
.
012 3
*
,-. /
*
,-. /
+
./0 1
$%& '
$%& '
$%& '
$%& '
$4
$4
()* +
(%
(%
()* +
(%
(%
()* +
(%
()* +
(%
()* +
(%
(% is positive
(% is negative
()* +
(%
(% smaller
(% larger
# = % class * =
# = % + 1
#
< of class j is misclassified as class m, then
> = = > + ? ⃗
<
<
<
(
(
(
(
$:; <
01ℓ13
#
$
#
Inputs Perceptrons w/ weights 4ℓ Argmax or Softmax