CS 224d: Assignment #1 Due date: 4/19 11:59 PM PST (You are allowed - PDF document

CS 224d: Assignment #1 Due date: 4/19 11:59 PM PST (You are allowed to use three (3) late days maximum for this assignment) These questions require thought, but do not require long answers. Please be as concise as possible. We encourage students to discuss in groups for assignments. However, each student must finish the problem set and programming assignment individually, and must turn in her/his assignment. We ask that you abide by the university Honor Code and that of the Computer Science department, and make sure that all of your submitted work is done by yourself Please review any additional instructions posted on the assignment page at http://cs224d.stanford.edu/assignments.html. When you are ready to submit, please follow the instructions on the course website. 1 Softmax (10 points) (a) (5 points) Prove that softmax is invariant to constant offsets in the input, that is, for any input vector x and any constant c , softmax( x ) = softmax( x + c ) where x + c means adding the constant c to every dimension of x . Remember that e x i softmax ( x ) i = (1) � j e x j Note: In practice, we make use of this property and choose c = − max i x i when computing softmax probabilities for numerical stability (i.e. subtracting its maximum element from all elements of x ). Solution: Proof. For all dimensions 1 ≤ i ≤ dim( x ) exp( x i + c ) exp( c ) exp( x i ) exp( x i ) (softmax( x + c )) i = = = = (softmax( x )) i . � dim( x ) exp( c ) � dim( x ) � dim( x ) exp( x j + c ) exp( x j ) exp( x j ) j =1 j =1 j =1 (b) (5 points) Given an input matrix of N rows and d columns, compute the softmax prediction for each row. Write your implementation in q1 softmax.py . You may test by executing python q1 softmax.py . Note: The provided tests are not exhaustive. Later parts of the assignment will reference this code so it is important to have a correct implementation. Your implementation should also be efficient and vectorized whenever possible. A non-vectorized implementation will not receive full credit! 2 Neural Network Basics (30 points) (a) (3 points) Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e. in some expression where only σ ( x ), but not x , is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is 1 σ ( x ) = (2) 1 + e − x 1

CS 224d: Assignment #1 Solution: σ ′ ( x ) = σ ( x )(1 − σ ( x )). (b) (3 points) Derive the gradient with regard to the inputs of a softmax function when cross entropy loss is used for evaluation, i.e. find the gradients with respect to the softmax input vector θ , when the prediction is made by ˆ y = softmax( θ ). Remember the cross entropy function is � CE ( y , ˆ y ) = − y i log(ˆ y i ) (3) i where y is the one-hot label vector, and ˆ y is the predicted probability vector for all classes. ( Hint: you might want to consider the fact many elements of y are zeros, and assume that only the k -th dimension of y is one. ) ∂CE ( y , ˆ y ) Solution: = ˆ y − y . ∂ θ Or equivalently, assume k is the correct class, � ˆ ∂CE ( y , ˆ y ) y i − 1 , i = k, = ∂ θ i y i , ˆ otherwise (c) (6 points) Derive the gradients with respect to the inputs x to an one-hidden-layer neural network (that is, find ∂J ∂ x where J is the cost function for the neural network). The neural network employs sigmoid activation function for the hidden layer, and softmax for the output layer. Assume the one-hot label vector is y , and cross entropy cost is used. (feel free to use σ ′ ( x ) as the shorthand for sigmoid gradient, and feel free to define any variables whenever you see fit) y ˆ x h Recall that the forward propagation is as follows ˆ h = sigmoid( xW 1 + b 1 ) y = softmax( hW 2 + b 2 ) Note that here we’re assuming that the input vector (thus the hidden variables and output probabilities) is a row vector to be consistent with the programming assignment. When we apply the sigmoid function to a vector, we are applying it to each of the elements of that vector. W i and b i ( i = 1 , 2) are the weights and biases, respectively, of the two layers. Solution: Denote z 2 = hW 2 + b 2 , and z 1 = xW 1 + b 1 x , then δ 1 = ∂CE = ˆ y − y ∂ z 2 δ 2 = ∂CE ∂ z 2 ∂ h = δ 1 W ⊤ = δ 1 2 ∂ h δ 3 = ∂CE ∂ h = δ 2 ◦ σ ′ ( z 1 ) = δ 2 ∂ z 1 z 1 ∂CE ∂ z 1 ∂ x = δ 3 W ⊤ = δ 3 1 ∂ x (d) (2 points) How many parameters are there in this neural network, assuming the input is D x -dimensional, the output is D y -dimensional, and there are H hidden units? Solution: ( D x + 1) · H + ( H + 1) · D y . Page 2 of 7

CS 224d: Assignment #1 (e) (4 points) Fill in the implementation for the sigmoid activation function and its gradient in q2 sigmoid.py . Test your implementation using python q2 sigmoid.py . Again, thoroughly test your code as the provided tests may not be exhaustive. (f) (4 points) To make debugging easier, we will now implement a gradient checker. Fill in the implementation for gradcheck naive in q2 gradcheck.py . Test your code using python q2 gradcheck.py . (g) (8 points) Now, implement the forward and backward passes for a neural network with one sigmoid hidden layer. Fill in your implementation in q2 neural.py . Sanity check your implementation with python q2 neural.py . 3 word2vec (40 points + 5 bonus) (a) (3 points) Assume you are given a predicted word vector v c corresponding to the center word c for skipgram, and word prediction is made with the softmax function found in word2vec models exp( u ⊤ o v c ) y o = p ( o | c ) = ˆ (4) � W w =1 exp( u ⊤ w v c ) where w denotes the w-th word and u w ( w = 1 , . . . , W ) are the “output” word vectors for all words in the vocabulary. Assume cross entropy cost is applied to this prediction and word o is the expected word (the o -th element of the one-hot label vector is one), derive the gradients with respect to v c . Hint: It will be helpful to use notation from question 2. For instance, letting ˆ y be the vector of softmax predictions for every word, y as the expected word vector, and the loss function J softmax − CE ( o , v c , U ) = CE ( y , ˆ y ) (5) where U = [ u 1 , u 2 , · · · , u W ] is the matrix of all the output vectors. Make sure you state the orientation of your vectors and matrices. Solution: Let ˆ y be the column vector of the softmax prediction of words, and y be the one-hot label which is also a column vector. Then ∂J = U T (ˆ y − y ) . ∂ v c Or equivalently, W ∂J � = − u i + y w u w ˆ ∂ v c w =1 (b) (3 points) As in the previous part, derive gradients for the “output” word vectors u w ’s (including u o ). Solution: ∂J y − y ) ⊤ ∂ U = v c (ˆ Or equivalently, � ( ˆ ∂J y w − 1) v c , w = o = ∂ u w y w , ˆ otherwise (c) (6 points) Repeat part (a) and (b) assuming we are using the negative sampling loss for the predicted vector v c , and the expected output word is o . Assume that K negative samples (words) are drawn, and they are 1 , · · · , K , respectively for simplicity of notation ( o / ∈ { 1 , . . . , K } ). Again, for a given word, o , denote its output vector as u o . The negative sampling loss function in this case is K � J neg − sample ( o , v c , U ) = − log( σ ( u ⊤ log( σ ( − u ⊤ o v c )) − k v c )) (6) k =1 Page 3 of 7

CS 224d: Assignment #1 Due date: 4/19 11:59 PM PST (You are allowed - PDF document

CS 224d: Assignment #1 Due date: 4/19 11:59 PM PST (You are allowed to use three (3) late days maximum for this assignment) These questions require thought, but do not require long answers. Please be as concise as possible. We encourage students

DPS915 Presentation Ray Tracing Parallelization Soutrik Barua Faiq Malik Assignment

Objects Announcements for Today Assignment 1 Assignment 2 We are starting grading

Assignment Design Assignment Design Across the Curriculum: Across the Curriculum: Cueing for

Objects Announcements for Today Assignment 1 Assignment 2 We are starting grading

Volunteer Name: State of Origin: Occupation: Assignment Title: SOW NO: Host Organization:

Dedicated Storage Assignment (DSAP) The assignment of items to slots is termed slotting

Announcements Assignment 4 due today. Assignment 5 uploaded to website and Piazza. Will be due

Assignment # 2 So You Want to Write a Physically Based Motion Which is something you may wish

CSE 158/258 Web Mining and Recommender Systems Assignment 2 Assignment 2 Open-ended Due

MCC assignment info Slides will be available in Noppa Assignment assistants: Rasmus Eskola

Assignment 1 Florian Vesting 2012-09-07 Florian Vesting Assignment 1 2012-09-07 1 / 11

Assignment #3 Which is something you may wish to do since it is Assignment #3 So You Want to

JAVASCRIPT PROGRAMMING Functions Examples Homework assignment

Assignment 01 Assignment 01 First Steps Prepare the Android development environment and create

Writing Assignment 2 Polisci 209 Writing Assignment 2 First Draft due on November 16th, Final

CS 2112 Lab 10: Assignment 6 CS 2112 Lab 10: Assignment 6 November 5 / 7, 2018 CS 2112 Lab 10:

E9 205: Machine Learning for Signal Processing Introduction to 16-10-2019 Neural Network Models

Why Squashing Functions in Shall We Go Beyond . . . Which . . . Multi-Layer Neural Invariance

libSVM LING572 Advanced Statistical Methods for NLP February 18, 2020 1 Documentation

Data Mining Lecture Notes for Chapter 4 Artificial Neural Networks Introduction to Data Mining ,

Introduction to Neural Networks Jakob Verbeek 2017-2018 Biological motivation Neuron is basic

Sign-Up Campaign Internal Training Webinar Over-service the top 20% of your accounts. ~ David

Implementing the CARES Act for HOPWA Competitive Renewal Grantees May 20, 2020 Presenters

Log-Linear Models Michael Collins, Columbia University The Language Modeling Problem w i is