CS231n-Gradient of SVM and softmax

9 minute read

Published: November 12, 2019

SVM
Softmax
This is the notes (mainly focus on needed math) while completed Stanford CS231n assignment1.
I omitted most prerequisite definitions and knowledge, so if it is difficult to follow, please first take a look of course videos and notes.
The related lecture note is on the website1.
My codes are not tidy enough, so the following python codes are captured from website1 and website2.

SVM

SVM loss function $L$ Let $x_i=(x_{i,0},\dots,x_{i,D})^T$ be i-th row of $X$, which is $N\times D$ matrix, and $w_j=(w_{0,j},\dots,w_{D,j})^T$ be j-th column of $W$,a $D\times C$ matrix.

\[\begin{aligned} L_i &=\sum\limits_{j \neq y_i} \left[ \max \left(0,w_j^Tx_i-w^T_{y_i}x_i \right)+\Delta \right] \\ L &=\frac{1}{N}\sum_{i} L_i \end{aligned}\]

where $\Delta$ is a margin and $L$ is the loss function which we want to minimize.

First, finding minimum requires its negative gradient direction which we mainly calculate in the following.

\[\begin{aligned} \nabla_{w_j}L_i &=\mathbf{1}(w_j^Tx_i-w^T_{y_i}x_i+\Delta>0)x_i \quad \forall j\neq y_i \\ \nabla_{w_{y_i}}L_i &=-\sum_{k\neq j}\mathbf{1}(w_k^Tx_i-w^T_{y_i}x_i+\Delta>0)x_i \quad \text{for } y_i=j,k\neq j \\ \nabla_{w_{j}}L &=\frac{1}{N}\left(\sum_{i} \mathbf{1}(w_j^Tx_i-w^T_{y_i}x_i+\Delta>0)x_i-\sum_{\{i| y_i=j\}}\sum_{k \neq j}\mathbf{1}(w_k^Tx_i-w^T_{y_i}x_i+\Delta>0)x_i\right)\\ &=\frac{1}{N}\sum_{i} a_{i,j}x_i \end{aligned}\]

Hence, the gradient of L is

\[\begin{align} \frac{dL}{dW} &=\left( \begin{matrix} \nabla_{w_0}L,&\nabla_{w_1}L, & \cdots, & \nabla_{w_C}L \label{chain_rule} \end{matrix} \right) \end{align}\]

Here, we recall some linear algebra, the following $X_i$ are the row vectors.

\[\left(\begin{matrix} a_1,a_2,a_3\\ b_1,b_2,b_3 \end{matrix}\right) \left(\begin{matrix} X_1 \\ X_2 \\ X_3\end{matrix} \right)= \left( \begin{matrix} \sum_{r} a_r X_r \\ \sum_{r} b_r X_r \end{matrix} \right)\]

Therefore, $\eqref{chain_rule}$ can be represented as

\[\left(\frac{dL}{dW}\right)^T= \left( \begin{matrix} \nabla_{w_0}L \\ \nabla_{w_1}L \\ \vdots \\ \nabla_{w_C}L \end{matrix} \right) =\frac{1}{N}AX\]

where

\[A_{ij}=\mathbf{1}(w_j^Tx_i-w^T_{y_i}x_i+\Delta>0)-\mathbf 1(y_i=j)\left(\sum_{k\neq j}\mathbf{1}(w_k^Tx_i-w^T_{y_i}x_i+\Delta>0)\right)\]

The following is the code how to caculate $\frac{dL}{dW}$

def svm_loss_vectorized(W, X, y, reg):
  """
  Structured SVM loss function, vectorized implementation.

  Inputs and outputs are the same as svm_loss_naive.
  """
  
  loss = 0.0
  dW = np.zeros(W.shape) # initialize the gradient as zero

  #######################################
  # TODO:                                                                     #
  # Implement a vectorized version of the structured SVM loss, storing the    #
  # result in loss.                                                           #
  #######################################
  scores = X.dot(W)
  correct_class_score = scores[np.arange(X.shape[0]),y]
  correct_class_score = np.reshape(correct_class_score, (X.shape[0], -1))
  margin = scores - correct_class_score +1
  margin = np.maximum(0, margin)
  margin[np.arange(X.shape[0]),y] = 0
  loss = np.sum(margin) / X.shape[0]
  loss += 0.5 * reg * np.sum(W * W)
  print(loss)
  #######################################
  #                             END OF YOUR CODE                              #
  #######################################


  #######################################
  # TODO:                                                                     #
  # Implement a vectorized version of the gradient for the structured SVM     #
  # loss, storing the result in dW.                                           #
  #                                                                           #
  # Hint: Instead of computing the gradient from scratch, it may be easier    #
  # to reuse some of the intermediate values that you used to compute the     #
  # loss.                                                                     #
  #######################################
  margin[margin > 0] = 1
  row_sum = np.sum(margin, axis=1)                  # 1 by N
  margin[np.arange(X.shape[0]), y] = -row_sum  
  dW += np.dot(X.T, margin)/X.shape[0] + reg * W     # D by C
  print(dW)
  #######################################
  #                             END OF YOUR CODE                              #
  #######################################

  return loss, dW

Softmax

Softmax loss function L

\[\begin{aligned} L_i &=-\log\left(\frac{\exp(w_{y_i}^Tx_i)}{\sum_{j} \exp(w_j^Tx_i)}\right) \\ &=\log\left(\sum_{j} \exp(w_j^Tx_i)\right) -w_{y_i}^Tx_i \end{aligned}\]

Now, we try to calculate $\nabla_{w_j} L$

\[\begin{aligned} \nabla_{w_j} L_i&=\frac{\exp(w_j^Tx_i)}{\sum_{j} \exp(w_j^Tx_i)}x_i \quad \text{for } \quad j\neq y_i \\ \nabla_{w_j} L_i&=\frac{\exp(w_j^Tx_i)}{\sum_{j} \exp(w_j^Tx_i)}x_i-x_i \quad \text{for } \quad j= y_i \end{aligned}\]

Combined above two equations, we conclude that

\[\begin{aligned} \nabla_{w_j} L&=\frac{1}{N}\sum_{i} \left( \frac{\exp(w_j^Tx_i)}{\sum_{j}\exp(w_j^Tx_i)} -\mathbf 1(y_i=j) \right) x_i \end{aligned}\]

Similarly, we got

\[\left(\frac{dL}{dX}\right)^T= \left( \begin{matrix} \nabla_{w_0} L\\ \vdots \\ \nabla_{w_C} L \end{matrix} \right)= \frac{1}{N}AX\]

where

\[A_{ij}=\sum_{i} \left( \frac{\exp(w_j^Tx_i)}{\sum_{j}\exp(w_j^Tx_i)} -\mathbf 1(y_i=j) \right)\]

The following is the code how to calculate $\nabla_{w_j} L$

def softmax_loss_vectorized(W, X, y, reg):
  """
  Softmax loss function, vectorized version.

  Inputs and outputs are the same as softmax_loss_naive.
  """
  # Initialize the loss and gradient to zero.
  loss = 0.0
  dW = np.zeros_like(W)

  #######################################
  # TODO: Compute the softmax loss and its gradient using no explicit loops.  #
  # Store the loss in loss and the gradient in dW. If you are not careful     #
  # here, it is easy to run into numeric instability. Don't forget the        #
  # regularization!                                                           #
  #######################################
  num_train = X.shape[0]
  num_class = W.shape[1]
  scores = X.dot(W)  #N*C
  p = np.zeros_like(W)
  scores_max = np.reshape(np.max(scores, axis=1), (num_train, 1))
  p = np.exp(scores - scores_max) / np.sum(np.exp(scores - scores_max), axis=1, keepdims=True) # N*C
  loss_selector = np.zeros_like(p) #N*C
  loss_selector[np.arange(num_train),y] = 1.0
  loss = - np.sum(loss_selector.dot(np.log(p.T))[0,:])
  dW = -(loss_selector - p).T.dot(X)
  dW = dW.T
  loss /= num_train
  loss += 0.5 * reg * np.sum(W * W)
  dW /= num_train
  dW += reg * W
  #######################################
  #                          END OF YOUR CODE                                 #
  #######################################

  return loss, dW

Now, we have known $\frac{dL}{dW}$ of softmax loss function. How about $\frac{dL}{dX}$ ?

\[\begin{align}\label{softmax_dx} \nabla_{x_i} L&= \frac{1}{N}\left(\sum_{j}\frac{\exp(w_jx_i)}{\sum_{j} \exp(w_jx_i)}w_j-w_{y_i} \right) \end{align}\]

Here, we recall some linear algebra, the following $X_i$ are the row vectors.

\[\begin{align} \left(\begin{matrix} a_1,a_2,a_3-1\\ b_1,b_2,b_3 \end{matrix}\right) \left(\begin{matrix} X_1 \\ X_2 \\ X_3\end{matrix} \right)= \left( \begin{matrix} (\sum_{r} a_r X_r) - X_3\\ \sum_{r} b_r X_r \end{matrix} \right) \label{matrix_form} \end{align}\]

Therefore, comparing $\eqref{softmax_dx}$ and $\eqref{matrix_form}$, we get

\[\frac{dL}{dX}= \left( \begin{matrix} \nabla_{x_1} L \\ \nabla_{x_2} L \\ \vdots \\ \nabla_{x_N} L \end{matrix} \right) =\frac{1}{N}AW^T\]

where

\[\begin{aligned} A_{i,j}&=\frac{\exp(w_j^Tx_i)}{\sum_{j} \exp(w_jx_i)} \quad \text{for} \quad y_i\neq j \\ A_{i,j}&=\frac{\exp(w_j^Tx_i)}{\sum_{j} \exp(w_jx_i)}-1 \quad \text{for} \quad y_i= j \end{aligned}\]

#-------------------------
# when y_i=j , delta_{i,j}=-1
scores = X.dot(W1) + b1
R1 = np.maximum(scores, 0)
scores = R1.dot(W2) + b2  #N*C

exp_scores = np.exp(scores)
row_sum = exp_scores.sum(axis=1).reshape((N, 1))
#norm_scores=scalar matrix A
norm_scores = exp_scores / row_sum 
delta3[np.arange(N), y] -= 1
delta3 += A
#-------------------------

Full code to be seen as following:

import numpy as np
import matplotlib.pyplot as plt


class TwoLayerNet(object):
  """
  A two-layer fully-connected neural network. The net has an input dimension of
  N, a hidden layer dimension of H, and performs classification over C classes.
  We train the network with a softmax loss function and L2 regularization on the
  weight matrices. The network uses a ReLU nonlinearity after the first fully
  connected layer.

  In other words, the network has the following architecture:

  input - fully connected layer - ReLU - fully connected layer - softmax

  The outputs of the second fully-connected layer are the scores for each class.
  """

  def __init__(self, input_size, hidden_size, output_size, std=1e-4):
    """
    Initialize the model. Weights are initialized to small random values and
    biases are initialized to zero. Weights and biases are stored in the
    variable self.params, which is a dictionary with the following keys:

    W1: First layer weights; has shape (D, H)
    b1: First layer biases; has shape (H,)
    W2: Second layer weights; has shape (H, C)
    b2: Second layer biases; has shape (C,)

    Inputs:
    - input_size: The dimension D of the input data.
    - hidden_size: The number of neurons H in the hidden layer.
    - output_size: The number of classes C.
    """
    self.params = {}
    self.params['W1'] = std * np.random.randn(input_size, hidden_size)
    self.params['b1'] = np.zeros(hidden_size)
    self.params['W2'] = std * np.random.randn(hidden_size, output_size)
    self.params['b2'] = np.zeros(output_size)

  def loss(self, X, y=None, reg=0.0):
    """
    Compute the loss and gradients for a two layer fully connected neural
    network.

    Inputs:
    - X: Input data of shape (N, D). Each X[i] is a training sample.
    - y: Vector of training labels. y[i] is the label for X[i], and each y[i] is
      an integer in the range 0 <= y[i] < C. This parameter is optional; if it
      is not passed then we only return scores, and if it is passed then we
      instead return the loss and gradients.
    - reg: Regularization strength.

    Returns:
    If y is None, return a matrix scores of shape (N, C) where scores[i, c] is
    the score for class c on input X[i].

    If y is not None, instead return a tuple of:
    - loss: Loss (data loss and regularization loss) for this batch of training
      samples.
    - grads: Dictionary mapping parameter names to gradients of those parameters
      with respect to the loss function; has the same keys as self.params.
    """
    # Unpack variables from the params dictionary
    W1, b1 = self.params['W1'], self.params['b1']
    W2, b2 = self.params['W2'], self.params['b2']
    N, D = X.shape

    # Compute the forward pass
    scores = None
    #######################################
    # TODO: Perform the forward pass, computing the class scores for the input. #
    # Store the result in the scores variable, which should be an array of      #
    # shape (N, C).                                                             #
    #######################################
    scores = X.dot(W1) + b1
    R1 = np.maximum(scores, 0)
    scores = R1.dot(W2) + b2  #N*C
    #######################################
    #                              END OF YOUR CODE                             #
    #######################################
    
    # If the targets are not given then jump out, we're done
    if y is None:
      return scores

    # Compute the loss
    loss = None
    #######################################
    # TODO: Finish the forward pass, and compute the loss. This should include  #
    # both the data loss and L2 regularization for W1 and W2. Store the result  #
    # in the variable loss, which should be a scalar. Use the Softmax           #
    # classifier loss. So that your results match ours, multiply the            #
    # regularization loss by 0.5                                                #
    #######################################
    exp_scores = np.exp(scores)
    row_sum = exp_scores.sum(axis=1).reshape((N, 1))
    #norm_scores=scalar matrix A
    norm_scores = exp_scores / row_sum 
    data_loss = -1.0/N * np.log(norm_scores[np.arange(N), y]).sum()
    reg_loss = 0.5 * reg * (np.sum(W1*W1) + np.sum(W2*W2))
    loss = data_loss + reg_loss
    #######################################
    #                              END OF YOUR CODE                             #
    #######################################

    # Backward pass: compute gradients
    grads = {}
    #######################################
    # TODO: Compute the backward pass, computing the derivatives of the weights #
    # and biases. Store the results in the grads dictionary. For example,       #
    # grads['W1'] should store the gradient on W1, and be a matrix of same size #
    #######################################
    delta3 = np.zeros_like(norm_scores)    #delta3 = dloss / dz3
    #-------------------------
    # when y_i=j 
    # a{i,j}=a{i,j}-1
    delta3[np.arange(N), y] -= 1
    delta3 += norm_scores
    #-------------------------
    grads['W2'] = R1.T.dot(delta3) / N + reg * W2
    grads['b2'] = np.ones(N).dot(delta3) / N

    da2_dz2 = np.zeros_like(R1)
    da2_dz2[R1>0] = 1
    delta2 = delta3.dot(W2.T) * da2_dz2
    grads['W1'] = X.T.dot(delta2) / N + reg * W1
    grads['b1'] = np.ones(N).dot(delta2) / N
    #######################################
    #                              END OF YOUR CODE          #
    #######################################

    return loss, grads

Share on

Twitter Facebook Google+ LinkedIn

Elsa CHOI

CS231n-Gradient of SVM and softmax

SVM

Softmax

Share on

You May Also Enjoy

Stochastic Processs

紅綠燈

Track Probability

04747 Java