Lesson 1: Linear Classifiers and Gradient Descent¶
Components of a Parametric Learning Algorithm¶
Multiclass linear classifier:
- You can interpret this as 3 independent linear classifiers
- Note that you can move bias term into the weight matrix, and a "1" at the end of the input. This results in a one matrix-vector multiplication. Why? Efficiency.
Interpreting a Linear Classifier¶
How we can interpret:
- Algebraic: We can view an image as a linear algebra computation. Given a matrix of pixel values, each can be tied to a weight+bias.
- Visual: We can convert weight vector back into the shape of the image and visualize. After optimization image looks like an average over all inputs of class.
- Geometric: We can view linear classifier as a linear separator in a dimension space.
Limitations:
- Not all classes are linearly separable.
- XOR, bullseye function not linearly separable
Converting Scores to Probabilities¶
Use softmax function to convert scores to probabilities:
$$
s = f(x,W) \\
P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}}
$$
Steps:
- Given score vector $s$ (score for each class)
- Feed this vector through softmax function (2nd equation, right)
- For each class $k$, exponentiate the score for class $k$ and divide it by the sum of exponentions for all the scores.
- This normalize all of the scores from zero to one, which fits the definition of a probability.
Example of multiclass SVM loss:
Takeaways:
- Loss is zero if the score for $y_i$ (ground truth label) is greater or equal to the scores of all the other (incorrect) classes, plus 1. (First equation)
- Goal is to maintain a margin between the score for the ground truth label and all the other possible categories.
- If not, penalize it by how different it is from this margin.
- Take the max over all classes that are not the ground truth and compute the loss (second equation)
- Take its sum over all classes that aren't the ground truth and penalize the model whenever the score for the ground truth itself is not bigger by a particular margin.
- This type of loss is called a hinge loss.
Example of how this is calculated with image class prediction:
- Car score is highest (wrong)
- Take max of either 0 or wrong score - grouth truth score + 1
- In this case, car score will incur loss but frog will not (since it's lower than cat).
- Final loss is the sum of each diff (in this case, 2.9)
Loss for classification¶
Cross-entropy and MLE:
- For multiclass loss, we use the negative log probability of the softmax output:
Example using the image classification task:
Takeaways:
- Given raw scores, we exponentiate it (orange box)
- Then normalize to get probabilities (green box)
- Take the negative log of the probability assigned to the one that matches ground truth (e.g. "cat")
- We don't take other probabilities into account for the loss. Why?
- Probabilities inherently induce comptetition
- Optimization algorithm can both boost weights for cat and decrease weight for other classes.
Loss for regression / probability¶
Regularization term¶
Regularization accounts for choosing simpler models over complex ones. This is applied to the loss function:
L1 regularization norms the weight vector:
$$
L_i = |y-Wx_i|^2+|W|
$$