Deep Learning (Goodfellow, Bengio, Courville)

Ch.2 - Linear Algebra background

See link for refresher: https://tnakatani.github.io/moocs/coursera_linalg_for_ml/notes.html

Ch.3 - Probability / Info Theory background

Other references:

Why Probability?

Probability theory is used to account for uncertainty. Three main sources come from:

  1. Randomness in the system being modeled - Shuffling of cards, randon numbers, subatomic particles all have some level of stochasticity.
  2. Incomplete data - Uncertain outcomes of actions we take, ie. choosing one of three door prizes.
  3. Incomplete modeling - Models may require discarding information, leading to loss in model precision. Examples are discretization, sampling, and compression of data.

Frequentist vs Bayesian Probability:

  • Frequentist - Probability of event occuring over many tries. Based on observed data.
  • Bayesian - Probability of event given prior beliefs. You change this prior iteratively.

ELI5: "Frequentist looks at lots of tries to guess chance, while Bayesian starts with a guess and changes it using new info to guess chance."

Probability Distributions

Discrete variables and Probability Mass Functions (PMF)

Probability Mass Function ($P$) - Distribution over discrete random variables

Joint Probability Distribution $P(\text{x}=x)(\text{y}=y)$ - PMF using more than one variable.

Uniform distribution - Given random variable $x$ and $k$ as its possible states:

$$ P(\text{x} = x_i) = \frac{1}{x_i} $$

Continuous variables and Probability Density Function (PDF)

Probability Density Function ($p(x)$) - Distribution over continuous random variables

Different property requirements compared to PMF:

  1. $\forall x \in x \text{ , } p(x) \geq 0$ - $x$ is not bound between 0 and 1
  2. $\int p(x)dx=1$ - Probability that x lies in some set $\mathbb{S}$ is the integral of p(x). Must be 1.

Marginal Probability

Marginal Probability Distribution - Given set of random variables $\mathbb{S}$, $P$ of its subset.

Probability of $x$ given joint probability of $x$ and $y$:

  1. Use sum rule for discrete variables $$ \forall x \in x \text{ , } P(\text{x}=x) = \sum_y P(\text{x}=x, \text{y}=y) $$
  2. Use integration instead of sum for continuous variables: $$ p(x) = \int p(x,y)dy $$

Conditional Probability

Conditional Probability - $P$ of an event, given some other event happened.

  • Example: Probability that someone is German ($y$) given they speak German fluently ($x$) is high.

Formula:

$$ P(\text{y}=y|\text{x}=x)=\frac{P(\text{y}=y|\text{x}=x)}{\text{x}=x} $$

Chain Rule of Conditional Probabilities

_Reference on chain rule from CS7641_

Any joint probability distributions can be broken down into smaller conditional probabilities. Example using variables $a,b,c$:

$$ \begin{align*} P(a,b,c) &= P(a|b,c)P(b,c) \\ P(b,c) &= P(b|c)P(c) \\ P(a,b,c) &= P(a|b,c)P(b,c)P(c) \end{align*} $$

Independence / Conditional Independence

Independent variables

Joint probabilities of independent variables is just the product of their probabilities.

$$ p(x,y) = p(x)p(y) $$

Dependent variables

Joint probabilities of conditional independent variables is the product of each probability wrt the shared dependent variable.

$$ p(x,y|z) = p(x|z) P(y|z) $$

Notation shorthand:

  1. x, y independent $$x \bot y$$
  2. x, y conditionally independent given z: $$x \bot y|z$$

Expectation, Variance and Covariance

Expectation

Expectation - The mean value we expect when randomly drawing a value ($f(x)$) from distribution $P(x)$

  • Discrete: $\sum_x P(x)f(x)$
  • Continuous: $\int p(x)f(x)dx$
  • Shorthand for expectation: $\mathbb{E}[f(x)]$

Variance

Variance: How much the value of a random sample $f(x)$ fluctuates over a prob. distribution.

  • Smaller the variance, we expect all samples to be near $\mathbb{E}$
  • Square root of variance = standard deviation
$$ \text{Var}(f(x)) = \mathbb{E}[f(x)] - \mathbb{E}[f(x)]^2 $$

Covariance: How much two values are linearly related to each other

  • If their values often go up and down at the same time, they have positive covariance.
  • If one goes up while the other goes down, they have negative covariance.
  • If they don't seem to follow any pattern together, their covariance is close to zero.

Formula: Expectation of the product of x and y

  • Like variance, but $f(x) - \mathbb{E}f(x)$ aren't squared $$ \text{Cov}(f(x),g(y)) = \mathbb{E}[(f(x)-\mathbb{E}f(x))\text{ }(g(y)-\mathbb{E}g(y))] $$

Common Probability Distributions

Reference: ISYE6501 notes on distributions

  • Covers Bernoulli, Geometric, Poisson, Exponential, Weibull distributions

Multinoulli

Multinoulli distribution: $P$ of random variable $x$ with $k$ discrete states.

  • Similar to multinomial distribution (distribution over vectors in $\{0,\ldots,n\}^k$

Gaussian (Normal Distribution)

Formula:

  1. $\mu \in \mathbb{R}$ - Mean
  2. $\sigma \in (0, \infty)$ - Standard deviation. $\sigma^2$ = variance.
  3. $x$ - Random variable
$$ \mathbb{N}(x; \mu, \sigma^2) = \sqrt{\frac{1}{2\pi\sigma^2}} \exp \left (-\frac{1}{2\sigma^2}(x-\mu)^2 \right ) $$

(See book for formula for evaluating PDF)

Why is normal distribution useful?

  1. Central Limit Theorem - Sum of many independent random variables is approximately Gaussian
  2. Gaussian has the least priors. Encodes the maximum amount of uncertainty over real numbers.

Multivariate Normal Distribution - Same as Gaussian, but handles multiple variables

  • Imagine a multi-dimensional bell-shaped curve

Formula (new notation):

  1. $\Sigma$ - Covariance of the matrix of the distribution. Positive definite symmetric matrix
$$ \mathbb{N}(x; \mu, \Sigma) = \sqrt{\frac{1}{(2\pi)^n\det(\Sigma)}} \exp \left (-\frac{1}{2}(x-\mu)^\top \Sigma^{-1}(x - \mu) \right ) $$

(See book for formula for evaluating PDF)

Exponential and Laplace

Exponential Distribution: Sharp peak at $x=0$, useful for deep learning.

$$ p(x; \lambda) = \lambda \exp(-\lambda x) $$

Laplace Distribution: Similar to exponential, but can place peak at an arbitrary point $\mu$

$$ \text{Laplace}(x; \mu, \gamma) = \frac{1}{2 \gamma} \exp \left( - \frac{|x-\mu|}{\gamma} \right) $$

Dirac and Empirical Distribution

Dirac delta function ($\delta(x)$) - PDF. Places infinitely high peak at mean, 0 everywhere else.

Empirical Distribution - Used to represent distribution over continuous

Mixtures of Distributions

Mixture Distribution: Method to combine distribution.

  • How:
    1. First, choosing the component distribution (one of $k$ distributinos used).
    2. Sampling from that distribution.
  • The selection can be weighted, with each distribution carrying a weight.

Gaussian mixture model (GMM) - Mixture distribution where each component is a Gaussian with different parameters (mean + covariance)

  • Note to self: See CS7641 A3 report for more info on GMMs and its application). That implementation used the Gaussian Mixture Model from scikit-learn
  • Also refer to Expectation Maximization section in my CS7641 Supervised Learning notes

Useful Properties of Common Functions

Sigmoid

Formula: $$\sigma(x)=\frac{1}{1+exp(-x)}$$

Properties:

  • $\sigma(x)$ goes from (0,1)
  • Used to saturate extreme values

Softplus

Formula: $$log(1+exp(x))$$

Properties:

  • Softened version of ReLU

Other topics like Baye's Rule, technical details of continuous variables coverd in book

Information Theory

Reference: CS7641 notes

Shannon entropy

Shannon Entropy: The amount of uncertainty over the entire probability distribution.

Formula:

  • $I$ - self-information of an event. $I=-log P(X)$
$$ \begin{align*} H(x) &= \mathbb{E}_{x\text{~}P} [I(x)] \\ &= - \mathbb{E}_{x\text{~}P}[log P(x)] \end{align*} $$
  • Given a binary random variable, values closer to the mean have high Shannon entropy while those toward the limit have low entropy.

Kullback-Leibler (KL) Divergence

Kullback-Leibler Divergence: Measure the distance between 2 probability distributions.

  • Zero if both distributions are the same ($P=Q$)
  • The higher the value, the greater the information loss.
  • Always non-negative
  • Not symmetric, ordering matters.
$$ \begin{align*} D_{KL}(P||Q) &= \mathbb{E}_{x\text{~}P} \left[ log \frac{P(x)}{Q(x)} \right] \\ &= \mathbb{E}_{x\text{~}P} [log P(x) - log Q(x)] \end{align*} $$

Cross-entropy

Cross-entropy's formula is similar to KL divergence but doesn't have the left term (ie. $P$)

$$ H(P,Q) = -\mathbb{E}_{x\text{~}log} Q(x) $$
In [2]:
# KL Divergence example

from scipy.stats import entropy
from scipy.special import rel_entr, kl_div
import numpy as np

# Define two probability distributions
# dirichlet sums to 1
P = np.random.dirichlet(np.ones(10).T,size=1)[0]
Q = np.random.dirichlet(np.ones(10).T,size=1)[0]
print(P)
print(Q)

print("entropy:", entropy(P,Q))
print("rel_entr (sum):", sum(rel_entr(P,Q)))
print("kl_div (sum):", sum(kl_div(P,Q)))
[0.06490079 0.01418882 0.19152213 0.18065904 0.0536485  0.30107653
 0.03642311 0.00056116 0.07041124 0.08660867]
[0.06189551 0.0753746  0.30711239 0.06874022 0.32688886 0.06758686
 0.0250988  0.00463188 0.037181   0.02548989]
entropy: 0.5796242308875327
rel_entr (sum): 0.5796242308875322
kl_div (sum): 0.5796242308875325

Structured Probabilistic Models

_Mostly covered in Module 1 notes (link_)

Ch.5 - Machine Learning Basics

_Mostly covered in CS7641 (link)

Estimators, Bias, Variance

Significance: Statistical methods covered in this section provide us tools to figure out how to train models that generalize well

Estimators

Point Estimation: Finding parameter(s) that provide the "best" prediction. High level function.

Function Estimation: Finding the most optimal $f(x)$ to predict $y$ while accounting for noise $\epsilon$

Example - For linear regression:

  • Finding the best set of parameters ($W$) is point estimation
  • Finding the best function $\hat{f}$ that accurately maps $x$ to $y$

Bias

Formula:

  • $\hat{\theta}_m$: Estimator
  • $\theta$: True distribution of the data
$$ \text{bias}(\hat{\theta}_m) = \mathbb{E}(\hat{\theta}_m) - \theta $$

Key terms:

  • Estimator is unbiased if bias is zero
  • Estimator is asymptotically unbiased if the estimator's bias goes away as the sample size increases. Going to infinity, the estimator parameter matches the population parameter.

Examples:

  • An estimator for the Bernoulli distribution is the mean of samples (see formula). This estimator is unbiased (Proven in the book, see there for details). $$ \hat{\theta}_m = \frac{1}{m}\sum^m_{i=1} x^i $$
  • An estimator for the Gaussian distribution's mean is known as the sample mean (see formula). This estimator is also unbiased. $$ \mu_m = \frac{1}{m}\sum^m_{i=1} x^i $$
  • Estimators of sample variance is biased (see book for formula), but we can make it unbiased.

Key Idea: Unbiased estimators are desirable, but are not always the "best" estimators

Variance and Standard Error

Variance: Gives us an understanding of how strongly the estimator is affected by different samples of the data.

  • Goal: Models with low variance

Standard error (SE) of the mean is a way to quantify this uncertainty:

  • $\sigma^2$: True variance of samples
  • $m$: Sample size
  • Approximation gets better with larger $m$
$$ SE(\hat{\theta}_m) = \sqrt{Var \left[ \frac{1}{m} \sum^{m}_{i=1} \right]} = \frac{\sigma}{\sqrt{m}} $$

We can then use $SE$ to get the 95% confidence interval: $$ \text{CI@95} = (\hat{\theta}_m - 1.96 SE(\hat{\theta}_m), \hat{\theta}_m + 1.96 \hat{\theta}_m) $$

Bias-Variance Tradeoff

Significance: Goal is to find the sweet spot where the model's SE is at the minimum without either underfitting (high bias) or overfitting (high variance).

  • _See ISYE6501 notes for more details link_

Consistency

Consistency: In ML context means that bias is reduced as the size of data grows.

Maximum Likelihood Estimation (MLE)

_See CS7641 notes (link-&-Maximum-Likelihood-Estimation))_

Key Ideas:

  • MLE ties back in the notion of info theory by using KL divergence as a measure to minimize difference between distribution of the actual data ($P_{data}$) and the probability distribution of the model ($P_{model}$).
  • Minimizing the KL divergence is exactly what cross-entropy is doing
  • In software, we use the term "minimizing the cost function".
  • Authors seem to prefer using minimizing KL divergence over negative log-likelihood because KL has a known minimum value of zero.

Conditional Log-Likelihood: Generalizing MLE to estimate a conditional probability ($P(y|x; \theta)$):

Formula for conditional MLE:

  • X: all inputs
  • Y: all targets
$$ \theta_{ML} = argmax P(Y|X; \theta) $$

(Book then ties MLE to MSE in regression and how they both reach the same optimum (p130))

Bayesian Statistics

_(See CS7641 notes: (link))_

Maximum A Posteriori (MAP) Estimation

_(See CS7641 notes: (link-&-Maximum-Likelihood-Estimation)))_

Challenges Motivating Deep Learning

Curse of Dimensionality

CoD: The more dimensions (ie. features) a data has, the amount of data we need to generalize accurately exponentially increases.

_See CS7641 notes: (link)_

Significance: The exponential advantages offered by DL counter the CoD without O(n) reliance on data.

  • ie. Using DL, $O(2^k)$ functions can be modeled with $O(k)$ examples.
  • Why? Hierarchical feature learning, non-linear activation functions, etc.

Local Constancy and Smoothness Regularization

Smoothness Prior (aka Local Constancy Prior): Bayesian concept, expects neighboring data to have similar feature values.

  • The crux in which traditional ML techniques rely on for generalizability of models.
  • Example: KNN, choose the mean of the k-nearest neighbors to that data point. Regions only grow as faster than the number of training examples.
    • Checkerboard example: There is clearly a structure, but trad ML algos can only learn the structure as so far as there are training examples within each distinguishing region.

Significance:

  • Trad ML algos - Each training example only informs learner how to generalize in area immediately near the example.
  • Deep learning techqniques succeed in cases where the notion of smoothness priors fail

Manifold Learning

Manifold: Neighborhood surrounding each point.

  • So what: Training data can often lie in a low-dimensional manifold even if entire dataset in a higher dimension space (e.g. line of points mostly throughout, but intersections are in 2D - Think line of points in a "Figure 8").
  • Essentially a dimensionality reduction technique to make learning easier.

Manifold Learning: Focus on the most "interesting" subset of inputs particularly occuring on the manifold or when the manifold change (e.g. 1D to 2D)

Manifold Hypothesis: P distribution over complex data is highly concentrated (e.g. images, text, speech).

Ch.6 - Deep Feedforward Networks

Gradient-based Learning

  • DNN has no convergence guarantee (like linear regression or SVMs).
  • DNN is sensitive to the initial parameters.

Max Likelihood

Why use it:

  • Removes burden of designing cost functions for each model.
  • Model $p(y|x)$ automatically determines a cost function $log p(y|x)$

Why use $log$? Simplifies calculations. Transform products (from the likelihood) into sums, which can be easier to handle and less prone to numerical underflow.

Key point: Gradient of the cost function must be large and predictable enough to serve a good guide for the learning algorithm

  • Why does cost function need to be large? CF measures how far off a model's prediction is from the label. If gradient of cost function is large, it can optimize faster (ie. take larger steps to the minima).
  • Small gradients mean the step is pretty much flat, ie. not moving towards the minima. Often happens using non-linear functions that saturate.
  • Technique: negative log-likelihood uses $exp$ which can saturate, but using $log$s undoes the effect of $exp$.

MSE vs MAE vs Cross-entropy (Ref: ChatGPT):

  • MSE (Mean Squared Error): Measures the average of squared differences between predicted and actual values. Sensitive to large errors due to squaring.
  • MAE (Mean Absolute Error): Measures the average of absolute differences between predicted and actual values. Treats all errors equally regardless of size.
  • Cross-Entropy: Measures the dissimilarity between predicted probabilities and actual outcomes. Used in classification tasks and penalizes larger discrepancies more strongly.

Output Units

Output layer provides additional transformations to actually solve the task (ie. a set of class predictions)

Linear Units for Gaussian Output Distributions

  • No nonlinearity
  • Can learn the mean of a conditional Gaussian distribution

Sigmoid Units for Bernoulli Output Distributions

When to use: Probability distribution over a binary variable.

Key points:

  • Uses sigmoid activation function to convert output $z$ (from linear layer $z = w^T h+b$) into a probability.
  • This $z$ is called the $logit$.
  • Loss function explained in p.178
  • MLE is preferred method to training sigmoid output units
  • Returns a single number

Softmax Units for Multinoulli Output Distributions

When to use: Probability over n discrete classes

Key points:

  • Returns a vector instead of a single number.
  • Each element of $y_i$ must be between (0,1) and entire vector sums to 1.

Formula:

$$ softmax(z)_i = \frac{exp(z_i)}{\sum_j exp(z_j)} $$

Reformulated to the log

$$ log \text{softmax}(z)_i = z_i - log \sum_j exp(z_j) $$
  • First term ($z_i$) always contributes to cost function
  • First term pushes up, while second term pushes down.
  • If correct answer predicted, these will cancel out to zero, so no contribution to the loss and no weight change.
  • Softmax can saturate like sigmoid. Saturates when differences between inputs become extreme.
    • "In other words, as the input values become very large or very small, the corresponding output probabilities can become very close to 0 or 1, causing a loss of sensitivity and information in the model's predictions." (ChatGPT)
    • "When the differences between $z_i$ values are extreme, the softmax output $s_i$ for the class with the largest $z_i$ becomes close to 1, and the outputs for other classes become close to 0. This means the model becomes extremely confident about its prediction, potentially disregarding valuable information from other classes." (ChatGPT)

Hidden Units

Rectified Linear Units (ReLU) and Variants

Relu formula: $$ g(z) = max\{0,x\} $$

Derivative: $$ \frac{d}{dx}g(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} $$

  • Derivative is undefined at $z = 0$.
    • Why?: A function is differentiable at $x$ IFF both left and right derivatives are equal.
    • If relu deriv at 0, left is 0 and right is 1.
    • But, still works well for ML because we don't expect training to actually reach a point where the gradient is 0.
  • Tip: Initialize weights with small positive number so relu will still be active for most inputs in the training set to allow derivatives to pass through.

Variations on ReLU

Problem: ReLU's gradient becomes zero if negative or zero. This means corresponding nodes will not get updated during backprop.

Solution: Add non-zero slope when $z<0$.

Examples:

  1. Absolute value rectification: Fixes slope to a -1 to get $|z|$
  2. Leaky ReLU: Fixes slope to a small value
  3. Parametric ReLU: Treats slope as a learnable parameter.
  4. Maxout units: Piecewise idea. Divides $z$ into groups of $k$ values, then gets the max element of each group.

Logistic Sigmoid and Hyperbolic Tangent

Sigmoid

$$g(z) = \sigma (z)$$
  • Saturates at extremes, making learning diffficult.
  • Don't use in hidden units.

Hyperbolic tangent (tanh)

$$g(z) = \text{tanh}(z)$$
  • Typically works better than sigmoid.

Other Hidden Units

  1. Not using activations at all
  2. Softmax (use as switch?)
  3. Radial basis function (RBF) - difficult to optimize
  4. Softplus - Smoothed rectifier, smooth curve but empirically doesn't work as well as ReLU.
  5. Hard tanh - tanh + rectifier

Architecture Design

Considerations:

  1. Depth of network (number of layers)
  2. Width of network (number of nodes)

Key Point: Deeper models generalized better than using more parameters per layer (more likely to overfit).

  • Why? Deep layers can better learn hierarchichal features via composition of simpler functions.

Universal Approximation Properties and Depth

Universal approximation theorem

By Hornik et al, 1989.

"...any continuous function on a closed and bounded subset of $\mathbb{R}^n$ is Borel measurable and therefore may be approximated by a neural network."

Key Point: Guarantees represntation of a function, but not the ability to learn the function.

  • Optimization gets harder as network becomes complex.
  • Overfitting. Can represent training data well, but can it generalize?

Back-propagation and Other Differentiation Algorithms

Terminology clarification:

  1. Backpropagation - Computes the gradient
  2. Stochastic Gradient Descent - Performs learning using this gradient.

Computational Graphs

Components:

  • Nodes: Indicates a scalar, vector, matrix, tensor, etc
  • Operation: Simple function of one or more variables (e.g. $f(x)=2x+y$

Chain Rule of Calculus

Chain rule example:

  • $x$: Real number
  • $f$ & $g$: Functions mapping a real number to a real number.
  • Supposed that $y = g(x)$, $z=f(g(x))$

Chain rule for a scalar:

$$ \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x} $$

Chain rule for a vector:

  • $x \in \mathbb{R}^m$
  • $y \in \mathbb{R}^n$
  • $g$ maps $\mathbb{R^m}$ to $\mathbb{R^n}$
  • $f$ maps $\mathbb{R^n}$ to $\mathbb{R}$
  • Supposed that $y = g(x)$, $z=f(g(x))$
$$ \frac{\partial z}{\partial x_i} = \sum_j \frac{\partial z}{\partial y_j} \frac{\partial y_j}{\partial x_i} $$

In vector notation:

$$ \nabla x^z = \left(\frac{\partial y}{\partial x}\right)^T \nabla y^z $$

where $\frac{\partial y}{\partial x}$ is the n x m Jacobian matrix of g.

The same idea applies to tensors (see book)

Recursively Applying the Chain Rule to Obtain Backprop

Key Idea: We can create a forward computation graph (G) and also create another graph for backprop (B).

  • B depends on G, 1 to 1 node mapping
  • B computes in the exact reverse order of G
  • Each node in B computes the derivative $\frac{\partial u^{(n)}}{\partial u^{(i)}}$ associated with forward graph node $u^{(i)}$.
  • Derivates are calculate using the chain rule wrt to scalar output $\partial u^{(n)}$
  • The amount of compute required for backprop scales linearly with number of edges in G.

Back-propagation Computation in Fully Connected MLP

Algos:

  1. 6.3 - Forward propagation (single example with regularization at loss)
  2. 6.4 - Back-propagation (regularization applied at each layer "where needed")

Symbolic Representations

Example of symbol-to-symbol approach to computing derivatives:

Example: Backpropagation for MLP Training

Computational graph of forward propagation:

Differentiation outside DL community

  1. Reverse mode accumulation - includes backprop.
    • Good to use when # of inputs greater than outputs
  2. Forward mode accumulation - include forward mode differentiation.
    • Good to use when # of inputs small than outputs.

Higher-order derivatives

  • Can use Hessian, but generally becomes huge. Not advised.
  • Could use Krylov method - set of iterative transforms. Used to get Hessian

History

Notable paradigm shifts:

  1. Parallel distributed processing - allowed backpropagation to be feasible.
  2. Larger datasets - allowed DL models to better generalize.
  3. Replace MSE with cross-entropy loss functions - improved performance of models with sigmoid/softmax outputs.
  4. Replace sigmoid with ReLU - Works better with big models compared to sigmoids.

Chapter 7 - Regularization for Deep Learning

(Note: I am going to change the approach to note taking, moving away from excess note taking to emphasizing knowledge synthesis)

Parameter Norm Penalties

L2 Regularization

L2 regularization applies a regularization term to the weights at each update. The formula applied to the gradient is:

$$ J = \alpha W^{J-1} + J(W^{J-1}) $$

As $\alpha$ approaches zero, no regularization takes place.

L1 Regularization

L1 regularization is similar to L2 but applies the a sign function to the weights in addition to regularization term:

$$ J = \alpha \text{ sign}(W^{J-1}) + J(W^{J-1}) $$

This results in a sparse parameters (ie. weights) since many are reduced to zero. This is often useful for feature selection as it indicates zeroed params to be okay to discard (ie. LASSO).

Norm Penalties as Constrained Optimization

Regularization puts constraints on parameter weights. The greater it is, the smaller the constraint region and vice versa. We can also explicitly set constraints rather than including it as a penalty, this ties into the idea of reprojection (taking SGD and project weights to closest region that satisfies constraints). Research shows that this can be a more stable method of optimization than applying penalties to the entire parameter weight.

Data Augmentation

Adding modulations and noise into existing data creates new data which can be used as additional data to train a NN. Techniques like image transformation (cropping, rotating, shearing) and noise injection is common in image recognition.

Noise Robustness

The section covers adding noise to the weights and to output targets. For weights, we add noise with very small variance. This ties into Bayesian notions of uncertainty of weights and how it fulfills this uncertainty by adding such noise. Adding small noise encourages learning to find parameter spaces where it is relatively robust to such noise. For output targets, label smoothing is used to apply some noise to the classification targets (ie. 0 and 1) with some amount of noise. This makes sense because we don't expect probabilities to ever reach 0 or 1, thus preventing models from overfitting trying to get there by applying larger and larger weights.

Semi-supervised Learning

Semi-supervised learning combines the use of unlabeled (unsupervised) and labeled (supervised) examples. The main idea is that unsupervised learning can provide a representation that groups similar examples together. You can combine both by using the generative model ($P(x)$, unsupervised) with the discriminative ($P(y|x)$). Generative provides a prior belief to the supervised learning problem.

Multi-task Learning

The core concept of multi-task learning is that by training a model on different (but related) tasks, it is able to generalize better. The way that it is done is by having a shared "generic" parameter layer, but then downstream parameters intended for different tasks (ie. translation + syntax parsing).

Early-stopping

Early stopping is a simple and effective regularization method to prevent overfitting by stopping training when the validation error increases for greater than N steps. It is unobtrusive since we don't need to change the loss function or change the network architecture. After training stops, we can reincorporate the validation data to train again up to N steps where early stopping occurred. The book shows that early-stopping's regularization effect for fitting a quadratic function with linear weights is the same as L2 regularization (p.243).

Parameter Tying and Parameter Sharing

Parameter sharing is based on the idea that a model trained on similar tasks should share similar parameters. The technique then is to force the weights of various models (or model components) to share the same set of weights. This is effective in reducing memory because we don't need to have unique weights across all parameters. This is effectively used in CNNs with images where model params are trained on different parts of the image but the params are shared. Reduction in memory allows more room for larger network sizes to boot.

Sparse Representations

"Sparsity" in ML terms generally mean some component of the model / data having a lot of zeros. While L1 regularization makes the weights to be sparse, sparse representations takes the input data and applies a function to make it sparse.

  • One thing I don't understand is how they're using the term "activation", ie. "place a penalty on the activations of the units in a neural network".

Bagging

Given several similar NN models trained on similar data, each model will make errors independent of each other. By averaging the outputs of each model, we can benefit from the generalization power of each. This is done by creating K sets of datasets from the original dataset (with replacement). Model $i$ trained on dataset $i$, leading to a set of models trained on different distributions of the data.

Dropout

Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0's) to prevent a subset of params from learning. The rate to which params are dropped out is a hyperparameter (book recommend u=0.8 for input params and u=0.5 for hidden params). The advantage of this method over masking is that it doesn't destroy extracted features but makes the model less reliant on specifically effective parameters (ie. one that identifies a nose really). This make it more robust and generalizable. It is very to implement and can be applied to almost any NN. However, we need to increase the model size to offset the regularization and also need to train for more iterations. The book ties the idea of dropout to bagging, where bagging ensembles inferences over several models whereas dropout use an ensemble of network sets.

Some technical considerations. When running forward pass, we need to scale the non-masked nodes by 1(1-dropout rate) to account for the masked nodes. This ensures that the expected value of the activations remain the same. The masked layers are kept as-is during back-propagation, but not during inference.

Adversarial Training

Adversarial training involves finding examples that are very close (input space-wise) to one label, but actually has another label. These "hard" examples are delibrately collected / synthesized and provided as training data to the model so that it becomes more robust to these "manifold" lines (particularly when a network learns a linear function).

Tangent Distance / Prop / Manifold Tangent Classifier

Chapter 8 - Optimization for Training Deep Models

How Learning Differs from Pure Optimization

Empirical Risk Minimization

Empirical risk minimization is a way in which we turn a ML problem into an optimization problem. Risk is the amount of expected generalization error, we want to minimize this. How? This is done by replacing the true distribution $p(x,y)$ with an empirical distribution $\hat{p}(x,y)$ and minimizing this empirical risk.

  • If we knew the true distribution, the problem can be solved via optimization algorithms.

Surrogate Loss Functions and Early Stopping

Surrogate loss functions are used when we can't optimize efficiently (e.g. 0-1 loss). Instead we use a surrogate like negative log-likelihood which is differentiable.

Batch and Minibatch Algorithms

Batch gradient descent is where the network goes through the forward/backward pass using the entire dataset. This is typically computationally intractable. The book notes that using larger batches also yields diminishing returns (linear).

Stochastic (or online) GD is where the network goes through f/b using only one example. This results in noisy updates but this can have a regularizing effect, meaning it can push the model out of local minima. On the other hand, it doesn't take advantage of parallel processing and thus inefficient.

Minibatch GD is where the network takes N samples from the empirical distribution and runs f/b pass. The advantage is that it can benefit from parallel processing without blowing up memory (like batch), and is more stable than stochastic GD.

On sampling: Batches must be randomly sampled, otherwise bias may enter from the ordering of the data. Typically with large datasets, just shuffling once is enough.

Challenges in Neural Network Optimization

Ill-conditioning

Ill-conditioning refers to matrices where a small change to its values leads to very different outcomes. You can think of this in terms of NN weights, where small changes to it could lead to very different losses.

Local Minima

Neural networks are nonconvex functions, which can have many local minima. However, the book says that this is not particularly a problem. This is because of the model identifiabiliy problem, where you can't rule out all but one setting of the model's parameters. Many equivalent models could have different parameters.

The biggest problem arises when there are many local minima with high cost (shallow minima) while the true minima is deep. However, literature seem to show that models with "sufficiently large networks" usually do not shallow minma..

Plateus, Saddle Points, and Other Flat Regions

Saddle points are points in a high-dimensional space with a zero-gradient. Historically, ML considered local minima a big problem. This is not the case for networks with a large network (ie. high dimensional space).

Why? Gradient descent methods seem to be able to escape saddle points in high-dimensional space . Some explanation is that some amount of noise pushes them off of the zero gradient and go further down.

Plateaus typically slow down gradient descent.

Good explanation with visuals: https://www.youtube.com/watch?v=fODpu1-lNTw

Cliffs and Exploding Gradients

Cliffs are where there are large shifts in the gradients. This can be dangerous because taking a large step from there might propel in a very different spot in the gradient space. Gradient clipping prevents this from happening by setting a cutoff point for step sizes.

Long Term Dependencies

Deep computational graphs can pose a problem because the weight matrix essentially is exponentiated by $t$ steps, ie. $W^t$. This causes vanishing gradient (don't know which way to take step) or exploding gradient (unstable steps).

Deep feed forward networks use same matrix $W$ for each step, so these typically do not occur. RNNs need to deal with this issue more.

Skipped:

  • Inexact gradients, p.282
  • Poor correspondence between Local and Global structure, p.283

Basic Algorithms

Stochastic Gradient Descent

Stochastic gradient descent takes mini-batches and calculates gradients over it. It inherently has some noise based on the randomn sampling. Learning rate is a crucial component in tuning, and is an art more than a science.

Momentum

Momentum uses the moving average of velocity times some scaling factor $\alpha$. It is used to speed up learning and overcome variance in SGD. For SGD, momentum is determined by the negative gradient. When previous gradients point in the same direction, the momentum is high and we can take larger steps. Common values are 0.5, 0.9 and 0.99.

Nesterov momentum is a variation where the gradient is evaluated after the momentum is applied. This serves as a corrective factor during each step. BUT, in a SGD case, this method doesn't improve the speed of convergence.

Parameter Initialization Strategies

Good initialization makes convergence faster, while bad ones slow things down.

The conventional approach is to set the weights by randomly sampling from a Gaussian distribution, while keep the variance small. Bias is conventionally set to zero. There are cases where deviating from this makes sense (more in the book), for example setting bias to 1 for a gating effect in an LSTM network.

Weight initialization can also be seen as a hyperparameter where we can adjust to

"A good rule of thumb for choosing the initial scales is to look at the range or standard deviation of activations or gradients on a single minibatch of data."

Algorithms with Adaptive Learning Rates

AdaGrad (Duchi et al. 2011)

Adagrad adapts LR by adjusting each parameters by scaling them inversely proportional to the square root of the squared sum of its historical squared values of the gradient. Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease. Works well in a gently sloped parameter space. Problem is that it can prematurely make learning rate too small.

RMSProp (Hinton, 2012)

Takes AdaGrad but replaces calculation by an exponential moving average. Includes a new hyperparameter on the window of moving average.

Adam (Kingma and Ba, 2014)

Adam ("adaptive moments") is a variant on RMSProp. Difference is that it includes momentum (first and second order). So in addition to magnitude to past gradients, it also considers past momentum. Also includes bias correction for each step to account for moving average.

Choosing the Right Optimization Algorithm

Schaul et al did some research on a variety of tasks (arxiv) and what is covered in the book were the recommended ones. It is really up to whichever one the user is most comfortable with using.

From lecture: Plan SGD + momentum can generalize better than adaptive methods but requires more tuning. See Luo et al. 2019 for more information.

Approximate Second-Order Methods

Second-order methods use second derivatives to improve optimization.

(Skipped. Starts at 8.2, p.302)

Optimization Strategies and Meta-Algorithms

Batch Normalization

Ioffe et al. 2015. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift".

Consider a batch of activations at some layer. Make each dimension unit gaussian: $$ \hat{x}^{k} = \frac{x^k - E[x^k]}{\sqrt{Var[x^k]}} $$

  • Differentiable
  • Can be applied before or after nonlinearity.
  • Note: At test time we can take running average taken during training. This allows evaluation on a single example.

We can squash normalization range if we want.

The whole thing:

Backpropagation through BatchNorm in the Second Layer:

Recall that in BatchNorm, you have the following transformations:

Z₂_normalized = (Z₂ - μ₂) / σ₂
Z₂_batchnorm = γ₂ * Z₂_normalized + β₂
Where μ₂ and σ₂ are the mini-batch mean and standard deviation of Z₂.
You've computed dLoss/dZ₂_normalized in the previous step when calculating the gradient with respect to the input of the first layer.

Now, to calculate dLoss/dγ₂ and dLoss/dβ₂, you can use the following formulas:

dLoss/dγ₂ = (dLoss/dZ₂_normalized) * Z₂_normalized
dLoss/dβ₂ = sum(dLoss/dZ₂_normalized) (sum across the mini-batch dimension)

Skipped

  • Coordinate Descent p.312
  • Polyak Averaging p.313

Supervised Pretraining

Supervised pretraining is a method where we train simple models first before building the final model for a complex task. It is also called "greedy supervised pretraining", where 'greedy' means breaking a problem up into many subcomponents to solve for the best solution.

This can be done through various methods. You can pretrain each hidden layer as a shallow network, each taking the output of the previous. You can also take the outputs of a previous, simpler model and train it to the next iteration along with existing inputs.

Fitnets (arxiv) uses a teacher-student knowledge distillation approach, where the teacher is shallow and wide while student is deep and thin. Student must predict outputs for original inputs but also predict values of middle layer of the teacher network.

Designing Models to Aid Optimization

Most advancemenets in deep learning are not new optimization algorithms. Instead, they are newer model types to make optimization easier for a given task, ie. better gradient flow. Examples are ReLU, LSTM, RNN.

Continuation methods is a family of strategies to ensure that models are initialized at well-behaved regions of parameter space.

Curriculum learning is a strategy where simpler, prototypical examples of the task are presented first and gradually more difficult examples are presented as inputs.

Chapter 9 - Convolutional Networks

"Convultional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers"

The Convolution Operation