See link for refresher: https://tnakatani.github.io/moocs/coursera_linalg_for_ml/notes.html
Other references:
Probability theory is used to account for uncertainty. Three main sources come from:
Frequentist vs Bayesian Probability:
ELI5: "Frequentist looks at lots of tries to guess chance, while Bayesian starts with a guess and changes it using new info to guess chance."
Probability Mass Function ($P$) - Distribution over discrete random variables
Joint Probability Distribution $P(\text{x}=x)(\text{y}=y)$ - PMF using more than one variable.
Uniform distribution - Given random variable $x$ and $k$ as its possible states:
$$ P(\text{x} = x_i) = \frac{1}{x_i} $$Probability Density Function ($p(x)$) - Distribution over continuous random variables
Different property requirements compared to PMF:
Marginal Probability Distribution - Given set of random variables $\mathbb{S}$, $P$ of its subset.
Probability of $x$ given joint probability of $x$ and $y$:
Conditional Probability - $P$ of an event, given some other event happened.
Formula:
$$ P(\text{y}=y|\text{x}=x)=\frac{P(\text{y}=y|\text{x}=x)}{\text{x}=x} $$_Reference on chain rule from CS7641_
Any joint probability distributions can be broken down into smaller conditional probabilities. Example using variables $a,b,c$:
$$ \begin{align*} P(a,b,c) &= P(a|b,c)P(b,c) \\ P(b,c) &= P(b|c)P(c) \\ P(a,b,c) &= P(a|b,c)P(b,c)P(c) \end{align*} $$Joint probabilities of independent variables is just the product of their probabilities.
$$ p(x,y) = p(x)p(y) $$Joint probabilities of conditional independent variables is the product of each probability wrt the shared dependent variable.
$$ p(x,y|z) = p(x|z) P(y|z) $$Notation shorthand:
Expectation - The mean value we expect when randomly drawing a value ($f(x)$) from distribution $P(x)$
Variance: How much the value of a random sample $f(x)$ fluctuates over a prob. distribution.
Covariance: How much two values are linearly related to each other
Formula: Expectation of the product of x and y
Reference: ISYE6501 notes on distributions
Multinoulli distribution: $P$ of random variable $x$ with $k$ discrete states.
Formula:
(See book for formula for evaluating PDF)
Why is normal distribution useful?
Multivariate Normal Distribution - Same as Gaussian, but handles multiple variables
Formula (new notation):
(See book for formula for evaluating PDF)
Exponential Distribution: Sharp peak at $x=0$, useful for deep learning.
$$ p(x; \lambda) = \lambda \exp(-\lambda x) $$Laplace Distribution: Similar to exponential, but can place peak at an arbitrary point $\mu$
$$ \text{Laplace}(x; \mu, \gamma) = \frac{1}{2 \gamma} \exp \left( - \frac{|x-\mu|}{\gamma} \right) $$Dirac delta function ($\delta(x)$) - PDF. Places infinitely high peak at mean, 0 everywhere else.
Empirical Distribution - Used to represent distribution over continuous
Mixture Distribution: Method to combine distribution.
Gaussian mixture model (GMM) - Mixture distribution where each component is a Gaussian with different parameters (mean + covariance)
Formula: $$\sigma(x)=\frac{1}{1+exp(-x)}$$
Properties:
Formula: $$log(1+exp(x))$$
Properties:
Other topics like Baye's Rule, technical details of continuous variables coverd in book
Reference: CS7641 notes
Shannon Entropy: The amount of uncertainty over the entire probability distribution.
Formula:
Kullback-Leibler Divergence: Measure the distance between 2 probability distributions.
Cross-entropy's formula is similar to KL divergence but doesn't have the left term (ie. $P$)
$$ H(P,Q) = -\mathbb{E}_{x\text{~}log} Q(x) $$# KL Divergence example
from scipy.stats import entropy
from scipy.special import rel_entr, kl_div
import numpy as np
# Define two probability distributions
# dirichlet sums to 1
P = np.random.dirichlet(np.ones(10).T,size=1)[0]
Q = np.random.dirichlet(np.ones(10).T,size=1)[0]
print(P)
print(Q)
print("entropy:", entropy(P,Q))
print("rel_entr (sum):", sum(rel_entr(P,Q)))
print("kl_div (sum):", sum(kl_div(P,Q)))
_Mostly covered in CS7641 (link)
Significance: Statistical methods covered in this section provide us tools to figure out how to train models that generalize well
Point Estimation: Finding parameter(s) that provide the "best" prediction. High level function.
Function Estimation: Finding the most optimal $f(x)$ to predict $y$ while accounting for noise $\epsilon$
Example - For linear regression:
Formula:
Key terms:
Examples:
Key Idea: Unbiased estimators are desirable, but are not always the "best" estimators
Variance: Gives us an understanding of how strongly the estimator is affected by different samples of the data.
Standard error (SE) of the mean is a way to quantify this uncertainty:
We can then use $SE$ to get the 95% confidence interval: $$ \text{CI@95} = (\hat{\theta}_m - 1.96 SE(\hat{\theta}_m), \hat{\theta}_m + 1.96 \hat{\theta}_m) $$
Significance: Goal is to find the sweet spot where the model's SE is at the minimum without either underfitting (high bias) or overfitting (high variance).
Consistency: In ML context means that bias is reduced as the size of data grows.
_See CS7641 notes (link-&-Maximum-Likelihood-Estimation))_
Key Ideas:
Conditional Log-Likelihood: Generalizing MLE to estimate a conditional probability ($P(y|x; \theta)$):
Formula for conditional MLE:
(Book then ties MLE to MSE in regression and how they both reach the same optimum (p130))
CoD: The more dimensions (ie. features) a data has, the amount of data we need to generalize accurately exponentially increases.
_See CS7641 notes: (link)_
Significance: The exponential advantages offered by DL counter the CoD without O(n) reliance on data.
Smoothness Prior (aka Local Constancy Prior): Bayesian concept, expects neighboring data to have similar feature values.
Significance:
Manifold: Neighborhood surrounding each point.
Manifold Learning: Focus on the most "interesting" subset of inputs particularly occuring on the manifold or when the manifold change (e.g. 1D to 2D)
Manifold Hypothesis: P distribution over complex data is highly concentrated (e.g. images, text, speech).
Why use it:
Why use $log$? Simplifies calculations. Transform products (from the likelihood) into sums, which can be easier to handle and less prone to numerical underflow.
Key point: Gradient of the cost function must be large and predictable enough to serve a good guide for the learning algorithm
MSE vs MAE vs Cross-entropy (Ref: ChatGPT):
Output layer provides additional transformations to actually solve the task (ie. a set of class predictions)
When to use: Probability distribution over a binary variable.
Key points:
When to use: Probability over n discrete classes
Key points:
Formula:
$$ softmax(z)_i = \frac{exp(z_i)}{\sum_j exp(z_j)} $$Reformulated to the log
$$ log \text{softmax}(z)_i = z_i - log \sum_j exp(z_j) $$Relu formula: $$ g(z) = max\{0,x\} $$
Derivative: $$ \frac{d}{dx}g(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases} $$
Problem: ReLU's gradient becomes zero if negative or zero. This means corresponding nodes will not get updated during backprop.
Solution: Add non-zero slope when $z<0$.
Examples:
Considerations:
Key Point: Deeper models generalized better than using more parameters per layer (more likely to overfit).
By Hornik et al, 1989.
"...any continuous function on a closed and bounded subset of $\mathbb{R}^n$ is Borel measurable and therefore may be approximated by a neural network."
Key Point: Guarantees represntation of a function, but not the ability to learn the function.
Terminology clarification:
Components:
Chain rule example:
Chain rule for a scalar:
$$ \frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \frac{\partial y}{\partial x} $$Chain rule for a vector:
In vector notation:
$$ \nabla x^z = \left(\frac{\partial y}{\partial x}\right)^T \nabla y^z $$where $\frac{\partial y}{\partial x}$ is the n x m Jacobian matrix of g.
The same idea applies to tensors (see book)
Key Idea: We can create a forward computation graph (G) and also create another graph for backprop (B).
Algos:
Example of symbol-to-symbol approach to computing derivatives:
Computational graph of forward propagation:
Notable paradigm shifts:
(Note: I am going to change the approach to note taking, moving away from excess note taking to emphasizing knowledge synthesis)
L2 regularization applies a regularization term to the weights at each update. The formula applied to the gradient is:
$$ J = \alpha W^{J-1} + J(W^{J-1}) $$As $\alpha$ approaches zero, no regularization takes place.
L1 regularization is similar to L2 but applies the a sign function to the weights in addition to regularization term:
$$ J = \alpha \text{ sign}(W^{J-1}) + J(W^{J-1}) $$This results in a sparse parameters (ie. weights) since many are reduced to zero. This is often useful for feature selection as it indicates zeroed params to be okay to discard (ie. LASSO).
Regularization puts constraints on parameter weights. The greater it is, the smaller the constraint region and vice versa. We can also explicitly set constraints rather than including it as a penalty, this ties into the idea of reprojection (taking SGD and project weights to closest region that satisfies constraints). Research shows that this can be a more stable method of optimization than applying penalties to the entire parameter weight.
Adding modulations and noise into existing data creates new data which can be used as additional data to train a NN. Techniques like image transformation (cropping, rotating, shearing) and noise injection is common in image recognition.
The section covers adding noise to the weights and to output targets. For weights, we add noise with very small variance. This ties into Bayesian notions of uncertainty of weights and how it fulfills this uncertainty by adding such noise. Adding small noise encourages learning to find parameter spaces where it is relatively robust to such noise. For output targets, label smoothing is used to apply some noise to the classification targets (ie. 0 and 1) with some amount of noise. This makes sense because we don't expect probabilities to ever reach 0 or 1, thus preventing models from overfitting trying to get there by applying larger and larger weights.
Semi-supervised learning combines the use of unlabeled (unsupervised) and labeled (supervised) examples. The main idea is that unsupervised learning can provide a representation that groups similar examples together. You can combine both by using the generative model ($P(x)$, unsupervised) with the discriminative ($P(y|x)$). Generative provides a prior belief to the supervised learning problem.
The core concept of multi-task learning is that by training a model on different (but related) tasks, it is able to generalize better. The way that it is done is by having a shared "generic" parameter layer, but then downstream parameters intended for different tasks (ie. translation + syntax parsing).
Early stopping is a simple and effective regularization method to prevent overfitting by stopping training when the validation error increases for greater than N steps. It is unobtrusive since we don't need to change the loss function or change the network architecture. After training stops, we can reincorporate the validation data to train again up to N steps where early stopping occurred. The book shows that early-stopping's regularization effect for fitting a quadratic function with linear weights is the same as L2 regularization (p.243).
Parameter sharing is based on the idea that a model trained on similar tasks should share similar parameters. The technique then is to force the weights of various models (or model components) to share the same set of weights. This is effective in reducing memory because we don't need to have unique weights across all parameters. This is effectively used in CNNs with images where model params are trained on different parts of the image but the params are shared. Reduction in memory allows more room for larger network sizes to boot.
"Sparsity" in ML terms generally mean some component of the model / data having a lot of zeros. While L1 regularization makes the weights to be sparse, sparse representations takes the input data and applies a function to make it sparse.
Given several similar NN models trained on similar data, each model will make errors independent of each other. By averaging the outputs of each model, we can benefit from the generalization power of each. This is done by creating K sets of datasets from the original dataset (with replacement). Model $i$ trained on dataset $i$, leading to a set of models trained on different distributions of the data.
Dropout is a technique in which a set of parameters are randomly masked (ie a matrix of 1/0's) to prevent a subset of params from learning. The rate to which params are dropped out is a hyperparameter (book recommend u=0.8 for input params and u=0.5 for hidden params). The advantage of this method over masking is that it doesn't destroy extracted features but makes the model less reliant on specifically effective parameters (ie. one that identifies a nose really). This make it more robust and generalizable. It is very to implement and can be applied to almost any NN. However, we need to increase the model size to offset the regularization and also need to train for more iterations. The book ties the idea of dropout to bagging, where bagging ensembles inferences over several models whereas dropout use an ensemble of network sets.
Some technical considerations. When running forward pass, we need to scale the non-masked nodes by 1(1-dropout rate) to account for the masked nodes. This ensures that the expected value of the activations remain the same. The masked layers are kept as-is during back-propagation, but not during inference.
Adversarial training involves finding examples that are very close (input space-wise) to one label, but actually has another label. These "hard" examples are delibrately collected / synthesized and provided as training data to the model so that it becomes more robust to these "manifold" lines (particularly when a network learns a linear function).
Empirical risk minimization is a way in which we turn a ML problem into an optimization problem. Risk is the amount of expected generalization error, we want to minimize this. How? This is done by replacing the true distribution $p(x,y)$ with an empirical distribution $\hat{p}(x,y)$ and minimizing this empirical risk.
Surrogate loss functions are used when we can't optimize efficiently (e.g. 0-1 loss). Instead we use a surrogate like negative log-likelihood which is differentiable.
Batch gradient descent is where the network goes through the forward/backward pass using the entire dataset. This is typically computationally intractable. The book notes that using larger batches also yields diminishing returns (linear).
Stochastic (or online) GD is where the network goes through f/b using only one example. This results in noisy updates but this can have a regularizing effect, meaning it can push the model out of local minima. On the other hand, it doesn't take advantage of parallel processing and thus inefficient.
Minibatch GD is where the network takes N samples from the empirical distribution and runs f/b pass. The advantage is that it can benefit from parallel processing without blowing up memory (like batch), and is more stable than stochastic GD.
On sampling: Batches must be randomly sampled, otherwise bias may enter from the ordering of the data. Typically with large datasets, just shuffling once is enough.
Ill-conditioning refers to matrices where a small change to its values leads to very different outcomes. You can think of this in terms of NN weights, where small changes to it could lead to very different losses.
Neural networks are nonconvex functions, which can have many local minima. However, the book says that this is not particularly a problem. This is because of the model identifiabiliy problem, where you can't rule out all but one setting of the model's parameters. Many equivalent models could have different parameters.
The biggest problem arises when there are many local minima with high cost (shallow minima) while the true minima is deep. However, literature seem to show that models with "sufficiently large networks" usually do not shallow minma..
Saddle points are points in a high-dimensional space with a zero-gradient. Historically, ML considered local minima a big problem. This is not the case for networks with a large network (ie. high dimensional space).
Why? Gradient descent methods seem to be able to escape saddle points in high-dimensional space . Some explanation is that some amount of noise pushes them off of the zero gradient and go further down.
Plateaus typically slow down gradient descent.
Good explanation with visuals: https://www.youtube.com/watch?v=fODpu1-lNTw
Cliffs are where there are large shifts in the gradients. This can be dangerous because taking a large step from there might propel in a very different spot in the gradient space. Gradient clipping prevents this from happening by setting a cutoff point for step sizes.
Deep computational graphs can pose a problem because the weight matrix essentially is exponentiated by $t$ steps, ie. $W^t$. This causes vanishing gradient (don't know which way to take step) or exploding gradient (unstable steps).
Deep feed forward networks use same matrix $W$ for each step, so these typically do not occur. RNNs need to deal with this issue more.
Skipped:
Stochastic gradient descent takes mini-batches and calculates gradients over it. It inherently has some noise based on the randomn sampling. Learning rate is a crucial component in tuning, and is an art more than a science.
Momentum uses the moving average of velocity times some scaling factor $\alpha$. It is used to speed up learning and overcome variance in SGD. For SGD, momentum is determined by the negative gradient. When previous gradients point in the same direction, the momentum is high and we can take larger steps. Common values are 0.5, 0.9 and 0.99.
Nesterov momentum is a variation where the gradient is evaluated after the momentum is applied. This serves as a corrective factor during each step. BUT, in a SGD case, this method doesn't improve the speed of convergence.
Good initialization makes convergence faster, while bad ones slow things down.
The conventional approach is to set the weights by randomly sampling from a Gaussian distribution, while keep the variance small. Bias is conventionally set to zero. There are cases where deviating from this makes sense (more in the book), for example setting bias to 1 for a gating effect in an LSTM network.
Weight initialization can also be seen as a hyperparameter where we can adjust to
"A good rule of thumb for choosing the initial scales is to look at the range or standard deviation of activations or gradients on a single minibatch of data."
Adagrad adapts LR by adjusting each parameters by scaling them inversely proportional to the square root of the squared sum of its historical squared values of the gradient. Larger gradients have a rapid decrease in LR, while those with small gradients get a slower decrease. Works well in a gently sloped parameter space. Problem is that it can prematurely make learning rate too small.
Takes AdaGrad but replaces calculation by an exponential moving average. Includes a new hyperparameter on the window of moving average.
Adam ("adaptive moments") is a variant on RMSProp. Difference is that it includes momentum (first and second order). So in addition to magnitude to past gradients, it also considers past momentum. Also includes bias correction for each step to account for moving average.
Schaul et al did some research on a variety of tasks (arxiv) and what is covered in the book were the recommended ones. It is really up to whichever one the user is most comfortable with using.
From lecture: Plan SGD + momentum can generalize better than adaptive methods but requires more tuning. See Luo et al. 2019 for more information.
Second-order methods use second derivatives to improve optimization.
(Skipped. Starts at 8.2, p.302)
Ioffe et al. 2015. "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift".
Consider a batch of activations at some layer. Make each dimension unit gaussian: $$ \hat{x}^{k} = \frac{x^k - E[x^k]}{\sqrt{Var[x^k]}} $$
We can squash normalization range if we want.
The whole thing:
Recall that in BatchNorm, you have the following transformations:
Z₂_normalized = (Z₂ - μ₂) / σ₂
Z₂_batchnorm = γ₂ * Z₂_normalized + β₂
Where μ₂ and σ₂ are the mini-batch mean and standard deviation of Z₂.
You've computed dLoss/dZ₂_normalized in the previous step when calculating the gradient with respect to the input of the first layer.
Now, to calculate dLoss/dγ₂ and dLoss/dβ₂, you can use the following formulas:
dLoss/dγ₂ = (dLoss/dZ₂_normalized) * Z₂_normalized
dLoss/dβ₂ = sum(dLoss/dZ₂_normalized) (sum across the mini-batch dimension)
Skipped
Supervised pretraining is a method where we train simple models first before building the final model for a complex task. It is also called "greedy supervised pretraining", where 'greedy' means breaking a problem up into many subcomponents to solve for the best solution.
This can be done through various methods. You can pretrain each hidden layer as a shallow network, each taking the output of the previous. You can also take the outputs of a previous, simpler model and train it to the next iteration along with existing inputs.
Fitnets (arxiv) uses a teacher-student knowledge distillation approach, where the teacher is shallow and wide while student is deep and thin. Student must predict outputs for original inputs but also predict values of middle layer of the teacher network.
Most advancemenets in deep learning are not new optimization algorithms. Instead, they are newer model types to make optimization easier for a given task, ie. better gradient flow. Examples are ReLU, LSTM, RNN.
Continuation methods is a family of strategies to ensure that models are initialized at well-behaved regions of parameter space.
Curriculum learning is a strategy where simpler, prototypical examples of the task are presented first and gradually more difficult examples are presented as inputs.
"Convultional networks are simply neural networks that use convolution in place of general matrix multiplication in at least one of their layers"