CS7643 Deep Learning - Module 1 (Intro to Neural Networks)

Lesson 1: Linear Classifiers and Gradient Descent

Components of a Parametric Learning Algorithm

Multiclass linear classifier:

  • You can interpret this as 3 independent linear classifiers
  • Note that you can move bias term into the weight matrix, and a "1" at the end of the input. This results in a one matrix-vector multiplication. Why? Efficiency.

Interpreting a Linear Classifier

How we can interpret:

  • Algebraic: We can view an image as a linear algebra computation. Given a matrix of pixel values, each can be tied to a weight+bias.
  • Visual: We can convert weight vector back into the shape of the image and visualize. After optimization image looks like an average over all inputs of class.
  • Geometric: We can view linear classifier as a linear separator in a dimension space.

Limitations:

  • Not all classes are linearly separable.
  • XOR, bullseye function not linearly separable

Performance Measure for a Classifer

Converting Scores to Probabilities

Use softmax function to convert scores to probabilities:

$$ s = f(x,W) \\ P(Y=k|X=x)=\frac{e^{s_k}}{\sum_j e^{s_j}} $$

Steps:

  1. Given score vector $s$ (score for each class)
  2. Feed this vector through softmax function (2nd equation, right)
  3. For each class $k$, exponentiate the score for class $k$ and divide it by the sum of exponentions for all the scores.
  4. This normalize all of the scores from zero to one, which fits the definition of a probability.

Performance Measure

Example of multiclass SVM loss:

Takeaways:

  • Loss is zero if the score for $y_i$ (ground truth label) is greater or equal to the scores of all the other (incorrect) classes, plus 1. (First equation)
    • Goal is to maintain a margin between the score for the ground truth label and all the other possible categories.
  • If not, penalize it by how different it is from this margin.
    • Take the max over all classes that are not the ground truth and compute the loss (second equation)
    • Take its sum over all classes that aren't the ground truth and penalize the model whenever the score for the ground truth itself is not bigger by a particular margin.
  • This type of loss is called a hinge loss.

Example of how this is calculated with image class prediction:

  • Car score is highest (wrong)
  • Take max of either 0 or wrong score - grouth truth score + 1
  • In this case, car score will incur loss but frog will not (since it's lower than cat).
  • Final loss is the sum of each diff (in this case, 2.9)

Loss for classification

Cross-entropy and MLE:

  • For multiclass loss, we use the negative log probability of the softmax output:

Example using the image classification task:

Takeaways:

  1. Given raw scores, we exponentiate it (orange box)
  2. Then normalize to get probabilities (green box)
  3. Take the negative log of the probability assigned to the one that matches ground truth (e.g. "cat")
  4. We don't take other probabilities into account for the loss. Why?
    1. Probabilities inherently induce comptetition
    2. Optimization algorithm can both boost weights for cat and decrease weight for other classes.

Loss for regression / probability

Regularization term

Regularization accounts for choosing simpler models over complex ones. This is applied to the loss function:

L1 regularization norms the weight vector: $$ L_i = |y-Wx_i|^2+|W| $$

Linear Algebra View: Vector and Matrix Sizes

Size of weights and inputs

  • c = number of classes
  • d = dimensionality of input
  • $W$: Size = $[c\cdot(d+1)]$
  • $x$: Size = $[(d+1)\cdot 1]$

Dimensionality of derivatives - Conventions

Given a scalar $s \in \mathbb{R}^1$ and vector $v \in \mathbb{R}^m$:

  • Size of the partial derivative of $v$ wrt $s$: $\mathbb{R}^{m\cdot 1}$
    • ie. column vector of size $m$
  • Size of the partial derivative of $s$ wrt $v$ $\mathbb{R}^{1\cdot m}$
    • ie. row vector of size $m$

Q: Given 2 vectors, what's the size of $\frac{\partial v^1}{\partial v^2}$?

  • Answer: Jacobian matrix. Contains all the partial derivatives wrt v1 and v2.

Q: Given a scalar and a matrix, what's the size of $\frac{\partial s}{\partial M}$?

  • Answer: Matrix containing derivative of the scalar wrt each element in the matrix

Q: What is the size of $\frac{\partial L}{\partial W}$?

  • Answer: Loss (scalar) * Weights (matrix) will return a matrix containing derivative of the loss wrt to each element in the matrix.

Jacobians of Batches

Takeaway:

  • Batch size affects the dimensionality of our data
  • Tensors: multidimensional matrices
  • Can be unwieldy - instead, flatten input to vector and get a vector of derivatives.

How is Deep Learning Different?

What makes it different:

  1. Representation learning - takes in raw data rather than processed form (ie. histogram)
    • Conducts feature extraction - automatically pulling features from raw data
  2. Uses neural networks
  3. Can tackle various ML tasks (unsupervised, RL, etc)

What is deep learning:

Features: Trad vs Deep Learning

Features are engineered in traditional ML:

Features are automatically extracted in deep learning:

  • Key is hierarchical compositionality, meaning all data has some hierarchical order which neural networks can represent.

Example of features for image detection:

Building a complicated function

Representation of data is done through building up a network of simple functions into a complex network.

  • This idea is similar to boosting where we employ an ensemble of weak learners.
  • Significance: We can use any differentiable function to construct the network (sin, quadratic, log, etc).

End-to-End Learning

"End-to-end": Learning is applied to entire spectrum, from raw data -> feature extraction -> classification.

  • No handcrafted feature extraction, auto-extracted. Distinction between extracted features and classifier is blurry.
  • Key Idea: Learn very separable feature representation that can be easily classified.

Gradient Descent

Derivatives

Algorithm:

Applying batch gradient descent:

Convergence notes:

Computing Gradients

How to compute $\frac{\partial L}{\partial W_i}$?

  1. Manual differentiation
    • Labor intensive
    • Can't compute closed form solution for complex functions
  2. Symbolic differentiation
    • Similar to manual
  3. Numerical differentiation
    • Works for any function
    • Computationally expensive
  4. Automatic differentiation
    • Used by most DL libaries

Manual Differentiation

Derivation of update rule using squared loss:

  • Note on 2nd to last summation on update rule:

    The partial derivative of this summation (with respect to $w_j$ is really causing most of the terms to be zero because when $i$ is not equal to $j$ then none of those weights actually affect $w_j$.

(Some more context on getting the partial derivative of update rule above)

Update rule once we add non-linearity (sigmoid) - Gets more complex:

Decomposing a Function

Manual differentiation can get messy. We can decompose the complicated function into modular sub-blocks.

Distributed Representations

Key ideas:

  • No single neuron "encodes" everything
  • Groups of neurons work together as a distributed representation

Distributed representation: Toy example

  • One-hot labels of shapes (left)
  • By distributing characteristics (right), can efficiently cover a wide range of shapes by combining them.

Lesson 2: Neural Networks

Neural Network View of a Linear Classifier

(Think of this as another view in combination with the linear algebra view)

Origins of the term

Output can be modulated by a non-linear functions (e.g. sigmoid)

Connecting Many Neurons

Terms:

  • Each input/output is a neuron (node)
  • A linear classifier is called fully connected layer
  • Connections represented as edges
  • Activation: Output of a particular neuron
  • This will be expanded as we view computation in a NN as a graph

The magic of NN is that we can stack multiple layers together:

From a linear algebra view, a 2-layer NN coresponds to adding another weight matrix:

We can build deeper network by ading more layers.

  • 2-layer can solve any continuous function
  • 3-layer can theoretically solve any function
  • However, number of nodes could grow unreasonably ($O^2$ or worse) wrt the complexity of the function.
  • However, we still need to figure out the initial architecture (how many layers, how many nodes). This is not learned.

Computation Graphs

Adding even more layers:

Compositionality

  • The world is compositional - we want our model to reflect this.
  • Empirical and theoretical evidence that having this hierarchy makes learning complex functions easier.
  • Note that prior SOTA hand-engineered features often had this compositionality as well.

Computing Gradients in a Complex Function

Problem:

  • We are learning complex models with significant amount of parameters (millions/billions)
  • How do we compute the gradients of the loss (at the end) with respect to internal parameters?
  • Intuitively, want to understand how small changes in weights deep inside are propagated to affect the loss function at the end.

Answer:

  • View the function as a computation graph
  • Specifically, a DAG (directed acyclic graph)
  • Modules must be differentiable to support gradient computations for gradient descent.
  • A training algorithm will process this graph, one module at a time.

Example representation of a function as a graph:

  • Represents an ordering of computations
  • Significane: Allows us to know how to compute it in reverse (importing to backprop)

Backpropagation

Overview of training

Forward pass

Backwards pass

Computing Local Gradients: Example

Use matrix calculus to get derivatives of local gradients:

Computing the gradients of loss:

  • Goal: For a given module, get the partial deriv. of the loss wrt its inputs, or the partial deriv. of the loss wrt to its weights.
  • Problem: Tricky part is that Loss is all the way at the end of the computation graph. How do we compute these terms?
  • Answer: Use the chain rule
    • Chain rule: If you have a particular graph that goes from $x$ to some intermediate variable $y$ to $z$, then $\frac{\partial z}{\partial x}$ is equal to the product of $\frac{\partial z}{\partial y}$ and $\frac{\partial y}{\partial x}$
    • Intuition: If we want to know how $z$ changes if we make tiny change to $x$, we need to understand how $y$ changes if we make a small change to $x$. Then, multiply that by how $z$ change if $y$ changes.

Applying Chain Rule

Neural Network Training

Summary:

  1. Forward Pass: Compute loss on mini-batch
  2. Backward Pass: Compute gradients wrt parameters.
    • Each layer has its upstream loss and change in loss wrt the inputs of the function or wrt parameters.
    • Each layer computes gradients wrt its parameters to perform gradient descent.
    • Then computes gradient of the loss wrt to inputs in order to send this back to the previous layer.
    • Now the previous layer knows how the loss changes if its output changes.
    • This continues up to the first layer. Note that last layer doesn't need to compute gradient of the loss wrt to inputs since there are no more inputs.

Key Idea: Backpropagation is the application of gradient descent to a computation graph via the chain rule.

Backpropagation and Automatic Differentiation

Key Idea:

  • Auto-diff is a generalization of back propagation.
  • Reverse-mode auto-diff: Iterates from last module backwards, applying the chain rule to a DAG.
  • Decomposes a function into very simple primitive functions (addition, multiplication) where we already know what the derivatives are.
  • Create framework where we can just define the computation graph using simple primitives so we don't need to worry about the backwards gradients. In other words, we don't need to write code that actually computes the gradients of the function.

Deep Learning = Differentiable Programming

Computation as a Graph:

  1. Input = Data + Parameters
  2. Output = Loss
  3. Scheduling = Topological ordering
    • ie. which computations have to be done first.

Automatic Differentiation: A family of algorithms for implementing chain-rule on computation graphs.

Example Computation Graph:

Partial derivatives from $a_3$ upstream:

Notes:

  • For nodes with more than one paths, you need to sum gradients from multiple paths.

Patterns of Gradient FLow

Different operations have different effects on the gradient.

  • Addition operation distributes gradients along all paths.
  • Multiplication is a gradient switcher, ie. multiplies it by the values of the other term.
  • Max operation selects which path to push the gradients through. Gradients will only flow through the path that was selected to be the maximal value. This information of the flow must be recorded in the forward pass.

Key Idea: If gradients do not flow backwards properly, learning slows or stops

Computational Implementation

Key Ideas:

  • Explicitly store computation graph in memory and corresponding gradient functions.
  • Nodes broken down to basic primitive computations - corresponding derivative is known.
  • In the past, you performed NN computation by specifying both the forward function (ie. $W^T \cdot X$) and manually compute the backwards function for each layer.
  • In auto-diff paradigm, all we need to do is put together a forward computation graph and all of these gradients will be computed for us.

Forward Mode Automatic Differentiation

Key ideas:

  • Different from reverse mode auto-diff, which starts from the output.
  • Forward mode starts from the input and propagates it forward.
  • Complexity is proportional to input size.
  • However, in most ML tasks, our input sizes are huge while outputs (ie. loss) are small. Thus not common in DL.

Computation Graphs in PyTorch

Key ideas:

  • Modern libraries builds out computation graphs on the fly as the code executes.
  • This allows backprop to be very simply executed (backward()). Don't need to define forward/backwards function like in the past.

Power of Automatic Differentiation

Power of auto-diff stems from the idea of Differentiable programming:

  1. Not limited to mathematical functions.
  2. Can employ control flows (if, loops) and backpropagate through these algorithms.
  3. Can be done dynamically so that gradients are computed, then nodes added, and you can repeat this.

Computation Graph Example for Logistic Regression

Linear Classifier: Logistic Regression

Components:

  1. Input: $x \in R^D$
  2. Binary label: $y \in {-1, +1}$
  3. Parameters: $w \in R^D$
  4. Output prediction (e.g. sigmoid): $$ p(y=1|x) = \frac{1}{1+e^-w^T x} $$
  5. Loss (e.g. log loss): $$ L = \frac{1}{2}||w||^2 - \lambda \log (p(y|x)) $$

Key Idea: Machine learning functions (input -> model -> loss) is also a computational graph

  • This means we can use the computed gradients from backprop/auto-diff to update the weights.

Example Gradient Computations

Key Idea: All we need to do is define the forward function, no backwards.

Vectorization and Jacobians of Simple Layers

Key Idea: Chain rule can be computed via a series of operations on scalars, vectors, and matrices.

Example of logistic regression:

Fully Connected (FC) Layer: Forward Function

Example of input/weights/output dimensions of a fully connected layer:

Sizes of Jacobians (gradients):

  • This video is dense. Rewatch link to video
  • Key Ideas:
    • Not taking partial derivative on entire $W$ matrix bc it would lead to tensors (bad). Note: partial of a vector wrt a matrix results in a tensor
    • Instead, take row-wise partial of $h^l_i$ wrt $w_i$. This results in sparse matrix. Each output only affected by corresponding weight row, others are zero.

Rectified Linear Unit (ReLU)

ReLU is a substitue for the sigmoid.

  • Provides non-linearity.
  • Better gradient flow than sigmoid (why?)
  • Performed element-wise (see figure below) - just applies max operation between 0 and input $h^{l-1}$.
  • No parameters (just doing max operation)

(ReLU and other activation functions):

Jacobian of ReLU

Key Ideas:

  • Gradient of inputs less than zero will be zero. Remember that ReLU takes either max or zero.
  • Full Jacobian of ReLU is large: input x output dimensions ($|h^l \cdot h^{l-1}|$).
    • However, Jacobian of ReLU is sparse - Only diagonal values will have non-zero elements.
    • Why? Output value is only affected by corresponding input value, nothing else.
  • Final Jacobian is:
$$ \frac{\partial h^l}{\partial h^{l-1}} = \begin{cases}1,& \text{if }h^{l-1} \gt 0 \\ 0, & \text{otherwise} \end{cases} $$

Aside: Why better gradient flow than sigmoid? (ChatGPT)

Rectified Linear Units (ReLU) and sigmoid activation functions behave differently in terms of gradient flow during the backpropagation process in neural networks, which can impact training efficiency. ReLU tends to provide better gradient flow than sigmoid for a few reasons:

  1. Vanishing Gradient Problem: Sigmoid activation functions have a range between 0 and 1, which causes gradients to become very small when the input values are far from 0. This is particularly problematic during backpropagation because small gradients can lead to slow convergence or even complete stagnation of learning. ReLU, on the other hand, has a flat gradient for positive inputs, which can help mitigate the vanishing gradient problem and allow for faster learning.

  2. Non-linearity and Sparsity: While sigmoid provides non-linearity to the network, ReLU introduces a sparsity aspect. This is because ReLU outputs 0 for negative inputs, effectively deactivating the neuron. This sparsity can make the network more efficient by reducing the number of active neurons and simplifying the representation of data, which can enhance gradient flow through the network.

  3. Efficient Computation: ReLU is computationally more efficient than sigmoid. The sigmoid function requires exponentiation and division operations, which can be more costly in terms of computation compared to the simple thresholding operation of ReLU. This computational efficiency can contribute to faster training times.

  4. Avoiding Saturation: Sigmoid saturates to either 0 or 1 when its inputs are very large or very small, causing the gradients to be close to zero. In such cases, the network's weights don't update effectively, slowing down learning. ReLU does not saturate for positive inputs, allowing gradients to flow more effectively through the network.

  5. Initialization: Initialization techniques, like He initialization, have been specifically designed for ReLU activations. These initialization methods help prevent gradients from becoming too small during the early stages of training, promoting better gradient flow.

However, it's important to note that ReLU isn't without its drawbacks. It can suffer from a problem known as the "dying ReLU" problem, where a large portion of the neurons can become inactive and never recover during training. This issue has led to the development of variants like Leaky ReLU, Parametric ReLU, and Exponential Linear Units (ELU) to address the dying ReLU problem while retaining the benefits of improved gradient flow.

In summary, ReLU generally provides better gradient flow compared to sigmoid due to its sparsity, non-saturation, computational efficiency, and avoidance of the vanishing gradient problem. These factors collectively contribute to faster and more effective training of neural networks.

End Quiz 1


Lesson 3: Optimization of Deep Neural Networks

Optimization of DNN Overview

Depth is important, Why?:

  1. Structures the model to represent a compositional world
  2. Theoretical evidence that it leads to parameter efficiency
  3. Gentle dimensionality reduction

Designing Deep Neural Networks

Design decisions:

  1. Architecture - We want architecture to reflect inherent structure of the data (e.g. RNN vs CNN)
    • Theretically, you don't need to account for this, but it will make learning faster / less data needed.
  2. Data Considerations - Trad ML techniques liike regularization.
  3. Training and Optimization - Initialization and how to reach optimal set of parameters.
  4. ML Considerations - DL networks are overparameterized. Bias / variance becomes even more important.

Architectural Considerations

What modules (layers) to use:

  • In reality, many networks share similar layer architectures.
  • In CV, geometric transformations are more important. Special geometric transform layers are used for this purpose.

How to connect layers:

  • Specific functions that each module computes affect the gradient flow, some can bottleneck the flow.
  • Some layers designed to improve backward flow of gradients.

How to use domain knowledge:

  • Need to add architectural bias to better learn the data (especially if not much data available)
  • The better the architectural biases mirror the reality, the better it will learn.

Example Architectures

  1. Fully Connected: Input converted to vector, extract more and more abstract features from high dimensional raw input data.
    • Typically dimensionality is reduced as we go deeper into the network.
    • NOT well suited for handling images. Parameter size blows up. Not exploiting spatial structure in the image.
  2. Convoluted NNs: Extract features for small local windows in the input image via striding a local window across the entire image.
    • Each window will have features extracted (shapes, corners, circles, etc).
    • Dimensions reduced per layer, then tied to fully connected layers to make predictions.
  3. Recurrent Neural Network: Output of each layer dependent on both the input from the previous layer and the new input (e.g. new word).

Architectural Considerations

Designing the architecture

  • Guided by type of data used
  • Lots of data types already have good architectures
  • Models are heavily depending on good flow of gradients - allows large steps and convergence.

Linear / Non-linear Modules

  • Many linear layers have the same representational power as one linear layer (combined form is equally the same).
  • Non-linear layers allow for complex transformations, BUT can reduce the flow of gradients.
  • Non-linear Types: Can analyze by:
    1. Viewing min/max of function
    2. Stats (mean/variance)
    3. Forward (from initialization) - does it shrink/blow up weights?
    4. View extreme weight values
    5. Complexity - anything with exponentials/division will be more complex.

Sigmoid:

  • Generally not used for large networks.
  • Why? Saturates at extremes, exponential in function. Always positive so weights can blow up in forward.

Tanh:

  • Balanced between -1/1 so more robust to weight explosion.
  • Still has most issues of sigmoid.

ReLU:

  • Simple function: Takes max or zero.
  • Pros: No saturation on positive end. Cheap to compute.
  • Cons:
    • Output is 0 is parameter is less than 0. Can cause "dead ReLU", ie no learning on that weight.
    • BUT, other weights could compensate.
  • Point at zero is non-differentiable.

Leaky ReLU:

  • Min: -inf, Max: inf
  • Learnable parameter
  • No dead neuron

Selecting a non-linearity:

  • Choose ReLU first, sometimes leaky ReLU can make a big diference.
  • Only use sigmoid if you want to clamp output from 0,1

Initialization

Bad initialization:

  1. Too big - saturates non-linearity, no learning
  2. Setting it near-zero will set initial weights in good "linear regime", start with strong weights to learn.
  3. This also limits use of full capacity of model, simpler model first.

Poor initialization:

  1. Setting values to constant value - all gradients become the same.
  2. Deeper networks are more sensitive to bad initialization, because activations get smaller.
    • Larger values lead to saturation.
    • Need good balance.

Rule of thumb: Xavier initialization

  • Sample from uniform distribution
  • The bigger the node, the smaller the weight and vice versa.

More from Stanford Lectures

Deep networks lead to very small activations at the end. If activations are very small, gradients at backward pass is very small. Poor gradient flow.

Initializing weights with large values ends up with unstable updates because neurons become completely saturated. Gradients become all zero.

Xavier normalization (ie. layer normalization) keeps the variance of the input the same as the output.

  • BUT when using ReLU it breaks. Because more and more units become deactivated
  • Can counter this by dividing by 2 to account for half of nodes dying (figure 2)

Preprocessing, Normalization and Augmentation

Batch normalization is the most common method of normalization.

  • No parameters
  • Differentiable
  • "Scale" and "shift" can be learnable parameters.

Complexities of BN

  • During inference, mean/variances on training set is used.
  • Sufficient batch sizes must be used to get stable per-batch mean/variance.
  • This is especially the case with multi-GPUs - might lead to very small batch size. There is PyTorch functino to sync the statistics between them all (SyncBatchNorm)

Where to apply BN: Right before non-linearity.

More on BN: https://web.archive.org/web/20201023123942/http://mlexplained.com/2018/11/30/an-overview-of-normalization-methods-in-deep-learning/

  • Diff between batch vs layer norm - Batch takes average across the batch, while layer computes it across each feature

More from Stanford lectures

Batch Norm:

Optimizers

Issues that impact learning:

  1. Noisy gradient estimates: Taking a batch of data to update leads to high variance. This leads to noisy updates. Results in slower convergence.
  2. Saddle points: Points where gradients of orthogonal directions are zero. More common in deep networks moreso than local minima with high cost.
    • Adding momentum (exponential moving average of velocity of gradient) to "push" the model to continue its direction at a saddle point.
  3. Ill-conditioned loss surface

Hessians (second derivative) can give info about curvature of the loss surface. If we can approximate the curve, we can just jump straight to the local minima.

  • Condition number tell us the ratio of the largest/smallest eignevalue. This gives us how different the curvature is along different dimensions. When this number is high, it tells us that the model will take big steps in certain direction but not for others - this is not the situation we want to be in.
  • Pro: Doesn't need a learning rate!
  • Con: Too computationally intensive
  • Alternative: L-BFGS (but only works with full batch updates with no stochasticity)

From Stanford lecture:

Momentum:

Nesterov Momentum:

Adagrad & RMSProp:

  • Problem: Step size reduces over time. Not good with non-convex problems.
  • Don't use Adagrad that much in neural networks now.

Adam:

Per-parameter Learning Rate

The main benefit from using adaptive learning rates is that you don't need to tune it as much as SGD + momentum. See my notes on Goodfellow's Optimization notes for more details on variations.

Learning Rate Schedules - continuously anneal the learning rate.

  1. Graduate student - look at loss curve, determine when it's converge and reduce the learning rate.
  2. Step scheduler - anneal LR every N epoch
  3. Exponential scheduler - same as above but exponential decay
  4. Cosine scheduler - Reduce and increase LR in a cosine cycle

Regularization

Common regulariation terms

  1. L1 - encourages sparsity
  2. L2 - encourage small values
  3. Elastic - Linear combination of the two
    • However, networks can still end up relying on strong subsets of features.

Dropout

Technique masks N nodes by p=0.5 during training. But in inference we use all nodes.

Why this works:

  1. Model doesn't rely too much on a single feature(s)
  2. Training $2^n$ networks - You can consider each masked layer as a network. These are all trained on a subset of inputs (via batch). This can be seen as an "ensembling" effect, where a bunch of weaker learners are created.

Data Augmentation

Idea: Data augmentation transforms existing data to get more training data.

  • Just need to be careful to make a change that would still make the label the same.

Types: Flipping, random crop, combining a bunch of cropped images (Cutmix), color jitter, geometric transformation.

  • Cowmix: Apply a cow-pattern mask to the input. You can either add noise or another image to the mask.

Process of Training Neural Networks

Sanity Checking:

Loss curves:

  1. Validation curve going down with training - generalization, good
  2. Validation curve goes up with training - overfitting
  3. Training error / bugs - dividing by NaNs or zeros, forgetting log of log-loss
  4. Validation loss lower than training - this can happen because validation will not have regularization, or it can happen because you're averaging across training loss over an epoch whereas with validation you're calculating it at the end of an epoch after model sees all training data.

Hyperparameter tuning:

  • Parameters are interdependent. For example, combining batch norm and dropout together may not work as well as independently.
  • Learning should change proportionally to batch size.

Loss vs Other Metrics:

  • Can't plug in non-differentiable metrics (e.g. accuracy/precision/recall)
  • Relationship between the two can be complex.
  • Example: Cross entropy vs accuracy. Cross entropy might appear to hit a floor but accuracy can keep going higher, why? Because even small decreases in loss may raise the probability of a correct prediction and cause a label to change to ground truth (in a supervised setting) and thus increase the accuracy. Small changes can still lead to improvements.

Stanford Lecture - Training Neural Nets I + II

Preprocessing (for images)

Steps (e.g. CIFAR-10, [32, 32, 3] images):

  1. Subtract the mean image (e.g. AlexNet)- [32, 32, 3] array
  2. Subtract per-channel mean (e.g. VGGNet) - Mean along each RGB channel = 3 numbers
    • PCA / whitening is uncommon for images.
In [ ]: