CS7641 - Machine Learning - Supervised Learning¶

Lecture Videos: https://omscs.gatech.edu/cs-7641-machine-learning-course-videos
Books:
- Tom M. Mitchell. "Machine Learning." McGraw-Hill, 1997. (http://www.cs.cmu.edu/~tom/mlbook.html)
- Wasserman, L. (2004). All of Statistics: A Concise Course in Statistical Inference. New York: Springer. (https://www.stat.cmu.edu/~larry/all-of-statistics/)
- George's Notes (https://teapowered.dev/assets/ml-notes.pdf)

Supervised Learning (SL)¶

Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.

Classification vs Regression¶

Regression: If output is continuous
Classification: If output is a small, discrete set

Classification Learning¶

Terms¶

Term	Definition
Instances	Input. Vectors of attributes that defines your input space.
Concept	Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function.
Target Concept	The correct concept to map instances to oututs.
Hypothesis Class	Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output).
Sample	Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning.
Candidate Concept	Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy)
Testing Set	Set of instances to test a candidate concept.

SL1: Decision Trees¶

Representation vs Algorithm¶

Decision Trees are a specific representations.
Algorithm is the method in which we build a decision tree.

Quiz: Running a decision tree¶

Work through each instance and give an answer based on the decision tree.

Note:

These examples are part of a test set.
The tree in the image is our candidate concept.
Some attributes doesn't matter, thus never used.

Decision Tree Algorithm¶

"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.

Learning Algorithm¶

Pick the "best" attribute. "Best" is the attribute that splits the training set in half.
Ask question
Follow the answer path.
Go to 1, until you get to an answer.

Quiz: Picking the "best" attribute¶

Notes:

Candidates 1 & 2 may be equally bad. 1 creates pretty much the same distribution after split. 2 doesn't split at all.

Expressiveness¶

Example: Boolean expressions as decision tress

Expressiveness per Tree Types¶

Type	Size of Nodes	Description
n-OR ("any" fn)	linear $n$	Easy to solve
n-XOR ("odd/even parity" fn)*	$O(2^{n})$	Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes).

Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.

* If number of attributes that are true is odd, then the output is true.

Expressiveness Continued¶

Q: Exactly how expressive is a decision tree?

Given:

n attributes (boolean) - $O(n\!)$ (combinatorial) to build nodes
outputs is boolean
Q1: If mapped to a truth table, how many rows are needed?
A1: $2^n$ rows
Q2: How many ways to fill in outputs (outputs are boolean)
A2: $2^{2^n}$ (really big number, grows very fast)

A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation

ID3 Algorithms¶

Loop:
    1. Pick best attribute A
    2. Assign A as decision attribute for Node
    3. For each value A, create a descendant of Node
    4. Sort training examples to leaves
    5. Stop if examples perfectly classified
    6. Else iterate over leaves

Using Information Gain to Choose Best Attribute¶

Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.

Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"

Intuition: Measure of the amount of information. Reducing entropy means we are gaining more information. Goal is to maximize information gain by choosing attribute that lowers entropy the most.
Example: Fair vs 2 heads coin. Fair coin has 50/50 chance of heads, making it a 1 bit entropy. 2 heads coin has 100 chance of head, making it zero entropy (no information).
- The more of one label you have over another, the amount of entropy goes down.

Formula:

$S$: Collection of training examples
$A$: Specific attribute
$v$: Specific values of an attribute (categorical, continuous)

$$ Gain(S,A) = Entropy(S) - \Sigma_v\frac{|S_v|}{|S|} Entropy(S_v) $$

Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$

ID3: Inductive Bias¶

Types of bias:

Restriction bias: Hypothesis set that you care about.
- For DT, includes all possible DT representations.
- Does NOT include functions outside of DT (quadratic, etc).
Preference bias: What sorts of hypotheses in the hypothesis set ($h \in H$) we prefer.
- Inductive bias == preference bias

ID3's inductive bias - Types of preferences:

Prefers Good splits near the top.
Prefers representations that return correct answer over incorrect ones.
Prefers shorter trees to longer trees. Come naturally from preferring good splits near the top.

Decision Trees: Other Considerations¶

Continuous Attributes¶

Q: How to build DTs with continuous attributes?

A: Create a binary branching decision based on continuous ranges.

Q: T/F, does it make sense to repeat an attribute along a path in a tree?

A1: False for discrete value attributes because there is no information gain asking the same question.
A2: True for continous attributes, because we can ask different questions based on the same attribute. Further nodes down the branch could use different ranges to split.

When to stop¶

Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.

Solutions

Validation: Use cross-validation, or use held out validation set to verify if we want to continue branching.
Pruning: Build the tree first, and then prune it based on how much error the pruning will cost (baesd on validation set).
- Output is based on mode

How to build DT using regression?¶

Regression implies continuous outputs. Information gain doesn't work well with continuous values.

Q: What's the splitting criteria for regression trees?

A: You can split by variance to decide.

Q: What's the output?

A: You can use the average

Summary¶

SL2: Regression¶

Regression & Function Approximation¶

Etymology: "Regression to the mean"

Slope of less than 1 (ie. 2/3) represents many natural phenomenon like parent/child height relations.
"Regression to the mean" means using a functional form to approximate a set of data points.

Regression in ML¶

Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).

Finding the best constant function¶

Above image shows how to derive the constant function of a squared sum of errors. Deriving the function shows the minimum.
In this case, the best constant is the average of all the $y$ values.

Polynomial Regression¶

Choosing the Order of Polynomial¶

The greater the order of polynomial used, the closer we can fit the data at the cost of sensitivity to noise (ie. overfitting).

Math: Polynomial Regression¶

Each $x$ (data point) is calculate to its nth-degree exponent in the function. This builds a matrix.
This is then matrix multiplied by the order of coefficient for each exponent.
Each row maps to one $y$ response.
$w$ is the network of weights of coefficients that give the best fit.

Errors, and where they come from¶

Training data has error that doesn't model $f$, results in $f+\epsilon$

Types of error sources:

Sensor errors
Malicious data
Transcription error
Unmodeled influences that is hard to model (ie. fluctuation in housing prices, time of day, buyer's mood)

Cross Validation¶

Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.

Aside - Assumptions of SL:

Train/test sets are IID (independent and identically distributed)

Applied to housing example in video:

Gap between train/cross-val error narrows up to 3-4th degree
Higher order polynomials begin overfitting to the training data at the expense of generalization on unseen examples.

Other Input Spaces¶

Regression can handle discrete, vector or scalar attributes.
Handling categorical: Enumerate (OK), or 1-hot encode (Better) - because enumeration gives categories a rank.

Summary¶

SL3: Neural Networks¶

video

Example of a perceptron (binary classifier with 3 input + weights)

Quiz: Example inputs and estimated y

How powerful is a perceptron unit?¶

Image: Representation of activation space given 2 data + weights. Generates a halfplane.

Takeaway: Perceptrons are linear functions that computes a hyperplane.

Perceptron Quizzes¶

Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?

A: AND Logic

Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?

A: (Many answers) $w_1 = 0.5$, $w_2 = 0.5$, $w_3 = 0.5$, which moves the activation function slope downward.

Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?

A: $w_1 = -1$ (turns 0 into 0, 1 into -1), and $\theta = 0$ so any negative number is above zero.

Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.

XOR as Perceptron Network Quiz¶

# Perceptron Quiz answer in code

def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
    activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
    return 1 if activation >= theta else 0

def test_xor(w1, w2, w_and, theta): 
    assert xor_network(0,0, w1, w2, w_and, theta) == 0
    assert xor_network(0,1, w1, w2, w_and, theta) == 1
    assert xor_network(1,0, w1, w2, w_and, theta) == 1
    assert xor_network(1,1, w1, w2, w_and, theta) == 0
    print("Passed all tests")

w_1 = 1
w_2 = 1
w_and = -2
threshold = 1   
test_xor(w_1, w_2, w_and, threshold)

Passed all tests

Answer to XOR question:

We essentially want the network to work like an OR logic, except when AND is 1. - Previous quiz showed that OR logic can be created by setting $w_1$ and $w_2$ to 1 and $\theta$ to 1.
We can therefore add more negative weight to balance the OR logic by setting $w_\text{and}$ to -2, which offsets $\Sigma(x_1w_1, x_2w_1)$ and gives us 0.

Perceptron Training¶

Definition: Given examples, find weights that map inputs to outputs.

2 ways in which perceptron training is done:

Perceptron Rule: Using the threshold ($\theta$)
Gradient Descent / Delta Rule: Using the "unthreshold" value.

Perceptron Rule¶

Goal: How to set the weights of a single unit so that it matches a training set.

Components:

$x$: input (single unit for this example)
$w_i$: Weight of $i$-th unit
$y$: Target
$\hat{y}$: Prediction
$\eta$: Learning rate. Affects amount of update rule.
$\theta$: Uses math trick so that we just set bias to 1 so we don't have to worry about the threshold.

(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.

$$ w_i = w_i + \Delta w_i $$

(2) Defining weight change amount - First, calculate $\hat{y}$.

If y and $\hat{y}$ are equal, then we don't need to change the weight.
If they are not equal, then we get either a positive or negative value.

$$ \hat{y} = (\Sigma_i w_i x_i \ge 0) $$

(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.

$$ \Delta w_i = \eta (y-\hat{y})x_i $$

Linearly Separable & Perceptron Rule¶

Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.

Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.

Q: What if the data is not linearly separable?

A: Continue until iteration limit is reached (not easy to figure out).

Gradient Descent¶

Motivation: More robust to linearly non-separable datasets.

Key is to not use thresholding.

Components:

$\alpha$: Activation, the sum of $x_i$ and $w_i$
$y$: Target
$\hat{y}$: Estimated output, ${\alpha \ge 0}$
$E(w)$: Error metric

(1) Calculate the error metric. Goal is to minimize this error.

$\frac{1}{2}$ is just a helper to make deriving the partial derivative easier.

$$ E(w) = \frac{1}{2} \Sigma_{(x,y)\in D} (y-\alpha)^2 $$

(2) Find the partial derivative of above.

Why? We do this to find the best weights to minimize the error.

$$ \frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2} \Sigma_{(x,y)\in D}(y-\alpha)^2 $$

(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.

$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$

Comparison of learning rules¶

Type	Learning Rule	Pros	Cons
Perceptron Rule	$$\Delta w_i = \eta (y-\hat{y})x_i$$	Guaranteed finite convergence	Requires linear separability
Gradient Descent	$$\Delta w_i = \eta (y-\alpha)x_i$$	More robust to non-linearly seprable data	Likely to converge only on local optimum

Quiz: Why not do gradient descent on $\hat{y}$?¶

Answer: Because $\hat{y}$ will always be an integer, hence it is non-differentiable. Example of how this will look like on a 2d plot below.
Significance: We can use a sigmoid instead which slightly bends the non-differentiable, disjointed line.

Sigmoid Function - Differentiable Threshold¶

(1) Sigmoid activation function $$ \sigma(x) = \frac{1}{1+e^{-x}} $$

(2) Derivative of the sigmoid function $$ \sigma`(x) = \sigma(x)(1-\sigma(x)) $$

As the sigmoid activation ($\alpha$) reaches infinity, it goes to 1
As it approaches negative infinity, it goes to 0
As zero, it is essentially $\frac{1}{1+e^{-\alpha}}$, which is 0.5.
Between -5 and 5, it creates a nice curve that makes it possible to perform gradient descent (ie. continuous line).

Extra: How to take the derivative of the sigmoid function. Reference: TDS

The Illustrated Neural Network¶

Key Points:

Each node is a sigmoid unit.
Each node is differentiable, so the entire network becomes differentiable.
Backpropagation: The act of sending the error information from the output back to the input, while updating the weights of each activation node in between.
- "Computationally beneficial organization of the chain rule"
Local minima issue: High dimensional complex problems can have many local minima, hence the result of gradient descent can get stuck in the local but not global minima.

Advanced Methods to Optimize Weights¶

Alternatives/supplements to vanilla gradient descent:

Momentum: Overcome noisy gradients (ie. local minima) by adding an additional hyperparameter that controls the amount of history (momentum) to include in the update equation.
- A large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous update, whereas a modest momentum (0.2) will mean very little influence
Higher order derivatives: The gradient is the first-order derivative for a multivariate objective function. Instead, we can use the second/third order derivative.
Randomized optimization: Covered in future sections.
Penalty for "complexity": Add penalty term for adding more nodes and layers, and/or large numbers.
- Larger numbers are more likely to overfit, bc it lets you represent arbitrarily complex functions.

Restriction Bias¶

Definition: Representational power of the data structure you're using. Tells you the set of hypotheses you're willing to consider.

Lower bias == more hypotheses considered and vice versa.
Relevant to all supervised learning algorithms (not just neural nets).

Types of neural nets and their representational power:

Perceptron: Linear function. Only considers planes.
Perceptron network: Introduces boolean logic (XOR, etc). Starts capturing nonlinear functions.
Sigmoid: More complex, less restriction
Continuous function: Can be handle by one hidden neural network layer (if enough hidden units).
Arbitrary function: Ie. disconnected functions. Can be represented by two hidden layers.

Preference Bias¶

Definition: The algorithm's selection (ie preference) of one representation over another.

Preference bias of neural nets: Small Random values that provide simpler explanations.

- SMALL: Favors small value, because networks with small numbers have low(er) complexity. Larger values tend to overfit.
- RANDOM: Variability. Possibility of exploring outside of local minimia.
- Similar to __Occam's Razor__: "Entities should not be multiplied unnecessarily"
    - Algorithm prefers stopping once it fits the data well, before letting weights to grow more.

Summary of Neural Nets¶

Perceptrons - threshold unit
Networks can produce any boolean function.
Perceptron rule - Finite time for linearly separable.
General differentiable rule - Backpropagation and gradient descent.
Preference / restriction bias of neural networks.

SL4: Instance Based Learning¶

Pros	Cons
Remembers (no catastrophic forgetting)	Cannot generalize
Fast (no learning)	Slower at runtime
Simple (Occams Razor)	Easily overfits (Remembers noise as well)
	Struggles with complexity (e.g. 2 same x with different y, multiple near distance neighbors with different y values)

K-Nearest Neighbor Algorithm¶

Given:

Training Data (set of pairs): $D=\{x_i, y_i\}$
Distance metric: $d(q, x)$
Number of neighbors: $K$
Query point (new data): $q$

(1) Find $K$ nearest neighbors of $q$ $$ NearestNeighbor = \{i:d(q,x_i)k \text{ smallest}\} $$

(2.1) Classification: Return the (weighted) vote of $y_i \in \text{NearestNeighbor}$ (ie. Mode)

(2.2) Regression: Return the (weighted) mean of $y_i \in \text{NearestNeighbor}$

Quiz 1: Runtime vs Space¶

Takeaways:

KNN is faster to learn (running time is constant, space is linear), but slower at querying (query time is $\log_n$, space is constant).
- If you query $n$ times overall, it will be less efficient than linear regression.
- Lazy-learner: Essentially only does work at query time.
Linear Regression is slower to learn (running time is linear, space is constant), but faster at quering (time+space is constant).
- Eager-learner: Wants to learn right away.

Quiz 2: Testing KNN on Regression¶

How to calculate Manhattan/Euclidean distance: Both are derived from Mikowski distance ("p-norm distance"), with different degrees of $p$ value:

Given:
- $x$: Input
- $y$: Target
- $p$: Order (1=Manhattan, 2=Euclidean)
Minkowski distance: $d(x,y) = \sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p}$
Manhattan distance: $d(x,y) = \Sigma_{i=1}^n |x_i - y_i| $
Euclidean distance: $d(x,y) = \sqrt{\Sigma_{i=1}^n|x_i - y_i|^2}$

Using the first row as an example (x=(1,6), y=7):

Manhattan distance: $d(4,2) = | (1, 6) - (4, 2) | = | (1-4) + (6-2) | = 7$
Euclidean distance: $d(4,2) = \sqrt{| (1, 6) - (4, 2) |^2} = \sqrt{|1-4|^2 + |6-2|^2} = 5$
- NOTE: In the video they use ED squared for readability.

Finding 1/3-NN:

How to find 1-NN: Find lowest distance. If more than one, get average.
How to find 3-NN: Find average of lowest 3 distance.
Manhattan:
- 1N = 4. 2 datapoints, (2,4), 8 and (7,1), 50. $y=\mu{(8, 50)}=29$
- 3NN = [4,6]. 4 datapoints (checked in MD column). $y=\mu{(8, 16, 50, 68)} = 33.5$
Euclidean:
- 1N = 8. 1 datapoint, (2,4), 8. $y=\sqrt{8}$
- 3NN = [8, 10, 20]. 3 datapoints (the ones checked in MD column in iamge). $y=\mu{(8, 50, 68)} = 42$

Error:

Target function: $y=x_1^2+x_2$
Given $q=(4,2)$, $y=18$

Takeaways:

KNN, using both Euclidean and Manhattan distance, were unable to predict anywhere near the target.
Prediction of KNN is heavily dependent on 1) How many neighbors are used, and 2) The distance metric used.
This implies an underlying bias of KNN.

KNN's Preference Bias¶

Locality: Near points are similar (e.g. real-estate example)
- Embedded in the distance function. Why would a specific distance function be better?
- There is always one best distance function for a specific problem (given all else is fixed).
Smoothness: Expects smoothly changing behavior between data points
All features matter equally (e.g. Minkowski distance) - but this is not always the case
- For example $y=x_1^2+x_2$, any change in $x_1$ causes a large change in $y$ comapred to $x_2$.

Curse of Dimensionality¶

Definition: As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

Example: Number of neighbors needed to cover the same dimension space.

1D: Each point covers 1/10 of dimension space
2D: Each point cover 1/100 of dimension space
3D: Each point cover 1/1000 of dimension space

Some Other KNN Stuff¶

Distance metric¶

Your choice of distance function matters. Choosing the wrong distance function will yield poor behavior.
Euclidean / Manhattan are useful for regression. Why? Because it assumes you have continuous, numerical features.
Use domain knowledge to decide which distance function makes sense. More info on how to decide in this link.

Number of neighbors + weighted average¶

What if $K=n$? You end up with a constant function, ie. simple average of all $y_i$.
Weighted averages: Instead weight all $x_i$ equally, you put more weight on those closer.
Locally weighted regression: Instead of averaging, use $x_n$ data points closest to $q$ and fit a line to it.
Takeaway: KNN is useful, because it allows you to take local information and build arbitrary functions/concepts.

Example: Locally weighted linear regression

Summary of Instance-based Leaners¶

Lazy vs eager learners
KNN
Nearest neighbor: similarity (ie. distance) functions
NOTE: Both number of neighbors and similarity functions are both ways to capture domain knowledge.
Classification vs Regression (KNN can handle both)
Averaging (non-weighted vs weighted)
Combining local knowledge to fit a regression line.
Curse of dimensionality
Domain knowledge matters - this guides you to choosing which algorithm will work best.

SL5: Ensemble Learning: Bagging and Boosting¶

Bagging¶

Idea: Build up a bunch of rules and combine them together until you got something that's good enough.

Pseudocode:

1. Learn over a subset of data (uniform & random) and apply a learner
...(repeat)
2. Combine all the rules (How? Regression: Mean, Classification: Mode)

Example: Spam Email

Quiz: Ensemble Output¶

This is essentially the same as running KNN where $K=n$.

Bagging (Bootstrap Aggregation)¶

Takeaway:

Bagging averages out variance/noise in the data by fitting multiple learners to subsets of data.
Higher order regression will fit to all data, more chance of overfitting.
In this case, regression fits better to train but does worse on test. Bagging is reverse, meaning it generalizes better.

Boosting¶

Pseudocode (high-level):

1. Learn over a subset of data apply a learner
2. Filter down to the "hardest" examples, ie. data points that results in highest errors.
...(repeat)
3. Combine all the rules (How? Regression: WEIGHTED mean, Classification: WEIGHTED Mode)

Defining "hardest" examples¶

Classification error = # of mismatches

Given:

$x$: Datapoint
$h$: Hypothesis (from learner)
$c$: Ground truth
$D$: Distribution (of examples being drawn, we don't know how they're being drawn though)

$$ Pr_D [h(x) \ne c(x)] $$

"Weak" Leaner¶

Definition: A learning algorithm that, under any distribution, will find a hypothesis that performs better than chance on the current distribution.

Error: No matter what distribution you get ($\forall$), you're always going to get an error rate that is lower than a half (minus very small number ($\epsilon$)).
Takeaway: You're always learning "something" from the learner (ie. better than chance).

"Does better than chance" expressed more formally: For all distributions, your learner will have an expected error (hypothesis disagreeing with the real target of a single sample)

$$ \forall_D Pr_D [\cdot] \leq 1/2 - \epsilon $$

Quiz: Weighted Error¶

Quiz: Weak Learners (TODO: Come back to this one - I don't understand it)¶

Find the distribution over 4 different examples ($x_i$) such that a learning algorithm that has to choose between one of the 3 hypotheses ($h_i$) will be able to find one that does better than chance (ie. expected error greater than half).
Find a distribution over the 4 examples, a learning algorithm (looking at $x_{1,2,3}$) would not be able to return one of them that does better than chance.

ANSWER:

Good distribution: If we give equal weight to each example, $h_1$ gets 3 of 4 correct which is better than chance. It's error distribution if 1/4.
Evil distribution: I don't understand how they got [1/2, 1/2, 0, 0]?

Some explanation in Slack thread:

One can ask a simple question: given a set of objects, X, what is the probability of seeing any particular object, x, that is drawn from X according to some distribution D? By definition, D, induces a value for each x that is between 0 and 1 inclusive where the sum of all these values is 1.

In our case, each x has some corresponding label, let’s call each on y.

A weak learner is simply a learning function that given X, D, Y, and a set of hypotheses, H, will find some hypothesis, h, drawn from H, such that the probability of drawing an x such that h(x) == y is better than chance.

Boosting Algorithm (Detailed)¶

Pseudocode (left side)¶

For a given training set, assume that each target has a value of either -1 or +1 (binary class)
Enter a loop for $T$ times ($t$ is each time step)
1. Construct a distribution over your examples ($D_t$)
2. Find a weak classifier ($h_{t}(x)$) with a small amount of error ($\epsilon_t$) given a specific $D_t$. (This is where we iteratively learn the error from the previous learners).
  - $\epsilon$ is expected to have error rate smaller than chance (ie. 50%)
Output a final hypotheses $H$

Constructing the Distribution (right side)¶

Base case: $D_{1}(i)$ is initialized with a uniform distribution.
- Why? We don't know whether one example is more important than another, so just assume they're all equal.
At each time step, we construct a new distribution over the examples ($D_{t+1}(i)$):
- The big equation is how the new distribution is calculated. The important part is that for a binary classification problem, if y and h agree it return 1 and -1 if they disagree.
- Alpha is a constant (0-1) that is multiplied with the error.
- If y and h agree, the distribution for that example ($D(i)$) decreases
  - Math: Agreement $y_i = h_t$ returns 1, which is multiplied by alpha then negated. $e$ to the exponent of a negative returns a number less than 1 (e^(-n) means 1/e^n).
- Denominator $Z_t$ normalizes the result
Significance: For each weak learner, it puts more weight on examples that it gets wrong and less weight on those it gets right.
1. If there is agreement on one example but at least one other one disagrees, $D_t(i)$ will DECREASE.
2. If there is disagreement on one example but at least one other one agrees, $D_t(i)$ will INCREASE.
3. If we assume each weak learner always gets error rate less than chance, the new weak learner will get some of the ones the learner at $t_{-1}$ was getting wrong.

Constructing the final hypothesis¶

We generate the hypothesis by getting the sum of all of our alpha times its hypothesis, then run it through a sign function (returns the sign, ie. -1, 1, 0 of the number).

Formula for final hypothesis $H$:

$$ \begin{align} H_{final}(x) = sgn(\Sigma_t \alpha_t h_t (x)) \end{align} $$

Boosting Example¶

Boosting Intuition¶

Each weak learner does better than chance.
New distribution of examples are weighted differently depending on if weak learner at time $t$'s hypothesis was correct. Those correct are weighed down and those with error goes up.
New learner at $t+1$ will now focus on getting those that the previous learner got wrong.
The sum of alpha and hypothesis gives our final hypothesis $H$.
There is a property of information gain, that each iteration must always be better than chance w.r.t error.

Boosting Summary¶

Bagging: Subsample bunch of examples and train different learners on each. Merge them with mean or mode (depending on regression/classification).
Combining simple learners give you ensemble learner that can fit complex data.
Boosting is really good.
Ensemble learners are agnostic to what type of weak learners to use.
Error is related to distribution $D$ for Boosting.
Overfitting: In practice, boosting is resilient to overfitting.

SL6: Kernel Methods & SVMs¶

Assumption: Dealing with linearly separable data + binary labels.
The middle lines explains the data correctly, but is the least committed to the sample set. This prevents overfitting.

SVM¶

Supporting material/notes for this section: My notes from ISYE6501

Linear Classifiers¶

Given:

$y$ - label
$w$ - parameters of the hyperplane
$b$ - bias

$$ y = w^T_x + b $$

Decision Boundaries and Margins:¶

Decision boundary is zero, ie $w^T_x + b = 0$. (This is the middle line).
Decision boundary (assuming binary classification (labels 1 and -1) + linearly separable data):
1. Hyperplane for positive class: $w^T_x+b=1$
2. Hyperplane for negative class: $w^T_x+b=-1$
We want to maximize the margin, ie distance between the two margin planes.

Quiz: Solving for the margin¶

Q: Solve for the line that is described by their difference (...?)

Answer:

Essentially asking for $W^T(x_1 - x_2)$ which is 2 (given class 1 and -1)
But wants the vector - with some tricks, it is $\frac{2}{||w||}$ (bottom right)
Explanation: We are solving for the parameters of the two hyperplane so that we maximize their distance (ie. maximize perpendicular line to parallel lines $x_1$ and $x_2$, BUT with the constraint that all classifications are still correct. This is called the margin.
tl;dr - SVM's goal is to find a decision boundary that maximizes the margin, subject to the constraint that all points are correctly classified

SVM as an Optimization Problem¶

Idea: We can turn the $\max \{ \frac{2}{||w||}\}$ solution into a quadratic programming function.

Why? Quadratic programming problems are easier to solve.

Takeaway: All 3 optimization models are solving the same problem, but quadratic is easier to solve:

$\max \{ \frac{2}{||w||}\}$
$\min\{ \frac{1}{2}||w||^2\}$
Bottom $\max \{W(\alpha)\}$ equation in image above (complicated, trust instructors on this one)

SVM Continued: Maximizing alpha, and solving for $w$¶

Goal: After finding an alpha that maximizes the quadratic optimization program, you can solve for the params of the hyperplane ($w$). Once we find the param, we can solve for $b$ by plugging in $w$ and $x_i$. This gives us $wx+b$.

Some considerations:

Most alphas are zeros (assumption given by instructors), which means that $x$'s with zero alphas won't contribute to $w$ at all.
Put it another way: Only a few $x$'s (with non-zero alpha) actually contribute to $w$. These data points are the support vectors.

The Kernel Trick¶

Movitation: We can separate non-linearly seprable data points by using the kernel trick.

Math behind the kernel trick¶

We first transform each data point into a higher dimension, given:
- $q$: One data point (ie + or -) that contains an x and y coordinate ($q_1$ and $q_2$)
- $\Phi$: Transform function
- We can create a triple for each data point: $$ \Phi(q) = (q^2_1, q^2_2, \sqrt{2} q_1 q_2) $$
If we matrix multiply $x_i$ and $x_j$ (like in the optimization program), we get a form that essentially is the function to create a circle.

The final outcome $(X^Ty)^2$ is an equation for a circle, which means we can use it to separate the 2 classes in the original problem.

Significance of the kernel trick¶

We can swap out the original $x_i^Tx_j$ with non-linear functions. There are various kernels we can use.
Using the trick is computationally efficient, because we're actually NOT converting data into higher dimensional space. That part basically gave us proof that it works; afterwards we're using the final equation (ie. $(X^Ty)^2$) as a surrogate of dimensionality transformation.
The $x_ix_j$ part is essentially considering the "similarity" of two data points. We are injecting domain knowledge into the optimzation problem about what "similarity" means.
Kernels must follow the Mercer Condition: "Mercer's theorem states that a symmetric, and positive-definite matrix can be represented as a sum of a convergent sequence of product functions."

Summary: SVMs¶

Margins: Large margins tend to avoid overfitting. Hence, we prefer linear separators that maximizes the margin.
For non-linearly separable datasets (ie. circle), we used quadratic programming to find the max margin.
Support vectors are points from the input data that were necessarily to create the support vectors.
SVMs are "eager-lazy" learners. It only uses a subset of the data, other data points don't really matter (based on zero alphas).
Kernel tricks allow us to emulate pushing the data into a higher dimensional space.
Kernels represent our domain knowledge.

Boosting & SVMs¶

Takeaway:

With boosting, creating more weak learners increases the confidence of the model's answers, particularly with the harder examples.
This essentially maximizes margin between classes, similar to SVMs.

Quiz: When do Boosting overfit?¶

Answer 1: If using NN with lots of params (ie. "strong" learner on its own)
- Why? Neural net may completely fit the training data, so there is not redistributing the "hard" examples. It will just return the same neural net over and over.
- Takeaway: If the underlying learner is built to have overfitting conditions (lots of params), boosting may overfit.
Answer 2: Boosting tends to overfit on pink noise.

SL7: Computational Learning Theory)¶

Goals - Find a formal way of addressing:

Defining learning problems.
Show how specific algorithms work / don't work for a problem.
Show whether a problem is fundamentally hard.

Definitions:

Computational complexity - How much effort is needed for a learner to converge to a correct (or almost correct) hypothesis?
Sample complexity (batch) - How many training examples do we need in order for the learner to create a successful hypothesis?
Mistake bounds (online) - How many mistakes can a learner make over an infinite run?

Tools: Use same tools for analyzing algorithms in CS.

Resources that matter in computational learning theory:

Time (eg. $O(n \log{n})$)
Space (eg. $O(n^2)$)
Data (ie.fewer samples to learn is better).

Defining Inductive Learning (learning from examples)¶

$P$ of successful training (eg. $P_{success}=1-\delta$).
Number of examples available for training.
Complexity of hypothesis class
Accuracy to which target concept is approximated (ie. $\epsilon$)
Manner in which training examples presented (eg. batch vs online).
Manner in which training examples selected (see next section)

Selecting Training Examples¶

How training examples are selected, from a teacher/learner paradigm:

Learners asks questions to teacher
- "I have $x$, what is $c(x)$?"
Teacher give examples to help leaner.
- "Here is $x$, it can be solved by $c(x)$"
Fixed distribution
- $x$ chosen from $D$ by nature.
Evil - worst distribution.

Quiz: 20 Questions¶

In the context of the game "20 questions":

$H$ is the hypothesis class of all possible people the student can choose (ie. target class).
$X$ is the set of questions that could be asked (ie. training examples)

Q1: How many questions does the student need to ask, if Teachers chooses the questions?

Answer: Just one, since the teacher knows the answer (ie. Tell student to ask "Is the person {target class}?"
Takeaway: The teacher can provide maximum information gain to the student since they know the answer.

Q2: How many question does the learner must choose before it narrows down the possibility to 1 person?

Answer: $log_2|H|$, because goal is to eliminate as many hypotheses as they can. The best way to do this is binary search (which is $O(\log{N})$).

Teacher with Constrained Queries¶

Given:

$X$: $x_i$ of K-bit inputs
$H$: Conjunctions of literals or negation
- Example: $x_1 \land x_3 \land \neg x_5$ (x1 and x3 and not x5)

Q1: Given the table of examples and hypotheses, figure out the combination of positive/absent/negated to get the hypothesis.

Q2: Harder problem - we don't have access to hypothesis beforehand.

Answer: Unless the student gets a 1, you get a lot of useless answers.

Takeaway: Constrained set of queries make it hard to consistently reduce the hypothesis space in half, so realistically it is going to take more time than log N time (possibly exponential).

Each question must be asked like a data point. We can't ask something like "Is x1 positive?"

Learner with "Mistake Bounds"¶

Reference: Mitchell, Ch. 7 "Computaional Learning Theory" pg. 220.

Key Idea: "How many mistakes will the learner make in its predictions before it learns the target concept?"

Find-S algorithm (covered in lecture and book pg. 221):¶

Key Idea: Set hypothesis as specific as possible, then generalize as you see more positive examples.

Initialize $h$ to the most specific hypothesis consisting of boolean literals, conjunctions and negations: $x_1 \land \neg x_1 \land x_2 \land \neg x_2 \ldots$
For each positive training instance:
- Remove from $h$ any literal that is not satisfied by $x$
Output $h$

Takeaway:

Converges to a hypothesis that makes no errors, given data is noise-free and concept class is in hypothesis space $C \subseteq H$
Algorithm never needs more thant $n+1$ where n is the number of literals (K in slide).

Version Spaces¶

True hypothesis: True concept (c) in hypothesis space H
Training set: S (subset of possible inputs X), combined with the true class for each x
Candidate hypothesis: One hypothesis that is a subset of all hypotheses
Consistent learner: Is able to produce a hypothesis to get the correct class.
Version space: Subset of hypotheses from H consistent with the training examples S.

Quiz: Version Spaces¶

PAC Learning¶

Error of h¶

Takeaway: True Error = penalized proportional to the probability of getting the hypothesis incorrect.

No error for examples that the learner would have gotten wrong but wasn't in the sample
Only small error for examples that occur rarely in the sample.

PAC Formula¶

Definition: Learner is PAC learnable if it can learn to get a low error and return hypothesis with relatively high confidence (ie. delta) - in time that is polynomial w.r.t error, certainty and number of hypotheses.

How to know if problem is PAC learnable¶

Formula: A problem is PAC learnable if the version space of sample S is epsilon exhausted.

Epsilon exhaustion occurs IFF all the hypotheses in the version space (VS(S)) has low error (ie. less than epsilon).
If any hypothesis has error greater than epsilon, it is NOT epsilon exhausted.

Quiz: Epsilon-exhausted VS¶

Haussler Theorem - Bound True Error¶

Key formula to find the bound: $$ m \geq \frac{1}{\epsilon} (ln|H| + ln\frac{1}{\delta}) $$

Takeaway: Haussler theorem gives a you formula (boxed in red) that tells you how many training examples you need in order to meet the epsilon threshold.

NOTE: This assumes target concept is in hypothesis space ($C \subseteq H$). If it isn't, the result is more or less the same, with an extra polynomial to epsilon. This is called an "agnostic" learning scenario.
IMPORTANT: This also assumes $H$ is a finite hypothesis space. How to deal with infinite hypothesis space is covered in VC dimensions lecture.

Quiz: PAC-learnable example¶

Takeaway: Answer (40) is much better than the input space (2**10, ie 1024).

Also, this bound is agnostic to the distribution. Will still work even if distribution is not uniform.

Summary - Computational Learning Theory¶

How fundamentally hard is it to learn the problem.
Sample complexity: How many samples do we need in order to learn the problem.
Types of interactions:
1. Learner picking the questions
2. Teacher picking the questions (most efficient in info gain)
3. Nature picks questions (based on distribution)
Mistake bounds: How many mistakes, vs how much data
Version spaces + PAC learning: Training vs Test vs True error (from distribution D)
Epsilon exhaustion: sample complexity bound ($m \geq \frac{1}{\epsilon} (ln|h| + ln\frac{1}{\delta})$

SL8: VC Dimensions¶

Reference:

Mitchell, Chapter 7 "Computational Learning Theory"
Lecture by Mitchell on VC dimensions
https://dev.to/swyx/supervised-learning-vc-dimensions-10b

Motivation: In the previous lesson, we covered Haussler's theorem ($m \geq \frac{1}{\epsilon} (ln|h| + ln\frac{1}{\delta})$), but its success was limited to finite hypothesis space (H). How do we handle very large, or infinite hypothesis space?

Goal: Connect insight about VC dimensions to understanding sample complexity.

Infinite Spaces¶

Many learners actually have infinite hypothesis space, rather than finite:

linear separators (infinite number of lines)
neural networks (infinite number of weights)
DT with continuous inputs (syntactically)

Syntactic vs Semantic Hypothesis Spaces¶

The lectures mentions the difference between syntactic vs semantic hypothesis space:

Syntactic - All the ways the hypothesis could be represented. Infinite.
Semantic - Actual different functions practically represented. Finite.

Significance: We can reduce the dimensionality of syntactic hypothesis space (infinite) into a finite hypothesis space. This is the underlying concept of determining VC dimensions.

VC Dimensions¶

From Mitchell (pg.215)

VC dimension $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size of the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H) = \infty$

Definition from lecture: What is the largest set of inputs that the hypothesis class can "shatter", ie. label in all possible ways?

Shattering¶

From Mitchell (pg.214)

A set of instances S is shattered by hypothesis space $H$ IFF for every dichotomy of S there exists some hypothesis in $H$ consistent with this dichotomy.

If a learner can correctly separate a set of points, it "shatters" those points.

Quiz: VC Dimension of Intervals¶

Question: Figure out the largest set we can shatter using the given hypotheses, given:

$X = \mathbb{R}$
$H = \{h(x) = x \in [a,b]\}$

Answer: $VC=2$

$VC \geq 1$: Yes
- Put bracket around 1 example: [+]
$VC \geq 2$: Yes. All permutations of input points can be captured by an interval:
- [+ +]
- [+] -
- + [-]
- - - []
$VC \geq 3$? No
- Cannot capture input points like + - + because the interval has to be contiguous.

Quiz: VC Dimension of Linear Separators¶

Question: Figure out the largest set we can shatter using the given hypotheses, given:

$X = R^2$
$H = \{ h(x) = W^Tx\geq\theta \}$

Answer: $VC \geq 3$

Can't fit a line with four input points that are in a + - + - matrix since the lines split a plane that has to be contiguous

VC Dimensions vs Number of Parameters¶

Takeaway: VC dimensions often align to the number of parameters in the hypothesis space.

Accounts for nonlinear separators. 2D circles = $VC\geq3$.

Quiz: VC Dimension of Convex Polygons¶

Takeaway: Convex polygon has infinite VC dimensions, because it can have arbitrarily many input points.

Sample Complexity & VC Dimension¶

Given:

$m$: Sample set
$\epsilon$: Target error rate
$\delta$: Prediction confidence
$VC(H)$: VC dimension of hypothesis space

We can determine the right sample size to meet the error rate $\epsilon$ and probability $1-\delta$ for an infinite hypothesis class.

$$ m \geq \frac{1}{\epsilon}(8 VC(H) log_2 \frac{13}{\epsilon} + 4 log_2 \frac{2}{\delta}) $$

This can be simplified to reflect a finite hypothesis class.

$$ m \geq \frac{1}{\epsilon}(\ln{|H|} + \ln{\frac{1}{\delta}}) $$

VC of finite H¶

Takeaway: H is PAC-learnable IFF VC dimension is finite. VC dimensions captures the idea of PAC learnability in a single concept.

From dev.to%3A-,The%20VC%20of%20Finite%20Hypothesis%20Space,-If%20we%20denote)

If we denote the VC of Finite Hypothesis Space by d, there has to be 2^d distinct concepts (as each different labelling can be captured by a different hypothesis in a class) - therefore 2^d is less than or equal to the number of hyptheses |H|.

Rearranging, d <= log2 (|H|). So a finite hypothesis class gives us finite VC dimensions, and therefore make things "PAC Learnable". There is a further proof that this goes the other way - H is PAC Learnable if and only iff the VC Dimension is finite. VC Dimension captures the notion of PAC Learnability in a single concept.

Summary: VC Dimensions¶

VC dimension + shattering.
VC relates to hypothesis space parameters.
VC relates to finite hypothesis space size.
Sample complexity relates to VC dimension.
VC computing tricks
VC dimension captures the idea of PAC learnability.
NOTE: VC dimension is defined for spaces of binary functions

From CMU lecture notes:

Through our discussions of VC theory we have seen that we can improve generalization by controlling the complexity of the concept class H from which we are choosing a hypothesis. We saw that the shatter coefficient and VC dimension were useful measures of complexity because we could bound the performance of a learning algorithm in terms of these quantities and the amount of data we have.

SL9: Bayesian Learning¶

Reference:

Mitchell, Ch 6, pg.154
George's Notes, pg. 52
dev.to

Motivation:

Bayes theorem gives us a way to calculate the probability of a hypothesis based on its
1. Prior probability
2. Probabilities of observing various data given the hypothesis, and
3. The obesrved data itself.

Key Idea: Find the hypothesis $h$ from hypothesis class $H$ that is the most probable given the data $D$. $$ argmax_{h \in H} P(h | D) $$

Bayes Theorem¶

Mitchell, p156

Bayes theorem ... provides a way to calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h)

Given:

$h$: hypothesis
$D$: Data
$P(h)$: Prior proability of $h$. Initial probability that hypothesis $h$ holds, before we observe the training data.
- This reflects our domain knowledge (ex. similarity metric for KNN, feature importance in DT, NN structure, true positive of a test).
- If we have no domain/background knowledge, we can assign the same prior probability to each candidate hypothesis.
- Independent of $D$.
$P(D)$: Prior probability that $D$ will be observed.
$P(D|h)$: Probability of observing $D$ given some world in which hypothesis $h$ holds.
- ie. Given a set of data $x_i$, what's the probability that I'd see some particular label?
- Example: Given $h(x) = x \geq 10$, if $x=7$, $P(True) = 0$ and $P(False)=1$.
- Computing this is much easier than $P(h|D)$, which is why it is used for maximum likelihood estimation.
$P(h|D)$: Posterior probability of $h$. Our hypothesis given $D$ (what we want to maximize). Reflects our confidence that $h$ holds after we have seen $D$.
- Reflects the influence of $D$, unlike $P(h)$
- Harder to compute

Bayes theorem $$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$

Takeaway: Bayes rule allows us to switch $P(h|D)$ to $P(D|h)$.

$P(h|D)$ increases with $P(h)$ and with $P(D|h)$

Quiz: Bayes Rule¶

A man goes to see a doctor. She gives him a lab test. The test return correct positive 98% of the time, and a correct negative 97% of the time. The test looks for a rare disease that only 0.8% of the population has it. The test comes out positive, does the man have the rare disease?

Given:

Prior probability of having disease $P(disease)$ = .008
True positive rate of test $P(testPositive)$ = 0.98
True negative rate of test $P(testNegative)$ = 0.97

We can apply the Bayes Theorem to get the probability of having the disease given that the test returns positive: $$ P(disease|testPositive) = \frac{P(testPositive|disease) P(disease)}{P(testPositive)} \\ = \frac{P(testPositive|disease) P(disease)}{P(testPositive) P(disease) + P(1 - testNegative) P(1 - disease)} \\ = \frac{.98 * 0.008}{(.98 * 0.008) + (0.03 *(1 - 0.008))} \\ = 0.2085106383 \\ = 21\% $$

ANSWER: Probably not. Even if the man tests positive there is only 21% chance that he has the disease.

Takeaway: The prior probability of the disease $h$ significantly affects the outcome. If prior probability is low, then the test isn't very useful.

Context: If you're in a world where everybody gets a test for a rare disease, prior probability will be low. But if you're in a world where only patients suspected to have a rare disease gets tested, its probability will be higher.

Maximum a posteriori (MAP) & Maximum Likelihood Estimation¶

Both are derived from the Bayes Theorem: For each hypothesis in $H$, calculate $P(h|D)$

$$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$

Takeaways:

To calculate MAP and MLE, we can drop $P(D)$ in the denominator since it's a constant independent of $h$.
If we assume $h \in H$ is uniform, we can remove the $P(h)$.
Computationally intensive (need to calculate MLE for each $h$), hence NOT practical.

Maximum a posteriori (MAP)¶

$$ h_{MAP} \equiv argmax_{h \in H} P(h|D) $$

Maximum Likelihood Estimation¶

MLE is more computationally efficient than MAP

$$ h_{MLE} \equiv argmax_{h \in H} P(D|h) $$

Bayesian Learning in Action¶

Takeaway: Given a bunch of data, the probability of a particular hypothesis being correct is uniform over all of the hypotheses in the version space (ie consistent with given data).

Quiz: Noisy Data¶

(TODO: Revisit this lecture - I didn't quite understand it)

Q: Compute the probability of seeing this particular data set in a world where the hypothesis ($h(x)=x$) is true.

Bayesian Learning & Sum of Squared Errors¶

Takeaways:

Gradient descent, linear regression, etc is proved to be correct by deriving the maximum likelihood hypothesis ($h_ML$). Why? Because it derives to minimizing the sum of squared errors (SSE).
OTOH, whenever you are trying to minimize the SSE, you are assuming that:
1. The function f(x) is deterministic
2. f(x) is summed with some amount of noise $\epsilon$
3. The noise ($\epsilon$) has a gaussian distribution (IID with a mean of zero and variance squared)
If any of these assumptions is false, then minimizing SSE to solve the problem is incorrect.

Quiz: Sum of Squared Errors¶

Q: Given data <$x_i$, $d_i$> and function $d_i=f(x_i) + \epsilon_i$, which one of these $h$ best fits the data:

h(x) = x mod 9
h(x) = x/3
h(x) = mean(d)

ANSWER: x mod 9 (SSE 12), better than linear regression

Minimum Description Lengths¶

From Mitchell, pg 173

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimzes the sum of (the hypothesis and data description length). MDL ... provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis.

Prefers simpler hypotheses that net the lowest error over a more complex hypothesis with the same error. This is basically bias-variance trade off.

Takeaway: Derivations based on Bayesian learning gives us a handle on why we make the decision we make when building ML models.

Bayesian Classification (Bayes Optimal Classifier)¶

Takeaway: So far we learned about finding the best hypothesis, but we actually care about finding the most probable classification of the new instance. To do this, we take the weighted vote of the probabilities of each hypothesis class.

Bayes Optimal Classifier:

$$ argmax_{v_j \in V} \Sigma_{h_i \in H} P(v_j|h_i)P(h_i|D) $$

The most probable classification of the new instance is ... combining the predictions of all hypotheses, weighted by their posterior probabilities. This method maximizes P that the new instance is classified correctly given the data, hypothesis space, and priors over the hypotheses. No other classification method using the same hypothesis space and same prior knowlege can outperform this method on average.

Bayes Classifier Quiz:

Summary: Bayesian Learning¶

Takeaway: Bayesian Learning + Classification gives us a way of talking about optimality and gold standards.

Bayes Rule - Swaps causes & effects $P(h|D)$ ~ $P(D|h)P(h)$
Priors matter.
$h_{MAP}$ and $h_{ML}$ (MAP when prior is uniform)
Derived rules we've used in previous lessons (SSE, Occam's Razor, bias/variance tradeoff using minimum description length).
Bayes Optimal Classifier - What we really care about is Bayesian "classification" (not the hypothesis)

SL10: Bayesian Inference¶

Reference:

Mitchell, Ch. 6 "Bayesian Learning"
Wasserman, Ch.17 "Directed Graphs and Conditional Independence"
dev.to
George's Notes, section 4.2.

Bayesian Networks - What is it?¶

From Lecture:

Representation for representing and manipulating probabilities quantities over complex spaces.

ChatGPT:

Bayesian networks are probabilistic graphical models that represent the relationships between different random variables and their conditional dependencies. The graph consists of nodes representing the variables and edges representing the dependencies between them. The dependencies are represented probabilistically, using Bayes' theorem, which specifies the probability of a variable given its dependencies.

Mitchell, pg.186

A Bayesian belief network represent thes the joint probability distribution for a set of variables.

Joint Distribution¶

Quiz: Joint Distribution¶

Takeaway:

Probability of 1 variable, eg. P(lightning): $\Sigma{P(x_i)}$
Probability of conditional, eg. P(lightning|storm): $P(y) / \Sigma{P(x_i)}$

Conditional Independence¶

Definition: From Mitchell, pg.185

Let X, Y, and Z be three discrete-valued random varables. X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z.

$$ (\forall{x_i, y_j, z_k})\text{ } P(X=x_i|Y=y_j,Z=z_k) = P(X=x_i|Z=z_k) $$

or simply, $$ P(X|Y,Z)=P(X|Z) $$

Quiz: Conditional Independence¶

Bayes Nets (aka Graphical Models, Belief Networks)¶

Quiz: Nodes and Probabilities¶

Takeaway:

There are no causal relationship between the nodes even though they are represented as arrows. It just shows how the probabilities are decomposed.

Quiz: Sampling from the Joint Distribution¶

Q: In what order should we select ABCDE?

ANSWER: It should be sampled in a topological order (topological sort)
- Reference: https://www.geeksforgeeks.org/topological-sorting/
  Topological sorting for Directed Acyclic Graph (DAG) is a linear ordering of vertices such that for every directed edge u v, vertex u comes before v in the ordering.
This only works if there are no cyclical relationships (ie. acyclic)

Recovering the Joint Distribution¶

Takeaway: We can go from the values in the conditional probability tables (in each of the nodes) to computing the probability of any joint combination of variables.

Formula: The joint probability for some assignment to the variables is equal to the product of all the individual values.

Example using the graph in the previous slide: $$ P(A,B,C,D,E) = P(A)P(B)P(C|A,B)P(D|B,C)P(E|C,D) $$

If A~E are boolean variables, the number of values we'd need to specify for the joint distribution in the standard representation (ie left side of equation) is $2^5-1$.
Conditional probability is more compact, only needing 14 (1, 1, 4, 4, 4).
Conditional probability, unlike joint distribution, only grows to the extent of number of parents.

Quiz: Why Sampling?¶

Distributions are useful for: 1) Simulating a complex process - If the distribution represented something complex, we would want to simulate that world and act according to those probabilities. 2) Approximate inference - We want to make an inference of what might be true of the world in different situations.

- Why approximate? Because finding the exact inference is NP-hard

3) Visualization - We want to get a sense of the data, especially when the distribution is complex with many variables. In these complex cases, it's hard for humans to understand the relationship between variables. Sampling gives you real examples that gives you a sense of how these relationships work.

Inference Rules¶

Quiz: Inference by Hand¶

This is very hard to figure out. Just watch the video and follow along: https://youtu.be/Vt8fRJwvdwA

Naive Bayes¶

Ref: Mitchell, pg. 177

The Naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. In other words, the assumption is that given the target value of the instance, the probability of boserving the conjunction $a_1, a_2, \ldots a_n$ is just the product of the probabilities for the individual attributes.

Naive Bayes Classifier, where $v_NB$ represents the predicted target value:

$V$: $$ v_{NB} = argmax_{v_j \in V} P(v_j) \Pi_i P(a_i|v_j) $$

Example from lecture using spam classification:

Estimating all the attributes given the target value $P(a_i|v)$ is not feasible, need a lot of training data. But we can flip it to get the target value given all attributes $P(v|a_i)$.
Using Bayes Theorem, we are flipping the probability of spam given attributes, to predicting the probability of attributes given probability of spam. Once we do that, we can just calculate the product of all conditional probabilities.

Takeaway: Naive bayes classifier allows us to observe attribute values to infer the class.

Why Naive Bayes is "Naive"¶

1) No Free Lunch: It is "naive" because it's not modeling interrelationship between attributes.

- Then why does it work? Naive Bayes will give you incorrect probabilities of attributes if those attributes are related, but for the purpose of classification (ie argmax) it actually works quite well empirically.  For example, if two words are related ("free" + "lunch"), those frequencies may be counted more but they also tend to cancel each other out (a bit hand-wavey here).

2) We don't have infinite data: $P(a_i|v)$ is calculated by counting how many times $a_i$ occurs for a target value $v$ in the training data (over the counts of target value). If we have limited data, that attribute may have zero instances, which makes the numerator zero. If we rely on the product of all attributes given target value, that probability becomes zero.

- Workaround: Smoothing, ie initializing the counts with a small value so that no attribute is zero.  BUT this does create inductive bias, assuming that all attributes are mildly possible.

Summary: Bayes Nets¶

Bayesian Network representation of joint probability distributions.
Examples of using networks to compute probabilities
Sampling as a way to do approximate inference.
In genera, exact inference is NP-hard.
Naive Bayes links all the Bayesian theory to a ML classification task.
- NB handles missing attributes well, bc all the attributes are linked together by probabilities.

Aside: Naive Bayes Text Classification¶

Goal: Binary classification task using text documents (e.g. target class = the document is an academic paper)

Steps:

Given a training data set, count how many times each word appears in the document for a given target class. $$ v_{NB} = argmax_{v_j \in \{Academic,NotAcademic\}} P(v_j)P(a_1="our"|v_j)P(a_2="approach"|v_j)\ldots P(a_i="trouble"|v_j) $$
We make an assumption that words occur IID, which means we can disregard the position of the word. This reduces the probability space significantly.
Use the m-estimate with uniform priors and with m equal to the size of the vocabulary.
- Given $n$ is the total number of word positions in all training examples whose target value is $v$, $n_k$ is the number of times word $w_k$ is found among these $n$ word positions, and $|Vocabulary|$ is the total number of distinct words (and other tokens) found in the training data: $$ \frac{n_k+1}{n+|Vocabulary|} $$

The final algorithm from Mitchell, pg.183: