CS7641 - Machine Learning - Supervised Learning

Supervised Learning (SL)

Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.

Classification vs Regression

  • Regression: If output is continuous
  • Classification: If output is a small, discrete set

Classification Learning

Terms

Term Definition
Instances Input. Vectors of attributes that defines your input space.
Concept Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function.
Target Concept The correct concept to map instances to oututs.
Hypothesis Class Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output).
Sample Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning.
Candidate Concept Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy)
Testing Set Set of instances to test a candidate concept.

SL1: Decision Trees

Representation vs Algorithm

  • Decision Trees are a specific representations.
  • Algorithm is the method in which we build a decision tree.

Quiz: Running a decision tree

Work through each instance and give an answer based on the decision tree.

img

Note:

  • These examples are part of a test set.
  • The tree in the image is our candidate concept.
  • Some attributes doesn't matter, thus never used.

Decision Tree Algorithm

"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.

Learning Algorithm

  1. Pick the "best" attribute. "Best" is the attribute that splits the training set in half.
  2. Ask question
  3. Follow the answer path.
  4. Go to 1, until you get to an answer.

Quiz: Picking the "best" attribute

img

Notes:

  • Candidates 1 & 2 may be equally bad. 1 creates pretty much the same distribution after split. 2 doesn't split at all.

Expressiveness

Example: Boolean expressions as decision tress

img

Expressiveness per Tree Types

img

Type Size of Nodes Description
n-OR ("any" fn) linear $n$ Easy to solve
n-XOR ("odd/even parity" fn)* $O(2^{n})$ Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes).

Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.

* If number of attributes that are true is odd, then the output is true.

Expressiveness Continued

img

Q: Exactly how expressive is a decision tree?

Given:

  • n attributes (boolean) - $O(n\!)$ (combinatorial) to build nodes
  • outputs is boolean
  • Q1: If mapped to a truth table, how many rows are needed?
  • A1: $2^n$ rows
  • Q2: How many ways to fill in outputs (outputs are boolean)
  • A2: $2^{2^n}$ (really big number, grows very fast)

A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation

ID3 Algorithms

Loop:
    1. Pick best attribute A
    2. Assign A as decision attribute for Node
    3. For each value A, create a descendant of Node
    4. Sort training examples to leaves
    5. Stop if examples perfectly classified
    6. Else iterate over leaves

Using Information Gain to Choose Best Attribute

img

Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.

Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"

  • Intuition: Measure of the amount of information. Reducing entropy means we are gaining more information. Goal is to maximize information gain by choosing attribute that lowers entropy the most.
  • Example: Fair vs 2 heads coin. Fair coin has 50/50 chance of heads, making it a 1 bit entropy. 2 heads coin has 100 chance of head, making it zero entropy (no information).
    • The more of one label you have over another, the amount of entropy goes down.

Formula:

  • $S$: Collection of training examples
  • $A$: Specific attribute
  • $v$: Specific values of an attribute (categorical, continuous)
$$ Gain(S,A) = Entropy(S) - \Sigma_v\frac{|S_v|}{|S|} Entropy(S_v) $$

Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$

ID3: Inductive Bias

Types of bias:

  1. Restriction bias: Hypothesis set that you care about.
    • For DT, includes all possible DT representations.
    • Does NOT include functions outside of DT (quadratic, etc).
  2. Preference bias: What sorts of hypotheses in the hypothesis set ($h \in H$) we prefer.
    • Inductive bias == preference bias

ID3's inductive bias - Types of preferences:

  • Prefers Good splits near the top.
  • Prefers representations that return correct answer over incorrect ones.
  • Prefers shorter trees to longer trees. Come naturally from preferring good splits near the top.

Decision Trees: Other Considerations

img

Continuous Attributes

Q: How to build DTs with continuous attributes?

  • A: Create a binary branching decision based on continuous ranges.

Q: T/F, does it make sense to repeat an attribute along a path in a tree?

  • A1: False for discrete value attributes because there is no information gain asking the same question.
  • A2: True for continous attributes, because we can ask different questions based on the same attribute. Further nodes down the branch could use different ranges to split.

When to stop

Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.

Solutions

  • Validation: Use cross-validation, or use held out validation set to verify if we want to continue branching.
  • Pruning: Build the tree first, and then prune it based on how much error the pruning will cost (baesd on validation set).
    • Output is based on mode

How to build DT using regression?

Regression implies continuous outputs. Information gain doesn't work well with continuous values.

Q: What's the splitting criteria for regression trees?

  • A: You can split by variance to decide.

Q: What's the output?

  • A: You can use the average

Summary

img

SL2: Regression

Regression & Function Approximation

img

Etymology: "Regression to the mean"

  • Slope of less than 1 (ie. 2/3) represents many natural phenomenon like parent/child height relations.
  • "Regression to the mean" means using a functional form to approximate a set of data points.

Regression in ML

Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).

Finding the best constant function

img

  • Above image shows how to derive the constant function of a squared sum of errors. Deriving the function shows the minimum.
  • In this case, the best constant is the average of all the $y$ values.

Polynomial Regression

Choosing the Order of Polynomial

img

  • The greater the order of polynomial used, the closer we can fit the data at the cost of sensitivity to noise (ie. overfitting).

Math: Polynomial Regression

img

img

  • Each $x$ (data point) is calculate to its nth-degree exponent in the function. This builds a matrix.
  • This is then matrix multiplied by the order of coefficient for each exponent.
  • Each row maps to one $y$ response.
  • $w$ is the network of weights of coefficients that give the best fit.

Errors, and where they come from

Training data has error that doesn't model $f$, results in $f+\epsilon$

Types of error sources:

  1. Sensor errors
  2. Malicious data
  3. Transcription error
  4. Unmodeled influences that is hard to model (ie. fluctuation in housing prices, time of day, buyer's mood)

Cross Validation

Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.

Aside - Assumptions of SL:

  • Train/test sets are IID (independent and identically distributed)

Applied to housing example in video:

  1. Gap between train/cross-val error narrows up to 3-4th degree
  2. Higher order polynomials begin overfitting to the training data at the expense of generalization on unseen examples.

img

Other Input Spaces

img

  • Regression can handle discrete, vector or scalar attributes.
  • Handling categorical: Enumerate (OK), or 1-hot encode (Better) - because enumeration gives categories a rank.

Summary

img

SL3: Neural Networks

video

Example of a perceptron (binary classifier with 3 input + weights)

img

Quiz: Example inputs and estimated y

img

How powerful is a perceptron unit?

Image: Representation of activation space given 2 data + weights. Generates a halfplane.

img

Takeaway: Perceptrons are linear functions that computes a hyperplane.

Perceptron Quizzes

Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?

  • A: AND Logic

Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?

  • A: (Many answers) $w_1 = 0.5$, $w_2 = 0.5$, $w_3 = 0.5$, which moves the activation function slope downward.

Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?

  • A: $w_1 = -1$ (turns 0 into 0, 1 into -1), and $\theta = 0$ so any negative number is above zero.

Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.

XOR as Perceptron Network Quiz

img

In [16]:
# Perceptron Quiz answer in code

def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
    activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
    return 1 if activation >= theta else 0

def test_xor(w1, w2, w_and, theta): 
    assert xor_network(0,0, w1, w2, w_and, theta) == 0
    assert xor_network(0,1, w1, w2, w_and, theta) == 1
    assert xor_network(1,0, w1, w2, w_and, theta) == 1
    assert xor_network(1,1, w1, w2, w_and, theta) == 0
    print("Passed all tests")

w_1 = 1
w_2 = 1
w_and = -2
threshold = 1   
test_xor(w_1, w_2, w_and, threshold)
Passed all tests

Answer to XOR question:

  • We essentially want the network to work like an OR logic, except when AND is 1. - Previous quiz showed that OR logic can be created by setting $w_1$ and $w_2$ to 1 and $\theta$ to 1.
  • We can therefore add more negative weight to balance the OR logic by setting $w_\text{and}$ to -2, which offsets $\Sigma(x_1w_1, x_2w_1)$ and gives us 0. img

Perceptron Training

Definition: Given examples, find weights that map inputs to outputs.

2 ways in which perceptron training is done:

  1. Perceptron Rule: Using the threshold ($\theta$)
  2. Gradient Descent / Delta Rule: Using the "unthreshold" value.

Perceptron Rule

Goal: How to set the weights of a single unit so that it matches a training set.

img

Components:

  • $x$: input (single unit for this example)
  • $w_i$: Weight of $i$-th unit
  • $y$: Target
  • $\hat{y}$: Prediction
  • $\eta$: Learning rate. Affects amount of update rule.
  • $\theta$: Uses math trick so that we just set bias to 1 so we don't have to worry about the threshold.

(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.

$$ w_i = w_i + \Delta w_i $$

(2) Defining weight change amount - First, calculate $\hat{y}$.

  • If y and $\hat{y}$ are equal, then we don't need to change the weight.
  • If they are not equal, then we get either a positive or negative value.
$$ \hat{y} = (\Sigma_i w_i x_i \ge 0) $$

(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.

$$ \Delta w_i = \eta (y-\hat{y})x_i $$

Linearly Separable & Perceptron Rule

Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.

Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.

Q: What if the data is not linearly separable?

  • A: Continue until iteration limit is reached (not easy to figure out).

Gradient Descent

Motivation: More robust to linearly non-separable datasets.

  • Key is to not use thresholding.

img

Components:

  • $\alpha$: Activation, the sum of $x_i$ and $w_i$
  • $y$: Target
  • $\hat{y}$: Estimated output, ${\alpha \ge 0}$
  • $E(w)$: Error metric

(1) Calculate the error metric. Goal is to minimize this error.

  • $\frac{1}{2}$ is just a helper to make deriving the partial derivative easier.
$$ E(w) = \frac{1}{2} \Sigma_{(x,y)\in D} (y-\alpha)^2 $$

(2) Find the partial derivative of above.

  • Why? We do this to find the best weights to minimize the error.
$$ \frac{\partial E}{\partial w_i} = \frac{\partial}{\partial w_i} \frac{1}{2} \Sigma_{(x,y)\in D}(y-\alpha)^2 $$

(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.

$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$

Comparison of learning rules

Type Learning Rule Pros Cons
Perceptron Rule $$\Delta w_i = \eta (y-\hat{y})x_i$$ Guaranteed finite convergence Requires linear separability
Gradient Descent $$\Delta w_i = \eta (y-\alpha)x_i$$ More robust to non-linearly seprable data Likely to converge only on local optimum

Quiz: Why not do gradient descent on $\hat{y}$?

  • Answer: Because $\hat{y}$ will always be an integer, hence it is non-differentiable. Example of how this will look like on a 2d plot below.
  • Significance: We can use a sigmoid instead which slightly bends the non-differentiable, disjointed line.

img

Sigmoid Function - Differentiable Threshold

img

(1) Sigmoid activation function $$ \sigma(x) = \frac{1}{1+e^{-x}} $$

(2) Derivative of the sigmoid function $$ \sigma`(x) = \sigma(x)(1-\sigma(x)) $$

  • As the sigmoid activation ($\alpha$) reaches infinity, it goes to 1
  • As it approaches negative infinity, it goes to 0
  • As zero, it is essentially $\frac{1}{1+e^{-\alpha}}$, which is 0.5.
  • Between -5 and 5, it creates a nice curve that makes it possible to perform gradient descent (ie. continuous line).

Extra: How to take the derivative of the sigmoid function. Reference: TDS

The Illustrated Neural Network

Key Points:

  • Each node is a sigmoid unit.
  • Each node is differentiable, so the entire network becomes differentiable.
  • Backpropagation: The act of sending the error information from the output back to the input, while updating the weights of each activation node in between.
    • "Computationally beneficial organization of the chain rule"
  • Local minima issue: High dimensional complex problems can have many local minima, hence the result of gradient descent can get stuck in the local but not global minima.

Advanced Methods to Optimize Weights

Alternatives/supplements to vanilla gradient descent:

  1. Momentum: Overcome noisy gradients (ie. local minima) by adding an additional hyperparameter that controls the amount of history (momentum) to include in the update equation.
    • A large momentum (e.g. 0.9) will mean that the update is strongly influenced by the previous update, whereas a modest momentum (0.2) will mean very little influence
  2. Higher order derivatives: The gradient is the first-order derivative for a multivariate objective function. Instead, we can use the second/third order derivative.
  3. Randomized optimization: Covered in future sections.
  4. Penalty for "complexity": Add penalty term for adding more nodes and layers, and/or large numbers.
    • Larger numbers are more likely to overfit, bc it lets you represent arbitrarily complex functions.

Restriction Bias

Definition: Representational power of the data structure you're using. Tells you the set of hypotheses you're willing to consider.

  • Lower bias == more hypotheses considered and vice versa.
  • Relevant to all supervised learning algorithms (not just neural nets).

Types of neural nets and their representational power:

  1. Perceptron: Linear function. Only considers planes.
  2. Perceptron network: Introduces boolean logic (XOR, etc). Starts capturing nonlinear functions.
  3. Sigmoid: More complex, less restriction
  4. Continuous function: Can be handle by one hidden neural network layer (if enough hidden units).
  5. Arbitrary function: Ie. disconnected functions. Can be represented by two hidden layers.

Preference Bias

Definition: The algorithm's selection (ie preference) of one representation over another.

Preference bias of neural nets: Small Random values that provide simpler explanations.

- SMALL: Favors small value, because networks with small numbers have low(er) complexity. Larger values tend to overfit.
- RANDOM: Variability. Possibility of exploring outside of local minimia.
- Similar to __Occam's Razor__: "Entities should not be multiplied unnecessarily"
    - Algorithm prefers stopping once it fits the data well, before letting weights to grow more.

Summary of Neural Nets

  1. Perceptrons - threshold unit
  2. Networks can produce any boolean function.
  3. Perceptron rule - Finite time for linearly separable.
  4. General differentiable rule - Backpropagation and gradient descent.
  5. Preference / restriction bias of neural networks.

SL4: Instance Based Learning

Pros Cons
Remembers (no catastrophic forgetting) Cannot generalize
Fast (no learning) Slower at runtime
Simple (Occams Razor) Easily overfits (Remembers noise as well)
Struggles with complexity (e.g. 2 same x with different y, multiple near distance neighbors with different y values)

K-Nearest Neighbor Algorithm

Given:

  • Training Data (set of pairs): $D=\{x_i, y_i\}$
  • Distance metric: $d(q, x)$
  • Number of neighbors: $K$
  • Query point (new data): $q$

(1) Find $K$ nearest neighbors of $q$ $$ NearestNeighbor = \{i:d(q,x_i)k \text{ smallest}\} $$

(2.1) Classification: Return the (weighted) vote of $y_i \in \text{NearestNeighbor}$ (ie. Mode)

(2.2) Regression: Return the (weighted) mean of $y_i \in \text{NearestNeighbor}$

Quiz 1: Runtime vs Space

Takeaways:

  1. KNN is faster to learn (running time is constant, space is linear), but slower at querying (query time is $\log_n$, space is constant).
    • If you query $n$ times overall, it will be less efficient than linear regression.
    • Lazy-learner: Essentially only does work at query time.
  2. Linear Regression is slower to learn (running time is linear, space is constant), but faster at quering (time+space is constant).
    • Eager-learner: Wants to learn right away.

Quiz 2: Testing KNN on Regression

How to calculate Manhattan/Euclidean distance: Both are derived from Mikowski distance ("p-norm distance"), with different degrees of $p$ value:

  • Given:
    • $x$: Input
    • $y$: Target
    • $p$: Order (1=Manhattan, 2=Euclidean)
  • Minkowski distance: $d(x,y) = \sqrt[p]{\sum_{i=1}^{n}|x_i - y_i|^p}$
  • Manhattan distance: $d(x,y) = \Sigma_{i=1}^n |x_i - y_i| $
  • Euclidean distance: $d(x,y) = \sqrt{\Sigma_{i=1}^n|x_i - y_i|^2}$

Using the first row as an example (x=(1,6), y=7):

  • Manhattan distance: $d(4,2) = | (1, 6) - (4, 2) | = | (1-4) + (6-2) | = 7$
  • Euclidean distance: $d(4,2) = \sqrt{| (1, 6) - (4, 2) |^2} = \sqrt{|1-4|^2 + |6-2|^2} = 5$
    • NOTE: In the video they use ED squared for readability.

Finding 1/3-NN:

  • How to find 1-NN: Find lowest distance. If more than one, get average.
  • How to find 3-NN: Find average of lowest 3 distance.
  • Manhattan:
    • 1N = 4. 2 datapoints, (2,4), 8 and (7,1), 50. $y=\mu{(8, 50)}=29$
    • 3NN = [4,6]. 4 datapoints (checked in MD column). $y=\mu{(8, 16, 50, 68)} = 33.5$
  • Euclidean:
    • 1N = 8. 1 datapoint, (2,4), 8. $y=\sqrt{8}$
    • 3NN = [8, 10, 20]. 3 datapoints (the ones checked in MD column in iamge). $y=\mu{(8, 50, 68)} = 42$

Error:

  • Target function: $y=x_1^2+x_2$
  • Given $q=(4,2)$, $y=18$

Takeaways:

  • KNN, using both Euclidean and Manhattan distance, were unable to predict anywhere near the target.
  • Prediction of KNN is heavily dependent on 1) How many neighbors are used, and 2) The distance metric used.
  • This implies an underlying bias of KNN.

KNN's Preference Bias

  1. Locality: Near points are similar (e.g. real-estate example)
    • Embedded in the distance function. Why would a specific distance function be better?
    • There is always one best distance function for a specific problem (given all else is fixed).
  2. Smoothness: Expects smoothly changing behavior between data points
  3. All features matter equally (e.g. Minkowski distance) - but this is not always the case
    • For example $y=x_1^2+x_2$, any change in $x_1$ causes a large change in $y$ comapred to $x_2$.

Curse of Dimensionality

Definition: As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.

Example: Number of neighbors needed to cover the same dimension space.

  1. 1D: Each point covers 1/10 of dimension space
  2. 2D: Each point cover 1/100 of dimension space
  3. 3D: Each point cover 1/1000 of dimension space

Some Other KNN Stuff

Distance metric

  • Your choice of distance function matters. Choosing the wrong distance function will yield poor behavior.
  • Euclidean / Manhattan are useful for regression. Why? Because it assumes you have continuous, numerical features.
  • Use domain knowledge to decide which distance function makes sense. More info on how to decide in this link.

Number of neighbors + weighted average

  • What if $K=n$? You end up with a constant function, ie. simple average of all $y_i$.
  • Weighted averages: Instead weight all $x_i$ equally, you put more weight on those closer.
  • Locally weighted regression: Instead of averaging, use $x_n$ data points closest to $q$ and fit a line to it.
  • Takeaway: KNN is useful, because it allows you to take local information and build arbitrary functions/concepts.

Example: Locally weighted linear regression

Summary of Instance-based Leaners

  • Lazy vs eager learners
  • KNN
  • Nearest neighbor: similarity (ie. distance) functions
  • NOTE: Both number of neighbors and similarity functions are both ways to capture domain knowledge.
  • Classification vs Regression (KNN can handle both)
  • Averaging (non-weighted vs weighted)
  • Combining local knowledge to fit a regression line.
  • Curse of dimensionality
  • Domain knowledge matters - this guides you to choosing which algorithm will work best.

SL5: Ensemble Learning: Bagging and Boosting

Bagging

Idea: Build up a bunch of rules and combine them together until you got something that's good enough.

Pseudocode:

1. Learn over a subset of data (uniform & random) and apply a learner
...(repeat)
2. Combine all the rules (How? Regression: Mean, Classification: Mode)

Example: Spam Email

Quiz: Ensemble Output

  • This is essentially the same as running KNN where $K=n$.

Bagging (Bootstrap Aggregation)

Takeaway:

  • Bagging averages out variance/noise in the data by fitting multiple learners to subsets of data.
  • Higher order regression will fit to all data, more chance of overfitting.
  • In this case, regression fits better to train but does worse on test. Bagging is reverse, meaning it generalizes better.

Boosting

Pseudocode (high-level):

1. Learn over a subset of data apply a learner
2. Filter down to the "hardest" examples, ie. data points that results in highest errors.
...(repeat)
3. Combine all the rules (How? Regression: WEIGHTED mean, Classification: WEIGHTED Mode)

Defining "hardest" examples

Classification error = # of mismatches

Given:

  • $x$: Datapoint
  • $h$: Hypothesis (from learner)
  • $c$: Ground truth
  • $D$: Distribution (of examples being drawn, we don't know how they're being drawn though)
$$ Pr_D [h(x) \ne c(x)] $$

"Weak" Leaner

Definition: A learning algorithm that, under any distribution, will find a hypothesis that performs better than chance on the current distribution.

  • Error: No matter what distribution you get ($\forall$), you're always going to get an error rate that is lower than a half (minus very small number ($\epsilon$)).
  • Takeaway: You're always learning "something" from the learner (ie. better than chance).

"Does better than chance" expressed more formally: For all distributions, your learner will have an expected error (hypothesis disagreeing with the real target of a single sample)

$$ \forall_D Pr_D [\cdot] \leq 1/2 - \epsilon $$

Quiz: Weighted Error

Quiz: Weak Learners (TODO: Come back to this one - I don't understand it)

  1. Find the distribution over 4 different examples ($x_i$) such that a learning algorithm that has to choose between one of the 3 hypotheses ($h_i$) will be able to find one that does better than chance (ie. expected error greater than half).
  2. Find a distribution over the 4 examples, a learning algorithm (looking at $x_{1,2,3}$) would not be able to return one of them that does better than chance.

ANSWER:

  1. Good distribution: If we give equal weight to each example, $h_1$ gets 3 of 4 correct which is better than chance. It's error distribution if 1/4.
  2. Evil distribution: I don't understand how they got [1/2, 1/2, 0, 0]?

Some explanation in Slack thread:

One can ask a simple question: given a set of objects, X, what is the probability of seeing any particular object, x, that is drawn from X according to some distribution D? By definition, D, induces a value for each x that is between 0 and 1 inclusive where the sum of all these values is 1.

In our case, each x has some corresponding label, let’s call each on y.

A weak learner is simply a learning function that given X, D, Y, and a set of hypotheses, H, will find some hypothesis, h, drawn from H, such that the probability of drawing an x such that h(x) == y is better than chance.

Boosting Algorithm (Detailed)

Pseudocode (left side)

  1. For a given training set, assume that each target has a value of either -1 or +1 (binary class)
  2. Enter a loop for $T$ times ($t$ is each time step)
    1. Construct a distribution over your examples ($D_t$)
    2. Find a weak classifier ($h_{t}(x)$) with a small amount of error ($\epsilon_t$) given a specific $D_t$. (This is where we iteratively learn the error from the previous learners).
      • $\epsilon$ is expected to have error rate smaller than chance (ie. 50%)
  3. Output a final hypotheses $H$

Constructing the Distribution (right side)

  1. Base case: $D_{1}(i)$ is initialized with a uniform distribution.
    • Why? We don't know whether one example is more important than another, so just assume they're all equal.
  2. At each time step, we construct a new distribution over the examples ($D_{t+1}(i)$):
    • The big equation is how the new distribution is calculated. The important part is that for a binary classification problem, if y and h agree it return 1 and -1 if they disagree.
    • Alpha is a constant (0-1) that is multiplied with the error.
    • If y and h agree, the distribution for that example ($D(i)$) decreases
      • Math: Agreement $y_i = h_t$ returns 1, which is multiplied by alpha then negated. $e$ to the exponent of a negative returns a number less than 1 (e^(-n) means 1/e^n).
    • Denominator $Z_t$ normalizes the result
  3. Significance: For each weak learner, it puts more weight on examples that it gets wrong and less weight on those it gets right.
    1. If there is agreement on one example but at least one other one disagrees, $D_t(i)$ will DECREASE.
    2. If there is disagreement on one example but at least one other one agrees, $D_t(i)$ will INCREASE.
    3. If we assume each weak learner always gets error rate less than chance, the new weak learner will get some of the ones the learner at $t_{-1}$ was getting wrong.

Constructing the final hypothesis

We generate the hypothesis by getting the sum of all of our alpha times its hypothesis, then run it through a sign function (returns the sign, ie. -1, 1, 0 of the number).

Formula for final hypothesis $H$:

$$ \begin{align} H_{final}(x) = sgn(\Sigma_t \alpha_t h_t (x)) \end{align} $$

Boosting Example

Boosting Intuition

  • Each weak learner does better than chance.
  • New distribution of examples are weighted differently depending on if weak learner at time $t$'s hypothesis was correct. Those correct are weighed down and those with error goes up.
  • New learner at $t+1$ will now focus on getting those that the previous learner got wrong.
  • The sum of alpha and hypothesis gives our final hypothesis $H$.
  • There is a property of information gain, that each iteration must always be better than chance w.r.t error.

Boosting Summary

  • Bagging: Subsample bunch of examples and train different learners on each. Merge them with mean or mode (depending on regression/classification).
  • Combining simple learners give you ensemble learner that can fit complex data.
  • Boosting is really good.
  • Ensemble learners are agnostic to what type of weak learners to use.
  • Error is related to distribution $D$ for Boosting.
  • Overfitting: In practice, boosting is resilient to overfitting.

SL6: Kernel Methods & SVMs

  • Assumption: Dealing with linearly separable data + binary labels.
  • The middle lines explains the data correctly, but is the least committed to the sample set. This prevents overfitting.

SVM

Supporting material/notes for this section: My notes from ISYE6501

Linear Classifiers

Given:

  • $y$ - label
  • $w$ - parameters of the hyperplane
  • $b$ - bias
$$ y = w^T_x + b $$

Decision Boundaries and Margins:

  • Decision boundary is zero, ie $w^T_x + b = 0$. (This is the middle line).
  • Decision boundary (assuming binary classification (labels 1 and -1) + linearly separable data):
    1. Hyperplane for positive class: $w^T_x+b=1$
    2. Hyperplane for negative class: $w^T_x+b=-1$
  • We want to maximize the margin, ie distance between the two margin planes.

Quiz: Solving for the margin

Q: Solve for the line that is described by their difference (...?)

Answer:

  • Essentially asking for $W^T(x_1 - x_2)$ which is 2 (given class 1 and -1)
  • But wants the vector - with some tricks, it is $\frac{2}{||w||}$ (bottom right)
  • Explanation: We are solving for the parameters of the two hyperplane so that we maximize their distance (ie. maximize perpendicular line to parallel lines $x_1$ and $x_2$, BUT with the constraint that all classifications are still correct. This is called the margin.
  • tl;dr - SVM's goal is to find a decision boundary that maximizes the margin, subject to the constraint that all points are correctly classified

SVM as an Optimization Problem

Idea: We can turn the $\max \{ \frac{2}{||w||}\}$ solution into a quadratic programming function.

  • Why? Quadratic programming problems are easier to solve.

Takeaway: All 3 optimization models are solving the same problem, but quadratic is easier to solve:

  1. $\max \{ \frac{2}{||w||}\}$
  2. $\min\{ \frac{1}{2}||w||^2\}$
  3. Bottom $\max \{W(\alpha)\}$ equation in image above (complicated, trust instructors on this one)

SVM Continued: Maximizing alpha, and solving for $w$

Goal: After finding an alpha that maximizes the quadratic optimization program, you can solve for the params of the hyperplane ($w$). Once we find the param, we can solve for $b$ by plugging in $w$ and $x_i$. This gives us $wx+b$.

Some considerations:

  • Most alphas are zeros (assumption given by instructors), which means that $x$'s with zero alphas won't contribute to $w$ at all.
  • Put it another way: Only a few $x$'s (with non-zero alpha) actually contribute to $w$. These data points are the support vectors.

The Kernel Trick

Movitation: We can separate non-linearly seprable data points by using the kernel trick.

Math behind the kernel trick

  1. We first transform each data point into a higher dimension, given:
    • $q$: One data point (ie + or -) that contains an x and y coordinate ($q_1$ and $q_2$)
    • $\Phi$: Transform function
    • We can create a triple for each data point: $$ \Phi(q) = (q^2_1, q^2_2, \sqrt{2} q_1 q_2) $$
  2. If we matrix multiply $x_i$ and $x_j$ (like in the optimization program), we get a form that essentially is the function to create a circle.

  1. The final outcome $(X^Ty)^2$ is an equation for a circle, which means we can use it to separate the 2 classes in the original problem.

Significance of the kernel trick

  1. We can swap out the original $x_i^Tx_j$ with non-linear functions. There are various kernels we can use.
  2. Using the trick is computationally efficient, because we're actually NOT converting data into higher dimensional space. That part basically gave us proof that it works; afterwards we're using the final equation (ie. $(X^Ty)^2$) as a surrogate of dimensionality transformation.
  3. The $x_ix_j$ part is essentially considering the "similarity" of two data points. We are injecting domain knowledge into the optimzation problem about what "similarity" means.
  4. Kernels must follow the Mercer Condition: "Mercer's theorem states that a symmetric, and positive-definite matrix can be represented as a sum of a convergent sequence of product functions."

Summary: SVMs

  • Margins: Large margins tend to avoid overfitting. Hence, we prefer linear separators that maximizes the margin.
  • For non-linearly separable datasets (ie. circle), we used quadratic programming to find the max margin.
  • Support vectors are points from the input data that were necessarily to create the support vectors.
  • SVMs are "eager-lazy" learners. It only uses a subset of the data, other data points don't really matter (based on zero alphas).
  • Kernel tricks allow us to emulate pushing the data into a higher dimensional space.
  • Kernels represent our domain knowledge.

Boosting & SVMs

Takeaway:

  • With boosting, creating more weak learners increases the confidence of the model's answers, particularly with the harder examples.
  • This essentially maximizes margin between classes, similar to SVMs.

Quiz: When do Boosting overfit?

  • Answer 1: If using NN with lots of params (ie. "strong" learner on its own)
    • Why? Neural net may completely fit the training data, so there is not redistributing the "hard" examples. It will just return the same neural net over and over.
    • Takeaway: If the underlying learner is built to have overfitting conditions (lots of params), boosting may overfit.
  • Answer 2: Boosting tends to overfit on pink noise.

SL7: Computational Learning Theory)

Goals - Find a formal way of addressing:

  1. Defining learning problems.
  2. Show how specific algorithms work / don't work for a problem.
  3. Show whether a problem is fundamentally hard.

Definitions:

  1. Computational complexity - How much effort is needed for a learner to converge to a correct (or almost correct) hypothesis?
  2. Sample complexity (batch) - How many training examples do we need in order for the learner to create a successful hypothesis?
  3. Mistake bounds (online) - How many mistakes can a learner make over an infinite run?

Tools: Use same tools for analyzing algorithms in CS.

Resources that matter in computational learning theory:

  1. Time (eg. $O(n \log{n})$)
  2. Space (eg. $O(n^2)$)
  3. Data (ie.fewer samples to learn is better).

Defining Inductive Learning (learning from examples)

  1. $P$ of successful training (eg. $P_{success}=1-\delta$).
  2. Number of examples available for training.
  3. Complexity of hypothesis class
  4. Accuracy to which target concept is approximated (ie. $\epsilon$)
  5. Manner in which training examples presented (eg. batch vs online).
  6. Manner in which training examples selected (see next section)

Selecting Training Examples

How training examples are selected, from a teacher/learner paradigm:

  1. Learners asks questions to teacher
    • "I have $x$, what is $c(x)$?"
  2. Teacher give examples to help leaner.
    • "Here is $x$, it can be solved by $c(x)$"
  3. Fixed distribution
    • $x$ chosen from $D$ by nature.
  4. Evil - worst distribution.

Quiz: 20 Questions

In the context of the game "20 questions":

  • $H$ is the hypothesis class of all possible people the student can choose (ie. target class).
  • $X$ is the set of questions that could be asked (ie. training examples)

Q1: How many questions does the student need to ask, if Teachers chooses the questions?

  • Answer: Just one, since the teacher knows the answer (ie. Tell student to ask "Is the person {target class}?"
  • Takeaway: The teacher can provide maximum information gain to the student since they know the answer.

Q2: How many question does the learner must choose before it narrows down the possibility to 1 person?

  • Answer: $log_2|H|$, because goal is to eliminate as many hypotheses as they can. The best way to do this is binary search (which is $O(\log{N})$).

Teacher with Constrained Queries

Given:

  • $X$: $x_i$ of K-bit inputs
  • $H$: Conjunctions of literals or negation
    • Example: $x_1 \land x_3 \land \neg x_5$ (x1 and x3 and not x5)

Q1: Given the table of examples and hypotheses, figure out the combination of positive/absent/negated to get the hypothesis.

Q2: Harder problem - we don't have access to hypothesis beforehand.

  • Answer: Unless the student gets a 1, you get a lot of useless answers.

Takeaway: Constrained set of queries make it hard to consistently reduce the hypothesis space in half, so realistically it is going to take more time than log N time (possibly exponential).

  • Each question must be asked like a data point. We can't ask something like "Is x1 positive?"

Learner with "Mistake Bounds"

Reference: Mitchell, Ch. 7 "Computaional Learning Theory" pg. 220.

Key Idea: "How many mistakes will the learner make in its predictions before it learns the target concept?"

Find-S algorithm (covered in lecture and book pg. 221):

Key Idea: Set hypothesis as specific as possible, then generalize as you see more positive examples.

  1. Initialize $h$ to the most specific hypothesis consisting of boolean literals, conjunctions and negations: $x_1 \land \neg x_1 \land x_2 \land \neg x_2 \ldots$
  2. For each positive training instance:
    • Remove from $h$ any literal that is not satisfied by $x$
  3. Output $h$

Takeaway:

  • Converges to a hypothesis that makes no errors, given data is noise-free and concept class is in hypothesis space $C \subseteq H$
  • Algorithm never needs more thant $n+1$ where n is the number of literals (K in slide).

Version Spaces

  1. True hypothesis: True concept (c) in hypothesis space H
  2. Training set: S (subset of possible inputs X), combined with the true class for each x
  3. Candidate hypothesis: One hypothesis that is a subset of all hypotheses
  4. Consistent learner: Is able to produce a hypothesis to get the correct class.
  5. Version space: Subset of hypotheses from H consistent with the training examples S.

Quiz: Version Spaces

PAC Learning

Error of h

Takeaway: True Error = penalized proportional to the probability of getting the hypothesis incorrect.

  • No error for examples that the learner would have gotten wrong but wasn't in the sample
  • Only small error for examples that occur rarely in the sample.

PAC Formula

Definition: Learner is PAC learnable if it can learn to get a low error and return hypothesis with relatively high confidence (ie. delta) - in time that is polynomial w.r.t error, certainty and number of hypotheses.

How to know if problem is PAC learnable

Formula: A problem is PAC learnable if the version space of sample S is epsilon exhausted.

  • Epsilon exhaustion occurs IFF all the hypotheses in the version space (VS(S)) has low error (ie. less than epsilon).
  • If any hypothesis has error greater than epsilon, it is NOT epsilon exhausted.

Quiz: Epsilon-exhausted VS

Haussler Theorem - Bound True Error

Key formula to find the bound: $$ m \geq \frac{1}{\epsilon} (ln|H| + ln\frac{1}{\delta}) $$

Takeaway: Haussler theorem gives a you formula (boxed in red) that tells you how many training examples you need in order to meet the epsilon threshold.

  • NOTE: This assumes target concept is in hypothesis space ($C \subseteq H$). If it isn't, the result is more or less the same, with an extra polynomial to epsilon. This is called an "agnostic" learning scenario.
  • IMPORTANT: This also assumes $H$ is a finite hypothesis space. How to deal with infinite hypothesis space is covered in VC dimensions lecture.

Quiz: PAC-learnable example

Takeaway: Answer (40) is much better than the input space (2**10, ie 1024).

  • Also, this bound is agnostic to the distribution. Will still work even if distribution is not uniform.

Summary - Computational Learning Theory

  • How fundamentally hard is it to learn the problem.
  • Sample complexity: How many samples do we need in order to learn the problem.
  • Types of interactions:
    1. Learner picking the questions
    2. Teacher picking the questions (most efficient in info gain)
    3. Nature picks questions (based on distribution)
  • Mistake bounds: How many mistakes, vs how much data
  • Version spaces + PAC learning: Training vs Test vs True error (from distribution D)
  • Epsilon exhaustion: sample complexity bound ($m \geq \frac{1}{\epsilon} (ln|h| + ln\frac{1}{\delta})$

SL8: VC Dimensions

Reference:

Motivation: In the previous lesson, we covered Haussler's theorem ($m \geq \frac{1}{\epsilon} (ln|h| + ln\frac{1}{\delta})$), but its success was limited to finite hypothesis space (H). How do we handle very large, or infinite hypothesis space?

Goal: Connect insight about VC dimensions to understanding sample complexity.

Infinite Spaces

Many learners actually have infinite hypothesis space, rather than finite:

  • linear separators (infinite number of lines)
  • neural networks (infinite number of weights)
  • DT with continuous inputs (syntactically)

Syntactic vs Semantic Hypothesis Spaces

The lectures mentions the difference between syntactic vs semantic hypothesis space:

  1. Syntactic - All the ways the hypothesis could be represented. Infinite.
  2. Semantic - Actual different functions practically represented. Finite.

Significance: We can reduce the dimensionality of syntactic hypothesis space (infinite) into a finite hypothesis space. This is the underlying concept of determining VC dimensions.

VC Dimensions

From Mitchell (pg.215)

VC dimension $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size of the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H) = \infty$

Definition from lecture: What is the largest set of inputs that the hypothesis class can "shatter", ie. label in all possible ways?

Shattering

From Mitchell (pg.214)

A set of instances S is shattered by hypothesis space $H$ IFF for every dichotomy of S there exists some hypothesis in $H$ consistent with this dichotomy.

  • If a learner can correctly separate a set of points, it "shatters" those points.

Quiz: VC Dimension of Intervals

Question: Figure out the largest set we can shatter using the given hypotheses, given:

  • $X = \mathbb{R}$
  • $H = \{h(x) = x \in [a,b]\}$

Answer: $VC=2$

  1. $VC \geq 1$: Yes
    • Put bracket around 1 example: [+]
  2. $VC \geq 2$: Yes. All permutations of input points can be captured by an interval:
    • [+ +]
    • [+] -
    • + [-]
    • - - []
  3. $VC \geq 3$? No
    • Cannot capture input points like + - + because the interval has to be contiguous.

Quiz: VC Dimension of Linear Separators

Question: Figure out the largest set we can shatter using the given hypotheses, given:

  • $X = R^2$
  • $H = \{ h(x) = W^Tx\geq\theta \}$

Answer: $VC \geq 3$

  • Can't fit a line with four input points that are in a + - + - matrix since the lines split a plane that has to be contiguous

VC Dimensions vs Number of Parameters

Takeaway: VC dimensions often align to the number of parameters in the hypothesis space.

  • Accounts for nonlinear separators. 2D circles = $VC\geq3$.

Quiz: VC Dimension of Convex Polygons

Takeaway: Convex polygon has infinite VC dimensions, because it can have arbitrarily many input points.

Sample Complexity & VC Dimension

Given:

  • $m$: Sample set
  • $\epsilon$: Target error rate
  • $\delta$: Prediction confidence
  • $VC(H)$: VC dimension of hypothesis space

We can determine the right sample size to meet the error rate $\epsilon$ and probability $1-\delta$ for an infinite hypothesis class.

$$ m \geq \frac{1}{\epsilon}(8 VC(H) log_2 \frac{13}{\epsilon} + 4 log_2 \frac{2}{\delta}) $$

This can be simplified to reflect a finite hypothesis class.

$$ m \geq \frac{1}{\epsilon}(\ln{|H|} + \ln{\frac{1}{\delta}}) $$

VC of finite H

Takeaway: H is PAC-learnable IFF VC dimension is finite. VC dimensions captures the idea of PAC learnability in a single concept.

From dev.to%3A-,The%20VC%20of%20Finite%20Hypothesis%20Space,-If%20we%20denote)

If we denote the VC of Finite Hypothesis Space by d, there has to be 2^d distinct concepts (as each different labelling can be captured by a different hypothesis in a class) - therefore 2^d is less than or equal to the number of hyptheses |H|.

Rearranging, d <= log2 (|H|). So a finite hypothesis class gives us finite VC dimensions, and therefore make things "PAC Learnable". There is a further proof that this goes the other way - H is PAC Learnable if and only iff the VC Dimension is finite. VC Dimension captures the notion of PAC Learnability in a single concept.

Summary: VC Dimensions

  • VC dimension + shattering.
  • VC relates to hypothesis space parameters.
  • VC relates to finite hypothesis space size.
  • Sample complexity relates to VC dimension.
  • VC computing tricks
  • VC dimension captures the idea of PAC learnability.
  • NOTE: VC dimension is defined for spaces of binary functions

From CMU lecture notes:

Through our discussions of VC theory we have seen that we can improve generalization by controlling the complexity of the concept class H from which we are choosing a hypothesis. We saw that the shatter coefficient and VC dimension were useful measures of complexity because we could bound the performance of a learning algorithm in terms of these quantities and the amount of data we have.

SL9: Bayesian Learning

Reference:

Motivation:

  • Bayes theorem gives us a way to calculate the probability of a hypothesis based on its
    1. Prior probability
    2. Probabilities of observing various data given the hypothesis, and
    3. The obesrved data itself.

Key Idea: Find the hypothesis $h$ from hypothesis class $H$ that is the most probable given the data $D$. $$ argmax_{h \in H} P(h | D) $$

Bayes Theorem

Mitchell, p156

Bayes theorem ... provides a way to calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h)

Given:

  • $h$: hypothesis
  • $D$: Data
  • $P(h)$: Prior proability of $h$. Initial probability that hypothesis $h$ holds, before we observe the training data.
    • This reflects our domain knowledge (ex. similarity metric for KNN, feature importance in DT, NN structure, true positive of a test).
    • If we have no domain/background knowledge, we can assign the same prior probability to each candidate hypothesis.
    • Independent of $D$.
  • $P(D)$: Prior probability that $D$ will be observed.
  • $P(D|h)$: Probability of observing $D$ given some world in which hypothesis $h$ holds.
    • ie. Given a set of data $x_i$, what's the probability that I'd see some particular label?
    • Example: Given $h(x) = x \geq 10$, if $x=7$, $P(True) = 0$ and $P(False)=1$.
    • Computing this is much easier than $P(h|D)$, which is why it is used for maximum likelihood estimation.
  • $P(h|D)$: Posterior probability of $h$. Our hypothesis given $D$ (what we want to maximize). Reflects our confidence that $h$ holds after we have seen $D$.
    • Reflects the influence of $D$, unlike $P(h)$
    • Harder to compute

Bayes theorem $$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$

Takeaway: Bayes rule allows us to switch $P(h|D)$ to $P(D|h)$.

  • $P(h|D)$ increases with $P(h)$ and with $P(D|h)$

Quiz: Bayes Rule

A man goes to see a doctor. She gives him a lab test. The test return correct positive 98% of the time, and a correct negative 97% of the time. The test looks for a rare disease that only 0.8% of the population has it. The test comes out positive, does the man have the rare disease?

Given:

  • Prior probability of having disease $P(disease)$ = .008
  • True positive rate of test $P(testPositive)$ = 0.98
  • True negative rate of test $P(testNegative)$ = 0.97

We can apply the Bayes Theorem to get the probability of having the disease given that the test returns positive: $$ P(disease|testPositive) = \frac{P(testPositive|disease) P(disease)}{P(testPositive)} \\ = \frac{P(testPositive|disease) P(disease)}{P(testPositive) P(disease) + P(1 - testNegative) P(1 - disease)} \\ = \frac{.98 * 0.008}{(.98 * 0.008) + (0.03 *(1 - 0.008))} \\ = 0.2085106383 \\ = 21\% $$

ANSWER: Probably not. Even if the man tests positive there is only 21% chance that he has the disease.

Takeaway: The prior probability of the disease $h$ significantly affects the outcome. If prior probability is low, then the test isn't very useful.

  • Context: If you're in a world where everybody gets a test for a rare disease, prior probability will be low. But if you're in a world where only patients suspected to have a rare disease gets tested, its probability will be higher.

Maximum a posteriori (MAP) & Maximum Likelihood Estimation

Both are derived from the Bayes Theorem: For each hypothesis in $H$, calculate $P(h|D)$

$$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$

Takeaways:

  • To calculate MAP and MLE, we can drop $P(D)$ in the denominator since it's a constant independent of $h$.
  • If we assume $h \in H$ is uniform, we can remove the $P(h)$.
  • Computationally intensive (need to calculate MLE for each $h$), hence NOT practical.

Maximum a posteriori (MAP)

$$ h_{MAP} \equiv argmax_{h \in H} P(h|D) $$

Maximum Likelihood Estimation

  • MLE is more computationally efficient than MAP
$$ h_{MLE} \equiv argmax_{h \in H} P(D|h) $$

Bayesian Learning in Action

Takeaway: Given a bunch of data, the probability of a particular hypothesis being correct is uniform over all of the hypotheses in the version space (ie consistent with given data).

Quiz: Noisy Data

(TODO: Revisit this lecture - I didn't quite understand it)

Q: Compute the probability of seeing this particular data set in a world where the hypothesis ($h(x)=x$) is true.

Bayesian Learning & Sum of Squared Errors

Takeaways:

  • Gradient descent, linear regression, etc is proved to be correct by deriving the maximum likelihood hypothesis ($h_ML$). Why? Because it derives to minimizing the sum of squared errors (SSE).
  • OTOH, whenever you are trying to minimize the SSE, you are assuming that:
    1. The function f(x) is deterministic
    2. f(x) is summed with some amount of noise $\epsilon$
    3. The noise ($\epsilon$) has a gaussian distribution (IID with a mean of zero and variance squared)
  • If any of these assumptions is false, then minimizing SSE to solve the problem is incorrect.

Quiz: Sum of Squared Errors

Q: Given data <$x_i$, $d_i$> and function $d_i=f(x_i) + \epsilon_i$, which one of these $h$ best fits the data:

  1. h(x) = x mod 9
  2. h(x) = x/3
  3. h(x) = mean(d)

ANSWER: x mod 9 (SSE 12), better than linear regression

Minimum Description Lengths

From Mitchell, pg 173

The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimzes the sum of (the hypothesis and data description length). MDL ... provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis.

  • Prefers simpler hypotheses that net the lowest error over a more complex hypothesis with the same error. This is basically bias-variance trade off.

Takeaway: Derivations based on Bayesian learning gives us a handle on why we make the decision we make when building ML models.

Bayesian Classification (Bayes Optimal Classifier)

Takeaway: So far we learned about finding the best hypothesis, but we actually care about finding the most probable classification of the new instance. To do this, we take the weighted vote of the probabilities of each hypothesis class.

Bayes Optimal Classifier:

$$ argmax_{v_j \in V} \Sigma_{h_i \in H} P(v_j|h_i)P(h_i|D) $$

The most probable classification of the new instance is ... combining the predictions of all hypotheses, weighted by their posterior probabilities. This method maximizes P that the new instance is classified correctly given the data, hypothesis space, and priors over the hypotheses. No other classification method using the same hypothesis space and same prior knowlege can outperform this method on average.

Bayes Classifier Quiz:

Summary: Bayesian Learning

Takeaway: Bayesian Learning + Classification gives us a way of talking about optimality and gold standards.

  1. Bayes Rule - Swaps causes & effects $P(h|D)$ ~ $P(D|h)P(h)$
  2. Priors matter.
  3. $h_{MAP}$ and $h_{ML}$ (MAP when prior is uniform)
  4. Derived rules we've used in previous lessons (SSE, Occam's Razor, bias/variance tradeoff using minimum description length).
  5. Bayes Optimal Classifier - What we really care about is Bayesian "classification" (not the hypothesis)

SL10: Bayesian Inference

Reference:

  • Mitchell, Ch. 6 "Bayesian Learning"
  • Wasserman, Ch.17 "Directed Graphs and Conditional Independence"
  • dev.to
  • George's Notes, section 4.2.

Bayesian Networks - What is it?

From Lecture:

Representation for representing and manipulating probabilities quantities over complex spaces.

ChatGPT:

Bayesian networks are probabilistic graphical models that represent the relationships between different random variables and their conditional dependencies. The graph consists of nodes representing the variables and edges representing the dependencies between them. The dependencies are represented probabilistically, using Bayes' theorem, which specifies the probability of a variable given its dependencies.

Mitchell, pg.186

A Bayesian belief network represent thes the joint probability distribution for a set of variables.

Joint Distribution

Quiz: Joint Distribution

Takeaway:

  • Probability of 1 variable, eg. P(lightning): $\Sigma{P(x_i)}$
  • Probability of conditional, eg. P(lightning|storm): $P(y) / \Sigma{P(x_i)}$

Conditional Independence

Definition: From Mitchell, pg.185

Let X, Y, and Z be three discrete-valued random varables. X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z.

$$ (\forall{x_i, y_j, z_k})\text{ } P(X=x_i|Y=y_j,Z=z_k) = P(X=x_i|Z=z_k) $$

or simply, $$ P(X|Y,Z)=P(X|Z) $$

Quiz: Conditional Independence

Bayes Nets (aka Graphical Models, Belief Networks)

Quiz: Nodes and Probabilities

Takeaway:

  • There are no causal relationship between the nodes even though they are represented as arrows. It just shows how the probabilities are decomposed.

Quiz: Sampling from the Joint Distribution

Q: In what order should we select ABCDE?

  • ANSWER: It should be sampled in a topological order (topological sort)
  • This only works if there are no cyclical relationships (ie. acyclic)

Recovering the Joint Distribution

Takeaway: We can go from the values in the conditional probability tables (in each of the nodes) to computing the probability of any joint combination of variables.

Formula: The joint probability for some assignment to the variables is equal to the product of all the individual values.

Example using the graph in the previous slide: $$ P(A,B,C,D,E) = P(A)P(B)P(C|A,B)P(D|B,C)P(E|C,D) $$

  • If A~E are boolean variables, the number of values we'd need to specify for the joint distribution in the standard representation (ie left side of equation) is $2^5-1$.
  • Conditional probability is more compact, only needing 14 (1, 1, 4, 4, 4).
  • Conditional probability, unlike joint distribution, only grows to the extent of number of parents.

Quiz: Why Sampling?

Distributions are useful for: 1) Simulating a complex process - If the distribution represented something complex, we would want to simulate that world and act according to those probabilities. 2) Approximate inference - We want to make an inference of what might be true of the world in different situations.

- Why approximate? Because finding the exact inference is NP-hard

3) Visualization - We want to get a sense of the data, especially when the distribution is complex with many variables. In these complex cases, it's hard for humans to understand the relationship between variables. Sampling gives you real examples that gives you a sense of how these relationships work.

Inference Rules

Quiz: Inference by Hand

This is very hard to figure out. Just watch the video and follow along: https://youtu.be/Vt8fRJwvdwA

Naive Bayes

Ref: Mitchell, pg. 177

The Naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. In other words, the assumption is that given the target value of the instance, the probability of boserving the conjunction $a_1, a_2, \ldots a_n$ is just the product of the probabilities for the individual attributes.

Naive Bayes Classifier, where $v_NB$ represents the predicted target value:

  • $V$: $$ v_{NB} = argmax_{v_j \in V} P(v_j) \Pi_i P(a_i|v_j) $$

Example from lecture using spam classification:

  • Estimating all the attributes given the target value $P(a_i|v)$ is not feasible, need a lot of training data. But we can flip it to get the target value given all attributes $P(v|a_i)$.
  • Using Bayes Theorem, we are flipping the probability of spam given attributes, to predicting the probability of attributes given probability of spam. Once we do that, we can just calculate the product of all conditional probabilities.

Takeaway: Naive bayes classifier allows us to observe attribute values to infer the class.

Why Naive Bayes is "Naive"

1) No Free Lunch: It is "naive" because it's not modeling interrelationship between attributes.

- Then why does it work? Naive Bayes will give you incorrect probabilities of attributes if those attributes are related, but for the purpose of classification (ie argmax) it actually works quite well empirically.  For example, if two words are related ("free" + "lunch"), those frequencies may be counted more but they also tend to cancel each other out (a bit hand-wavey here).

2) We don't have infinite data: $P(a_i|v)$ is calculated by counting how many times $a_i$ occurs for a target value $v$ in the training data (over the counts of target value). If we have limited data, that attribute may have zero instances, which makes the numerator zero. If we rely on the product of all attributes given target value, that probability becomes zero.

- Workaround: Smoothing, ie initializing the counts with a small value so that no attribute is zero.  BUT this does create inductive bias, assuming that all attributes are mildly possible.

Summary: Bayes Nets

  1. Bayesian Network representation of joint probability distributions.
  2. Examples of using networks to compute probabilities
  3. Sampling as a way to do approximate inference.
  4. In genera, exact inference is NP-hard.
  5. Naive Bayes links all the Bayesian theory to a ML classification task.
    • NB handles missing attributes well, bc all the attributes are linked together by probabilities.

Aside: Naive Bayes Text Classification

Goal: Binary classification task using text documents (e.g. target class = the document is an academic paper)

Steps:

  1. Given a training data set, count how many times each word appears in the document for a given target class. $$ v_{NB} = argmax_{v_j \in \{Academic,NotAcademic\}} P(v_j)P(a_1="our"|v_j)P(a_2="approach"|v_j)\ldots P(a_i="trouble"|v_j) $$
  2. We make an assumption that words occur IID, which means we can disregard the position of the word. This reduces the probability space significantly.
  3. Use the m-estimate with uniform priors and with m equal to the size of the vocabulary.
    • Given $n$ is the total number of word positions in all training examples whose target value is $v$, $n_k$ is the number of times word $w_k$ is found among these $n$ word positions, and $|Vocabulary|$ is the total number of distinct words (and other tokens) found in the training data: $$ \frac{n_k+1}{n+|Vocabulary|} $$

The final algorithm from Mitchell, pg.183:

In [ ]: