Definition: Take examples of inputs and outputs. Now, given a new input, predict its output.
Term | Definition |
---|---|
Instances | Input. Vectors of attributes that defines your input space. |
Concept | Function. Takes the input space and maps them to an output. Concept is a mapping between objects in a world and membership in a set, which makes it a function. |
Target Concept | The correct concept to map instances to oututs. |
Hypothesis Class | Set of concepts that you're willing to consider in order to find the target concept (e.g. Binary or multiclass, but not considering regression output). |
Sample | Training set. Set of all of our inputs paired with labels with the correct output. Foundation for inductive learning. |
Candidate Concept | Concept that you think is the target concept. (e.g. Anyone with curly hair is credit-worthy) |
Testing Set | Set of instances to test a candidate concept. |
Work through each instance and give an answer based on the decision tree.
Note:
"20 questions" example: We want to narrow possibilities with each questions asked. This motivates us to ask very general Qs that significantly narrows down possiblities.
Notes:
Example: Boolean expressions as decision tress
Type | Size of Nodes | Description |
---|---|---|
n-OR ("any" fn) | linear $n$ | Easy to solve |
n-XOR ("odd/even parity" fn)* | $O(2^{n})$ | Hard to solve. No clever way to pick the right attribute to solve the problem (need to look at all attributes). |
Takeaway: We hope that our ML problems are more like n-OR problems rather than n-XOR problems because n-XOR problems are hard and expensive.
* If number of attributes that are true is odd, then the output is true.
Q: Exactly how expressive is a decision tree?
Given:
A: The hypothesis space of decision tree is very expressive. We need to cleverly search this space to efficiently find the best representation
Loop:
1. Pick best attribute A
2. Assign A as decision attribute for Node
3. For each value A, create a descendant of Node
4. Sort training examples to leaves
5. Stop if examples perfectly classified
6. Else iterate over leaves
Definition: Reduction in randomness over the labels you have over with the set of data based upon knowing the value of a particular attribute.
Entropy: "Average amount of information conveyed by an event, when considering all possible outcomes"
Formula:
Formula of entropy: $$ -\Sigma_v p(v) log p(v) $$
Types of bias:
ID3's inductive bias - Types of preferences:
Q: How to build DTs with continuous attributes?
Q: T/F, does it make sense to repeat an attribute along a path in a tree?
Overfitting: Don't trust the data too much because there is potential for noise. This happens when trees are too big.
Solutions
Regression implies continuous outputs. Information gain doesn't work well with continuous values.
Q: What's the splitting criteria for regression trees?
Q: What's the output?
Etymology: "Regression to the mean"
Finding the best line (measured by least squared error) is achieved through calculus (ie. finding the derivative).
Training data has error that doesn't model $f$, results in $f+\epsilon$
Types of error sources:
Definition: Take n-folds of training data then use one of the folds as the "test" data. Do this with each fold, using the rest as training data. Choose the model with the lowest error.
Aside - Assumptions of SL:
- Train/test sets are IID (independent and identically distributed)
Applied to housing example in video:
Example of a perceptron (binary classifier with 3 input + weights)
Quiz: Example inputs and estimated y
Image: Representation of activation space given 2 data + weights. Generates a halfplane.
Takeaway: Perceptrons are linear functions that computes a hyperplane.
Quiz 1: If we focus on $x_1 \in {0,1}$ and $x_2 \in {0,1}$, what is $y$?
Quiz 2: If $y$ is an OR function, what would we set $w_1$, $w_2$, $\theta$?
Quiz 3: Given a NOT logic (one variable), what should we set $w_1$ and $\theta$?
Takeaway: AND/OR/NOT logic can be expressed as perceptron units, even XORs.
# Perceptron Quiz answer in code
def xor_network(x1, x2, w1, w2, w_and, theta) -> bool:
activation = x1*w1 + x2*w2 + bool(x1 and x2)*w_and
return 1 if activation >= theta else 0
def test_xor(w1, w2, w_and, theta):
assert xor_network(0,0, w1, w2, w_and, theta) == 0
assert xor_network(0,1, w1, w2, w_and, theta) == 1
assert xor_network(1,0, w1, w2, w_and, theta) == 1
assert xor_network(1,1, w1, w2, w_and, theta) == 0
print("Passed all tests")
w_1 = 1
w_2 = 1
w_and = -2
threshold = 1
test_xor(w_1, w_2, w_and, threshold)
Answer to XOR question:
Definition: Given examples, find weights that map inputs to outputs.
2 ways in which perceptron training is done:
Goal: How to set the weights of a single unit so that it matches a training set.
Components:
(1) Weight change formula: Changing weight of $w_i$ by $\Delta w_i$. This is what we iterate on while there is error.
$$ w_i = w_i + \Delta w_i $$(2) Defining weight change amount - First, calculate $\hat{y}$.
(3) Determining the weight change needed based on distance between $y$ and $\hat{y}$. This is multiplied by the unit ($x$) value, counterbalanced by the learning rate $\eta$.
$$ \Delta w_i = \eta (y-\hat{y})x_i $$Definition: All examples can be separated by a linear hyperplane. This gets more complicated to visualize with 4+ dimensions.
Perceptron Rule: If we have data that is linearly separable, then the perceptron rule will find this hyperplane.
Q: What if the data is not linearly separable?
Motivation: More robust to linearly non-separable datasets.
Components:
(1) Calculate the error metric. Goal is to minimize this error.
(2) Find the partial derivative of above.
(3) We get the result - The derivative of the error $E$ with respect to $w_i$. The result looks pretty similar to the perceptron rule.
$$ \frac{\partial E}{\partial w_i} = \Sigma_{(x,y) \in D} (y-a)(-x_i) $$Type | Learning Rule | Pros | Cons |
---|---|---|---|
Perceptron Rule | $$\Delta w_i = \eta (y-\hat{y})x_i$$ | Guaranteed finite convergence | Requires linear separability |
Gradient Descent | $$\Delta w_i = \eta (y-\alpha)x_i$$ | More robust to non-linearly seprable data | Likely to converge only on local optimum |
(1) Sigmoid activation function $$ \sigma(x) = \frac{1}{1+e^{-x}} $$
(2) Derivative of the sigmoid function $$ \sigma`(x) = \sigma(x)(1-\sigma(x)) $$
Extra: How to take the derivative of the sigmoid function. Reference: TDS
Key Points:
Alternatives/supplements to vanilla gradient descent:
Definition: Representational power of the data structure you're using. Tells you the set of hypotheses you're willing to consider.
Types of neural nets and their representational power:
Definition: The algorithm's selection (ie preference) of one representation over another.
Preference bias of neural nets: Small Random values that provide simpler explanations.
- SMALL: Favors small value, because networks with small numbers have low(er) complexity. Larger values tend to overfit.
- RANDOM: Variability. Possibility of exploring outside of local minimia.
- Similar to __Occam's Razor__: "Entities should not be multiplied unnecessarily"
- Algorithm prefers stopping once it fits the data well, before letting weights to grow more.
Pros | Cons |
---|---|
Remembers (no catastrophic forgetting) | Cannot generalize |
Fast (no learning) | Slower at runtime |
Simple (Occams Razor) | Easily overfits (Remembers noise as well) |
Struggles with complexity (e.g. 2 same x with different y, multiple near distance neighbors with different y values) |
Given:
(1) Find $K$ nearest neighbors of $q$ $$ NearestNeighbor = \{i:d(q,x_i)k \text{ smallest}\} $$
(2.1) Classification: Return the (weighted) vote of $y_i \in \text{NearestNeighbor}$ (ie. Mode)
(2.2) Regression: Return the (weighted) mean of $y_i \in \text{NearestNeighbor}$
Takeaways:
How to calculate Manhattan/Euclidean distance: Both are derived from Mikowski distance ("p-norm distance"), with different degrees of $p$ value:
Using the first row as an example (x=(1,6), y=7)
:
Finding 1/3-NN:
(2,4), 8
and (7,1), 50
. $y=\mu{(8, 50)}=29$(2,4), 8
. $y=\sqrt{8}$ Error:
Takeaways:
Definition: As the number of features or dimensions grows, the amount of data we need to generalize accurately grows exponentially.
Example: Number of neighbors needed to cover the same dimension space.
Example: Locally weighted linear regression
Idea: Build up a bunch of rules and combine them together until you got something that's good enough.
Pseudocode:
1. Learn over a subset of data (uniform & random) and apply a learner
...(repeat)
2. Combine all the rules (How? Regression: Mean, Classification: Mode)
Example: Spam Email
Takeaway:
Pseudocode (high-level):
1. Learn over a subset of data apply a learner
2. Filter down to the "hardest" examples, ie. data points that results in highest errors.
...(repeat)
3. Combine all the rules (How? Regression: WEIGHTED mean, Classification: WEIGHTED Mode)
Classification error = # of mismatches
Given:
Definition: A learning algorithm that, under any distribution, will find a hypothesis that performs better than chance on the current distribution.
"Does better than chance" expressed more formally: For all distributions, your learner will have an expected error (hypothesis disagreeing with the real target of a single sample)
$$ \forall_D Pr_D [\cdot] \leq 1/2 - \epsilon $$ANSWER:
1/4
.Some explanation in Slack thread:
One can ask a simple question: given a set of objects, X, what is the probability of seeing any particular object, x, that is drawn from X according to some distribution D? By definition, D, induces a value for each x that is between 0 and 1 inclusive where the sum of all these values is 1.
In our case, each x has some corresponding label, let’s call each on y.
A weak learner is simply a learning function that given X, D, Y, and a set of hypotheses, H, will find some hypothesis, h, drawn from H, such that the probability of drawing an x such that h(x) == y is better than chance.
We generate the hypothesis by getting the sum of all of our alpha times its hypothesis, then run it through a sign function (returns the sign, ie. -1, 1, 0 of the number).
Formula for final hypothesis $H$:
$$ \begin{align} H_{final}(x) = sgn(\Sigma_t \alpha_t h_t (x)) \end{align} $$Supporting material/notes for this section: My notes from ISYE6501
Given:
Q: Solve for the line that is described by their difference (...?)
Answer:
Idea: We can turn the $\max \{ \frac{2}{||w||}\}$ solution into a quadratic programming function.
Takeaway: All 3 optimization models are solving the same problem, but quadratic is easier to solve:
Goal: After finding an alpha that maximizes the quadratic optimization program, you can solve for the params of the hyperplane ($w$). Once we find the param, we can solve for $b$ by plugging in $w$ and $x_i$. This gives us $wx+b$.
Some considerations:
Movitation: We can separate non-linearly seprable data points by using the kernel trick.
Takeaway:
Goals - Find a formal way of addressing:
Definitions:
Tools: Use same tools for analyzing algorithms in CS.
Resources that matter in computational learning theory:
How training examples are selected, from a teacher/learner paradigm:
In the context of the game "20 questions":
Q1: How many questions does the student need to ask, if Teachers chooses the questions?
Q2: How many question does the learner must choose before it narrows down the possibility to 1 person?
Given:
Q1: Given the table of examples and hypotheses, figure out the combination of positive/absent/negated to get the hypothesis.
Q2: Harder problem - we don't have access to hypothesis beforehand.
Takeaway: Constrained set of queries make it hard to consistently reduce the hypothesis space in half, so realistically it is going to take more time than log N time (possibly exponential).
Reference: Mitchell, Ch. 7 "Computaional Learning Theory" pg. 220.
Key Idea: "How many mistakes will the learner make in its predictions before it learns the target concept?"
Key Idea: Set hypothesis as specific as possible, then generalize as you see more positive examples.
Takeaway:
Takeaway: True Error = penalized proportional to the probability of getting the hypothesis incorrect.
Definition: Learner is PAC learnable if it can learn to get a low error and return hypothesis with relatively high confidence (ie. delta) - in time that is polynomial w.r.t error, certainty and number of hypotheses.
Formula: A problem is PAC learnable if the version space of sample S is epsilon exhausted.
Key formula to find the bound: $$ m \geq \frac{1}{\epsilon} (ln|H| + ln\frac{1}{\delta}) $$
Takeaway: Haussler theorem gives a you formula (boxed in red) that tells you how many training examples you need in order to meet the epsilon threshold.
Takeaway: Answer (40) is much better than the input space (2**10, ie 1024).
Reference:
Motivation: In the previous lesson, we covered Haussler's theorem ($m \geq \frac{1}{\epsilon} (ln|h| + ln\frac{1}{\delta})$), but its success was limited to finite hypothesis space (H). How do we handle very large, or infinite hypothesis space?
Goal: Connect insight about VC dimensions to understanding sample complexity.
Many learners actually have infinite hypothesis space, rather than finite:
The lectures mentions the difference between syntactic vs semantic hypothesis space:
Significance: We can reduce the dimensionality of syntactic hypothesis space (infinite) into a finite hypothesis space. This is the underlying concept of determining VC dimensions.
From Mitchell (pg.215)
VC dimension $VC(H)$, of hypothesis space $H$ defined over instance space $X$ is the size of the largest finite subset of $X$ shattered by $H$. If arbitrarily large finite sets of $X$ can be shattered by $H$, then $VC(H) = \infty$
Definition from lecture: What is the largest set of inputs that the hypothesis class can "shatter", ie. label in all possible ways?
From Mitchell (pg.214)
A set of instances S is shattered by hypothesis space $H$ IFF for every dichotomy of S there exists some hypothesis in $H$ consistent with this dichotomy.
- If a learner can correctly separate a set of points, it "shatters" those points.
Question: Figure out the largest set we can shatter using the given hypotheses, given:
Answer: $VC=2$
[+]
[+ +]
[+] -
+ [-]
- - []
+ - +
because the interval has to be contiguous.Question: Figure out the largest set we can shatter using the given hypotheses, given:
Answer: $VC \geq 3$
+ - + -
matrix since the lines split a plane that has to be contiguousTakeaway: VC dimensions often align to the number of parameters in the hypothesis space.
Takeaway: Convex polygon has infinite VC dimensions, because it can have arbitrarily many input points.
Given:
We can determine the right sample size to meet the error rate $\epsilon$ and probability $1-\delta$ for an infinite hypothesis class.
$$ m \geq \frac{1}{\epsilon}(8 VC(H) log_2 \frac{13}{\epsilon} + 4 log_2 \frac{2}{\delta}) $$This can be simplified to reflect a finite hypothesis class.
$$ m \geq \frac{1}{\epsilon}(\ln{|H|} + \ln{\frac{1}{\delta}}) $$Takeaway: H is PAC-learnable IFF VC dimension is finite. VC dimensions captures the idea of PAC learnability in a single concept.
From dev.to%3A-,The%20VC%20of%20Finite%20Hypothesis%20Space,-If%20we%20denote)
If we denote the VC of Finite Hypothesis Space by d, there has to be 2^d distinct concepts (as each different labelling can be captured by a different hypothesis in a class) - therefore 2^d is less than or equal to the number of hyptheses |H|.
Rearranging, d <= log2 (|H|). So a finite hypothesis class gives us finite VC dimensions, and therefore make things "PAC Learnable". There is a further proof that this goes the other way - H is PAC Learnable if and only iff the VC Dimension is finite. VC Dimension captures the notion of PAC Learnability in a single concept.
From CMU lecture notes:
Through our discussions of VC theory we have seen that we can improve generalization by controlling the complexity of the concept class H from which we are choosing a hypothesis. We saw that the shatter coefficient and VC dimension were useful measures of complexity because we could bound the performance of a learning algorithm in terms of these quantities and the amount of data we have.
Reference:
Motivation:
Key Idea: Find the hypothesis $h$ from hypothesis class $H$ that is the most probable given the data $D$. $$ argmax_{h \in H} P(h | D) $$
Mitchell, p156
Bayes theorem ... provides a way to calculate the posterior probability P(h|D), from the prior probability P(h), together with P(D) and P(D|h)
Given:
Bayes theorem $$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$
Takeaway: Bayes rule allows us to switch $P(h|D)$ to $P(D|h)$.
A man goes to see a doctor. She gives him a lab test. The test return correct positive 98% of the time, and a correct negative 97% of the time. The test looks for a rare disease that only 0.8% of the population has it. The test comes out positive, does the man have the rare disease?
Given:
We can apply the Bayes Theorem to get the probability of having the disease given that the test returns positive: $$ P(disease|testPositive) = \frac{P(testPositive|disease) P(disease)}{P(testPositive)} \\ = \frac{P(testPositive|disease) P(disease)}{P(testPositive) P(disease) + P(1 - testNegative) P(1 - disease)} \\ = \frac{.98 * 0.008}{(.98 * 0.008) + (0.03 *(1 - 0.008))} \\ = 0.2085106383 \\ = 21\% $$
ANSWER: Probably not. Even if the man tests positive there is only 21% chance that he has the disease.
Takeaway: The prior probability of the disease $h$ significantly affects the outcome. If prior probability is low, then the test isn't very useful.
Both are derived from the Bayes Theorem: For each hypothesis in $H$, calculate $P(h|D)$
$$ P(h|D) = \frac{P(D|h)P(h)}{P(D)} $$Takeaways:
Takeaway: Given a bunch of data, the probability of a particular hypothesis being correct is uniform over all of the hypotheses in the version space (ie consistent with given data).
(TODO: Revisit this lecture - I didn't quite understand it)
Q: Compute the probability of seeing this particular data set in a world where the hypothesis ($h(x)=x$) is true.
Takeaways:
Q: Given data <$x_i$, $d_i$> and function $d_i=f(x_i) + \epsilon_i$, which one of these $h$ best fits the data:
ANSWER: x mod 9 (SSE 12), better than linear regression
From Mitchell, pg 173
The Minimum Description Length (MDL) principle recommends choosing the hypothesis that minimzes the sum of (the hypothesis and data description length). MDL ... provides a way of trading off hypothesis complexity for the number of errors committed by the hypothesis.
- Prefers simpler hypotheses that net the lowest error over a more complex hypothesis with the same error. This is basically bias-variance trade off.
Takeaway: Derivations based on Bayesian learning gives us a handle on why we make the decision we make when building ML models.
Takeaway: So far we learned about finding the best hypothesis, but we actually care about finding the most probable classification of the new instance. To do this, we take the weighted vote of the probabilities of each hypothesis class.
Bayes Optimal Classifier:
$$ argmax_{v_j \in V} \Sigma_{h_i \in H} P(v_j|h_i)P(h_i|D) $$The most probable classification of the new instance is ... combining the predictions of all hypotheses, weighted by their posterior probabilities. This method maximizes P that the new instance is classified correctly given the data, hypothesis space, and priors over the hypotheses. No other classification method using the same hypothesis space and same prior knowlege can outperform this method on average.
Bayes Classifier Quiz:
Takeaway: Bayesian Learning + Classification gives us a way of talking about optimality and gold standards.
Reference:
From Lecture:
Representation for representing and manipulating probabilities quantities over complex spaces.
ChatGPT:
Bayesian networks are probabilistic graphical models that represent the relationships between different random variables and their conditional dependencies. The graph consists of nodes representing the variables and edges representing the dependencies between them. The dependencies are represented probabilistically, using Bayes' theorem, which specifies the probability of a variable given its dependencies.
Mitchell, pg.186
A Bayesian belief network represent thes the joint probability distribution for a set of variables.
Takeaway:
Definition: From Mitchell, pg.185
$$ (\forall{x_i, y_j, z_k})\text{ } P(X=x_i|Y=y_j,Z=z_k) = P(X=x_i|Z=z_k) $$Let X, Y, and Z be three discrete-valued random varables. X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given a value for Z.
or simply, $$ P(X|Y,Z)=P(X|Z) $$
Takeaway:
Q: In what order should we select ABCDE?
Topological sorting for Directed Acyclic Graph (DAG) is a linear ordering of vertices such that for every directed edge u v, vertex u comes before v in the ordering.
Takeaway: We can go from the values in the conditional probability tables (in each of the nodes) to computing the probability of any joint combination of variables.
Formula: The joint probability for some assignment to the variables is equal to the product of all the individual values.
Example using the graph in the previous slide: $$ P(A,B,C,D,E) = P(A)P(B)P(C|A,B)P(D|B,C)P(E|C,D) $$
Distributions are useful for: 1) Simulating a complex process - If the distribution represented something complex, we would want to simulate that world and act according to those probabilities. 2) Approximate inference - We want to make an inference of what might be true of the world in different situations.
- Why approximate? Because finding the exact inference is NP-hard
3) Visualization - We want to get a sense of the data, especially when the distribution is complex with many variables. In these complex cases, it's hard for humans to understand the relationship between variables. Sampling gives you real examples that gives you a sense of how these relationships work.
This is very hard to figure out. Just watch the video and follow along: https://youtu.be/Vt8fRJwvdwA
Ref: Mitchell, pg. 177
The Naive Bayes classifier is based on the simplifying assumption that the attribute values are conditionally independent given the target value. In other words, the assumption is that given the target value of the instance, the probability of boserving the conjunction $a_1, a_2, \ldots a_n$ is just the product of the probabilities for the individual attributes.
Naive Bayes Classifier, where $v_NB$ represents the predicted target value:
Example from lecture using spam classification:
Takeaway: Naive bayes classifier allows us to observe attribute values to infer the class.
1) No Free Lunch: It is "naive" because it's not modeling interrelationship between attributes.
- Then why does it work? Naive Bayes will give you incorrect probabilities of attributes if those attributes are related, but for the purpose of classification (ie argmax) it actually works quite well empirically. For example, if two words are related ("free" + "lunch"), those frequencies may be counted more but they also tend to cancel each other out (a bit hand-wavey here).
2) We don't have infinite data: $P(a_i|v)$ is calculated by counting how many times $a_i$ occurs for a target value $v$ in the training data (over the counts of target value). If we have limited data, that attribute may have zero instances, which makes the numerator zero. If we rely on the product of all attributes given target value, that probability becomes zero.
- Workaround: Smoothing, ie initializing the counts with a small value so that no attribute is zero. BUT this does create inductive bias, assuming that all attributes are mildly possible.
Goal: Binary classification task using text documents (e.g. target class = the document is an academic paper)
Steps:
The final algorithm from Mitchell, pg.183: