Intro to ML
Notes for EECS 445.
Linear Classifier
- feature, labels space: $\bar{x}^{(i)} \in \mathbb{R}^d$, $y^{(i)}$
- Labels -> Supervised Learning.
- Classification Problem: finite, discrete labels.
- $Sn = { (\bar{x}^{(i)}, y^{(i)}) }{i=1}^{n}$
- Linear Decision Boundary
- Overfitting, underfitting
- $\bar{x}^{(i)} = [x{1}^{(i)}, x{2}^{(i)}, \cdots, x_{k}^{(i)}]$
- Hyperplanes in $\mathbb{R}^d$: $\bar{\theta} \cdot \bar{x} + b = 0$.
- $\bar{\theta} \in \mathbb{R}^d$ is called parameter vector, $b \in \mathbb{R}$ is the offset.
- Linear Classifier: $h(\bar{x}; \bar{\theta}) = sgn(\bar{\theta} \cdot \bar{x})$
- Training Error: $En(\bar{\theta}) = \frac{1}{n} \sum{i = 1}^{n} [[ y^{(i)} \neq h(\bar{x}^{(i)}; \bar{\theta}) ]] = \frac{1}{n} \sum_{i = 1}^{n} [ y^{(i)} h(\bar{x}^{(i)}; \bar{\theta}) \leq 0] \in [0, 1]$
- The Preception Algorithm
Perceptron
- Let $\bar{x}^{(i)} \in \mathbb{R}^d$, $\bar{x}^{(i)’} = [1, \bar{x}^{(i)}]^T$, $\bar{\theta}^{(i)’} \in \mathbb{R}^{d+1}$
- $Rn(\bar{\theta}) = \frac{1}{n} \sum{i = 1}^{n} LOSS[ y^{(i)}; h(\bar{x}^{(i)}; \bar{\theta})]$.
- Hinge Loss.
- Gradient Descent (GD), Step size.
Gradient Descend
- GD :$\bar{\theta}^{(k+1)} = \bar{\theta}^{(k)} - \eta \nabla{\bar{\theta}} R_n(\bar{\theta}) |{\bar{\theta} = \bar{\theta}^{(k)}}$.
- Because for Hinge Loss Function, $Rn(\bar{\theta}) = \sum{i = 1}^{n} \max{ 1 - y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)}), 0 }$.
Stochastic Gradient Descent (SGD)
- $\bar{\theta}^{(k+1)} = \bar{\theta}^{(k)} - \eta \nabla{\bar{\theta}} Loss_h(y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)})) |{\bar{\theta} = \bar{\theta}^{(k)}}$.
Support Vector Machines
- We want the boundary maximize the minimal distance to any of the training dataset.
- QP: Quadratic Program: the target is quadratic, and the constraints are linear.
- Hard-Margin SVM: We want to maximize $\frac{y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)})}{||\bar{\theta}||}$, s.t.. $y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)}) \geq 1$.
- This is equivalent to minimizing $\frac{||\theta||^2}{2}$
- Soft-Margin SVM: allow some “violates”.
- Minimizing $\frac{||\theta||^2}{2} + C \sum \epsilon_i$ s.t.. $y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)}) \geq 1 - \epsilon_i$
Feature Map
- Linear classifiers in higher dimensional spaces
- For example, a circle can be classified as $sign(\phi(\bar{x})\cdot \bar{\theta})$, where $\phi(\bar{x}) = [x_1^2, x_2^2, x_1, x_2, 1]^T$.
- The circle centering at $(2, 2)$ with radius 1 has $\theta = [1, 1, -4, -4, 7]^T$.
Duality gap
- Primal formulation: $\min {\bar{w}} \max {\bar{\alpha}, \alpha_i \geq 0} L(\bar{w}, \bar{\alpha})$.
- Dual formulation: $\max {\bar{\alpha}, \alpha_i \geq 0} \min {\bar{w}} L(\bar{w}, \bar{\alpha})$.
- The dual form: $\max {\bar{\alpha}, \alpha_i \geq 0} \min {\bar{\theta}} \frac{||\theta||^2}{2} + \sum \alpha_i (1 - y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)})) $
- Get the min part by taking gradient of $\theta$: $\bar{\theta}^* = \sum \alpha_i y^{(i)} \bar{x}^{(i)}$
- Plug in the formula: $\max _{\bar{\alpha}, \alpha_i \geq 0} \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y^{(i)} y^{(j)} \bar{x}^{(i)} \bar{x}^{(j)} $
- For $\alpha_i > 0$, they are support vectors.
SVM and kernal Trick
- Valid kernal: $K(u, v) = \phi(u) \phi(v)$
- Mercer’s
Regression
- Supervised Learning
- $f(\overline x; \overline \theta; b) = \overline \theta \cdot \overline x + b$
- Care more than the sign
Regularization
Bias and Variance tradeoff
- Low model complexity: high bias
- constant function
- Too high complexity: high variance
- RBF kernel
- Ridge regression
- When $\lambda = 0$, linear regression with squared loss
- When $\lambda = \infty$, constant 0.
Feature Selection
- Filter
- Wrapper
- Embedded
- L1-norm shrinks features
Decision tree
- non linear and interpretble
- Shannon’s entropy
- Boosting (Adaboost)
- $h_M(x) = \sum \alpha_m h(x, theta_m)$
- $\alpha_m$ is the weight of the classifier.
- misclassified rate is called $\epsilon_m$
- reweight the data after each iteration.
- Algorithm:
- split candidate: $\theta = [d (feature), split value, sign (+1 or -1)]$.
- given $\theta$, we can calculate the error rate $\epsilon_m$ and the weight $\alpha_m$.
- Then update the weight of the data and normalize them.
- Extremely case:
- $epsilon_m = \frac{1}{2}$, $\alpha_m = 0$: ignore this classifier.
- $epsilon_m = 0$, $\alpha_m = \infty$: consider this classifier as the only one.