Posted 2024-01-17Updated 2024-02-195 minutes read (About 675 words)

Intro to ML

Notes for EECS 445.

Linear Classifier

feature, labels space: $\bar{x}^{(i)} \in \mathbb{R}^d$, $y^{(i)}$
Labels -> Supervised Learning.
Classification Problem: finite, discrete labels.
$Sn = { (\bar{x}^{(i)}, y^{(i)}) }{i=1}^{n}$
Linear Decision Boundary
Overfitting, underfitting
$\bar{x}^{(i)} = [x{1}^{(i)}, x{2}^{(i)}, \cdots, x_{k}^{(i)}]$
Hyperplanes in $\mathbb{R}^d$: $\bar{\theta} \cdot \bar{x} + b = 0$.
$\bar{\theta} \in \mathbb{R}^d$ is called parameter vector, $b \in \mathbb{R}$ is the offset.
Linear Classifier: $h(\bar{x}; \bar{\theta}) = sgn(\bar{\theta} \cdot \bar{x})$
Training Error: $En(\bar{\theta}) = \frac{1}{n} \sum{i = 1}^{n} [[ y^{(i)} \neq h(\bar{x}^{(i)}; \bar{\theta}) ]] = \frac{1}{n} \sum_{i = 1}^{n} [ y^{(i)} h(\bar{x}^{(i)}; \bar{\theta}) \leq 0] \in [0, 1]$
The Preception Algorithm

Perceptron

Let $\bar{x}^{(i)} \in \mathbb{R}^d$, $\bar{x}^{(i)’} = [1, \bar{x}^{(i)}]^T$, $\bar{\theta}^{(i)’} \in \mathbb{R}^{d+1}$
$Rn(\bar{\theta}) = \frac{1}{n} \sum{i = 1}^{n} LOSS[ y^{(i)}; h(\bar{x}^{(i)}; \bar{\theta})]$.
Hinge Loss.
Gradient Descent (GD), Step size.

Gradient Descend

GD :$\bar{\theta}^{(k+1)} = \bar{\theta}^{(k)} - \eta \nabla{\bar{\theta}} R_n(\bar{\theta}) |{\bar{\theta} = \bar{\theta}^{(k)}}$.
Because for Hinge Loss Function, $Rn(\bar{\theta}) = \sum{i = 1}^{n} \max{ 1 - y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)}), 0 }$.

Stochastic Gradient Descent (SGD)

$\bar{\theta}^{(k+1)} = \bar{\theta}^{(k)} - \eta \nabla{\bar{\theta}} Loss_h(y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)})) |{\bar{\theta} = \bar{\theta}^{(k)}}$.

Support Vector Machines

We want the boundary maximize the minimal distance to any of the training dataset.
QP: Quadratic Program: the target is quadratic, and the constraints are linear.
Hard-Margin SVM: We want to maximize $\frac{y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)})}{||\bar{\theta}||}$, s.t.. $y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)}) \geq 1$.
- This is equivalent to minimizing $\frac{||\theta||^2}{2}$
Soft-Margin SVM: allow some “violates”.
- Minimizing $\frac{||\theta||^2}{2} + C \sum \epsilon_i$ s.t.. $y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)}) \geq 1 - \epsilon_i$

Feature Map

Linear classifiers in higher dimensional spaces
- For example, a circle can be classified as $sign(\phi(\bar{x})\cdot \bar{\theta})$, where $\phi(\bar{x}) = [x_1^2, x_2^2, x_1, x_2, 1]^T$.
- The circle centering at $(2, 2)$ with radius 1 has $\theta = [1, 1, -4, -4, 7]^T$.

Duality gap

Primal formulation: $\min {\bar{w}} \max {\bar{\alpha}, \alpha_i \geq 0} L(\bar{w}, \bar{\alpha})$.
Dual formulation: $\max {\bar{\alpha}, \alpha_i \geq 0} \min {\bar{w}} L(\bar{w}, \bar{\alpha})$.
The dual form: $\max {\bar{\alpha}, \alpha_i \geq 0} \min {\bar{\theta}} \frac{||\theta||^2}{2} + \sum \alpha_i (1 - y^{(i)} (\bar{\theta} \cdot \bar{x}^{(i)})) $
- Get the min part by taking gradient of $\theta$: $\bar{\theta}^* = \sum \alpha_i y^{(i)} \bar{x}^{(i)}$
- Plug in the formula: $\max _{\bar{\alpha}, \alpha_i \geq 0} \sum \alpha_i - \frac{1}{2} \sum_i \sum_j \alpha_i \alpha_j y^{(i)} y^{(j)} \bar{x}^{(i)} \bar{x}^{(j)} $
- For $\alpha_i > 0$, they are support vectors.

SVM and kernal Trick

Valid kernal: $K(u, v) = \phi(u) \phi(v)$
Mercer’s

Regression

Supervised Learning
$f(\overline x; \overline \theta; b) = \overline \theta \cdot \overline x + b$
- Care more than the sign

Regularization

Bias and Variance tradeoff

Low model complexity: high bias
- constant function
Too high complexity: high variance
- RBF kernel
Ridge regression
- When $\lambda = 0$, linear regression with squared loss
- When $\lambda = \infty$, constant 0.

Feature Selection

Filter
Wrapper
Embedded
- L1-norm shrinks features

Decision tree

non linear and interpretble
Shannon’s entropy
Boosting (Adaboost)
- $h_M(x) = \sum \alpha_m h(x, theta_m)$
- $\alpha_m$ is the weight of the classifier.
- misclassified rate is called $\epsilon_m$
- reweight the data after each iteration.
- Algorithm:
  - split candidate: $\theta = [d (feature), split value, sign (+1 or -1)]$.
  - given $\theta$, we can calculate the error rate $\epsilon_m$ and the weight $\alpha_m$.
  - Then update the weight of the data and normalize them.
- Extremely case:
  - $epsilon_m = \frac{1}{2}$, $\alpha_m = 0$: ignore this classifier.
  - $epsilon_m = 0$, $\alpha_m = \infty$: consider this classifier as the only one.

Intro to ML

https://keke1022.fun/2024/01/17/eecs445-notes/

Author

Beijie Liu

Posted on

2024-01-17

Updated on

2024-02-19

Licensed under

#Machine Learning, Notes

Intro to ML

Linear Classifier

Perceptron

Gradient Descend

Stochastic Gradient Descent (SGD)

Support Vector Machines

Feature Map

Duality gap

SVM and kernal Trick

Regression

Regularization

Bias and Variance tradeoff

Feature Selection

Decision tree

Author

Posted on

Updated on

Licensed under

Categories

Recents

Archives

Tags