SVM
Published:
- SVM
- Binary Classification Problem
- Strengths of SVM
- The Structural Risk Minimization (SRM):
- Summary the Notations
- Support Vector Classfication (Primal)
- Support Vector Classfication (Dual)
- Dual Representation of SVM
- Soft Margin SVM(Nonseperable Case)
- 2-Norm Soft Margin
- 1-Norm Soft Margin
- 1-Norm Soft Margin(Dual)
- Slack Variables for 1-Norm
SVM
- Basically, this is my studying from NTCU open course - Machine Learning. I take the key points for my references.
Binary Classification Problem
- Find a decision function to discriminate two categories.
- Supervised learning in Machine Learning
- Decision Tree,Neural Network, K-NN and support Vector Machines,etc.
- Discrimination Analysis in Statistics
- Fisher Linear Discriminater
- Successful applications
- Marketing, Bioinformatics.
Strengths of SVM
- SVM is powerful tool for Data Mining
- SVM classifier is an optimally defined surface.
- SVM have a good geometric interpretation
- SVM can be generated very efficientlly
- Can be extended from linear to nonlinear case
- Typically nonlinear in the input space
- Linear in a high dimensional “feature space”
- Implicitly defined by a kernel function
- Have a sound theoretical foundation
- Based on Statistical Learning Theory
The Structural Risk Minimization (SRM):
- The expected risk will be less than or equal to empirical risk(training error)+VC(error) bound.
- $\Vert w\Vert_2 \propto \text{VC bound}$
- $\min\Vert w\Vert_2 \iff \min \text{VC bound} \iff \text{max Margin}=\max \frac{2}{\Vert w\Vert_2} $
Summary the Notations
Let $S={ (x_1,y_1),\dots,(x_l,y_l)}$ be a training dataset and respresented by matrices.
\[A=\left[ \begin{matrix} x_1^T \\ \vdots \\ x_l^T \end{matrix} \right], \quad D=\left[ \begin{matrix} y_1 & \dots & 0 \\ \vdots & \ddots &\vdots\\ 0 & \dots & y_l \end{matrix} \right]\]We can represent as
\[D(Aw+\mathbf 1 b)\geq 1\]Support Vector Classfication (Primal)
The hyperplane $(w,b)$ is determined by solving the minimization problem:
\[\min\limits_{(w,b) \in \mathbb R^{n+1}} \frac{1}{2}\Vert w\Vert_2 ,\quad D(Aw+\mathbf 1 b)\geq 1\]In other word,
\[\frac{1}{2}\Vert w\Vert_2 =\frac{1}{2}[w^T,b^T] \left[ \begin{array}{c|c} \mathbf I & \mathbf 0\\ \hline \mathbf 0 & \mathbf 0 \end{array}\right] \left[ \begin{matrix} w\\ b \end{matrix} \right]= \frac{1}{2}x^TQx\]and
\[D\left[ \begin{array}{c|c} A & \mathbf 1 \end{array}\right] \left[ \begin{matrix} w\\ b \end{matrix} \right] \geq 1 \quad (g(x) \leq 0)\]Support Vector Classfication (Dual)
\begin{equation} \mathcal{L}(w,b,\alpha)=\frac{1}{2}w^Tw+\alpha^T(\mathbf 1 -D(Aw+\mathbf 1 b)) \label{original} \end{equation}
\[\begin{align} \frac{\partial \mathcal{L}(w,b,\alpha)}{\partial w}&=w-A^TD\alpha=0 \implies w=A^TD\alpha \label{org_tran1}\\ \frac{\partial \mathcal{L}(w,b,\alpha)}{\partial b}&=\alpha D\mathbf 1 \implies \sum\limits_{i=1}^l \alpha_i y_i =0 \label{org_tran2} \end{align}\]Subtitute $\eqref{org_tran1},\eqref{org_tran2}$ into $\eqref{original}$, we have
\[\begin{aligned} \mathcal{L}(w,b,\alpha) &=\frac{1}{2}(A^TD\alpha)^T A^TD\alpha+\alpha^T \mathbf 1 -\alpha^TDAA^TD\alpha \\ &= -\frac{1}{2}\alpha^TDAA^TD\alpha+\alpha^T \mathbf 1 \end{aligned}\]Hence,Dual problem of MP:
\[\max\limits_{x\in\mathbb R^l} -\frac{1}{2}\alpha^TDAA^TD\alpha+\alpha^T\]subject to
\[\mathbf 1^TD\alpha=0,\quad \alpha \geq0\]While we solve $\alpha$, we get $w=A^TD\alpha$ Also, recall that KKI optimality conditions, we have
\[\alpha \bot D(Aw+\mathbf 1 b)-1 \iff \alpha_i (y_i(A_i w+b)-1)=0 \quad \forall i\]Applying the above equation, finally we get $b$.
Dual Representation of SVM
key of kernel methods:
\[w=A^TD\alpha^*=\sum\limits_{i=1}^l y_i\alpha_iA_i^T\]The hypothesis is determined by $(\alpha^{*},b^{*})$,
\[\begin{aligned} h(x) &=\text{sign}(\langle\, x,A^TD\alpha^*\rangle+b^*)\\ &=\text{sign}(\sum\limits_{i=1}^l y_i\alpha_i^* \langle\, x_i,x\rangle+b^*) \\ &=\text{sign}(\sum\limits_{\alpha_i>0}y_i\alpha_i^* \langle\, x_i,x\rangle+b^*) \\ \end{aligned}\]The last equation shows that those $\alpha_i =0$ do not affect the model.
Soft Margin SVM(Nonseperable Case)
- If data are not linearly sperable
- Primal problem is infeasible
- Dual problem is unbounded above
Introduce the slack variable for each training point
- The inequality system is always feasible
2-Norm Soft Margin
\[\min\limits_{w,b,\xi \in \mathbb R^{n+1+l}} \frac{1}{2}\Vert w\Vert_2^2+\frac{C}{2}\Vert \xi\Vert_2^2\]subject to
\[D(Aw+b)+\xi\geq 1\]1-Norm Soft Margin
\[\min\limits_{w,b,\xi \in \mathbb R^{n+1+l}} \frac{1}{2}\Vert w\Vert_2^2+C\mathbf 1^T\xi\]subject to
\[D(Aw+b)+\xi\geq 1,\quad \xi\geq 0\]1-Norm Soft Margin(Dual)
\[\mathcal{L}(w,b,\xi,\alpha,\gamma)=\frac{1}{2}w^Tw+C\mathbf 1 \xi +\alpha^T(1-D(Aw+b)-\xi) -\gamma^T\xi\]where $\alpha>0,\gamma>0.$
\[\begin{align} \label{part1} \frac{\partial \mathcal{L}(w,b,\xi,\alpha,\gamma)}{\partial w}&=w-A^TD\alpha=0 \implies w=A^TD\alpha \\ \label{part2} \frac{\partial \mathcal{L}(w,b,\xi,\alpha,\gamma)}{\partial b}&=\alpha D\mathbf 1 \implies \sum\limits_{i=1}^l \alpha_i y_i =0 \\ \label{part3} \frac{\partial \mathcal{L}(w,b,\xi,\alpha,\gamma)}{\partial \xi}&=C\mathbf 1-\alpha-\gamma =0 \end{align}\]Substitute $\eqref{part1},\eqref{part2},\eqref{part3}$ into $\mathcal{L}(w,b,\xi,\alpha,\gamma)$, we have
\[\begin{aligned} \mathcal{L}(w,b,\xi,\alpha,\gamma) &=\frac{1}{2}(A^TD\alpha)^T A^TD\alpha+\alpha^T \mathbf 1 -\alpha^TDAA^TD\alpha \\ &= -\frac{1}{2}\alpha^TDAA^TD\alpha+\alpha^T \mathbf 1 \end{aligned}\]subject to
\[\alpha D\mathbf 1=0, \quad \alpha+\gamma=C,\quad \alpha, \gamma>0\]Hence, we write it formally, Dual:
\[\max\limits_{\alpha\in \mathbb R^l}\mathbf 1^T \alpha-\frac{1}{2}\alpha^TDAA^TD\alpha\]subject to
\[\alpha D\mathbf 1=0, \quad \alpha+\gamma=C,\quad \alpha, \gamma>0 .\]The corresponding KKT complementarity
\[0 \leq \alpha \bot D(Aw+\mathbf 1 b)+\xi-1\geq0\] \[0 \leq \xi \bot -\gamma \leq 0 \iff 0 \leq \xi \bot (\alpha-C\mathbf 1) \leq 0\]Slack Variables for 1-Norm
- Non-Zero slack can only occur wher $\alpha^*=C$.
The trade-off between accuracy and regularization directly controls by $C$.
The points for which $0 < \alpha^* <C $ lie out the boundary planes
This will help us to find $b^*$.
$\xi_i>0 \implies \alpha_i=C$ does not mean that $x_i$ is assigned to the wrong part.