Binary Classification and Support Vector Machines¶
In a classification problem, the aim is to categorize the inputs into one of a finite set of classes. Formulated as a supervised learning task, the dataset again consists of input-output pairs, i.e. \(\lbrace(\mathbf{x}_{1}, y_{1}), \dots, (\mathbf{x}_{m}, y_{m})\rbrace\) with \(\mathbf{x}\in \mathbb{R}^n\). However, unlike regression problems, the output \(y\) is a discrete integer number representing one of the classes. In a binary classification problem, in other words a problem with only two classes, it is natural to choose \(y\in\{-1, 1\}\).
We have introduced linear regression in the previous section as a method for supervised learning when the output is a real number. Here, we will see how we can use the same model for a binary classification task. If we look at the regression problem, we first note that geometrically
defines a hyperplane perpendicular to the vector with elements \(\beta_{j\geq1}\). If we fix the length \(\sum_{j=1}^n \beta_j^2=1\), then \(f(\mathbf{x}|\boldsymbol \beta)\) measures the (signed) distance of \(\mathbf{x}\) to the hyperplane with a sign depending on which side of the plane the point \(\mathbf{x}_i\) lies. To use this model as a classifier, we thus define
which yields \(\{+1, -1\}\). If the two classes are (completely) linearly separable, then the goal of the classification is to find a hyperplane that separates the two classes in feature space. Specifically, we look for parameters \(\boldsymbol \beta\), such that
where \(M\) is called the margin. The optimal solution \(\hat{\boldsymbol \beta}\) then maximizes this margin. Note that instead of fixing the norm of \(\beta_{j\geq1}\) and maximizing \(M\), it is customary to minimize \(\sum_{j=1}^n \beta_j^2\) setting \(M=1\) in Eq. (25).
In most cases, the two classes are not completely separable. In order to still find a good classifier, we allow some of the points \(\mathbf{x}_i\) to lie within the margin or even on the wrong side of the hyperplane. For this purpose, we rewrite the optimization constraint Eq. (25) to
We can now define the optimization problem as finding
subject to the constraint Eq. (26). Note that the second term with hyperparameter \(C\) acts like a regularizer, in particular a lasso regularizer. As we have seen in the example of the previous section, such a regularizer tries to set as many \(\xi_i\) to zero as possible.
We can solve this constrained minimization problem by introducing Lagrange multipliers \(\alpha_i\) and \(\mu_i\) and solving
which yields the conditions
It is numerically simpler to solve the dual problem
subject to \(\sum_i \alpha_i y_i =0\) and \(0\leq \alpha_i \leq C\) 1. Using Eq. (29), we can reexpress \(\beta_j\) to find
where the sum only runs over the points \(\mathbf{x}_i\), which lie within the margin, as all other points have \(\alpha_i\equiv0\) [see Eq. (32)]. These points are thus called the support vectors and are denoted in Fig. 11 with a circle around them. Finally, note that we can use Eq. (32) again to find \(\beta_0\).
The Kernel trick and support vector machines¶
We have seen in our discussion of PCA that most data is not separable linearly. However, we have also seen how the kernel trick can help us in such situations. In particular, we have seen how a non-linear function \(\boldsymbol \Phi(\mathbf{x})\), which we first apply to the data \(\mathbf{x}\), can help us separate data that is not linearly separable. Importantly, we never actually use the non-linear function \(\boldsymbol \Phi(\mathbf{x})\), but only the kernel. Looking at the dual optimization problem Eq. (30) and the resulting classifier Eq. (31), we see that, as in the case of Kernel PCA, only the kernel \(K(\mathbf{x}, \mathbf{y}) = \boldsymbol \Phi(\mathbf{x})^T\boldsymbol \Phi(\mathbf{y})\) enters, simplifying the problem. This non-linear extension of the binary classifier is called a support vector machine.
- 1
Note that the constraints for the minimization are not equalities, but actually inequalities. A solution thus has to fulfil the additional Karush-Kuhn-Tucker constraints
(32)¶\[\begin{split} \begin{aligned} \alpha_i [y_i \tilde{\mathbf{x}}_i^T\boldsymbol \beta - (1-\xi_i)]&=&0,\\ \mu_i\xi_i &=& 0,\\ y_i \tilde{\mathbf{x}}_i^T\boldsymbol \beta - (1-\xi_i)&>& 0. \end{aligned}\end{split}\]