Kernel PCA

PCA performs a linear transformation on the data. However, there are cases where such a transformation is unable to produce any meaningful result. Consider for instance the fictitious dataset with \(2\) classes and \(2\) data features as shown on the left of Fig. 5. We see by naked eye that it should be possible to separate this data well, for instance by the distance of the datapoint from the origin, but it is also clear that a linear function cannot be used to compute it. In this case, it can be helpful to consider a non-linear extension of PCA, known as kernel PCA.

The basic idea of this method is to apply to the data \(\mathbf{x} \in \mathbb{R}^{n}\) a chosen non-linear vector-valued transformation function \(\mathbf{\Phi}(\mathbf{x})\) with

(3)\[ \mathbf{\Phi}: \mathbb{R}^{n} \rightarrow \mathbb{R}^{N},\]

which is a map from the original \(n\)-dimensional space (corresponding to the \(n\) original data features) to a \(N\)-dimensional feature space. Kernel PCA then simply involves performing the standard PCA on the transformed data \(\mathbf{\Phi}(\mathbf{x})\). Here, we will assume that the transformed data is centered, i.e.,

\[\sum_i \Phi(\mathbf{x}_i) = 0\]

to have simpler formulas.

../../_images/circles_pca_kpca.png

Fig. 5 Kernel PCA versus PCA.

In practice, when \(N\) is large, it is not efficient or even possible to explicitly perform the transformation \(\mathbf{\Phi}\). Instead we can make use of a method known as the kernel trick. Recall that in standard PCA, the primary aim is to find the eigenvectors and eigenvalues of the covariance matrix \(C\) . In the case of kernel PCA, this matrix becomes

\[C = \sum_{i=1}^{m} \mathbf{\Phi}(\mathbf{x}_{i})\mathbf{\Phi}(\mathbf{x}_{i})^T,\]

with the eigenvalue equation

(4)\[\sum_{i=1}^{m} \mathbf{\Phi}(\mathbf{x}_{i})\mathbf{\Phi}(\mathbf{x}_{i})^T \mathbf{v}_{j} = \lambda_{j}\mathbf{v}_{j}.\]

By writing the eigenvectors \(\mathbf{v}_{j}\) as a linear combination of the transformed data features

\[\mathbf{v}_{j} = \sum_{i=1}^{m} a_{ji}\mathbf{\Phi}(\mathbf{x}_{i}),\]

we see that finding the eigenvectors is equivalent to finding the coefficients \(a_{ji}\). On substituting this form back into Eq. (4), we find

\[\sum_{i=1}^{m} \mathbf{\Phi}(\mathbf{x}_{i})\mathbf{\Phi}(\mathbf{x}_{i})^T \left[ \sum_{i=1}^{m} a_{ji}\mathbf{\Phi}(\mathbf{x}_{j}) \right] = \lambda_{j} \left[ \sum_{i=1}^{m} a_{ji}\mathbf{\Phi}(\mathbf{x}_{i}) \right].\]

By multiplying both sides of the equation by \(\mathbf{\Phi}(\mathbf{x}_{k})^{T}\) we arrive at

(5)\[\begin{split} \begin{split} \sum_{i=1}^{m} \mathbf{\Phi}(\mathbf{x}_{k})^{T} \mathbf{\Phi}(\mathbf{x}_{i})\mathbf{\Phi}(\mathbf{x}_{i})^T \left[ \sum_{l=1}^{m} a_{jl}\mathbf{\Phi}(\mathbf{x}_{l}) \right] &= \lambda_{j} \mathbf{\Phi}(\mathbf{x}_{k})^{T} \left[ \sum_{l=1}^{m} a_{jl} \mathbf{\Phi}(\mathbf{x}_{l}) \right] \\ \sum_{i=1}^{m} \left[ \mathbf{\Phi}(\mathbf{x}_{k})^{T} \mathbf{\Phi}(\mathbf{x}_{i}) \right] \sum_{l=1}^{m} a_{jl} \left[ \mathbf{\Phi}(\mathbf{x}_{i})^T \mathbf{\Phi}(\mathbf{x}_{l}) \right] &= \lambda_{j} \sum_{l=1}^{m} a_{jl} \left[ \mathbf{\Phi}(\mathbf{x}_{k})^{T} \mathbf{\Phi}(\mathbf{x}_{l}) \right] \\ \sum_{i=1}^{m} K(\mathbf{x}_{k},\mathbf{x}_{i}) \sum_{l=1}^{m} a_{jl} K(\mathbf{x}_{i},\mathbf{x}_{l}) &= \lambda_{j} \sum_{l=1}^{m} a_{jl} K(\mathbf{x}_{k},\mathbf{x}_{l}), \end{split}\end{split}\]

where \(K(\mathbf{x},\mathbf{y}) = \mathbf{\Phi}(\mathbf{x})^{T} \mathbf{\Phi}(\mathbf{y})\) is known as the kernel. Thus we see that if we directly specify the kernels we can avoid explicit performing the transformation \(\mathbf{\Phi}\). In matrix form, we find the eigenvalue equation \(K^{2}\mathbf{a}_{j} = \lambda_{j} K \mathbf{a}_{j}\), which simplifies to

\[K \mathbf{a}_{j} = \lambda_{j} \mathbf{a}_{j}.\]

Note that this simplification requires \(\lambda_j \neq 0\), which will be the case for relevant principle components. (If \(\lambda_{j} = 0\), then the corresponding eigenvectors would be irrelevant components to be discarded.) After solving the above equation and obtaining the coefficients \(a_{jl}\), the kernel PCA transformation is then simply given by the overlap with the eigenvectors \(\mathbf{v}_{j}\), i.e.,

\[\mathbf{x} \rightarrow \mathbf{\Phi}(\mathbf{x}) \rightarrow y_{j} = \mathbf{v}_{j}^{T}\mathbf{\Phi}(\mathbf{x}) = \sum_{i=1}^{m} a_{ji} \mathbf{\Phi}(\mathbf{x}_{i})^{T} \mathbf{\Phi}(\mathbf{x}) = \sum_{i=1}^{m} a_{ji} K(\mathbf{x}_{i},\mathbf{x}),\]

where once again the explicit \(\mathbf{\Phi}\) transformation is avoided.

A common choice for the kernel is known as the radial basis function kernel (RBF) defined by

\[K_{\textrm{RBF}}(\mathbf{x},\mathbf{y}) = \exp\left( -\gamma \| \mathbf{x} - \mathbf{y}\|^{2} \right),\]

where \(\gamma\) is a tunable parameter. Using the RBF kernel, we compare the result of kernel PCA with that of standard PCA, as shown on the right of Fig. 5. It is clear that kernel PCA leads to a meaningful separation of the data while standard PCA completely fails.