Support Vector Machine: Kernel Trick; Mercer’s Theorem

And what the heck is Kernel Trick ?From the previous post about support vectors, we have already seen (please check to understand the mathematical formulation) that the maximization depends only on the dot products of support vectors,not only that, as the decision rule also depends on the dot product of support vector and a new sampleThat means if we use a mapping function that maps our data into a higher dimensional space then the maximization and decision rule will now depend on the dot products of the mapping function for different samples, as below -And Voila!!.If we have a function K defined as belowthen we just need to know K and not the mapping function itself..This function is known as Kernel function and it reduces the complexity of finding the mapping function..So, Kernel function defines inner product in the transformed space.Let us look at some of the most used kernel functionsI have used Radial Basis Function kernel to plot figure 2, where mapping from 2D space to 3D space indeed helps us in classification.Apart from this per-defined kernels, what conditions are there to determine which functions can be Kernels ?.This is given by Mercer’s theorem..First condition is rather trivial i.e..the Kernel function must be symmetric..As a Kernel function is dot product (inner product) of the mapping function we can write as below —Mercer theorem guides us to the necessary and sufficient condition for a function to be Kernel function..One way to understand the theorem is —In other words, in a finite input space, if the Kernel matrix (also known as Gram matrix) is positive semi-definite then the matrix element i.e..the function K can be a kernel function..So the Gram matrix merges all the information necessary for the learning algorithm, the data points and the mapping function fused into the inner product.Let’s see an example of finding the mapping function from the kernel function and here we will use Gaussian kernel functionTuning ParameterSince we have discussed about the non-linear kernels and specially Gaussian kernel (or RBF kernel), we finish the discussion with one of the tuning parameters in SVM — gamma.Looking at the RBF kernel we see that it depends on the Euclidean distance between two points, i.e..if two vectors are closer then this term is small.. More details

Leave a Reply