Last september I went through Caltech’s Learning from Data course offered through EdX. Went through all the lectures and homework except for the final exam.
Finally got around to finish it this week, after re-reading some forgotten materials. All the code written for the problem sets are here. Some notes:
VC Dimension and Generalization
One of the most emphasized point in the course, is how to estimate the generalization performance of an algorithm. VC-Dimension can be used as a preliminary measure of an algorithm’s generalization performance. A more complex hypothesis is more likely to overfit, and given the same amount of training data, may lead to worse generalization. When possible, using simpler hypothesis set is better. Otherwise, use regularization.
Linear Regression
Simple, not very sexy, but surprisingly powerful in many applications. Combined with nonlinear transforms can solve many classes of problems. We have many techniques that can solve linear problems. So in general, want to try to reduce problems into this class.
Validation
Cross-validation, especially leave-one-out, is an unbiased estimator of the out of sample error \(E_{out}\).
Use validation for model selection, including parameters for regularization.
SVM
Very powerful technique. Motivation is simple - finding a hyperplane that separats two classes of data with the greatest margin from the separation plane to the closest points of each class. Solving the equations become a quadratic programming problem of minimizing the norm of the weights of the plane (which is inversely proportional to the margin width), subject to the all points being classified correctly (in hard-margin SVM).
This original formulation is called the primal problem.
The same problem can be formulated into a LaGarange Multiplier problem, with an additional multiplier parameter \(\alpha_i\) for each data point. This problem then requires the minimizing over the weights, after maximizing over the multiplier \(\alpha\)s. Under the KKT conditions (or vice versa), we can switch the order to maximizing over the multipliers after minimizing over the weights. This allows us to write expressions for the weights in terms of \(\alpha\) after doing the minimization (by taking the partial derivatives of the expression).
Finally, we reach another Quadratic-Programming problem, where the objective is over \(\alpha\)s, subjective to constraints from the weights’ minimization.
The advantages of this Dual Problem is that the data points are present in the expression in terms of inner products – allowing us to use the kernel trick, which is a computationally efficient way of doing nonlinear transformation to higher-order (even infinite) dimensions.
One of the coolest thing about SVM is that, according to strict VC-dimension analysis, after nonlinear transformation of the data points into infinite dimensions via the kernel trick, \(E_{out}\) should be unbounded. However, because SVM searches for hyperplane with the fattest separation margin, the effective VC-dimension is significantly reduced, and thus \(E_{out}\) is under control.
RBF
RBF network classifiers places a RBF on each training data point, and then classifies test points by summing those RBF values at the test point.
Since this is prohibitive when data set is big, we can instead pick K representative points, by finding the centroid of K clusters (via clustering algorithm). Therefore, the classification problem is transformed into a linear regression problem, where the basis are the RBF centered around the cluster centers.
SVM with RBF kernel still outperforms the regular-RBF network classifiers most of the time. Why do we use it? I can imagine doing unsupervised learning: have a cluster of data points with no labels, but simply use clustering + RBF to derive K labels and assign them to new samples.