Many successful methods for classification, including AdaBoost, support vector machines and neural networks, can be viewed as convex relaxations of an intractable empirical risk minimization problem. We consider the statistical properties of these large margin classifiers, and in particular what convex optimization problems lead to accurate classifiers. We also study the relationship between these methods and probability models, and characterize when these methods lead to asymptotic estimates of conditional probabilities. For methods that predict using elements of a reproducing kernel Hilbert space, such as support vector machines, we show that there is a tight trade-off between the ability to estimate conditional probabilities and the sparseness of the solution.
Joint work with Mike Jordan, Jon McAuliffe, and Ambuj Tewari.
Research in statistical learning algorithms aims at discovering computer methods to learn from examples so as to generalize well to new cases. Many non-parametric methods have recently been proposed to discover salient features of the unknown data distribution, such as non-linear manifolds. We first show how many of them can be unified under a shared framework, as estimators of particular eigenfunctions.
We then show a fundamental difficulty with such models, when the structure of the density is too complex (or the data too noisy) given the amount and the dimensionality of the data. We propose a generic approach to address this problem based on sharing information about the distribution across large subsets of the data range.
Rademacher averages are important quantities in empirical process theory. They also have nice concentration properties. Recently they have been exploited in statistics in order to obtain risk bounds that can be computed from the data. This allows to design model selection procedures with empirical penalties.
When one considers M-estimators, it has been shown that tighter error bounds can be obtained when the Rademacher averages are localized, that is, when they are computed for a small ball around the best function in the model. We give general results of this type and show in particular some consequences for popular classification algorithms such as Support Vector Machines and related kernel methods.
In statistical binary classification, an unknown i.i.d. source generates a stream of data elements (e.g., vectors of features) whose binary labels must be predicted. A classification system may improve its performance by occasionally asking whether it made a prediction mistake or not. In learning, this process is known as selective sampling. In first part of the talk we will introduce a selective sampling algorithm for simple linear i.i.d. sources and prove a bound on its theoretical performance (mistake probability versus number of queried labels).
In the second part of the talk we will show that a simple randomized variants of the selective sampling algorithm works well even in the case the unknown source generating the data is deterministic (or, in other words, the data is generated adversarially).