ICML 2010 proceedings front matter: preToC postToC
ICML 2010 proceedings table of contents
Invited Application Track

Paper ID: 901 WebScale Bayesian ClickThrough Rate Prediction for Sponsored Search Advertising in Microsoft’s Bing Search EngineThore Graepel (Microsoft Research); Joaquin Quiñonero Candela (Microsoft Research); Thomas Borchert (Microsoft Research); Ralf Herbrich (Microsoft Research) We describe a new Bayesian clickthrough rate (CTR) prediction algorithm used for Sponsored Search in Microsoft’s Bing search engine. The algorithm is based on a probit regression model that maps discrete or realvalued input features to probabilities. It maintains Gaussian beliefs over weights of the model and performs Gaussian online updates derived from approximate message passing. Scalability of the algorithm is ensured through a principled weight pruning procedure and an approximate parallel implementation. We discuss the challenges arising from evaluating and tuning the predictor as part of the complex system of sponsored search where the predictions made by the algorithm decide about future training sample composition. Finally, we show experimental results from the production system and compare to a calibrated Naïve Bayes algorithm. [Full Paper] [Discussion] Paper ID: 902 Detecting LargeScale System Problems by Mining Console LogsWei Xu (UC Berkeley); Ling Huang (Intel Labs Berkeley); Armando Fox (UC Berkeley); David Patterson (UC Berkeley); Michael I. Jordan (UC Berkeley) Surprisingly, console logs rarely help operators detect problems in largescale datacenter services, for they often consist of the voluminous intermixing of messages from many
software components written by independent
developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime [Full Paper] [Discussion] Paper ID: 903 The Role of Machine Learning in Business OptimizationChid Apte (IBM T. J. Watson Research Center) In a trend that reflects the increasing demand for intelligent applications driven by business data, IBM today is building out a significant number of applications that leverage machine learning technologies to optimize business process decisions. This talk highlights this trend; and describes the many different ways in which leading edge machine learning concepts are being utilized in business applications developed by IBM for its internal use and for clients. [Full Paper] [Discussion] Paper ID: 904 Music Plus One and Machine LearningChristopher Raphael (Indiana University) A system for musical accompaniment is presented in which a computerdriven orchestra follows and learns from a soloist in a concertolike setting. The system is decomposed into three modules: the first computes a realtime score match using a hidden Markov model; the second generates the output audio by phasevocoding a preexisting audio recording; the third provides a link between these two, by predicting future timing evolution using a Kalman filterlike model. Several examples are presented showing the system in action in diverse musical settings. Connections with machine learning are highlighted, showing current weaknesses and new possible directions. [Full Paper] [Discussion] Paper ID: 905 Climbing the Tower of Babel: Unsupervised Multilingual LearningBenjamin Snyder (MIT); Regina Barzilay (MIT) For centuries, scholars have explored the deep
links among human languages. In this paper,
we present a class of probabilistic models
that use these links as a form of naturally [Full Paper] [Discussion] Paper ID: 906 FABMAP: AppearanceBased Place Recognition and Mapping using a Learned Visual Vocabulary ModelMark Cummins (University of Oxford); Paul Newman (University of Oxford) We present an overview of FABMAP, an algorithm for place recognition and mapping developed for infrastructurefree mobile robot navigation in large environments. The system allows a robot to identify when it is revisiting a previously seen location, on the basis of imagery captured by the robot's camera. We outline a complete probabilistic framework for the task, which is applicable even in visually repetitive environments where many locations may appear identical. Our work introduces a number of technical innovations  notably we demonstrate that place recognition performance can be improved by learning an approximation to the joint distribution over visual elements. We also investigate several principled approaches to making the system robust in visually repetitive environments, and define an efficient bailout strategy for multihypothesis testing to improve system speed. Our model has been shown to substantially outperform standard tfidf ranking on our task of interest. We demonstrate the system performing reliable online appearance mapping and loop closure detection over a 1,000km trajectory, with mean filter update times of 14 ms. [Full Paper] [Discussion] Paper ID: 907 Discriminative Latent Variable Models for Object DetectionPedro Felzenszwalb (University of Chicago); Ross Girshick (University of Chicago); David McAllester (Toyota Technological Institute at Chicago); Deva Ramanan (UC Irvine) In this talk, I will discuss recent work by colleagues and myself on discriminative latentvariable models for object detection. Object recognition is one of the fundamental challenges of computer vision. We specifically consider the task of localizing and detecting instances of a generic object category, such as people or cars, in cluttered realword images. Recent benchmark competitions such as the PASCAL Visual Object Challenge suggest our method is the stateoftheart system for such tasks. This success, combined with publicallyavailable code that runs orders of magnitude faster than comparable approaches, has turned our system into a standard baseline for contemporary research on object recognition. [Full Paper] [Discussion] 
Paper ID: 16 Large Graph Construction for Scalable SemiSupervised LearningWei Liu (Columbia University); Junfeng He; ShihFu Chang In this paper, we address the scalability issue plaguing graphbased semisupervised learning viaa small number of anchor points which adequately cover the entire point cloud. Critically, these anchor points enable nonparametric regression that predicts the label for each data point as a locally weighted average of the labels on anchor points. Because conventional graph construction is inefficient in large scale, we propose to construct a tractable large graph by couplinganchorbased label prediction and adjacency matrix design. Contrary to the Nystrom approximation of adjacency matrices which results in indefinite graph Laplacians and in turn leads to potential nonconvex optimization over graphs, the proposed graph construction approach based on a unique idea called AnchorGraph provides nonnegative adjacency matrices to guarantee positive semidefinite graph Laplacians. Our approach scales linearly with the data size and in practice usually produces a large sparse graph. Experiments on large datasets demonstrate the significant accuracy improvement and scalability of the proposed approach. [Full Paper] [Discussion] Paper ID: 23 Boosting Classifiers with Tightened L0Relaxation PenaltiesNoam Goldberg (Technion); Jonathan Eckstein We propose a novel boosting algorithm which improves on current algorithms for weighted voting classification in striking a better balance between classification accuracy and sparsity of the weight vector. In order to justify our optimization formulations, we first consider a novel integer linear program as a model forsparse classifier selection, generalizing the minimum disagreementhalfspace problem, whose complexity has been investigated in computational learning theory. Specifically, our mixed integer problem is that of finding a separating hyperplane with minimum empirical error subject to an L0norm penalty. We find that common soft margin linear programming formulations for robust classification are equivalent to a continuous relaxation of our model. Since the initial continuous relaxation is weak, we suggesta tighter relaxation, using novel cutting planes, to better approximate the integer solution. To solve this relaxation, we propose a new boosting algorithm based on linear programming with dynamic generation of variables and constraints. We demonstrate the classification performance of our proposed algorithm with experimental results, and justify our selection of parameters using a minimum description length, compression intrepretation of learning. [Full Paper] [Discussion] Paper ID: 26 Variable Selection in ModelBased Clustering: To Do or To FacilitateLeonard Poon (HKUST); Nevin Zhang (Hong Kong University Of Science & Technology); Tao Chen; Yi Wang Variable selection for cluster analysis is a difficult problem. The difficulty originates not only from the lack of class information but also the fact that highdimensional data are often multifaceted and can be meaningfully clustered in multiple ways. In such a case the effort to find one subset of attributes that presumably gives the ``best'' clustering may be misguided. It makes more sense to facilitate variable selection by domain experts, that is, to systematically identify various facets of a data set (each being based on a subset of attributes), cluster the data along each one, and present the results to the domain experts for appraisal and selection. In this paper, we propose a generalization of the Gaussian mixture model, show its ability to cluster data along multiple facets, and demonstrate it is often more reasonable to facilitate variable selection than to perform it. [Full Paper] [Discussion] Paper ID: 28 Modeling Interaction via the Principle of Maximum Causal EntropyBrian Ziebart (Carnegie Mellon University); Drew Bagnell (Cmu); Anind Dey (Carnegie Mellon University) The principle of maximum entropy provides a powerful framework for statistical models of joint, conditional, and marginal distributions. However, there are many important distributions with elements of interaction and feedback where its applicability has not been established. This work presents the principle of maximum causal entropyan approach based on causally conditioned probabilities that can appropriately model the availability and influence of sequentially revealed side information. Using this principle, we derive models for sequential data with revealed information, interaction, and feedback, and demonstrate their applicability for statistically framing inverse optimal control and decision prediction tasks. [Full Paper] [Discussion] Paper ID: 35 MultiTask Learning of Gaussian Graphical ModelsJean Honorio (Stony Brook University); Dimitris Samaras (Stony Brook University) We present multitask structure learning for Gaussian graphical models. We discuss uniqueness and boundedness of the optimal solution of the maximization problem. A block coordinate descent method leads to a provably convergent algorithm that generates a sequence of positive definite solutions. Thus, we reduce the original problem into a sequence of strictly convex $\ell_\infty$ regularized quadratic minimization subproblems. We further show that this subproblem leads to the continuous quadratic knapsack problem, for which very efficient methods exist. Finally, we show promising results in a dataset that captures brain function of cocaine addicted and control subjects under conditions of monetary reward. [Full Paper] [Discussion] Paper ID: 45 Spherical Topic ModelsJoseph Reisinger (UT Austin); Austin Waters (UT Austin); Bryan Silverthorn (UT Austin); Raymond Mooney (University Of Texas At Austin) We introduce the Spherical Admixture Model (SAM), a Bayesian topic model for arbitrary L2 normalized data. SAM maintains the same hierarchical structure as Latent Dirichlet Allocation (LDA), but models documents as points on a highdimensional spherical manifold, allowing a natural likelihood parameterization in terms of cosine distance. Furthermore, SAM topics are capable of assigning negative weight to terms and can model word absence/presence unlike previous models. Performance is evaluated empirically both subjectively as a topic model using human raters and across several disparate classification tasks, from natural language processing and computer vision. [Full Paper] [Discussion] Paper ID: 52 Feature Selection Using Regularization in Approximate Linear Programs for Markov Decision ProcessesMarek Petrik (University of Massachusetts ); Gavin Taylor (Duke); Ron Parr (Duke); Shlomo Zilberstein (University of Massachusetts Amherst) Approximate dynamic programming has been used successfully in a large variety of domains, but it relies on a small set of provided approximation features to calculate solutions reliably. Large and rich sets of features can cause the existing algorithms to overfit because of the limited number of samples. We address this shortcoming using L_1 regularization in approximate linear programming. Because the proposed method can automatically select the appropriate richness of features, its performance does not degrade with an increasing number of features. These results rely on new and stronger sampling bounds for regularized approximate linear programs. We also propose a computationally efficient homotopy method. The empirical evaluation of the approach shows that the proposed method performs well on simple MDPs and standard benchmark problems. [Full Paper] [Discussion] Paper ID: 76 Multiagent Learning Experiments on Repeated Matrix GamesBruno Bouzy (Université Paris Descartes); Marc Metivier (Université Paris Descartes) This paper experimentally evaluates multiagent learning algorithms playing repeated matrix games to maximize their cumulative return. Previous works assessed that Qlearning surpassed Nashbased multiagent learning algorithms. Based on allagainstall repeated matrix game tournaments, this paper updates the state of the art of multiagent learning experiments. In a first stage, it shows that MQubed, S and banditbased algorithms such as UCB are the best algorithms on generalsum games, Exp3 being the best on cooperative games and zerosum games. In a second stage, our experiments show that two features  forgetting the far past, and using recent history with states  improve the learning algorithms. Finally, the best algorithms are two new algorithms, Qlearning and UCB enhanced with the two features, and MQubed. [Full Paper] [Discussion] Paper ID: 77 Probabilistic Backward and Forward Reasoning in Stochastic Relational WorldsTobias Lang (TU Berlin); Marc Toussaint (Tu Berlin) Inference in graphical models has emerged as a promising technique for planning. A recent approach to decisiontheoretic planning in relational domains uses forward inference in dynamic Bayesian networks compiled from learned probabilistic relational rules. Inspired by work in nonrelational domains with small state spaces, we derive a backpropagation method for such nets in relational domains starting from a goal state mixture distribution. We combine this with forward reasoning in a bidirectional twofilter approach. We perform experiments in a complex 3D simulated desktop environment with an articulated manipulator and realistic physics. Empirical results show that bidirectional probabilistic reasoning can lead to more efficient and accurate planning in comparison to pure forward reasoning. [Full Paper] [Discussion] Paper ID: 78 Causal filter selection in microarray dataGianluca Bontempi (Université Libre De Bruxelles); Patrick Meyer (ULB) The importance of bringing causality intoplay when designing feature selection methods is more and more acknowledged in the machine learning community. This paper proposes a filter approach based on information theory which aimsto prioritise direct causal relationships in feature selection problems where the ratio between the numberof features and the number of samples is high. This approach is based on thenotion of interaction which is shown to be informative about the relevance of an input subset as well asits causal relationship with the target.The resulting filter, called mIMR (minInteraction MaxRelevance), is compared with stateoftheart approaches. Classification results on 25 real microarray datasets show thatthe incorporation of causal aspects in the feature assessment is beneficial both for the resulting accuracy and stability. A toy example of causal discovery shows the effectivenessof the filter for identifying direct causal relationships. [Full Paper] [Discussion] Paper ID: 87 A Conditional Random Field for MultiInstance LearningThomas Deselaers (ETH Zurich); Vittorio Ferrari (ETH Zurich) We present MICRF, a conditional random field (CRF) model for multiple instance learning (MIL). MICRF models bags as nodes in a CRF with instances as their states. It combines discriminative unary instance classifiers and pairwise dissimilarity measures. We show that both forces improve the classification performance. Unlike other approaches, MICRF considers all bags jointly during training as well as during testing. This makes it possible to classify test bags in an imputation setup. The parameters of MICRF are learned using constraint generation. Furthermore, we show that MICRF can incorporate previous MIL algorithms to improve on their results. MICRF obtains competitive results on five standard MIL datasets. [Full Paper] [Discussion] Paper ID: 99 Supervised Aggregation of Classifiers using Artificial Prediction MarketsNathan Lay (Florida State University); Adrian Barbu (Florida State University) Prediction markets are used in real life to predict outcomes of interest such as presidential elections. In this work we introduce a mathematical theory for Artificial Prediction Markets for supervised classifier aggregation and probability estimation. We introduce the artificial prediction market as a novel way to aggregate classifiers. We derive the market equations to enforce total budget conservation, show the market price uniqueness and give efficient algorithms for computing it. We show how to train the market participants by updating their budgets using training examples. We introduce classifier specialization as a new differentiating characteristic between classifiers. Finally, we present experiments using random decision rules as specialized classifiers and show that the prediction market consistently outperforms Random Forest on real and synthetic data of varying degrees of difficulty. [Full Paper] [Discussion] Paper ID: 100 3D Convolutional Neural Networks for Human Action RecognitionShuiwang Ji (Arizona State University); Wei Xu; Ming Yang; Kai Yu (NEC Research Princeton) We consider the fully automated recognition of actions in uncontrolled environment. Most existing work relies on domain knowledge to construct complex handcrafted features from inputs. In addition, the environments are usually assumed to be controlled. Convolutional neural networks (CNNs) are a type of deep models that can act directly on the raw inputs, thus automating the process of feature construction. However, such models are currently limited to handle 2D inputs. In this paper, we develop a novel 3D CNN model for action recognition. This model extracts features from both spatial and temporal dimensions by performing 3D convolutions, thereby capturing the motion information encoded in multiple adjacent frames. The developed model generates multiple channels of information from the input frames, and the final feature representation is obtained by combining information from all channels. We apply the developed model to recognize human actions in realworld environment, and it achieves superior performance without relying on handcrafted features. [Full Paper] [Discussion] Paper ID: 107 Asymptotic Analysis of Generative SemiSupervised LearningJoshua Dillon (Georgia Institute of Technolog); Krishnakumar Balasubramanian (Georgia Institute of Technology); Guy Lebanon (Georgia Tech) Semisupervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semisupervised learning. In doing so, we complement distributionfree analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP. [Full Paper] [Discussion] Paper ID: 115 Restricted Boltzmann Machines are Hard to Approximately Evaluate or SimulatePhil Long (Google); Rocco Servedio (Columbia) Restricted Boltzmann Machines (RBMs) are a type of probability modelover the Boolean cube {1,1\}^n that have recently received muchattention. We establish the intractability of two basiccomputational tasks involving RBMs, even if only a coarseapproximation to the correct output is required.We first show that assuming P != NP, for any fixed positive constant K (which may be arbitrarily large)there is no polynomialtime algorithm for the followingproblem: given an nbit input string x and the parameters of aRBM M, output an estimate of the probability assigned to x byM that is accurate to within a multiplicative factor ofe^{Kn}. This hardness result holds even if the parametersof M are constrained to be at most psi(n) for anyfunction psi(n) = \omega(n), and if the number of hidden nodes ofM is at most n.We then show that assuming RP != NP,there is no polynomialtime randomized algorithm for thefollowing problem: given the parameters of an RBM M, generate arandom example from a probability distribution whose total variationdistance from the distribution defined by M is at most 1/12. [Full Paper] [Discussion] Paper ID: 117 Learning from Noisy Side Information by Generalized Maximum Entropy ModelTianbao Yang (Michigan State University); Rong Jin (Michigan State University); Anil Jain (michigan State University) We consider the problem of learning from noisy side information in the form of pairwise constraints. Although many algorithms have been developed to learn from side information, most of them assume perfect pairwise constraints. Given the pairwise constraints are often extracted from data sources such as paper citations, they tend to be noisy and inaccurate. In this paper, we introduce the generalization of maximum entropy model and propose a framework for learning from noisy side information based on the generalized maximum entropy model. The theoretic analysis shows that under certain assumption, the classification model trained from the noisy side information can be very close to the one trained from the perfect side information. Extensive empirical studies verify the effectiveness of the proposed framework. [Full Paper] [Discussion] Paper ID: 119 Finding Planted Partitions in Nearly Linear Time using Arrested Spectral ClusteringNader Bshouty (Technion); Phil Long (Google) We describe an algorithm for clustering using a similarity graph. Thealgorithm (a) runs in O(n log^3 n + m \log n) time on graphs withn vertices and m edges, and (b) with high probability, finds all``large enough'' clusters in a random graph generated according to theplanted partition model. We provide lower bounds that imply that our``large enough'' constraint cannot be improved much, even using acomputationally unbounded algorithm. We describe some experimentsrunning the algorithm and a few related algorithms on random graphswith partitions generated using a Chinese Restaurant Processes, andsome results of applying the algorithm to cluster DBLP titles. [Full Paper] [Discussion] Paper ID: 123 The Elastic Embedding Algorithm for Dimensionality ReductionMiguel CarreiraPerpinan (UC Merced) We propose a new dimensionality reduction method, the elastic embedding (EE), that optimises an intuitive, nonlinear objective function of the lowdimensional coordinates of the data. The method reveals a fundamental relation betwen a spectral method, Laplacian eigenmaps, and a nonlinear method, stochastic neighbour embedding; and shows that EE can be seen as learning both the coordinates and the affinities between data points. We give a homotopy method to train EE, characterise the critical value of the homotopy parameter, and study the method's behaviour. For a fixed homotopy parameter, we give a globally convergent iterative algorithm that is very effective and requires no user parameters. Finally, we give an extension to outofsample points. In standard datasets, EE obtains results as good or better than those of SNE, but more efficiently and robustly. [Full Paper] [Discussion] Paper ID: 125 TwoStage Learning Kernel AlgorithmsCorinna Cortes (Google); Mehryar Mohri (Courant Institute Nyu); Afshin Rostamizadeh (Courant Institute, NYU) This paper examines twostage techniques for learning kernels based on a notion of alignment. It presents a number of novel theoretical, algorithmic, and empirical results for alignmentbased techniques. Our results build on previous work by Cristianini et al. (2001), but we adopt a different definition of kernel alignment and significantly extend that work in several directions: we give a novel and simple concentration bound for alignment between kernel matrices; show the existence of good predictors for kernels with high alignment, both for classification and for regression; give algorithms for learning a maximum alignment kernel by showing that the problem can be reduced to a simple QP; and report the results of extensive experiments with this alignmentbased method in classification and regression tasks, which show an improvement both over the uniform combination of kernels and over other stateoftheart learning kernel methods. [Full Paper] [Discussion] Paper ID: 132 Robust Graph Mode Seeking by Graph ShiftHairong Liu (NUS); Shuicheng Yan (National University of Singapore) In this paper, we study how to robustly computethe modes of a graph, namely the densesubgraphs, which characterize the underlyingcompact patterns and are thus useful formany applications. We first define the modesbased on graph density function, then proposethe graph shift algorithm, which startsfrom each vertex and iteratively shifts towardsthe nearest mode of the graph alonga certain trajectory. Both theoretic analysisand experiments show that graph shift algorithmis very efficient and robust, especiallywhen there exist large amount of noises andoutliers. [Full Paper] [Discussion] Paper ID: 137 Multiscale Wavelets on Trees, Graphs and High Dimensional Data: Theory and Applications to Semi Supervised LearningMatan Gavish (Stanford University); Boaz Nadler (Weizmann Institute Of Science); Ronald Coifman (Yale University) Harmonic analysis, and in particular the relation between function smoothnessand approximate sparsity of its wavelet coefficients, has played a key role insignal processing and statistical inference for low dimensional data.In contrast, harmonic analysis has thus far had little impact in modern problemsinvolving high dimensional data, or data encoded as graphs or networks.The main contribution of this paper is the development of a harmonic analysisapproach, including both learning algorithms and supporting theory, applicable to these moregeneral settings. Given data (be it high dimensional, graph or network) that is representedby one or more hierarchical trees, we first construct multiscale waveletlikeorthonormal baseson it.Second, we prove that in analogyto the Euclidean case, function smoothness with respectto a specific metric induced by the tree is equivalent to exponential rate of coefficient decay,that is, to approximate sparsity.These results readily translate to simple practicalalgorithms for various learning tasks. %, such as extension from partially labeled data and function denoising.We present an application to transductive semisupervised learning. [Full Paper] [Discussion] Paper ID: 149 Deep Supervised TDistributed EmbeddingRenqiang Min (DCS, University of Toronto); Laurens van der Maaten (Tilburg University); Zineng Yuan (University of Toronto); Anthony Bonner (University of Toronto); Zhaolei Zhang (University of Toronto) Deep learning has been successfully applied to learn nonlinear feature mappings and to perform dimensionality reduction. In this paper, we present supervised embedding techniques that use a deep neural network to collapse classes. The network is pretrained using a stack of Restricted Boltzmann Machines (RBMs), and finetuned using approaches that try to collapse classes. The finetuning is inspired by ideas from Neighborhood Components Analysis (NCA), but it uses a Student tdistribution to model the probabilities of pairwise data points belonging to the same class in the embedding. We investigate two types of objective functions: deep tdistributed MCML (dtMCML) and deep tdistributed NCA (dtNCA). Our experiments on two handwritten digit datasets reveal the strong performance of dtMCML in supervised parametric data visualization, whereas dtNCA outperforms alternative techniques when embeddings with more than two or three dimensions are constructed, e.g., to obtain good classification performances. Overall, our results demonstrate the advantage of using a deep architecture and a heavytailed tdistribution for measuring pairwise similarities in supervised embedding. [Full Paper] [Discussion] Paper ID: 168 A Nonparametric Information Theoretic Clustering AlgorithmLev Faivishevsky (Bar Ilan University); Jacob Goldberger (Bar Ilan University) In this paper we propose a novel clustering algorithm based on maximizing the mutual information between data points and clusters. Unlike previous methods, we neitherassume the data are given in terms of distributions nor impose any parametric model on the withincluster distribution. Instead, we utilize a nonparametric estimation of the average cluster entropies and search for a clustering that maximizes the estimated mutual information between data points and clusters. The improved performance of the proposed algorithm is demonstrated on several standard datasets. [Full Paper] [Discussion] Paper ID: 170 Gaussian Process Change Point ModelsYunus Saatci (University of Cambridge); Ryan Turner; Carl Rasmussen We combine Bayesian online change point detection with Gaussian processes to create a nonparametric time series model which can handle change points.The model can be used to locate change points in an online manner;and, unlike other Bayesian online change point detection algorithms, is applicable when temporal correlations in a regime are expected.We show three variations on how to apply Gaussian processes in the change point context, each with their own advantages.We present methods to reduce the computational burden of these models and demonstrate it on several real world data sets. [Full Paper] [Discussion] Paper ID: 175 Dynamical Products of Experts for Modeling Financial Time SeriesYutian Chen (Univ. of California, Irvine); Max Welling (University Of California  Irvine) Predicting the ``Value at Risk'' of a portfolio of stocks is of great significance in quantitative finance. We introduce a new class models, ``dynamical products of experts'' that treats the latent process over volatilities as an inverse Gamma process. We show that our multivariate volatility models significantly outperform all related Garch and stochastic volatility models which are in popular use in the quantitative finance community. [Full Paper] [Discussion] Paper ID: 176 The Margin Perceptron with UnlearningConstantinos Panagiotakopoulos (Aristotle University of Thessaloniki); Petroula Tsampouka (Aristotle University of Thessaloniki) We introduce into the classical Perceptron algorithm with margin a mechanism of unlearning which in the course of the regular update allows for a reduction of possible contributions from “very well classified” patterns to the weight vector. The resulting incremental classification algorithm, called Margin Perceptron with Unlearning (MPU), provably converges in a finite number of updates to any desirable chosen before running approximation of either the maximal margin or the optimal 1norm soft margin solution. Moreover, an experimental comparative evaluation involving representative linear Support Vector Machines reveals that the MPU algorithm is very competitive. [Full Paper] [Discussion] Paper ID: 178 Sequential Projection Learning for Hashing with Compact CodesJun Wang (Columbia University); Sanjiv Kumar (Google Research Ny); ShihFu Chang Hashing based Approximate Nearest Neighbor (ANN) search has attracted much attention due to its fast query time and drastically reduced storage. However, most of the hashing methods either use random projections or extract principal directions from the data to derive hash functions. The resulting embedding suffers from poor discrimination when compact codes are used. In this paper, we propose a novel datadependent projection learning method such that each hash function is designed to correct the errors made by the previous one sequentially. The proposed method easily adapts to both unsupervised and semisupervised scenarios and shows significant performance gains over the stateoftheart methods on two large datasets containing up to 1 million points. [Full Paper] [Discussion] Paper ID: 179 Generalization Bounds for Learning KernelsCorinna Cortes (Google); Mehryar Mohri (Courant Institute Nyu); Afshin Rostamizadeh (Courant Institute, NYU) This paper presents several novel generalization bounds for the problem of learning kernels based on a combinatorial analysis of the Rademacher complexity of the corresponding hypothesis sets. Our bound for learning kernels with a convex combination of p base kernels using L_1 regularization admits only a \sqrt{\log p} dependency on the number of kernels, which is *tight* and considerably more favorable than the previous best bound given for the same problem. We also give a novel bound for learning with a nonnegative combination of p base kernels with an L_2 regularization whose dependency on p is also*tight* and only in p^{1/4}. We present similar results for L_q regularization with other values of q, and outline the relevance of our proof techniques to the analysis of the complexity of the class of linear functions. Experiments with a large number of kernels further validate the behavior of the generalization error as a function of p predicted by our bounds. [Full Paper] [Discussion] Paper ID: 180 Modeling Transfer Learning in Human Categorization with the Hierarchical Dirichlet ProcessKevin Canini (University of California); Mikhail Shashkov (Univeristy of California, Berkeley); Tom Griffiths (Berkeley) Transfer learning can be described as the distillation of abstract knowledge from one learning domain or task and the reuse of that knowledge in a related domain or task. In categorization settings, transfer learning is the modification by past experience of prior expectations about what types of categories are likely to exist in the world. While transfer learning is an important and active research topic in machine learning, there have been few studies of transfer learning in human categorization. We propose an explanation for transfer learning effects in human categorization, implementing a model from the statistical machine learning literature  the hierarchical Dirichlet process (HDP)  to make empirical evaluations of its ability to explain these effects. We present two laboratory experiments which measure the degree to which people engage in transfer learning in a controlled setting, and we compare our model to their performance. We find that the HDP provides a good explanation for transfer learning exhibited by human learners. [Full Paper] [Discussion] Paper ID: 187 Convergence of Least Squares Temporal Difference Methods Under General ConditionsHuizhen Yu (Univ. of Helsinki) We consider approximate policy evaluation for finite state and action Markov decision processes (MDP) in the offpolicy learning context and with the simulationbased least squares temporal difference algorithm, LSTD($\lambda$). We establish for the discounted cost criterion that the offpolicy LSTD($\lambda$) converges almost surely under mild, minimal conditions. We also analyze other convergence and boundedness properties of the iterates involved in the algorithm, and based on them, we suggest a modification in its practical implementation. Our analysis uses theories of both finite space Markov chains and Markov chains on topological spaces. [Full Paper] [Discussion] Paper ID: 191 Classes of Multiagent Qlearning Dynamics with epsilongreedy ExplorationMichael Wunder (Rutgers University); Michael Littman (Rutgers University); Monica Babes (Rutgers) Qlearning in singleagent environments is known to converge in the limit given sufficient exploration. The same algorithm has been applied, with some success, in multiagent environments, where traditional analysis techniques break down. Using established dynamical systems methods, we derive and study an idealization of Qlearning in 2player 2action repeated generalsum games. In particular, we address the discontinuous case of epsilongreedy exploration and use it as a proxy for valuebased algorithms to highlight a contrast with existing results in policy search. Analogously to previous results for gradient ascent algorithms, we provide a complete catalog of the convergence behavior of the epsilongreedy Qlearning algorithm by introducing new subclasses of these games. We identify two subclasses of Prisoner's Dilemmalike games where the application of Qlearning with epsilongreedy exploration results in higherthanNash payoffs for some initial conditions. [Full Paper] [Discussion] Paper ID: 195 Estimation of (near) lowrank matrices with noise and highdimensional scalingSahand Negahban (UC Berkeley); Martin Wainwright (UC Berkeley) We study an instance of highdimensional statistical inference inwhich the goal is to use $N$ noisy observations to estimate a matrix$\Theta^* \in \real^{k \times p}$ that is assumed to be either exactlylow rank, or "near" lowrank, meaning that it can bewellapproximated by a matrix with low rank. We consider an$M$estimator based on regularization by the trace or nuclear normover matrices, and analyze its performance under highdimensionalscaling. We provide nonasymptotic bounds on the Frobenius norm errorthat hold for a general class of noisy observation models, and applyto both exactly lowrank and approximately lowrank matrices. We thenillustrate their consequences for a number of specific learningmodels, including lowrank multivariate or multitask regression,system identification in vector autoregressive processes, and recoveryof lowrank matrices from random projections. Simulations showexcellent agreement with the highdimensional scaling of the errorpredicted by our theory. [Full Paper] [Discussion] Paper ID: 196 A Simple Algorithm for Nuclear Norm Regularized ProblemsMartin Jaggi (ETH Zürich); Marek Sulovská (ETH Zürich, Switzerland) Optimization problems with a nuclear norm regularization, such as e.g. low norm matrix factorizations, have seen many applications recently. We propose a new approximation algorithm building upon the recent sparse approximate SDP solver of (Hazan, 2008). The experimental efficiency of our method is demonstrated on large matrix completion problems such as the Netflix data set. The algorithm comes with strong convergence guarantees, and can be interpreted as a first theoretically justified variant of SimonFunktype SVD heuristics. The method is free of tuning parameters, and very easy to parallelize. [Full Paper] [Discussion] Paper ID: 197 On Sparse Non Parametric Conditional Covariance SelectionMladen Kolar (Carnegie Mellon University); Ankur Parikh (Carnegie Mellon University); Eric Xing (Cmu) We develop a penalized kernel smoothing method for the problem of selecting nonzero elements of the conditional precision matrix, known as conditional covariance selection. This problem has a key role in many modern applications such as finance and computational biology. However, it has not been properly addressed. Our estimator is derived under minimal assumptions on the underlying probability distribution and works well in the highdimensional setting. The efficiency of the algorithm is demonstrated on both simulation studies and the analysis of the stock market. [Full Paper] [Discussion] Paper ID: 202 Exploiting DataIndependence for Fast BeliefPropagationJulian McAuley (NICTA); Tiberio Caetano (Nicta) Maximum a posteriori (MAP) inference in graphical models requires that we maximize the sum of two terms: a datadependent term, encoding the conditional likelihood of a certain labeling given an observation, and a dataindependent term, encoding some prior on labelings. Often, datadependent factors contain fewer latent variables than dataindependent factors  for instance, many grid and treestructured models contain only firstorder conditionals despite having pair wise priors. In this paper, we note that MAPinference in such models can be made substantially faster by appropriately preprocessing their dataindependent terms. Our main result is to show that messagepassing in any such pair wise model has an expectedcase exponent of only 1.5 on the number of states per node, leading to significant improvements over existing quadratictime solutions. [Full Paper] [Discussion] Paper ID: 207 Onesided Support Vector Regression for Multiclass Costsensitive ClassificationHanHsing Tu (National Taiwan University); HsuanTien Lin (National Taiwan University) We propose a novel approach that reduces costsensitive classification to onesided regression. The approach stores the cost information in the regression labels and encodes the minimumcost prediction with the onesided loss. The simple approach is accompanied by a solid theoretical guarantee of error transformation, and can be used to cast any onesided regression method as a costsensitive classification algorithm. To validate the proposed reduction approach, we design a new costsensitive classification algorithm by coupling the approach with a variant of the support vector machine (SVM) for onesided regression. The proposed algorithm can be viewed as a theoretically justified extension of the popular oneversusall SVM. Experimental results demonstrate that the algorithm is not only superior to traditional oneversusall SVM for costsensitive classification, but also better than many existing SVMbased costsensitive classification algorithms. [Full Paper] [Discussion] Paper ID: 219 OTL: A Framework of Online Transfer LearningPeilin Zhao (Nanyang Technological University); Steven C.H. Hoi (Nanyang Technological Universi) In this paper, we investigate a new machine learning framework called ``Online Transfer Learning (OTL) that aims to transfer knowledge from some source domain to an online learning task on a target domain. We do not assume the target data follows the same class or generative distribution as the source data, and our key motivation is to improve a supervised online learning task in a target domain by exploiting the knowledge that had been learned from large amount of training data in source domains. OTL is in general challenging since data in both domains not only can be different in their class distributions but can be also different in their feature representations. As a first attempt to this problem, we propose techniques to address two kinds of OTL tasks: one is to perform OTL in a homogeneous domain, and the other is to perform OTL across heterogeneous domains. We show the mistake bounds of the proposed OTL algorithms, and empirically examine their performance on several challenging OTL tasks. Encouraging results validate the efficacy of our techniques. [Full Paper] [Discussion] Paper ID: 223 SVM Classifier Estimation from Group ProbabilitiesStefan Rueping (Fraunhofer IAIS) A learning problem that has only recently gained attention in the machine learning community is that of learning a classifier from group probabilities. It is a learning task that lies somewhere between the wellknown tasks of supervised and unsupervised learning, in the sense that for a set of observations we do not know the labels, but for some groups of observations, the frequency distribution of the label is known. This learning problem has important practical applications, for example in privacypreserving data mining. This paper presents an approach to learn a classifier from group probabilities based on support vector regression and the idea of inverting a classifier calibration process. A detailed analysis will show that this new approach outperforms existing approaches. [Full Paper] [Discussion] Paper ID: 227 Learning Sparse SVM for Feature Selection on Very High Dimensional DatasetsMingkui Tan (NTU, Singapore); Li Wang (University of California, San Diego); Ivor Tsang (NTU) A sparse representation of Support vector Machines (SVMs) with respect to input features is desirable for many applications. In this paper, by introducing a 01 control variable to each input feature, $l_0$norm Sparse SVM (SSVM) is converted to a mixed integer programming (MIP) problem. Rather than directly solving this MIP, we propose an efficient cutting plane algorithm combining with multiple kernel learning to solve its convex relaxation. A global convergence proof for our method is also presented. Comprehensive experimental results on one synthetic and 10 real world datasets show that our proposed method can obtain better or competitive performance compared with existing SVMbased feature selection methods in term of sparsity and eneralization performance. Moreover, our proposed method can effectively handle largescale and extremely high dimensional problems. [Full Paper] [Discussion] Paper ID: 233 Total Variation and Cheeger CutsArthur Szlam (NYU); Xavier Bresson (UCLA) In this work, inspired by (Buhler and Hein, 2009), (Strang, 1983), and (Zhang et al., 2009),we give a continuous relaxation of the Cheeger cut problem on a weighted graph. We show that the relaxation is actually equivalent to the original problem. We then describe an algorithm for finding good cuts suggested by the similarities of the energy of the relaxed problem and various well studied energies in image processing. Finally we provide experimental validation of the proposed algorithm, demonstrating its efficiency in finding high quality cuts. [Full Paper] [Discussion] Paper ID: 235 Learning Temporal Graphs for Relational TimeSeries AnalysisYan Liu (IBM TJ Watson Research Center); ALEXANDRU NICULESCUMIZIL; AURELIE LOZANO (IBM TJ Watson Research Center); Yong Lu (Harvard University) Learning temporal causal graph structures from multivariate timeseries data reveals important dependency relationships between current observations and histories, and provides a better understanding of complex systems. In this paper, we examine learning tasks where one is presented with multiple multivariate timeseries, as well as a relational graph among the different timeseries. We propose an L1 regularized hidden Markov random field regression framework to leverage the information provided by the relational graph and jointly infer more accurate temporal causal structures for all timeseries. We test the proposed model on climate modeling and crossspecies micro array data analysis applications. [Full Paper] [Discussion] Paper ID: 238 Online Streaming Feature SelectionXindong Wu (University of Vermont); Kui Yu (Hefei University of Technology); Hao Wang (Hefei University of Technology); Wei Ding (University of Massachusetts Boston) We study an interesting and challenging problem, online streaming feature selection, in which the size of the feature set is unknown, and not all features are available for learning while leaving the number of observations constant. In this problem, the candidate features arrive one at a time, and the learner's task is to select a “best so far” set of features from streaming features. Standard feature selection methods cannot perform well in this scenario. Thus, we present a novel framework based on feature relevance. Under this framework, a promising alternative method, Online Streaming Feature Selection (OSFS), is presented to online select strongly relevant and nonredundant features. In addition to OSFS, a faster FastOSFS algorithm is proposed to further improve the selection efficiency. Experimental results show that our algorithms achieve more compactness and better accuracy than existing streaming feature selection algorithms on various datasets. [Full Paper] [Discussion] Paper ID: 242 Making LargeScale Nystrom Approximation PossibleMu Li (Shanghai Jiao Tong University); James Kwok (University Of Science And Technology Hong Kong); BaoLiang Lu (SJTU) The Nystrom method is an efficient technique for the eigen value decomposition of large kernel matrices. However, in order to ensure an accurate approximation, a sufficiently large number of columns have to be sampled. On very large data sets, the SVD step on the resultant data sub matrix will soon dominate the computations and become prohibitive. In this paper, we propose an accurate and scalable Nystrom scheme that first samples a large column subset from the input matrix, but then only performs an approximate SVD on the inner sub matrix by using the recent randomized lowrank matrix approximation algorithms. Theoretical analysis shows that the proposed algorithm is as accurate as the standard Nystrom method that directly performs a large SVD on the inner sub matrix On the other hand, its time complexity is only as low as performing a small SVD. Experiments are performed on a number of largescale data sets for lowrank approximation and spectral embedding. In particular, spectral embedding of a MNIST data set with 3.3 million examples takes less than an hour on a standard PC with 4G memory. [Full Paper] [Discussion] Paper ID: 246 Particle Filtered MCMCMLE with Connections to Contrastive DivergenceArthur Asuncion (UC Irvine); Qiang Liu (UC Irvine); Alex Ihler (UC Irvine); Padhraic Smyth (UC Irvine) Learning undirected graphical models such as Markov random fields is an important machine learning task with applications in many domains. Since it is usually intractable to learn these models exactly, various approximate learning techniques have been developed, such as contrastive divergence (CD) and Markov chain Monte Carlo maximum likelihood estimation (MCMCMLE). In this paper, we introduce particle filtered MCMCMLE, which is a samplingimportancere sampling version of MCMCMLE with additional MCMC rejuvenation steps. We also describe a unified view of (1) MCMCMLE, (2) our particle filtering approach, and (3) a stochastic approximation procedure known as persistent contrastive divergence. We show how these approaches are related to each other and discuss the relative merits of each approach. Empirical results on various undirected models demonstrate that the particle filtering technique we propose in this paper can significantly outperform MCMCMLE. Furthermore, in certain cases, the proposed technique is faster than persistent CD. [Full Paper] [Discussion] Paper ID: 247 Feature Selection as a OnePlayer GameRomaric Gaudel (Lri); Michele Sebag (Université Paris Sud) This paper formalizes Feature Selection as a Reinforcement Learning problem, leading to a provably optimal though intractable selection policy. As a second contribution, this paper presents an approximation thereof, based on a oneplayer game approach and relying on the MonteCarlo tree search UCT (Upper Confidence Tree) proposed by Kocsis and Szepesvari (2006).More precisely, the presented FUSE (Feature Uct SElection) algorithm extends UCT to deal with i) a finite unknown horizon (the target number of relevant features); ii) a huge branching factor of the search tree (the size of the initial feature set). Additionally, a frugal reward function is proposed as a rough but unbiased estimate of the relevance of a feature subset. A proof of concept of FUSE is shown on benchmark data sets. [Full Paper] [Discussion] Paper ID: 248 The Translationinvariant WishartDirichlet Process for Clustering Distance DataJulia Vogt (University of Basel); Sandhya Prabhakaran (University of Basel); Thomas Fuchs (ETH Zurich); Volker Roth (Eth Zurich) We present a probabilistic model for clustering of objects represented via pair wise dissimilarities. We propose that even if an underlying vectorial representation exists, it is better to work directly with the dissimilarity matrix hence avoiding unnecessary bias and variance caused by embeddings.By using a Dirichlet process prior we are not obliged to fix the number of clusters in advance. Furthermore, our clustering model is permutation, scale and translationinvariant, and it is called the Translationinvariant Wishart Dirichlet(TIWD) process. A highly efficient MCMC sampling algorithm is presented. Experiments show that the TIWD process exhibits several advantages over competing approaches. [Full Paper] [Discussion] Paper ID: 259 Online Prediction with PrivacyJun Sakuma (University of Tsukuba ); Hiromi Arai (University of Tsukuba) In this paper, we consider online prediction from expert advice in a situation where each expert observes its own loss at each time while the loss cannot be disclosed to others for reasons of privacy or confidentiality preservation. Our secure exponential weighting scheme enables exploitation of such private loss values by making use of cryptographic tools. We proved that the regret bound of the secure exponential weighting is the same or almost the same with the wellknown exponential weighting scheme in the full information model. In addition, we prove theoretically that the secure exponential weighting is privacypreserving in the sense of secure function evaluation. [Full Paper] [Discussion] Paper ID: 263 Fast boosting using adversarial banditsRobert BusaFekete (CNRS / Université ParisSud 11); Balazs Kegl (LAL/LRI Université ParisSud, CNRS) In this paper we apply multiarmed bandits (MABs) to improve the computational complexity of AdaBoost. AdaBoost constructs a strong classifier in a stepwise fashion by selecting simple base classifiers and using their weighted ``vote'' to determine the final classification. We model this stepwise base classifier selection as a sequential decision problem, and optimize it with MABs where each arm represents a subset of the base classifier set. The MAB gradually learns the ``usefulness'' of the subsets, and selects one of the subsets in each iteration. AdaBoost then searches only this subset instead of optimizing the base classifier over the whole space. The main improvement of this paper over a previous approach is that we use an adversarial bandit algorithm instead of stochastic bandits. This choice allows us to prove a weaktostronglearning theorem, which means that the proposed technique remains a boosting algorithm in a formal sense. We demonstrate on benchmark datasets that our technique can achieve a generalization performance similar to standard AdaBoost for a computational cost that is an order of magnitude smaller. [Full Paper] [Discussion] Paper ID: 268 Robust Formulations for Handling Uncertainty in Kernel MatricesSahely Bhadra (Indian Institute of Science.); Sourangshu Bhattacharya (Yahoo! Research Labs, Bangalore,INDIA); Chiranjib Bhattacharyya (Indian Institute Of Science); Aharon BenTal (TechnionIsrael Institute of T) We study the problem of uncertainty in the entries of the Kernel matrix, arising in SVM formulation. Using Chance Constraint Programming and a novel large deviation inequality we derive a formulation which is robust to such noise. The resulting formulation applies when the noise is Gaussian, or has finite support. The formulation in general is nonconvex, but in several cases of interest it reduces to a convex program. The problem of uncertainty in kernel matrix is motivated from the real world problem of classifying proteins when the structures are provided with some uncertainty. The formulation derived here naturally incorporates such uncertainty in a principled manner leading to significant improvements over the state of the art. [Full Paper] [Discussion] Paper ID: 269 Bayesian MultiTask Reinforcement LearningAlessandro Lazaric (Inria); Mohammad Ghavamzadeh (Inria) We consider the problem of multitask reinforcement learning where the learner is provided with a set of tasks, for which only a small number of samples can be generated for any given policy. As the number of samples may not be enough to learn an accurate evaluation of the policy, it would be necessary to identify classes of tasks with similar structure and to learn them jointly. We consider the case where the tasks share structure in their value functions, and model this by assuming that the value functions are all sampled from a common prior. We adopt the Gaussian process temporaldifference value function model and use a hierarchical Bayesian approach to model the distribution over the value functions. We study two cases, where all the value functions belong to the same class and where they belong to an undefined number of classes. For each case, we present a hierarchical Bayesian model, and derive inference algorithms for (i) joint learning of the value functions, and (ii) efficient transfer of the information gained in (i) to assist learning the value function of a newly observed task. [Full Paper] [Discussion] Paper ID: 275 A New Analysis of CoTrainingWei Wang (Nanjing University); ZhiHua Zhou (Nanjing University) In this paper, we present a new analysis on cotraining, a representative paradigm of disagreementbased semisupervised learning methods. In our analysis the cotraining process is viewed as a combinative label propagation over two views; this provides possibility to bring the graphbased and disagreementbased semisupervised methods into a unified framework. With the analysis we get some insight that has not been disclosed by previous theoretical studies. In particular, we provide the tex tit {sufficient and necessary} condition for cotraining to succeed. We also discuss the relationship to previous theoretical results and give some other interesting implications of our results, such as combination of weight matrices and view split. [Full Paper] [Discussion] Paper ID: 279 Clustering processesDaniil Ryabko (INRIA) The problem of clustering is considered, for the case when each data point is a sample generated by a stationary ergodic process. We propose a very natural asymptotic notion of consistency,and show that simple consistent algorithms exist, under most general nonparametric assumptions. The notion of consistency is as follows: two samples should be put into the same cluster if and only if they we regenerated by the same distribution. With this notion of consistency, clustering generalizes such classical statistical problems as homogeneity testing and process classification. We show that, for the case of a known number of clusters, consistency can be achieved under the only assumption that the joint distribution of the data is stationary ergodic (no parametric or Markovian assumptions, no assumptions of independence, neither between nor within the samples). If the number of clusters is unknown, consistency can be achieved under appropriate assumptions on the mixing rates of the processes (again, no parametric or independence assumptions). In both cases we give examples of simple (at most quadratic in each argument) algorithms which are consistent. [Full Paper] [Discussion] Paper ID: 280 COFFIN : A Computational Framework for Linear SVMsSoeren Sonnenburg (Tu Berlin); Vojtech Franc (Czech Technical University) In a variety of applications, kernel machines such as Support Vector Machines (SVMs) have been used with great success often delivering stateoftheart results. Using the kernel trick, they work on several domains and even enable heterogeneous data fusion by concatenating feature spaces or multiple kernel learning. Unfortunately, they are not suited for truly largescale applications since they suffer from the curse of supporting vectors, i.e., the speed of applying SVMs decays linearly with the number of support vectors. In this paper we develop COFFIN  a new training strategy for linear SVMs that effectively allows the use of on demand computed kernel feature spaces and virtual examples in the primal. With linear training and prediction effort this framework leverages SVM applications to truly largescale problems: As an example, we train SVMs for human splice site recognition involving 50 million examples and sophisticated string kernels. Additionally, we learn an SVM based gender detector on 5 million examples on lowtech hardware and achieve beyond the stateoftheart accuracies on both tasks. Source code, data sets and scripts are freely available from http://sonnenburgs.de/soeren/coffin . [Full Paper] [Discussion] Paper ID: 284 Multiagent Inductive Learning: an Argumentationbased ApproachSantiago Ontanon (IIIACSIC); Enric Plaza (Iiia) Multiagent Inductive Learning is the problem that groups of agents face when they want to perform inductive learning, but the data of interest is distributed among them. This paper focuses on concept learning, and presents AMAIL, a framework for multiagent induction integrating ideas from inductive learning, casebased reasoning and argumentation. Argumentation is used as a communication framework with which the agents can communicate their inductive inferences to reach shared and agreedupon concept definitions. We also identify the requirements for learning algorithms to be used in our framework, and propose an algorithm which satisfies them. [Full Paper] [Discussion] Paper ID: 285 Active Risk EstimationChristoph Sawade (University of Potsdam); Niels Landwehr (University Of Potsdam); Steffen Bickel (Nokia gate5 GmbH); Tobias Scheffer (University of Potsdam) We address the problem of evaluating the risk of a given model accurately at minimal labeling costs. This problem occurs in situations in which risk estimates cannot be obtained from heldout training data, because the training data are unavailable or do not reflect the desired test distribution. We study active risk estimation processes in which instances are actively selected by a sampling process from a pool of unlabeled test instances and their labels are queried. We derive the sampling distribution that minimizes the estimation error of the active risk estimator when used to select instances from the pool. An analysis of the distribution that governs the estimator leads to confidence intervals. We empirically study conditions under which the active risk estimate is more accurate than a standard risk estimate that draws equally many instances from the test distribution. [Full Paper] [Discussion] Paper ID: 286 Heterogeneous Continuous Dynamic Bayesian Networks with Flexible Structure and InterTime Segment Information SharingFrank Dondelinger (BioSS); Sophie Lebre (University of Strasbourg); Dirk Husmeier (BioSS) Classical dynamic Bayesian networks (DBNs) are based on the homogeneous Markov assumption and cannot deal with heterogeneity and nonstationarity in temporal processes. Various approaches to relax the homogeneity assumption have recently been proposed. The present paper aims to improve the shortcomings of three recent versions of heterogeneous DBNs along the following lines:(i) avoiding the need for data discretization,(ii) increasing the flexibility over a timeinvariant network structure,(iii) avoiding overflexibility and over fitting by introducing a regularization scheme based in intertime segment information sharing. The improved method is evaluated on synthetic data and compared with alternative published methods on gene expression time series from Drosophila melanogaster. [Full Paper] [Discussion] Paper ID: 295 Temporal Difference Bayesian Model Averaging: A Bayesian Perspective on Adapting LambdaCarlton Downey (Victoria University of Wellington); Scott Sanner (Nicta) Temporal difference (TD) algorithms are attractive for reinforcement learning due to their easeofimplementation and use of ``bootstrapped'' return estimates to make efficient use of sampled data. In particular, TD(lambda) methods comprise a family of reinforcement learning algorithms that often yield fast convergence by averaging multiple estimators of the expected return. However, TD(lambda) chooses a very specific way of averaging these estimators based on the fixed parameter lambda, which may not lead to optimal convergence rates in all settings. In this paper, we derive an automated Bayesian approach to setting lambda that we call temporal difference Bayesian model averaging (TDBMA). Empirically, TDBMA always performs as well and often much better than the best fixed lambda for TD(lambda) (even when performance for different values of lambda varies across problems) without requiring that lambda or any analogous parameter be manually tuned. [Full Paper] [Discussion] Paper ID: 297 Surrogating the surrogate: accelerating Gaussianprocessbased global optimization with a mixture crossentropy algorithmR mi Bardenet (LRI); Balazs Kegl (LAL/LRI Université ParisSud, CNRS) In global optimization, when the evaluation of the target function is costly, the usual strategy is to learn a surrogate model for the target function and replace the initial optimization by the optimization of the model. Gaussian processes have been widely used since they provide an elegant way to model the fitness and to deal with the explorationexploitation tradeoff in a principled way. Several empirical criteria have been proposed to drive the model optimization, among which is the wellknown Expected Improvement criterion. The major computational bottleneck of these algorithms is the exhaustive grid search used to optimize the highly multi modal merit function. In this paper, we propose a competitive ``adaptive grid'' approach, based on a properly derived CrossEntropy optimization algorithm with mixture proposals. Experiments suggest that 1) we outperform the classical singleGaussian crossentropy method when the fitness function is highly multi modal, and 2) we improve on standard exhaustive search in GPbased surrogate optimization. [Full Paper] [Discussion] Paper ID: 298 Random Spanning Trees and the Prediction of Weighted GraphsNicolo CesaBianchi (University Of Milan); Claudio Gentile (Universita' Dell'Insubria); Fabio Vitale (Universita' di Milano); Giovanni Zappella (Universita' di Milano) We show that the mistake bound for predicting the nodes of an arbitrary weighted graphics characterized (up to logarithmic factors) by the weighted cut size of a random spanning tree of the graph.The cut size is induced by the unknown adversarial labeling of the graph nodes.In deriving our characterization, we obtain a simple randomized algorithm achieving the optimal mistake bound on any weighted graph.Our algorithm draws a random spanning tree of the original graph and then predicts the nodes of this tree in constant amortized time and linear space. Experiments on realworld diastases that our method compares well to both global (Perceptron) and local (labelpropagation) methods, while being much faster. [Full Paper] [Discussion] Paper ID: 303 Analysis of a Classificationbased Policy Iteration AlgorithmAlessandro Lazaric (Inria); Mohammad Ghavamzadeh (Inria); Remi Munos (Inria) We present a classificationbased policy iteration algorithm, called Direct Policy Iteration, and provide its finitesample analysis. Our results state a performance bound in terms of the number of policy improvement steps, the number of rollouts used in each iteration, the capacity of the considered policy space, and a new capacity measure which indicates how well the policy space can approximate policies that are greedy w.r.t. any of its members. The analysis reveals a tradeoff between the estimation and approximation errors in this classificationbased policy iteration setting. We also study the consistency of the method when there exists a sequence of policy spaces with increasing capacity. [Full Paper] [Discussion] Paper ID: 310 Unsupervised Risk Stratification in Clinical Datasets: Identifying Patients at Risk of Rare OutcomesZeeshan Syed (University of Michigan); Ilan Rubinfeld (Henry Ford Health System) Most existing algorithms for clinical risk stratification rely on labeled training data. Collecting this data is challenging for clinical conditions where only a small percentage of patients experience adverse outcomes. We propose an unsupervised anomaly detection approach to risk stratify patients without the need of positively and negatively labeled training examples. High risk patients are identified without any expert knowledge using a minimum enclosing ball to find cases that lie in sparse regions of the feature space. When evaluated on data from patients admitted with acute coronary syndrome and patients undergoing inpatient surgical procedures, our approach was able to successfully identify individuals at increased risk of adverse endpoints in both populations. In some cases, unsupervised anomaly detection outperformed other machine learning methods that used additional knowledge in the form of labeled examples. [Full Paper] [Discussion] Paper ID: 311 Gaussian Covariance and Scalable Variational InferenceMatthias Seeger (Saarland University And Max Planck Institute For Informatics) We analyze computational aspects of variational approximate inference techniques for sparse linear models, which have to be understood to allow for large scale applications. Gaussian covariances play a key role,whose approximation is computationally hard. While most previous methods gain scalability by not even representing most posterior dependencies, harmful factorization assumptions can be avoided by employing datadependent lowrank approximations instead. We provide theoretical and empirical insights into algorithmic and statistical consequences of lowrank covariance approximation errors on decision outcomes in nonlinear sequential Bayesian experimental design. [Full Paper] [Discussion] Paper ID: 319 Efficient Learning with Partially Observed AttributesNicolo CesaBianchi (University Of Milan); Shai ShalevShwartz (Hebrew University); Ohad Shamir (Huji) We describe and analyze efficient algorithms for learning a linear predictor from examples when the learner can only view a few attributes of each training example. This is the case, for example, in medical research, where each patient participating in the experiment is only willing to go through a small number of tests. Our analysis bounds the number of additional examples sufficient to compensate for the lack of full information on each training example. We demonstrate the efficiency of our algorithms by showing that when running on digit recognition data, they obtain a high prediction accuracy even when the learner gets to see only four pixels of each image. [Full Paper] [Discussion] Paper ID: 330 Boosting for Regression TransferDavid Pardoe (University of Texas at Austin); Peter Stone (University Of Texas At Austin) The goal of transfer learning is to improve the learning of a new target concept given knowledge of related source concept(s). We introduce the first boostingbased algorithms for transfer learning that apply to regression tasks. First, we describe two existing classification transfer algorithms, ExpBoost and TrAdaBoost, and show how they can be modified for regression. We then introduce extensions of these algorithms that improve performance significantly on controlled experiments in a wide range of test domains. [Full Paper] [Discussion] Paper ID: 331 Label Ranking under Ambiguous Supervision for Learning Semantic CorrespondencesAntoine Bordes (Lip6 Paris); Nicolas Usunier ( LIP6, Université Pierre et Marie Curie  Paris 6); Jason Weston (Google Nyc) This paper studies the problem of learning from ambiguous supervision, focusing on the task of learning semantic correspondences. A learning problem is said to be ambiguously supervised when, for a given training input, a set of output candidates is provided with no prior of which one is correct. We propose to tackle this problem by solving a related unambiguous task with a label ranking approach and show how and why this performs well on the original task, via the method of tasktransfer. We apply it to learning to match natural language sentences to a structured representation of their meaning and empirically demonstrate that this competes with the stateoftheart on two benchmarks. [Full Paper] [Discussion] Paper ID: 333 From TransformationBased Dimensionality Reduction to Feature SelectionMahdokht Masaeli (Northeastern University); Glenn Fung; Jennifer Dy (Northeastern University) Many learning applications are characterized by high dimensions. Usually not all of these dimensions are relevant and some are redundant. There are two main approaches to reduce dimensionality: feature selection and feature transformation. When one wishes to keep the original meaning of the features, feature selection is desired. Feature selection and transformation are typically presented separately. In this paper, we introduce a general approach for converting transformationbased methods to feature selection methods through L_1/L_infty regularization. Instead of solving feature selection as a discrete optimization, we relax and formulate the problem as a continuous optimization problem. An additional advantage of our formulation is that our optimization criterion optimizes for feature relevance and redundancy removal automatically. Here, we illustrate how our approach can be utilized to convert linear discriminant analysis (LDA) and the dimensionality reduction version of the HilbertSchmidt Independence Criterion (HSIC) to two new feature selection algorithms. Experiments show that our new feature selection methods outperform related stateoftheart feature selection approaches. [Full Paper] [Discussion] Paper ID: 336 LeastSquares Λ Policy Iteration: BiasVariance Tradeoff in Control ProblemsChristophe Thiery (Loria); Bruno Scherrer (Loria) In the context of large space MDPs with linear value function approximation, we introduce a new approximate version of ΛPolicy Iteration (Berteskas and Ioffe, 1996), a method that generalizes Value Iteration and Policy Iteration. Our approach, called LeastSquares Λ Policy Iteration, generalizes LSPI (Lagoudakis & Parr, 2003) which makes efficient use of training samples compared to classical temporaldifferences methods. The motivation of our work is to exploit the Λ parameter within the leastsquares context, and without having to generate new sample sat each iteration or to know a model of the MDP. We provide a performance bound that shows the soundness of the algorithm. We show empirically on a simple chain problem and on the Tetris game that this Λ parameter acts as a biasvariance tradeoff that may improve the convergence and the performance of the policy obtained. [Full Paper] [Discussion] Paper ID: 342 Multiple NonRedundant Spectral Clustering ViewsDonglin Niu (Northeastern University); Jennifer Dy (Northeastern University); Michael Jordan (UC Berkeley) Many clustering algorithms only find one clustering solution. However, data can often be grouped and interpreted in many different ways. This is particularly true in the highdimensional setting where different subspaces reveal different possible groupings of the data. Instead of committing to one clustering solution, here we introduce a novel method that can provide several nonredundant clustering solutions to the user. Our approach simultaneously learns nonredundant subspaces that provide multiple views and finds a clustering solution in each view. We achieve this by augmenting a spectral clustering objective function to incorporate dimensionality reduction and multiple views and to penalize for redundancy between the views. [Full Paper] [Discussion] Paper ID: 344 Large Scale MaxMargin MultiLabel Classification with PriorsBharath Hariharan (Indian Institute of Technology Delhi); Lihi ZelnikManor (Technion); S.V.N. Vishwanathan (Purdue); Manik Varma (Microsoft Research India) We propose a maxmargin formulation for the multilabel classification problem where the goal is to tag a data point with a set of prespecified labels. Given a set of $L$ labels, a data point can be tagged with any of the $2^L$ possible subsets. The main challenge therefore lies in optimizing over this exponentially large label space subject to label correlations .Existing solutions take either of two approaches. The first assumes, it a priori , that there are no label correlations and independently trains a classifier for each label (as is done in the 1vsAllheuristic). This reduces the problem complexity from exponential to linear and such methods can scale to large problems. The second approach explicitly models correlations by pair wise label interactions. However, the complexity remains exponential unless one assumes that label correlations are sparse. Furthermore, the learnt correlations reflect the training set biases.We take a middle approach that assumes labels are correlated but does not incorporate pair wise label terms in the prediction function. We show that the complexity can still be reduced from exponential to linear while modeling dense pair wise label correlations. By incorporating correlation priors we can overcome training set biases and improve prediction accuracy. We provide a principled interpretation of the 1vsAll method and show that it arises as a special case of our formulation. We also develop efficient optimization algorithms that can be orders of magnitude faster than the stateoftheart. [Full Paper] [Discussion] Paper ID: 347 Fast Neighborhood Subgraph Pairwise Distance KernelFabrizio Costa (ku leuven); Kurt De Grave (K.U.Leuven) We introduce a novel graph kernel called the Neighborhood Subgraph Pairwise Distance Kernel. The kernel decomposes a graph into all pairs of neighborhood subgraphs of small radius at increasing distances. We show that using a fast graph invariant we obtain significant speedups in the Gram matrix computation. Finally, we test the novel kernel on a wide range of chemoinformatics tasks, from antiviral to anti carcinogenic to toxicological activity prediction, and observe competitive performance when compared against several recent graph kernel methods. [Full Paper] [Discussion] Paper ID: 352 TreeGuided Group Lasso for MultiTask Regression with Structured SparsitySeyoung Kim (Carnegie Mellon University); Eric Xing (Cmu) We consider the problem of learning a sparse multitask regression,where the structure in the output scan be represented as a tree with leaf nodes as outputs and internal nodes as clusters of the outputs at multiple granularity. Our goal is to recover the common set of relevant inputs for each output cluster. Assuming that the tree structure is available as prior knowledge,we formulate this problem as a new multitask regularized regression called treeguided group lasso. Our structured regularization is based on a grouplasso penalty, where groups are defined with respect to the tree structure. We describe a systematic weighting scheme for the groups in the penalty such that each output variable is penalized in a balanced manner even if the groups overlap.We present an efficient optimization method that can handle a largescale problem. % as is typically the case in association mapping. Using simulated and yeast datasets, we demonstrate that our method shows a superior performance in terms of both prediction error sand recovery of true sparsity patterns compared to other methods for multitask learning. [Full Paper] [Discussion] Paper ID: 353 Label Ranking Methods based on the PlackettLuce ModelWeiwei Cheng (University Marburg); Krzysztof Dembczynski (Department Of Mathematics And Computer Science University Of Marburg Germany); Eyke Huellermeier (University of Marburg) This paper introduces two new methods for label ranking based on a probabilistic model of ranking data, called the PlackettLuce model. The idea of the first method is to use the PL model to fit locally constant probability models in the context of instancebased learning. As opposed to this, the second method estimates a global model in which the PL parameters are represented as functions of the instance. Comparing our methods with previous approaches to label ranking, we find that they offer a number of advantages. Experimentally, we moreover show that they are highly competitive to startoftheart methods in terms of predictive accuracy, especially in the case of training data with incomplete ranking information. [Full Paper] [Discussion] Paper ID: 359 A DC Programming Approach for Sparse Eigenvalue ProblemMamadou THIAO (LMI INSA of RouenFrance); Tao PHAM DINH (LMI INSA of RouenFrance); Hoai An LE THI (LITA  UFR MIM, Metz University MetzFrance) We investigate the sparse eigenvalue problem which arises in various fields such as machine learning and statistics. Unlike standard approaches relying on approximation of the $l_{0}$norm, we work with an equivalent reformulation of the problem at hand as a DC program. Our starting point is the eigenvalue problem to which a constraint for sparsity requirement is added. The obtained problem is first formulated as a mixed integer program, and exact penalty techniques are used to equivalently transform the resulting problem into a DC program, whose solution is assumed by a customized DCA. Computational results for sparse principal component analysis are reported, which show the usefulness of our approach that compares favorably with some related standard methods using approximation of the $l_{0}$norm. [Full Paper] [Discussion] Paper ID: 366 Submodular Dictionary Selection for Sparse RepresentationAndreas Krause (Caltech); Volkan Cevher (EPFL) We develop an efficient learning framework to construct signal dictionaries for sparse representation by selecting the dictionary columns from multiple candidate bases. By sparse, we mean that only a few dictionary elements, compared to the ambient signal dimension, can exactly represent or wellapproximate the signals of interest. We formulate both the selection of the dictionary columns and the sparse representation of signals as a joint combinatorial optimization problem. The proposed combinatorial objective maximizes variance reduction over the set of training signals by constraining the size of the dictionary as well as the number of dictionary columns that can be used to represent each signal. We show that if the available dictionary column vectors are incoherent, our objective function satisfies approximate submodularity. We exploit this property to develop SDSOMP and SDSMA, two greedy algorithms with approximation guarantees. We also describe how our learning framework enables dictionary selection for structured sparse representations, e.g., where the sparse coefficients occur in restricted patterns. We evaluate our approach on synthetic signals and natural images for representation and in painting problems. [Full Paper] [Discussion] Paper ID: 370 Deep networks for robust visual recognitionYichuan Tang (University of Waterloo); Chris Eliasmith (University of Waterloo) Deep Belief Networks (DBNs) are hierarchical generative models which have been used successfully to model high dimensional visual data. However, they are not robust to common variations such as occlusion and random noise. We explore two strategies for improving the robustness of DBNs. First, we show that a DBN with sparse connections in the first layer is more robust to variations that are not in the training set. Second, we develop a probabilistic denoising algorithm to determine a subset of the hidden layer nodes to unclamp. We show that this can be applied to any feed forward network classifier with localized first layer connections. Recognition results after denoising are significantly better over the standard DBN implementations for various sources of noise. [Full Paper] [Discussion] Paper ID: 371 A StickBreaking Construction of the Beta ProcessJohn Paisley (Duke University); Aimee Zaas (Duke Medical Center); Christopher Woods (Duke Medical Center); Geoffrey Ginsburg (Duke Medical Center); Lawrence Carin (Duke University) We present and derive a new stickbreaking construction of the beta process. The construction is closely related to a special case of the stickbreaking construction of the Dirichlet process (Sethuraman, 1994) applied to the beta distribution. We derive an inference procedure that relies on Monte Carlo integration to reduce the number of parameters to be inferred, and present results on synthetic data, the MNIST handwritten digits data set and a timeevolving gene expression data set. [Full Paper] [Discussion] Paper ID: 374 Local Minima EmbeddingMinyoung Kim (CMU); Fernando De la Torre (CMU) Dimensionality reduction is a commonly used step in many algorithms for visualization, classification, clustering and modeling. Most dimensionality reduction algorithms find a low dimensional embedding that preserves the structure of highdimensional data points. This paper proposes Local Minima Embedding (LME), a technique to find a lowdimensional embedding that preserves the local minima structure of a given objective function. LME provides an embedding that is useful for visualizing and understanding the relation between the original variables that create local minima. Additionally, the embedding can potentially be used to sample the original function to discover new local minima. The function in the embedded space takes an analytic form and hence the gradients can be computed analytically. We illustrate the benefits of LME in both synthetic data and real problems in the context of image alignment. To the best of our knowledge this is the first paper that addresses the problem of finding an embedding that preserves local minima properties of an objective function. [Full Paper] [Discussion] Paper ID: 376 Risk minimization, probability elicitation, and costsensitive SVMsHamed MasnadiShirazi (UCSD); Nuno Vasconcelos (University of California at San Diego) A new procedure for learning costsensitive SVM classifiers is proposed. The SVM hinge loss is extended to the cost sensitive setting, and the costsensitive SVM is derived as the minimizer of the associated risk. The extension of the hinge loss draws on recent connections between risk minimization and probability elicitation. These connections are generalized to costsensitive classification, in a manner that guarantees consistency with the costsensitive Bayes risk, and associated Bayes decision rule. This ensures that optimal decision rules, under the new hinge loss, implement the Bayesoptimal costsensitive classification boundary. Minimization of the new hinge loss is shown to be a generalization of the classic SVM optimization problem, and can be solved by identical procedures. The resulting algorithm avoids the shortcomings of previous approaches to costsensitive SVM design, and has superior experimental performance. [Full Paper] [Discussion] Paper ID: 378 ContinuousTime Belief PropagationTal ElHay (Hebrew University); Ido Cohn; Nir Friedman (Hebrew University); Raz Kupferman (Hebrew University) Many temporal processes can be naturally modeled as a stochastic system that evolves continuously over time. The representation language of continuoustime Bayesian networks allows to succinctly describe multicomponent continuoustime stochastic processes. A crucial element in applications of such models is inference. Here we introduce a variational approximation scheme, which is a natural extension of Belief Propagation for continuous time processes. In this scheme, we view messages as inhomogeneous Markov processes over individual components. This leads to a relatively simple procedure that allows to easily incorporate adaptive ordinary differential equation (ODE) solvers to perform individual steps. We provide the theoretical foundations for the approximation, and show how it performs on a range of networks. Our results demonstrate that our method is quite accurate on singly connected networks, and provides close approximations in more complex ones. [Full Paper] [Discussion] Paper ID: 384 A Languagebased Approach to Measuring Scholarly ImpactSean Gerrish (Princeton University); David Blei (Princeton University) Identifying the most influential documents in a corpus is an important problem in many fields, from information science and historiography to text summarization and news aggregation. Unfortunately, traditional bibliometrics such as citations are often not available. We propose using changes in the thematic content of documents over time to measure the importance of individual documents within the collection. We describe a dynamic topic model for both quantifying and qualifying the impact of these documents. We validate the model by analyzing three large corpora of scientific articles. Our measurement of a document’s impact correlates significantly with its number of citations. [Full Paper] [Discussion] Paper ID: 387 Power Iteration ClusteringFrank Lin (Carnegie Mellon University); William Cohen (Cmu) We present a simple and scalable graph clustering method called power iteration clustering (PIC). PIC finds a very lowdimensional embedding of a dataset using truncated power iteration on a normalized pairwise similarity matrix of the data. This embedding turns out to be an effective cluster indicator, consistently outperforming widely used spectral methods such as NCut on real datasets. PIC is very fast on large datasets, running over 1,000 times faster than an NCut implementation based on the stateoftheart IRAM eigenvector computation technique. [Full Paper] [Discussion] Paper ID: 397 The IBP Compound Dirichlet Process and its Application to Focused Topic ModelingSinead Williamson (University Of Cambridge); Chong Wang (Princeton); Katherine Heller (Gatsby Computational Neuroscience Unit); David Blei (Princeton University) The hierarchical Dirichlet process (HDP) is a Bayesian nonparametric mixed membership modeleach data point is modeled with a collection of components of different proportions. Though powerful, the HDP makes an assumption that the probability of a component being exhibited by a data point is positively correlated with its proportion within that data point. This might be an undesirable assumption. For example, in topic modeling, a topic (component) might be rare throughout the corpus but dominant within those documents (data points) where it occurs. We develop the IBP compound Dirichlet process (ICD), a Bayesian nonparametric prior that decouples acrossdata prevalence and withindata proportion in a mixed membership model. The ICD combines properties from the HDP and the Indian buffet process (IBP), a Bayesian nonparametric prior on binary matrices. The ICD assigns a subset of the shared mixture components to each data point. This subset, the data point's ``focus'', is determined independently from the amount that each of its components contribute. We develop an ICD mixture model for text, the focused topic model (FTM), and show superior performance over the HDPbased topic model. [Full Paper] [Discussion] Paper ID: 406 Budgeted Distribution Learning of Belief Net ParametersBarnabas Poczos (University of Alberta); Liuyang Li; Csaba Szepesvari (University Of Alberta); Russell Greiner (University of Alberta) Most learning algorithms assume that a data set is given initially. We address the common situation where data is not available initially, but can be obtained, at a cost. We focus on learning Bayesian belief networks (BNs) over discrete variables. As such BNs are models of probabilistic distributions, we consider the generative challenge of learning the parameters, for a fixed structure, that best match the true distribution. We focus on the budgeted learning setting, where there is a known fixed cost c_i for acquiring the value of the ith feature for any specified instance, and a known total cost to spend acquiring all information. After formally defining this problem from a Bayesian perspective, we first consider allocation algorithms that must decide, before seeing any results, which features of which instances to probe. We show this is NPhard, even if all variables are independent, then prove that the greedy allocation algorithm IGA is optimal when the costs are uniform and the features are independent, but can otherwise be suboptimal. We then show that general (nonallocation) policies perform better, and explore the challenges of learning the parameters for general belief networks in this setting, describing conditions for when the obvious roundrobin algorithm will, versus will not work optimally. We also explore the effectiveness of this and various other heuristic algorithms. [Full Paper] [Discussion] Paper ID: 410 Efficient Selection of Multiple Bandit Arms: Theory and PracticeShivaram Kalyanakrishnan (University of Texas at Austin); Peter Stone (University Of Texas At Austin) We consider the general, widely applicable problem of selecting from n realvalued random variables a subset of size m of those with the highest means, based on as few samples as possible. This problem,which we denote Explorem, is a core aspect in several stochastic optimization algorithms, and applications of simulation and industrial engineering. The theoretical basis for our work is an extension of a previous formulation using multiarmed bandits that is devoted to identifying just the one best of n random variables(Explore1). In addition to providing PAC bounds for the general case, we tailor our theoretically grounded approach to work efficiently in practice. Empirical comparisons of the resulting sampling algorithm against stateoftheart subset selection strategies demonstrate significant gains in sample efficiency. [Full Paper] [Discussion] Paper ID: 412 Gaussian Processes Multiple Instance LearningMinyoung Kim (CMU); Fernando De la Torre (CMU) This paper proposes a multiple instance learning (MIL) algorithm for Gaussian processes (GP). The GPMIL model inherits two crucial benefits from GP: (i) a principle manner of learning kernel parameters, and (ii) a probabilistic interpretation (e.g., variance in prediction) that is informative for better understanding of the MIL prediction problem. The bag labeling protocol of the MIL problem, namely the existence of a positive instance in a bag, can be effectively represented by a sigmoid likelihood model through the max function over GP latent variables. To circumvent the intractability of exact GP inference and learning incurred by the noncontinuous max function, we suggest two approximations: first, the softmax approximation; second, the use of witness indicator variables optimized with a deterministic annealing schedule. The effectiveness of GPMIL against other stateoftheart MIL approaches is demonstrated on several benchmark MIL datasets. [Full Paper] [Discussion] Paper ID: 416 Proximal Methods for Sparse Hierarchical Dictionary LearningRodolphe Jenatton (Inria); Julien Mairal (Inria); Guillaume Obozinski (Inria); Francis Bach (INRIA) We propose to combine two approaches for modeling data admitting sparse representations: on the one hand, dictionary learning has proven effective for various signal processing tasks. On the other hand, recent work on structured sparsity provides a natural framework for modeling dependencies between dictionary elements. We thus consider a treestructured sparse regularization to learn dictionaries embedded in a hierarchy. The involved proximal operator is computable exactly via a primaldual method, allowing the use of accelerated gradient techniques. Experiments show that for natural image patches, learned dictionary elements organize themselves in such a hierarchical structure, leading to an improved performance for restoration tasks. When applied to text documents, our method learns hierarchies of topics, thus providing a competitive alternative to probabilistic topic models. [Full Paper] [Discussion] Paper ID: 420 Conditional Topic Random FieldsJun Zhu; Eric Xing (Cmu) Generative topic models such as LDA are limited by their inability to utilize nontrivial input features to enhance their performance, and many topic models assume that topic assignments of different words are conditionally independent. Some work exists to address the second limitation but no work exists to address both. This paper presents a conditional topic random field (CTRF) model, which can use arbitrary non local features about words and documents and incorporate the Markov dependency between topic assignments of neighboring words. We develop an efficient variational inference algorithm that scales linearly in terms of topic numbers, and a maximum likelihood estimation (MLE) procedure for parameter estimation. For the supervised version of CTRF, we also develop an arguably more discriminative maxmargin learning method. We evaluate CTRF on real review rating data and demonstrate the advantages of CTRF over generative competitors, and we show the advantages of maxmargin learning over MLE. [Full Paper] [Discussion] Paper ID: 421 On the Consistency of Ranking AlgorithmsJohn Duchi (UC Berkeley); Lester Mackey (U.C. Berkeley); Michael Jordan (University Of California At Berkeley) We present a theoretical analysis of supervised ranking, providing necessary and sufficient conditions for the asymptotic consistency of algorithms based on minimizing a surrogate loss function. We show that many commonly used surrogate losses are inconsistent; surprisingly, we show inconsistency even in lownoise settings. We present a new valueregularized linear loss, establish its consistency under reasonable assumptions on noise, and show that it outperforms conventional ranking losses in a collaborative filtering experiment. [Full Paper] [Discussion] Paper ID: 422 Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental DesignNiranjan Srinivas (California Institute of Technology); Andreas Krause (Caltech); Sham Kakade (Wharton); Matthias Seeger (Saarland University And Max Planck Institute For Informatics) Many applications require optimizing an unknown, noisy function that is expensive to evaluate. We formalize this task as a multiarmed bandit problem, where the payoff function is either sampled from a Gaussian process (GP) or has low RKHS norm. We resolve the important open problem of deriving regret bounds for this setting, which imply novel convergence rates for GP optimization. We analyze GPUCB, an intuitive upperconfidence based algorithm, and bound its cumulative regret in terms of maximal information gain, establishing a novel connection between GP optimization and experimental design. Moreover, by bounding the latter in terms of operator spectra, we obtain explicit sublinear regret bounds for many commonly used covariance functions. In some important cases, our bounds have surprisingly weak dependence on the dimensionality. In our experiments on real sensor data, GPUCB compares favorably with other heuristical GP optimization approaches. [Full Paper] [Discussion] Paper ID: 429 Implicit Online LearningBrian Kulis (University Of California At Berkeley); Peter Bartlett (UC Berkeley) Online learning algorithms have recently risen to prominence due to their strong theoretical guarantees and an increasing number of practical applications for largescale data analysis problems. In this paper, we analyze a class of online learning algorithms based on fixed potentials and nonlinearized losses, which yields algorithms with implicit update rules. We show how to efficiently compute these updates, and we prove regret bounds for the algorithms. We apply our formulation to several special cases where our approach has benefits over existing online learning methods. In particular, we provide improved algorithms and bounds for the online metric learning problem, and show improved robustness for online linear prediction problems. Results over a variety of data sets demonstrate the advantages of our framework. [Full Paper] [Discussion] Paper ID: 432 Rectified Linear Units Improve Restricted Boltzmann MachinesVinod Nair (University of Toronto); Geoffrey Hinton (University of Toronto) Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these "Stepped Sigmoid Units" are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors. [Full Paper] [Discussion] Paper ID: 433 Budgeted Nonparametric Learning from Data StreamsRyan Gomes (California Institute of Technology); Andreas Krause (Caltech) We consider the problem of extracting informative exemplars from a data stream. Examples of this problem include exemplarbased clustering and nonparametric inference such as Gaussian process regression on massive data sets. We show that these problems require maximization of a submodular function that captures the informativeness of a set of exemplars, over a data stream. We develop an efficient algorithm, Stream Greedy, which is guaranteed to obtain a constant fraction of the value achieved by the optimal solution to this NPhard optimization problem. We extensively evaluate our algorithm on large real world data sets. [Full Paper] [Discussion] Paper ID: 436 Interactive Submodular Set CoverAndrew Guillory (University of Washington); Jeff Bilmes (University of Washington, Seat) We introduce a natural generalization of submodular set cover and exact active learning with a finite hypothesis class (query learning). We call this new problem interactive submodular set cover. Applications include advertising in social networks with hidden information. We give an approximation guarantee for a novel greedy algorithm and give a hardness of approximation result which matches up to constant factors. We also discuss negative results for simpler approaches and present encouraging early experimental results. [Full Paper] [Discussion] Paper ID: 438 A fast natural Newton methodNicolas Le Roux (Microsoft Cambridge); Andrew Fitzgibbon (Microsoft Research) Nowadays, for many tasks such as object recognition or language modeling, data is plentiful. As such, an important challenge has become to find learning algorithms which can make use of all the available data. In this setting, called ``largescale learning'' by Bottou and Bousquet (2008), learning and optimization become different and powerful optimization algorithms are suboptimal learning algorithms. While most efforts are focused on adapting optimization algorithms for learning by efficiently using the information contained in the Hessian, Le Roux et al. (2008) exploited the special structure of the learning problem to achieve faster convergence. In this paper, we investigate a natural way of combining these two directions to yield fast and robust learning algorithms. [Full Paper] [Discussion] Paper ID: 441 Learning Deep Boltzmann Machines using Adaptive MCMCRuslan Salakhutdinov (MIT) When modeling highdimensional richly structured data, it is often the case that the distribution defined by the Deep Boltzmann Machine (DBM) has a rough energy landscape with many local minima that are separated by high energy barriers. The commonly used Gibbs sampler tends to get trapped in one local mode, which often results in unstable learning dynamics and leads to poor parameter estimates. In this paper, we concentrate on learning DBM's using adaptive MCMC algorithms. We first show a close connection between Fast PCD and adaptive MCMC. We then develop a Coupled Adaptive Simulated Tempering algorithm that can be used to better explore a highly multimodal energy landscape. Finally, we demonstrate that the proposed algorithm considerably improves parameter estimates, particularly when learning largescale DBM's. [Full Paper] [Discussion] Paper ID: 442 Internal Rewards Mitigate Agent BoundednessJonathan Sorg (University of Michigan); Satinder Singh (University of Michigan); Richard Lewis (University of Michigan) Reinforcement learning (RL) research typically develops algorithms for helping an RL agent best achieve its goals however they came to be defined while ignoring the relationship of those goals to the goals of the agent designer. We extend agent design to include the metaoptimization problem of selecting internal agent goals (rewards)which optimize the designer's goals. Our claim is that welldesigned internal rewards can help improve the performance of RL agents which are computationally bounded in some way (as practical agents are). We present a formal framework for understanding both bounded agents and the metaoptimization problem, and we empirically demonstrate several instances of common agent bounds being mitigated by general internal reward functions. [Full Paper] [Discussion] Paper ID: 446 Learning optimally diverse rankings over large document collectionsAleksandrs Slivkins (Microsoft Research); Filip Radlinski (Microsoft Research); Sreenivas Gollapudi (Microsoft Research) Most learning to rank research has assumed that the utility of different documents is independent, which results in learned ranking functions that return redundant results. The few approaches that avoid this have rather un satisfyingly lacked theoretical foundations, or do not scale. We present a learningtorank formulation that optimizes the fraction of satisfied users, with a scalable algorithm that explicitly takes document similarity and ranking context into account. We present theoretical justifications for this approach, as well as a nearoptimal algorithm. Our evaluation adds optimizations that improve empirical performance, and shows that our algorithms learn orders of magnitude more quickly than previous approaches. [Full Paper] [Discussion] Paper ID: 449 Learning Fast Approximations of Sparse CodingKarol Gregor (New York University); Yann LeCun (New York University) In Sparse Coding (SC), input vectors are reconstructed using a sparse linear combination of basis vectors. SC has become a popular method for extracting features from data. For a given input, SC minimizes a quadratic reconstruction error with an L1 penalty term on the code. The process is often too slow for applications such as realtime pattern recognition. We proposed two versions of a very fast algorithm that produces approximate estimates of the sparse code that can be used to compute good visual features, or to initialize exact iterative algorithms. The main idea is to train a nonlinear, feedforward predictor with a specific architecture and a fixed depth to produce the best possible approximation of the sparse code. A version of the method, which can be seen as a trainable version of Li and Osher's coordinate descent method, is shown to produce approximate solutions with 10 times less computation than Li and Osher's for the same approximation error. Unlike previous proposals for sparse code predictors, the system allows a kind of approximate explaining away to take place during inference. The resulting predictor is differentiable and can be included into globallytrained recognition systems. [Full Paper] [Discussion] Paper ID: 451 Boosted Backpropagation Learning for Training Deep Modular NetworksAlexander Grubb (Carnegie Mellon University); Drew Bagnell (Cmu) Divideandconquer is key to building sophisticated learning machines: hard problems are solved by composing a network of modules that solve simpler problems (LeCun et al., 1998; Rohde, 2002; Bradley, 2009). Many such existing systems rely on learning algorithms which are based on simple parametric gradient descent where the parametrization must be predetermined, or more specialized perapplication algorithms which are usually adhoc and complicated. We present a novel approach for training generic modular networks that uses two existing techniques: the error propagation strategy of backpropagation and more recent research on descent in spaces of functions (Mason et al., 1999; Scholkopf & Smola, 2001). Combining these two methods of optimization gives a simple algorithm for training heterogeneous networks of functional modules using simple gradient propagation mechanics and established learning algorithms. The resulting separation of concerns between learning individual modules and error propagation mechanics eases implementation, enables a larger class of modular learning strategies, and allows permodule control of complexity/regularization. We derive and demonstrate this functional backpropagation and contrast it with traditional gradient descent in parameter space, observing that in our example domain the method is significantly more robust to local optima. [Full Paper] [Discussion] Paper ID: 453 Convergence, Targeted Optimality, and Safety in Multiagent LearningDORAN CHAKRABORTY (UNIVERSITY OF TEXAS, AUSTIN); Peter Stone (University Of Texas At Austin) This paper introduces a novel multiagent learning algorithm, Convergence with Model Learning and Safety (or CMLeS in short), which achieves convergence, targeted optimality against memorybounded adversaries, and safety, in arbitrary repeated games. The most novel aspect of CMLeS is the manner in which it guarantees(in a PAC sense) targeted optimality against memorybounded adversaries, via efficient exploration and exploitation. CMLeS is fully implemented and we present empirical results demonstrating its effectiveness. [Full Paper] [Discussion] Paper ID: 454 Improved Local Coordinate Coding using Local TangentsKai Yu (NEC Research Princeton); Tong Zhang (Rutgers University) Local Coordinate Coding (LCC), introduced in (Yu et al., 2009), is a high dimensional nonlinear learning method that explicitly takes advantage of the geometric structure of the data. Its successful use in the winning system of last year's Pascal image classification Challenge (Everingham, 2009) shows that the ability to integrate geometric information is critical for some real world machine learning applications. This paper further develops the idea of integrating geometry in machine learning by extending the original LCC method to include local tangent directions. These new correction terms lead to better approximation of high dimensional nonlinear functions when the underlying data manifold is locally flat. The method significantly reduces the number of anchor points needed in LCC, which not only reduces computational cost, but also improves prediction performance. Experiments are included to demonstrate that this method is more effective than the original LCC method on some image classification tasks. [Full Paper] [Discussion] Paper ID: 458 Deep learning via Hessianfree optimizationJames Martens (University of Toronto) We develop a 2ndorder optimization method based on the ``Hessianfree approach, and apply it to training deep autoencoders. Without using pretraining, we obtain results superior to those reported by Hinton & Salakhutdinov (2006) on the same tasks they considered. Our method is practical, easy to use, scales nicely to very large datasets, and isn't limited in applicability to autoencoders, or any specific model class. We also discuss the issue of ``pathological curvature as a possible explanation for the difficulty of deeplearning and how 2ndorder optimization, and our method in particular, effectively deals with it. [Full Paper] [Discussion] Paper ID: 464 Efficient Reinforcement Learning with Multiple Reward Functions for Randomized Controlled Trial AnalysisDaniel Lizotte (University of Michigan); Michael Bowling (University of Alberta); Susan Murphy (University of Michigan) We introduce new, efficient algorithms for value iteration with multiple reward functions and continuous state. We also give an algorithm for finding the set of all nondominated actions in the continuous state setting. This novel extension is appropriate for environments with continuous or finely discretized states where generalization is required, as is the case for data analysis of randomized controlled trials. [Full Paper] [Discussion] Paper ID: 468 Cognitive Models of TestItem Effects in Human Category LearningXiaojin Zhu (University of Wisconsin, Madison); Bryan Gibson; KwangSung Jun; Tim Rogers; Joseph Harrison; Chuck Kalish (University of WisconsinMadison) Imagine two identical people receive exactly the same training on how to classify certain objects. Perhaps surprisingly, we show that one can then manipulate them into classifying some test items in opposite ways, simply depending on what other test items they are asked to classify (without label feedback). We call this the TestItem Effect, which can be induced by the order or the distribution of test items. We formulate the TestItem Effect as online semisupervised learning, and extend three standard human category learning models to explain it. [Full Paper] [Discussion] Paper ID: 473 Online Learning for Group LassoHaiqin Yang (The Chinese University of HK); Zenglin Xu; Irwin King; Michael Lyu We develop a novel online learning algorithm for the group lasso in order to efficiently find the important explanatory factors in a grouped manner. Different from traditional batchmode group lasso algorithms, which suffer from the inefficiency and poor scalability, our proposed algorithm performs in an online mode and scales well: at each iteration one can update the weight vector according to a closedform solution based on the average of previous subgradients. Therefore, the proposed online algorithm can be very efficient and scalable. This is guaranteed by its low worstcase time complexity and memory cost both in the order of ${\cal O}(d)$, where $d$ is the number of dimensions. Moreover, in order to achieve more sparsity in both the group level and the individual feature level, we successively extend our online system to efficiently solve a number of variants of sparse group lasso models. We also show that the online system is applicable to other group lasso models, such as the group lasso with overlap and graph lasso. Finally, we demonstrate the merits of our algorithm by experimenting with both synthetic and realworld datasets. [Full Paper] [Discussion] Paper ID: 475 Generalizing Apprenticeship Learning across Hypothesis ClassesThomas Walsh (Rutgers University); Kaushik Subramanian (Rutgers University); Michael Littman (Rutgers University); Carlos Diuk (Princeton University) This paper develops a generalized apprenticeship learning protocol for reinforcementlearning agents with access to a teacher who provides policy traces (transition and reward observations). We characterize sufficient conditions of the underlying models for efficient apprenticeship learning and link this criteria to two established learn ability classes (KWIK and Mistake Bound). We then construct efficient apprenticeshiplearning algorithms in a number of domains, including two types of relational MDPs. We instantiate our approach in a software agent and a robot agent that learn effectively from a human teacher. [Full Paper] [Discussion] Paper ID: 481 Projection Penalties: Dimension Reduction without LossYi Zhang (Carnegie Mellon University); Jeff Schneider (Carnegie Mellon University) Dimension reduction is popular for learning predictive models in highdimensional spaces. It can highlight the relevant part of the feature space and avoid the curse of dimensionality. However, it can also be harmful because any reduction loses information. In this paper, we propose the projection penalty framework to make use of dimension reduction without losing valuable information .Reducing the feature space before learning predictive models can be viewed as restricting the model search to some parameter subspace. The idea of projection penalties is that instead of restricting the search to a parameter subspace, we can search in the full space but penalize the projection distance to this subspace. Dimension reduction is used to guide the search, rather than to restrict it.We propose projection penalties for linear dimension reduction, and then generalize to kernelbased reduction and other nonlinear methods. We test projection penalties with various dimension reduction techniques in different prediction tasks, including principal component regression and partial least squares in regression tasks, kernel dimension reduction in face recognition, and latent topic modeling in text classification. Experimental results show that projection penalties are a more effective and reliable way to make use of dimension reduction techniques. [Full Paper] [Discussion] Paper ID: 493 Application of Machine Learning To Epileptic Seizure DetectionAli Shoeb (Massachusetts Institute of Technology); John Guttag (Massachusetts Institute of Technology) We present and evaluate a machine learning approach to constructing patientspecific classifiers that detect the onset of an epileptic seizure through analysis of the scalp EEG, a noninvasive measure of the brain’s electrical activity. This problem is challenging because the brain’s electrical activity is composed of numerous classes with overlapping characteristics. The key steps involved in realizing a high performance algorithm included shaping the problem into an appropriate machine learning framework, and identifying the features critical to separating seizure from other types of brain activity. When trained on 2 or more seizures per patient and tested on 916 hours of continuous EEG from 24 patients, our algorithm detected 96% of 173 test seizures with a median detection delay of 3 seconds and a median false detection rate of 2 false detections per 24 hour period. We also provide information about how to download the CHBMIT database, which contains the data used in this study. [Full Paper] [Discussion] Paper ID: 495 Hilbert Space Embeddings of Hidden Markov ModelsLe Song (Cmu); Byron Boots (Carnegie Mellon University); Sajid Siddiqi (Google); Geoffrey Gordon (Carnegie Mellon University); Alex Smola (Yahoo! Research) Hidden Markov Models (HMMs) are important tools for modeling sequence data. However, they are restricted to discrete latent states, and are largely restricted to Gaussian and discrete observations. And,learning algorithms for HMMs have predominantly relied on local search heuristics, with the exception of spectral methods such as those described below. We propose a non parametric HMM that extends traditional HMMs to structured and nonGaussian continuous distributions. Furthermore, we derive a localminimumfree kernel spectral algorithm for learning these HMMs. We apply our method to robot vision data, slot car inertial sensor data and audio event classification data, and show that in these applications, embedded HMMs exceed the previous stateoftheart performance. [Full Paper] [Discussion] Paper ID: 502 Learning Markov Logic Networks Using Structural MotifsStanley Kok (University of Washington); Pedro Domingos (University Of Washington) Markov logic networks (MLNs) use firstorder formulas to define features of Markov networks. Current MLN structure learners can only learn short clauses (45 literals) due to extreme computational costs, and thus are unable to represent complex regularities in data. To address this problem,we present LSM, the first MLN structure learner capable of efficiently and accurately learning long clauses. LSM is based on the observation that relational data typically contains patterns that are variations of the same structural motifs. By constraining the search for clauses to occur within motifs, LSM can greatly speed up the search and thereby reduce the cost of finding long clauses. LSM uses random walks to identify densely connected objects in data, and groups them and their associated relations into a motif.Our experiments on three realworld datasets show that our approach is 25 orders of magnitude faster than the stateoftheart ones, while achieving the same or better predictive performance. [Full Paper] [Discussion] Paper ID: 504 Metric Learning to RankBrian McFee (UC San Diego); Gert Lanckriet (Ucsd) We study metric learning as problem of information retrieval. We present a general metric learning algorithm, based on the structural SVM framework, to learn a metric such that rankings of data induced by distance from a query can be optimized against various ranking measures, such as AUC, Precisionatk, MRR, MAP or NDCG. We demonstrate experimental results on standard classification data sets, and a largescale online dating recommendation problem. [Full Paper] [Discussion] Paper ID: 505 Transfer Learning for Collective Link Prediction in Multiple Heterogeneous DomainsBin Cao (HKUST); Nathan Liu (HKUST); Qiang Yang Link prediction is a key technique in many applications such as recommended systems, where potential links between users and items need to be predicted. A challenge in link prediction is the data sparsity problem. In this paper, we address this problem by jointly considering multiple heterogeneous link prediction tasks such as predicting links between users and different types of items including books, movies and songs, which we refer to as the collective link prediction (CLP) problem. We propose a nonparametric Bayesian framework for solving the CLP problem, which allows knowledge to be adaptively transferred across heterogeneous tasks while taking into account the similarities between tasks. We learn the intertask similarity automatically. We also introduce link functions for different tasks to correct their biases and skewness of distributions in their link data. We conduct experiments on several real world datasets and demonstrate significant improvements over several existing stateoftheart methods. [Full Paper] [Discussion] Paper ID: 518 Implicit Regularization in Variational Bayesian Matrix FactorizationShinichi Nakajima (Nikon Corporation); Masashi Sugiyama (Tokyo Institute Of Technology) Matrix factorization into the product of lowrank matrices induces nonidentifiability, i.e.,the mapping between the target matrix and factorized matrices is not onetoone.In this paper, we theoretically investigate the influence of nonidentifiability on Bayesian matrix factorization.More specifically, we show that a variational Bayesian method involves regularization effect even when the prior is noninformative,which is intrinsically different from the maximum a posteriori approach.We also extend our analysis to empirical Bayes scenarios where hyper parameters are also learned from data. [Full Paper] [Discussion] Paper ID: 520 On learning with kernels for unordered pairsMartial Hue (Mines Paristech); JeanPhilippe Vert (Mines ParisTech) We propose and analyze two strategies to learn over unordered pairs with kernels, and provide a common theoretical framework to compare them. The strategies are related to methods that were recently investigated to predict edges in biological networks. We show that both strategies differ in their loss function and in the kernels they use. We deduce in particular a smooth interpolation between the two approaches, as well as new ways to learn over unordered pairs. The different approaches are tested on the inference of missing edges in two biological networks. [Full Paper] [Discussion] Paper ID: 521 Robust Subspace Segmentation by LowRank RepresentationGuangcan Liu (SJTU); Zhouchen Lin (Visual Computing Group, Microsoft Research Asia); Yong Yu We propose lowrank representation(LRR) to segment data drawn from a union of multiple linear (or affine) subspaces. Given a set of data vectors, LRR seeks the lowestrank representation among all the candidates that represent all vectors as the linear combination of the bases in a dictionary. Unlike the wellknown sparse representation (SR), which computes the sparsest representation of each data vector individually, LRR aims at finding the lowestrank representation of a collection of vectors jointly. LRR better captures the global structure of data, giving a more effective tool for robust subspace segmentation from corrupted data. Both theoretical and experimental results show that LRR is a promising tool for subspace segmentation. [Full Paper] [Discussion] Paper ID: 522 Structured Output Learning with Indirect SupervisionMingWei Chang (University Of Illinois); Vivek Srikumar; Dan Goldwasser (University of Illinois); Dan Roth (University of Illinois at UrbanaChampaign) We present a novel approach for structure prediction that addresses the difficulty of obtaining labeled structures for training. We observe that structured output problems often have a companion learning problem of determining whether a given input possesses a good structure. For example, the companion problem for the partofspeech (POS) tagging task asks whether a given sequence of words has a corresponding sequence of POS tags that is ``legitimate''. While obtaining direct supervision for structures is difficult and expensive, it is often very easy to obtain indirect supervision from the companion binary decision problem.In this paper, we develop a large margin framework that jointly learns from both direct and indirect forms of supervision. Our experiments exhibit the significant contribution of the easytoget indirect binary supervision on three important NLP structure learning problems. [Full Paper] [Discussion] Paper ID: 523 Bayesian Nonparametric Matrix Factorization for Recorded MusicMatthew Hoffman (Princeton University); David Blei (Princeton University); Perry Cook (Princeton University) Recent research in machine learning has focused on breaking audio spectrograms into separate sources of sound using latent variable decompositions. These methods require that the number of sources be specified in advance, which is not always possible. To address this problem, we develop Gamma Process Nonnegative Matrix Factorization (GaPNMF), a Bayesian nonparametric approach to decomposing spectrograms. The assumptions behind GaPNMF are based on research in signal processing regarding the expected distributions of spectrogram data, and GaPNMF automatically discovers the number of latent sources. We derive a meanfield variational inference algorithm and evaluate GaPNMF on both synthetic data and recorded music. [Full Paper] [Discussion] Paper ID: 532 Learning the Linear Dynamical System with ASOSJames Martens (University of Toronto) We develop a new algorithm, based on EM, for learning the Linear Dynamical System model. Called the method of Approximated SecondOrder Statistics (ASOS) our approach achieves dramatically superior computational performance over standard EM through its use of approximations, which we justify with both intuitive explanations and rigorous convergence results. In particular, after an inexpensive precomputation phase, the iterations of ASOS can be performed in time independent of the length of the training dataset. [Full Paper] [Discussion] Paper ID: 537 BottomUp Learning of Markov Network StructureJesse Davis (University Of Washington); Pedro Domingos (University Of Washington) The structure of a Markov network is typically learned using topdown search. At each step, the search specializes a feature by conjoining it to the variable or feature that most improves the score. This is inefficient, testing many feature variations with no support in the data, and highly prone to local optima. We propose bottomup search as an alternative, inspired by the analogous approach in the field of rule induction. Our BLM algorithm starts with each complete training example as a long feature, and repeatedly generalizes a feature to match its $k$ nearest examples by dropping variables. An extensive empirical evaluation demonstrates that BLM is both faster and more accurate than the standard topdown approach, and also outperforms other stateoftheart methods. [Full Paper] [Discussion] Paper ID: 540 Simple and Efficient Multiple Kernel Learning By Group LassoZenglin Xu; Rong Jin (Michigan State University); Haiqin Yang (The Chinese University of HK); Irwin King; Michael Lyu We consider the problem of how to improve the efficiency of Multiple Kernel Learning (MKL). MKL is often regarded as a convexconcave optimization problem, which is convex on the kernel weights and concave on the SVM dual variables. In literature, the alternating approach is widely observed: (1) the minimization of the kernel weights is solved by complicated techniques, such as Semiinfinite Linear Programming, Gradient Descent, or Level method; (2) the maximization of SVM dual variables can be solved by standard SVM solvers. However, the minimization over kernel weights in these methods is usually dependent on its solving techniques or commercial softwares, which therefore limits the efficiency and applicability. In this paper, we formulate a closedform solution for the optimization of kernel weights based on the equivalence between grouplasso and MKL. Although this equivalence is not our invention, our derived variant equivalence not only lead to an efficient algorithm for MKL, but also generalize this equivalence to that between LpMKL (p >=1 and denoting the Lpnorm of kernel weights) and group lasso, which does not appear in literature. Therefore, our proposed algorithm provides a unified solution for the whole family of LpMKL models. Experiments on multiple data sets show the promising performance of the proposed technique compared with other competitive methods. [Full Paper] [Discussion] Paper ID: 544 Active Learning for Networked DataMustafa Bilgic (University of Maryland at CP); Lilyana Mihalkova (University Of Maryland); Lise Getoor (University Of Maryland) We introduce a novel active learning algorithm for classification of network data. In this setting, training instances are connected by a set of links to form a network, the labels of linked nodes are correlated, and the goal is to exploit these dependencies and accurately label the nodes. This problem arises in many domains, including social and biological network analysis and document classification, and there has been much recent interest in methods that collectively classify the nodes in the network. While in many cases labeled examples are expensive, often network information is available. We show how an active learning algorithm can take advantage of network structure. Our algorithm effectively exploits the links between instances and the interaction between the local and collective aspects of a classifier to improve the accuracy of learning from fewer labeled examples. We experiment with two realworld benchmark collective classification domains, and show that we are able to achieve extremely accurate results even when only a small fraction of the data is labeled. [Full Paper] [Discussion] Paper ID: 546 Modelbased reinforcement learning with nearly tight exploration complexity boundsIstvan Szita (University of Alberta); Csaba Szepesvari (University Of Alberta) One might believe that modelbased algorithms of reinforcement learning can propagate the obtained experience more quickly, and are able to direct exploration better. As a consequence, fewer exploratory actions should be enough to learn a good policy. Strangely enough, current theoretical results for modelbased algorithms do not support this claim: In a finite Markov decision process with N states, the best bounds on the number of exploratory steps necessary are of order O(N^2 log N), in contrast to the O(N log N) bound available for the modelfree, delayed Qlearning algorithm. In this paper we show that MoRmax, a modified version of the Rmax algorithm needs to make at most O(N log N) exploratory steps. This matches the lower bound up to logarithmic factors, as well as the upper bound of the stateoftheart modelfree algorithm, while our new bound improves the dependence on other problem parameters. [Full Paper] [Discussion] Paper ID: 549 Forgetting Counts: Constant Memory Inference for a Dependent Hierarchical PitmanYor ProcessNicholas Bartlett (Columbia University); David Pfau (Columbia University); Frank Wood (Columbia University) We propose a novel dependent hierarchical PitmanYor process model for discrete data. An incremental Monte Carlo inference procedure for this model is developed. We show that inference in this model can be performed in constant space and linear time. The model is demonstrated in a discrete sequence prediction task where it is shown to achieve state of the art sequence prediction performance while using significantly less memory. [Full Paper] [Discussion] Paper ID: 551 Distance Dependent Chinese Restaurant ProcessesDavid Blei (Princeton University); Peter Frazier (Cornell) We develop the distance dependent Chinese restaurant process (CRP), a flexible class of distributions over partitions that allows for nonexchangeability. This class can be used to model dependencies between data in infinite clustering models, including dependencies across time or space. We examine the properties of the distance dependent CRP, discuss its connections to Bayesian nonparametric mixture models, and derive a Gibbs sampler for both observed and mixture settings. We study its performance with timedependent models and three text corpora. We show that relaxing the assumption of exchangeability with distance dependent CRPs can provide a better fit to sequential data. We also show its alternative formulation of the traditional CRP leads to a fastermixing Gibbs sampling algorithm than the one based on the original formulation. [Full Paper] [Discussion] Paper ID: 553 Mixed Membership Matrix FactorizationLester Mackey (U.C. Berkeley); David Weiss (University of Pennsylvania); Michael Jordan (University Of California At Berkeley) Discrete mixed membership modeling and continuous latent factor modeling (also known as matrix factorization) are two popular, complementary approaches to dyadic data analysis. In this work, we develop a fully Bayesian framework for integrating the two approaches into unified Mixed Membership Matrix Factorization (M3F) models. We introduce two M3F models, derive Gibbs sampling inference procedures, and validate our methods on the Each Movie, Movie Lens, and Netflix Prize collaborative filtering datasets. We find that, even when fitting fewer parameters, the M3F models outperform stateoftheart latent factor approaches on all benchmarks, yielding the greatest gains in accuracy on sparselyrated, highvariance items. [Full Paper] [Discussion] Paper ID: 554 An Analysis of the Convergence of Graph LaplaciansDaniel Ting (UC Berkeley); Ling Huang (Intel Labs Berkeley); Michael Jordan (University Of California At Berkeley) Existing approaches to analyzing the asymptotics of graph Laplacians typically assume a wellbehaved kernel function with smoothness assumptions. We remove the smoothness assumption and generalize the analysis of graph Laplacians to include previously unstudied graphs including kNN graphs. We also introduce a kernelfree framework to analyze graph constructions with shrinking neighborhoods in general and apply it to analyze locally linear embedding (LLE). We also describe how, for a given limit operator, desirable properties such as a convergent spectrum and sparseness can be achieved by choosing the appropriate graph construction [Full Paper] [Discussion] Paper ID: 556 A Fast Augmented Lagrangian Algorithm for Learning LowRank MatricesRyota Tomioka (University of Tokyo); Taiji Suzuki (University of Tokyo); Masashi Sugiyama (Tokyo Institute Of Technology); Hisashi Kashima (University of Tokyo) We propose a general and efficient algorithm for learning lowrank matrices. The proposed algorithm converges superlinearly and can keep the matrix to be learned in a compact factorized representation without the need of specifying the rank beforehand. Moreover, we show that the framework can be easily generalized to the problem of learning multiple matrices and general spectral regularization. Empirically we show that we can recover a 10,000x10,000 matrix from 1.2 million observations in about 5 minutes. Furthermore, we show that in a braincomputer interface problem, the proposed method can speedup the optimization by two orders of magnitude against the conventional projected gradient method and produces more reliable solutions. [Full Paper] [Discussion] Paper ID: 562 A scalable trustregion algorithm with application to mixednorm regressionDongmin Kim (University of Texas at Austin); Suvrit Sra (Mpi For Biological Cybernetics); Inderjit Dhillon (University of Texas at Austin) We present a new algorithm for minimizing a convex lossfunction subject to regularization. Our framework applies to numerous problems in machine learning and statistics; notably, for sparsitypromoting regularizers such as l_1 or l_{1,\infty} norms it enables efficient computation of sparse solutions. Our approach is based on the trustregion framework with non smooth objectives, which allows us to build on known results to provide convergence analysis. We avoid the computational overheads associated with the conventional Hessian approximation used by trustregion methods by instead using a simple separable quadratic approximation. This approximation also enables use of proximity operators for tackling non smooth regularizers. We illustrate the versatility of our resulting algorithm by specializing it to three mixednorm regression problems: group lasso [36], group logistic regression [21], and multitask lasso [19]. We experiment with both synthetic and realworld largescale data—our method is seen to be competitive, robust, and scalable. [Full Paper] [Discussion] Paper ID: 568 Learning Programs: A Hierarchical Bayesian ApproachPercy Liang (University Of California  Berkeley); Michael Jordan (University Of California At Berkeley); Dan Klein (UC Berkeley) We are interested in learning programs for multiple related tasks given only a few training examples per task. Since the program for a single task is underdetermined by its data, we introduce a nonparametric hierarchical Bayesian prior over programs which shares statistical strength across multiple tasks. The key challenge is to parametrize this multitask sharing. For this, we introduce a new representation of programs based on combinatory logic and provide an MCMC algorithm that can perform safe program transformations on this representation to reveal shared interprogram substructures. [Full Paper] [Discussion] Paper ID: 569 MultiClass Pegasos on a BudgetZhuang Wang (Temple University); Koby Crammer (Department Of Electrical Enginering The Technion Israel); Slobodan Vucetic When equipped with kernel functions, online learning algorithms are susceptible to the “curse of kernelization” that causes unbounded growth in the model size. To address this issue, we present a family of budgeted online learning algorithms for multiclass classification which have constant space and time complexity per update. Our approach is based on the multiclass version of the popular Pegasos algorithm. It keeps the number of support vectors bounded during learning through budget maintenance. By treating the budget maintenance as a source of the gradient error, we prove that the gap between the budgeted Pegasos and the optimal solution directly depends on the average model degradation due to budget maintenance. To minimize the model degradation, we study greedy multiclass budget maintenance methods based on removal, projection, and merging of support vectors. Empirical results show that the proposed budgeted online algorithms achieve accuracy comparable to nonbudget multiclass kernelized Pegasos while being extremely computationally efficient. [Full Paper] [Discussion] Paper ID: 571 Inverse Optimal Control with Linearly Solvable MDPsKrishnamurthy Dvijotham (University of Washington); Emanuel Todorov (University of Washington) We present new algorithms for inverse optimal control (or inverse reinforcement learning, IRL) within the framework of linearlysolvable MDPs (LMDPs). Unlike most prior IRL algorithms which recover only the control policy of the expert, we recover the policy, the value function and the cost function. This is possible because here the cost and value functions are uniquely defined given the policy. Despite these special properties, we can handle a wide variety of problems such as the grid worlds popular in RL and most of the nonlinear problems arising in robotics and control engineering. Direct comparisons to prior IRL algorithms show that our new algorithms provide more information and are orders of magnitude faster. Indeed our fastest algorithm is the first inverse algorithm which does not require solving the forward problem; instead it performs unconstrained optimization of a convex and easytocompute loglikelihood. Our work also sheds light on the recent Maximum Entropy (MaxEntIRL) algorithm, which was defined in terms of density estimation and the corresponding forward problem was left unspecified. We show that MaxEntIRL is inverting an LMDP, using the less efficient of the algorithms derived here. Unlike all prior IRL algorithms which assume preexisting features, we study feature adaptation and show that such adaptation is essential in continuous state spaces. [Full Paper] [Discussion] Paper ID: 576 Telling cause from effect based on highdimensional observationsDominik Janzing (MPI for biological cybernetics); Patrik Hoyer; Bernhard Schoelkopf We describe a method for inferring linear causal relations among multidimensional variables. The idea is to use an asymmetry between the distributions of cause and effect that occurs if the covariance matrix of the cause and the structure matrix mapping the cause to the effect are independently chosen.The method applies to both stochastic and deterministic causal relations, provided that the dimensionality is sufficiently high (in some experiments, 5 was enough). It is applicable to Gaussian as well as nonGaussian data. [Full Paper] [Discussion] Paper ID: 582 Mining Clustering DimensionsSajib Dasgupta (University of Texas at Dallas); Vincent Ng (University of Texas at Dallas) Although it is common practice to produce only a single clustering of a dataset, in many cases text documents can be clustered along different dimensions. Unfortunately, not only do traditional clustering algorithms fail to produce multiple clusterings of a dataset, the only clustering they produce may not be the one that the user desires. To address this major limitation, we propose a novel clustering algorithm for inducing multiple clusterings along the important dimensions of a dataset. Its ability to reveal the important clustering dimensions of a dataset in an unsupervised manner is particularly appealing for those users who have no idea of how a data can possibly be clustered. We demonstrate its viability on several challenging text classification tasks. [Full Paper] [Discussion] Paper ID: 586 Learning Tree Conditional Random FieldsJoseph Bradley (Carnegie Mellon University); Carlos Guestrin (Cmu) We examine maximum spanning treebased methods for learning the structure of tree Conditional Random Fields (CRFs) P(YX). We use edge weights which take advantage of local inputs X and thus scale to large problems. For a general class of edge weights, we give a negative learn ability result. However, we demonstrate that two members of the classlocal Conditional Mutual Information and Decomposable Conditional Influencehave reasonable theoretical bases and perform very well in practice. On synthetic data and a largescale fMRI application, our methods outperform existing techniques. [Full Paper] [Discussion] Paper ID: 587 Learning efficiently with approximate inference via dual lossesOfer Meshi (HUJI); David Sontag (Mit); Tommi Jaakkola; Amir Globerson (Hebrew University) Many structured prediction tasks involve complex models where inference is computationally intractable, but where it can be well approximated using a linear programming relaxation. Previous approaches for learning for structured prediction (e.g., cuttingplane, subgradient, perceptron) repeatedly make predictions for some of the data points. These approaches are computationally demanding because each prediction involves solving a linear program to optimality.We present a scalable algorithm for learning for structured prediction.The main idea is to instead solve the dual of the structured prediction loss. We formulate the learning task as a convex minimization over both the weights and the dual variables corresponding to each data point. As a result, we can begin to optimize the weights even before completely solving any of the individual prediction problems. We show how the dual variables can be efficiently optimized using coordinate descent. Our algorithm is competitive with stateoftheart methods such as stochastic subgradient and cuttingplane. [Full Paper] [Discussion] Paper ID: 588 Approximate Predictive Representations of Partially Observable SystemsDoina Precup (Mcgill University); Monica Dinculescu (McGill University) We provide a novel view of learning an approximate model of a partially observable environment from data and present a simple implementation of the idea. The learned model abstracts away unnecessary details of the agent's experience and focuses only on making certain predictions of interest. We illustrate our approach empirically in small computational examples, demonstrating the data efficiency of the algorithm. [Full Paper] [Discussion] Paper ID: 589 Bayes Optimal Multilabel Classification via Probabilistic Classifier ChainsKrzysztof Dembczynski (Department Of Mathematics And Computer Science University Of Marburg Germany); Weiwei Cheng (University Marburg); Eyke Huellermeier (University of Marburg) In the realm of multilabel classification (MLC), it has become an opinio communis that optimal predictive performance can only be achieved by learners that explicitly take label dependence into account. The goal of this paper is to elaborate on this postulate in a critical way. To this end, we formalize and analyze MLC within a probabilistic setting. Thus, it becomes possible to look at the problem from the point of view of risk minimization and Bayes optimal prediction. Moreover, inspired by our probabilistic setting, we propose a new method for MLC that generalizes and outperforms another approach, called classifier chains, that was recently introduced in the literature. [Full Paper] [Discussion] Paper ID: 592 NonLocal Contrastive ObjectivesDavid Vickrey (Stanford University); Cliff ChiungYu Lin (Stanford University); Daphne Koller (Stanford University) Pseudolikelihood and contrastive divergence are two wellknown examples of contrastive methods. These algorithms trade off the probability of the correct label with the probabilities of other “nearby” instantiations. In this paper we explore more general types of contrastive objectives, which trade off the probability of the correct label against an arbitrary set of other instantiations. We prove that a large class of contrastive objectives are consistent with maximum likelihood,even for finite amounts of data. This result generalizes asymptotic consistency for pseudolikelihood. The proof gives significant insight into contrastive objectives, suggesting that they enforce (soft) probabilityratio constraints between pairs of instantiations. Based on this insight, we propose Contrastive Constraint Generation (CCG),an iterative constraintgeneration style algorithm that allows us to learn a loglinear model using only MAP inference. We evaluate CCG on a scene classification task, showing that it significantly outperforms pseudolikelihood,contrastive divergence, and a wellknown marginbased method. [Full Paper] [Discussion] Paper ID: 593 Constructing States for Reinforcement LearningM. M. Mahmud (Australian National University) POMDPs are the models of choice for reinforcement learning (RL) tasks where the environment cannot be observed directly. In many applications we need to learn the POMDP structure and parameters from experience and this is considered to be a difficult problem. In this paper we address this issue by modeling the hidden environment with a novel class of models that are less expressive, but easier to learn and plan with than POMDPs. We call these models {\em deterministic Markov models} (DMMs), which are deterministicprobabilistic finite automata from learning theory, extended with actions to the sequential (rather than i.i.d.) setting. Conceptually, we extend the Utile Suffix Memory method of McCallum to handle long term memory. We describe DMMs, give Bayesian algorithms for learning and planning with them and also present experimental results for some standard POMDP tasks and tasks to illustrate its efficacy. [Full Paper] [Discussion] Paper ID: 596 Graded Multilabel Classification: The Ordinal CaseWeiwei Cheng (University Marburg); Krzysztof Dembczynski (Department Of Mathematics And Computer Science University Of Marburg Germany); Eyke Huellermeier (University of Marburg) We propose a generalization of multilabel classification that we refer to as graded multilabel classification. The key idea is that, instead of requesting a yesno answer to the question of class membership or, say, relevance of a class label for an instance, we allow for a graded membership of an instance, measured on an ordinal scale of membership degrees. This extension is motivated by practical applications in which a graded or partial class membership is natural. Apart from introducing the basic setting, we propose two general strategies for reducing graded multilabel problems to conventional (multilabel) classification problems. Moreover, we address the question of how to extend performance metrics commonly used in multilabel classification to the graded setting, and present first experimental results. [Full Paper] [Discussion] Paper ID: 598 FiniteSample Analysis of LSTDAlessandro Lazaric (Inria); Mohammad Ghavamzadeh (Inria); Remi Munos (Inria) In this paper we consider the problem of policy evaluation in reinforcement learning, i.e., learning the value function of a fixed policy, using the leastsquares temporaldifference (LSTD) learning algorithm. We report a finitesample analysis of LSTD. We first derive a bound on the performance of the LSTD solution evaluated at the states generated by the Markov chain and used by the algorithm to learn an estimate of the value function. This result is general in the sense that no assumption is made on the existence of a stationary distribution for the Markov chain. We then derive generalization bounds in the case when the Markov chain possesses a stationary distribution and is $\beta$mixing. [Full Paper] [Discussion] Paper ID: 601 On the Interaction between Norm and Dimensionality: Multiple Regimes in LearningPercy Liang (University Of California  Berkeley); Nathan Srebro (TTI Chicago) A learning problem might have several measures of complexity (e.g., norm and dimensionality) that affect the generalization error. What is the interaction between these complexities? Dimensionfree learning theory bounds and parametric asymptotic analyses each provide a partial picture of the full learning curve. In this paper, we use highdimensional asymptotics on two classical problemsmean estimation and linear regressionto explore the learning curve more completely. We show that these curves exhibit multiple regimes, where in each regime, the excess risk is controlled by a subset of the problem complexities. [Full Paper] [Discussion] Paper ID: 605 Learning Hierarchical Riffle Independent Groupings from RankingsJonathan Huang (CMU); Carlos Guestrin (Cmu) Riffled independence is a generalized notion of probabilistic independence that has been shown to be naturally applicable to ranked data. In the riffled independence model, one assigns rankings to two disjoint sets of items independently, then in a second stage, interleaves (or riffles) the two rankings together to form a full ranking, as if by shuffling a deck of cards. Because of this interleaving stage, it is much more difficult to detect riffled independence than ordinary independence. In this paper, we provide the first automated method for discovering sets of items which are riffle independent from a training set of rankings. We show that our clusteringlike algorithms can be used to discover meaningful latent coalitions from real preference ranking datasets and to learn the structure of hierarchically decomposable models based on riffled independence. [Full Paper] [Discussion] Paper ID: 620 Active Learning for MultiTask Adaptive FilteringAbhay Harpale (Carnegie Mellon University); Yiming Yang (Carnegie Mellon University) In this paper, we propose an Active Learning (AL) framework for the MultiTask Adaptive Filtering (MTAF) problem. Specifically, we explore AL approaches to rapidly improve an MTAF system, based on Dirichlet Process priors, with minimal user/tasklevel feedback. The proposed AL approaches select instances for delivery with a twofold objective: 1) Improve future taskspecific system performance based on feedback received on delivered instances for that task, 2) Improve the future overall system performance, thereby benefiting other tasks in the system, based on feedback received on delivered instances for a particular task. Current AL approaches focus only on the first objective. For satisfying both goals, we define a new scoring function called Utility Gain to estimate the perceived improvements in taskspecific and global models. In our experiments on standard benchmark datasets, we observed that global AL approaches that additionally take into account the potential benefit of feedback to other tasks in the system performed better than the taskspecific approach that focused only the benefit of the current task. [Full Paper] [Discussion] Paper ID: 627 Toward OffPolicy Learning Control with Function ApproximationHamid Maei (University of Alberta); Csaba Szepesvari (University Of Alberta); Shalabh Bhatnagar (Indian Institute of Science); Richard Sutton (University of Alberta) We present the first temporaldifference learning algorithm for offpolicy control with unrestricted linear function approximation whose pertimestep complexity is linear in the number of features. Our algorithm, text it {GreedyGQ}, is an extension of recent work on gradient temporaldifference learning, which has hitherto been restricted to a prediction (policy evaluation) setting, to a control setting in which the target policy is greedy with respect to a linear approximation to the optimal actionvalue function. A limitation of our control setting is that we require the behavior policy to be stationary. We call this setting text it {latent learning} because the optimal policy, though learned, is not manifest in behavior. Popular offpolicy algorithms such as Qlearning are known to be unstable in this setting when used with linear function approximation. [Full Paper] [Discussion] Paper ID: 628 Accelerated dual decomposition for MAP inferenceVladimir Jojic (Stanford University); Stephen Gould (Stanford University); Daphne Koller (Stanford University) Approximate MAP inference in graphical models is an important and challenging problem for many domains including computer vision, computational biology and natural language understanding. Current stateoftheart approaches employ convex relaxations of these problems as surrogate objectives, but only provide weak running time guarantees. In this paper, we develop an approximate inference algorithm that is both efficient and has strong theoretical guarantees. Specifically, our algorithm is guaranteed to converge to an $\epsilon$accurate solution of the convex relaxation in $O\left(\frac{1}{\epsilon}\right)$ time. We demonstrate our approach on synthetic and realworld problems and show that it outperforms current stateoftheart techniques. [Full Paper] [Discussion] Paper ID: 636 Sparse Gaussian Process Regression via $\ell_1$ PenalizationFeng Yan (Purdue University); Yuan Qi (Purdue University) To handle massive data, a variety of sparse Gaussian Process (GP) methods have been proposed to reduce the computational cost. Many of them essentially map the large dataset into a small set of basis points. A common approach to learn these basis points is evidence maximization. Nevertheless, maximized evidence may lead to over fitting and cause high computational cost for optimization. In this paper, we propose a novel sparse GP regression approach, GPLasso, that explicitly represents the tradeoff between its approximation quality and the model sparsity. GPLasso minimizes a $\ell_1$penalized KL divergence between the exact and sparse GP posterior processes. Optimizing this convex cost function leads to sparse GP parameters. Furthermore, we use lowrank matrix approximation (for example, incomplete Cholesky factorization) in our optimization procedure to significantly reduce the computational cost. Experimental results demonstrate that, compared with several stateoftheart sparse GP methods and direct lowrank matrix approximation methods, GPLasso achieves a much improved balance between prediction accuracy and computational cost. [Full Paper] [Discussion] Paper ID: 638 A theoretical analysis of feature pooling in vision algorithmsYLan Boureau (Nyu); Jean Ponce (Ens); Yann LeCun (New York University) Many modern visual recognition algorithms incorporate a step of spatial `pooling', where the outputs of several nearby feature detectors are combined into a local or global `bag of features', in a way that preserves taskrelated information while removing irrelevant details. Pooling is used to achieve invariance to image transformations, more compact representations, and better robustness to noise and clutter. Several papers have shown that the details of the pooling operation can greatly influence the performance, but studies have so far been purely empirical. In this paper, we show that the reasons underlying the performance of various pooling methods are obscured by several confounding factors, such as the link between the sample cardinality in a spatial pool and the resolution at which lowlevel features have been extracted. We provide a detailed theoretical analysis of max pooling and average pooling, and give extensive empirical comparisons for object recognition tasks. [Full Paper] [Discussion] Paper ID: 642 Comparing Clusterings in SpaceMichael Coen (UWMadison); Hidayath Ansari (UWMadison); Nathanael Fillmore (UWMadison) This paper proposes a new method for comparing clusterings both partitionally and geometrically. Our approach is motivated by the following observation: the vast majority of previous techniques for comparing clusterings are entirely partitional, i.e., they examine assignments of points in set theoretic terms after they have been partitioned. In doing so, these methods ignore the spatial layout of the data, disregarding the fact that this information is responsible for generating the clusterings to begin with. We demonstrate that this leads to a variety of failure modes. Previous comparison techniques often fail to differentiate between significant changes made in data being clustered. We formulate a new measure for comparing clusterings that combines spatial and partitional information into a single measure using optimization theory. Doing so eliminates pathological conditions in previous approaches. It also simultaneously removes common limitations, such as that each clustering must have the same number of clusters or the yare over identical datasets. This approach is stable, easily implemented, and has strong intuitive appeal. [Full Paper] [Discussion] Paper ID: 643 HighPerformance SemiSupervised Learning using Discriminatively Constrained Generative ModelsGregory Druck (University of Massachusetts Am); Andrew McCallum (University of Massachusetts Amherst) We develop a semisupervised learning method that constrains the posterior distribution of latent variables under a generative model to satisfy a rich set of feature expectation constraints estimated with labeled data. This approach encourages the generative model to discover latent structure that is relevant to a prediction task. We estimate parameters with a coordinate ascent algorithm, one step of which involves training a discriminative loglinear model with an embedded generative model. This hybrid model can be used for test time prediction. Unlike other highperformance semisupervised methods, the proposed method converges to a local maximum of a single objective function, and affords additional flexibility, for example to use different latent and output spaces. We conduct experiments on three sequence labeling tasks, achieving the best reported results on two of them, and showing promising results on CoNLL03 NER. [Full Paper] [Discussion] Paper ID: 652 Nonparametric Return Distribution Approximation for Reinforcement LearningTetsuro Morimura (IBM Research  Tokyo); Masashi Sugiyama (Tokyo Institute Of Technology); Hisashi Kashima (University of Tokyo); Hirotaka Hachiya; Toshiyuki Tanaka Standard Reinforcement Learning (RL) aims to optimize decisionmaking rules in terms of the expected return. However, especially for riskmanagement purposes, other criteria such as the expected shortfall are sometimes preferred. Here, we describe a method of approximating the distribution of returns, which allows us to derive various kinds of information about the returns. We first show that the Bellman equation, which is a recursive formula for the expected return, can be extended to the cumulative return distribution. Then we derive a nonparametric return distribution estimator with particle smoothing based on this extended Bellman equation. A key aspect of the proposed algorithm is to represent the recursion relation in the extended Bellman equation by a simple replacement procedure of particles associated with a state by using those of the successor state. We show that our algorithm leads to a risksensitive RL paradigm. The usefulness of the proposed approach is demonstrated through numerical experiments. [Full Paper] [Discussion] Paper ID: 654 Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection viewBruno Scherrer (Loria) We investigate projection methods, for evaluating a linear approximation of the value function of a policy in a Markov Decision Process context. We consider two popular approaches, the onestep Temporal Difference fixpoint computation (TD(0)) and the Bellman Residual (BR) minimization. We describe examples, where each method outperforms the other. We highlight a simple relation between the objective function they minimize, and show that while BR enjoys a performance guarantee, TD(0) does not in general. We then propose a unified view in terms of oblique projections of the Bellman equation, which substantially simplifies and extends the characterization of Schoknecht (2002) and the recent analysis of Yu and Bertsekas (2008). Eventually, we describe some simulations that suggest that if the TD(0) solution is usually slightly better than the BR solution, its inherent numerical instability makes it very bad in some cases, and thus worse on average. [Full Paper] [Discussion] 