[edit]
AISTATS 2017 Program of Events
Best Paper Awards
A Sub-Quadratic Exact Medoid Algorithm
James Newling, Francois Fleuret
Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation
Sohail Bahmani, Justin Romberg
Reparameterization Gradients through Acceptance- Rejection Sampling Algorithms
Christian Naesseth, Francisco Ruiz, Scott Linderman, David Blei
19-Apr (Wed)
18:30-20:30 Registration Desk
20-Apr (Thu)
7:30-8:50 Breakfast, Windows on the Green & Chart Room
8-10 Registration Desk
8:50-9pm Welcome and award announcement
9:00-10:00 Invited Talk: Csaba Szepesvari. Crystal Ballroom 1, 2
Stochastic linear bandits.
See abstract. See slides.
Learning and decision making often conflict: A decision maker who is uncertain about its environment may need to choose actions whose main benefit is to gain information rather gaining reward. The dilemma between exploration and exploitation is at the heart of bandit problems. In this talk I will focus on the so-called stochastic linear bandits where the payoff function is assumed to posses a linear structure, an assumption that proved to be extremely effective elsewhere in machine learning. Here we look into how the linear structure can be exploited in bandits. I will discuss existing results and open questions. The issues discussed will include limits of performance of bandit algorithms, how to design efficient and effective algorithms, or how to exploit additional information like sparsity.
Bio: Csaba Szepesvari is interested in designing principled learning algorithms for agents that learn from and control their environments. He is the author or coauthor of two books on the topic, as well as nearly 200 journal and conference papers. He is best known as the co-inventor of "UCT", a tree-search algorithm that inspired much subsequent work in AI, search and optimization and serves, among other things, as the basis of the search component in AlphaGo, Deepmind's Go program that defeated a top human Go player in the beginning of 2016. His work on UCT was recently recognized by the "Test of Time" award at ECML'2016. Csaba serves as an action editor of the Journal of Machine Learning Research and the Machine Learning Journal. He also served as a co-chair for COLT and ALT and has served as a senior PC member for first-tier AI and machine learning conferences for too many years. Currently Csaba is a Professor at the Department of Computing Science of the University of Alberta and a Principal Investigator of the Alberta Machine Intelligence Institute (AMII). From August, he will be joining Deepmind, London.
10:00-10:30 Coffee Break, Crystal Atrium
10:30-12:10 Online Learning, Crystal Ballroom 1, 2
Session Chair: Csaba Szepesvari
63 Linear Thompson Sampling Revisited
217 Horde of Bandits using Gaussian Markov Random Fields
225 The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits
304 Improved Strongly Adaptive Online Learning using Coin Betting
12:10-2:00 Lunch on your own
1:00-3:00 Registration Desk
2:00-3:40 Nonparametric methods, Crystal Ballroom 1, 2
Session Chair: Byron Boots
97 Poisson intensity estimation with reproducing kernels
249 Attributing Hacks
248 Regression Uncertainty on the Grassmannian
401 Modal-set estimation with an application to clustering
3:40-4:10 Coffee break, Crystal Atrium
4:10-7:00 Poster Session (with light snacks), Crystal Ballroom 3, 4
See poster list.
tP01: 352 Scalable variational inference for super resolution microscopy
tP02: 342 Complementary Sum Sampling for Likelihood Approximation in Large Scale Classification
tP03: 249 Attributing Hacks
tP04: 317 Fairness Constraints: Mechanisms for Fair Classification
tP05: 116 Encrypted accelerated least squares regression
tP06: 294 DP-EM: Differentially Private Expectation Maximization
tP07: 539 Greedy Direction Method of Multiplier for MAP Inference of Large Output Domain
tP08: 328 A Learning Theory of Ranking Aggregation
tP09: 406 Relativistic Monte Carlo
tP10: 119 Gray-box inference for structured Gaussian process models
tP11: 11 Clustering from Multiple Uncertain Experts
tP12: 139 Combinatorial Topic Models using Small-Variance Asymptotics
tP13: 212 High-dimensional Time Series Clustering via Cross-Predictability
tP14: 516 Convergence rate of stochastic k-means
tP15: 183 Fast column generation for atomic norm regularization
tP16: 492 Binary and Multi-Bit Coding for Stable Random Projections
tP17: 30 On the learnability of fully-connected neural networks
tP18: 362 Fast rates with high probability in exp-concave statistical learning
tP19: 526 Learning Graphical Games from Behavioral Data: Sufficient and Necessary Conditions
tP20: 441 Inference Compilation and Universal Probabilistic Programming
tP21: 199 Gradient Boosting on Stochastic Data Streams
tP22: 114 Localized Lasso for High-Dimensional Regression
tP23: 366 Learning with feature feedback: from theory to practice
tP24: 47 Consistent and Efficient Nonparametric Different-Feature Selection
tP25: 148 Rapid Mixing Swendsen-Wang Sampler for Stochastic Partitioned Attractive Models
tP26: 215 Learning Nonparametric Forest Graphical Models with Prior Information
tP27: 301 Efficient Algorithm for Sparse Tensor-variate Gaussian Graphical Models via Gradient Descent
tP28: 383 Belief Propagation in Conditional RBMs for Structured Prediction
tP29: 393 A Fast and Scalable Joint Estimator for Learning Multiple Related Sparse Gaussian Graphical Models
tP30: 422 Learning the Network Structure of Heterogeneous Data via Pairwise Exponential Markov Random Fields
tP31: 99 Generalized Pseudolikelihood Methods for Inverse Covariance Estimation
tP32: 327 A New Class of Private Chi-Square Hypothesis Tests
tP33: 32 An Information-Theoretic Route from Generalization in Expectation to Generalization in Probability
tP34: 405 Local Group Invariant Representations via Orbit Embeddings
tP35: 439 Spatial Decompositions for Large Scale SVMs
tP36: 488 Distributed Adaptive Sampling for Kernel Matrix Approximation
tP37: 97 Poisson intensity estimation with reproducing kernels
tP38: 268 Scalable Learning of Non-Decomposable Objectives
tP39: 414 Fast Classification with Binary Prototypes
tP40: 497 Label Filters for Large Scale Multilabel Classification
tP41: 150 Efficient Rank Aggregation via Lehmer Codes
tP42: 325 A Unified Computational and Statistical Framework for Nonconvex Low-rank Matrix Estimation
tP43: 45 Tensor-Dictionary Learning with Deep Kruskal-Factor Analysis
tP44: 469 Optimal Recovery of Tensor Slices
tP45: 290 Hit-and-Run for Sampling and Planning in Non-Convex Spaces
tP46: 312 Black-box Importance Sampling
tP47: 361 Sequential Graph Matching with Sequential Monte Carlo
tP48: 134 On the Troll-Trust Model for Edge Sign Prediction in Social Networks
tP49: 153 Nonlinear ICA of Temporally Dependent Stationary Sources
tP50: 216 Sparse Randomized Partition Trees for Nearest Neighbor Search
tP51: 341 Structured adaptive and random spinners for fast machine learning computations
tP52: 401 Modal-set estimation with an application to clustering
tP53: 435 Lipschitz Density-Ratios, Structured Data, and Data-driven Tuning
tP54: 138 Online Optimization of Smoothed Piecewise Constant Functions
tP55: 178 Contextual Bandits with Latent Confounders: An NMF Approach
tP56: 200 Online Learning and Blackwell Approachability with Partial Monitoring: Optimal Convergence Rates
tP57: 217 Horde of Bandits using Gaussian Markov Random Fields
tP58: 221 Trading off Rewards and Errors in Multi-Armed Bandits
tP59: 225 The End of Optimism? An Asymptotic Analysis of Finite-Armed Linear Bandits
tP60: 250 Unsupervised Sequential Sensor Acquisition
tP61: 304 Improved Strongly Adaptive Online Learning using Coin Betting
tP62: 35 Nearly Instance Optimal Sample Complexity Bounds for Top-k Arm Selection
tP63: 63 Linear Thompson Sampling Revisited
tP64: 96 Regret Bounds for Lifelong Learning
tP65: 106 Regret Bounds for Transfer Learning in Bayesian Optimisation
tP66: 111 Scaling Submodular Maximization via Pruned Submodularity Graphs
tP67: 23 Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach
tP68: 360 Linear Convergence of Stochastic Frank Wolfe Variants
tP69: 386 Finite-sum Composition Optimization via Variance Reduced Gradient Descent
tP70: 40 Guaranteed Non-convex Optimization: Submodular Maximization over Continuous Domains
tP71: 467 Initialization and Coordinate Optimization for Multi-way Matching
tP72: 52 Less than a Single Pass: Stochastically Controlled Stochastic Gradient
tP73: 521 Scalable Convex Multiple Sequence Alignment via Entropy-Regularized Dual Decomposition
tP74: 193 Exploration-Exploitation in MDPs with Options
tP75: 515 Value-Aware Loss Function for Model-based Reinforcement Learning
tP76: 84 Learning Nash Equilibrium for General-Sum Markov Games from Batch Data
tP77: 324 Frequency Domain Predictive Modelling with Aggregated Data
tP78: 394 Communication-efficient Distributed Sparse Linear Discriminant Analysis
tP79: 402 Compressed Least Squares Regression revisited
tP80: 219 Random projection design for scalable implicit smoothing of randomly observed stochastic processes
tP81: 78 Learning Theory for Conditional Risk Minimization
tP82: 248 Regression Uncertainty on the Grassmannian
tP83: 299 Bayesian Learning and Inference in Recurrent Switching Linear Dynamical Systems
tP84: 55 Learning Time Series Detection Models from Temporally Imprecise Labels
21-Apr (Fri)
7:30-9:00 Breakfast, Windows on the Green & Chart Room
8-10 Registration Desk
9:00-10:00 Invited Talk, Cynthia Rudin, Crystal Ballroom 1, 2
What Are We Afraid Of?: Computational Hardness vs the Holy Grail of Interpretability in Machine Learning.
See abstract. See slides.
Is there always a tradeoff between accuracy and interpretability? This is a very old AI question. Many people have claimed that they have investigated the answer to this question, but it is not clear that these attempts have been truly serious. If we try to investigate this claim by comparing interpretable modeling algorithms (like decision trees - say CART, C4.5) to a black box method that optimizes only accuracy (SVM or neural networks), we will not find the answer. This is not a fulfilling comparison - the methods for producing interpretable models are greedy myopic methods with no global objective, whereas the black box algorithms have global objectives and principled optimization routines. In order to actually answer this question, we would have to compare an "optimal" interpretable model to an optimal black box model. This means we actually need optimality for interpretable models. This, of course, leads to computationally hardness, which scares us. On the other hand, we have computing power like never before. So do we truly know what we are afraid of any more?
In this talk I will discuss algorithms for interpretable machine learning. Some of these algorithms are designed to create certificates of nearness to optimality. I will focus on some of our most recent work, including (1) work on optimal rule list models using customized bounds and data structures (these are an alternative to CART) (2) work on optimal scoring systems (alternatives to logistic regression + rounding). Further, since we have methods that can produce optimal or near-optimal models, we can use them to produce interesting new forms of interpretable models. These new forms were simply not possible before, since they are almost impossible to produce using traditional techniques (like greedy splitting and pruning). In particular: (3) Falling rule lists, (4) Causal falling rule lists, and (5) Cost-effective treatment regimes. Work on (1) is joint with postdoc Elaine Angelino, students Nicholas Larus-Stone and Daniel Alabi, and colleague Margo Seltzer. Work on (2) is joint with student Berk Ustun. Work on (3) and (4) are joint with students Fulton Wang and Chaofan Chen, and (5) is an AISTATS 2017 paper that is joint work with student Himabindu Lakkaraju.
Bio: Cynthia Rudin is an associate professor of computer science and electrical and computer engineering at Duke University, and directs the Prediction Analysis Lab. Her interests are in machine learning, data mining, applied statistics, and knowledge discovery (Big Data). Her application areas are in energy grid reliability, healthcare, and computational criminology. Previously, Prof. Rudin held positions at MIT, Columbia, and NYU. She holds an undergraduate degree from the University at Buffalo where she received the College of Arts and Sciences Outstanding Senior Award in Sciences and Mathematics, and three separate outstanding senior awards from the departments of physics, music, and mathematics. She received a PhD in applied and computational mathematics from Princeton University. She is the recipient of the 2013 and 2016 INFORMS Innovative Applications in Analytics Awards, an NSF CAREER award, was named as one of the "Top 40 Under 40" by Poets and Quants in 2015, and was named by Businessinsider.com as one of the 12 most impressive professors at MIT in 2015. Work from her lab has won 10 best paper awards in the last 5 years. Her work has been featured in Businessweek, The Wall Street Journal, the New York Times, the Boston Globe, the Times of London, Fox News (Fox & Friends), the Toronto Star, WIRED Science, U.S. News and World Report, Slashdot, CIO magazine, Boston Public Radio, and on the cover of IEEE Computer. She is past chair of the INFORMS Data Mining Section, and is currently chair-elect of the Statistical Learning and Data Science section of the American Statistical Association.
10:00-10:30 Coffee Break, Crystal Atrium
10:30-12:10 Theory, Crystal Ballroom 1, 2
Session Chair: Sanjoy Dasgupta
94 Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation
68 A Sub-Quadratic Exact Medoid Algorithm
456 On the Interpretability of Conditional Probability Estimates in the Agnostic Setting
209 Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers
12:10-2:00 Lunch on your own
1:00-3:00 Registration Desk
2:00-3:40 Approximate Inference and MCMC, Crystal Ballroom 1, 2
Session Chair: Simon Lacoste-Julien
51 Annular Augmentation Sampling
101 Removing Phase Transitions from Gibbs Measures
170 Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms
174 Asymptotically exact inference in differentiable generative models
3:40-4:10 Coffee Break, Crystal Atrium
4:10-7:00 Poster Session (with light snacks), Crystal Ballroom 3, 4
See poster list.
fP01: 82 Near-optimal Bayesian Active Learning with Correlated and Noisy Tests
fP02: 9 Large-Scale Data-Dependent Kernel Approximation
fP03: 86 Distance Covariance Analysis
fP04: 228 Rank Aggregation and Prediction with Item Features
fP05: 420 Signal-based Bayesian Seismic Monitoring
fP06: 60 Learning Cost-Effective and Interpretable Treatment Regimes
fP07: 170 Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms
fP08: 174 Asymptotically exact inference in differentiable generative models
fP09: 288 Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models
fP10: 196 Local Perturb-and-MAP for Structured Prediction
fP11: 51 Annular Augmentation Sampling
fP12: 104 Performance Bounds for Graphical Record Linkage
fP13: 180 Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets
fP14: 273 CPSG-MCMC: Clustering-Based Preprocessing method for Stochastic Gradient MCMC
fP15: 298 On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior
fP16: 345 Learning Optimal Interventions
fP17: 419 Learning Structured Weight Uncertainty in Bayesian Neural Networks
fP18: 429 Discovering and Exploiting Additive Structure for Bayesian Optimization
fP19: 211 Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes
fP20: 416 Prediction Performance After Learning in Gaussian Process Regression
fP21: 129 A Framework for Optimal Matching for Causal Inference
fP22: 523 Robust Causal Estimation in the Large-Sample Limit without Strict Faithfulness
fP23: 182 Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds
fP24: 68 A Sub-Quadratic Exact Medoid Algorithm
fP25: 117 Random Consensus Robust PCA
fP26: 224 Adaptive ADMM with Spectral Penalty Parameter Selection
fP27: 94 Phase Retrieval Meets Statistical Learning Theory: A Flexible Convex Relaxation
fP28: 214 Data Driven Resource Allocation for Distributed Learning
fP29: 278 Comparison-Based Nearest Neighbor Search
fP30: 363 Generalization Error of Invariant Classifiers
fP31: 456 On the Interpretability of Conditional Probability Estimates in the Agnostic Setting
fP32: 141 ConvNets with Smooth Adaptive Activation Functions for Regression
fP33: 404 Diverse Neural Network Learns True Target Functions
fP34: 209 Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers
fP35: 329 Anomaly Detection in Extreme Regions via Empirical MV-sets on the Sphere
fP36: 69 Minimax density estimation for growing dimension
fP37: 540 Scalable Greedy Feature Selection via Weak Submodularity
fP38: 449 Information Projection and Approximate Inference for Structured Sparse Variables
fP39: 227 Dynamic Collaborative Filtering With Compound Poisson Factorization
fP40: 242 Information-theoretic limits of Bayesian network structure learning
fP41: 3 Conditions beyond treewidth for tightness of higher-order LP relaxations
fP42: 347 A Lower Bound on the Partition Function of Attractive Graphical Models in the Continuous Case
fP43: 531 Non-Count Symmetries in Boolean & Multi-Valued Prob. Graphical Models
fP44: 504 Sequential Multiple Hypothesis Testing with Type I Error Control
fP45: 22 Lower Bounds on Active Learning for Graphical Model Selection
fP46: 498 Learning from Conditional Distributions via Dual Embeddings
fP47: 13 Online Nonnegative Matrix Factorization with General Divergences
fP48: 161 Stochastic Difference of Convex Algorithm and its Application to Training Deep Boltzmann Machines
fP49: 384 Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data
fP50: 417 Communication-Efficient Learning of Deep Networks from Decentralized Data
fP51: 520 Automated Inference with Adaptive Batches
fP52: 190 Bayesian Hybrid Matrix Factorisation for Data Integration
fP53: 192 Co-Occurring Directions Sketching for Approximate Matrix Multiply
fP54: 205 Tensor Decompositions via Two-Mode Higher-Order SVD (HOSVD)
fP55: 442 Active Positive Semidefinite Matrix Completion: Algorithms, Theory and Applications
fP56: 245 Markov Chain Truncation for Doubly-Intractable Inference
fP57: 101 Removing Phase Transitions from Gibbs Measures
fP58: 484 Distribution of Gaussian Process Arc Lengths
fP59: 76 Estimating Density Ridges by Direct Estimation of Density-Derivative-Ratios
fP60: 213 Minimax Approach to Variable Fidelity Data Interpolation
fP61: 132 Stochastic Rank-1 Bandits
fP62: 26 Sparse Accelerated Exponential Weights
fP63: 479 Efficient Online Multiclass Prediction on Graphs via Surrogate Losses
fP64: 124 Frank-Wolfe Algorithms for Saddle Point Problems
fP65: 167 Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot
fP66: 175 Decentralized Collaborative Learning of Personalized Models over Networks
fP67: 20 ASAGA: Asynchronous Parallel SAGA
fP68: 264 A Stochastic Nonconvex Splitting Method for Symmetric Nonnegative Matrix Factorization
fP69: 282 A Unified Optimization View on Generalized Matching Pursuit and Frank-Wolfe
fP70: 284 Faster Coordinate Descent via Adaptive Importance Sampling
fP71: 375 Tracking Objects with Higher Order Interactions via Delayed Column Generation
fP72: 399 Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage
fP73: 367 Optimistic Planning for the Stochastic Knapsack Problem
fP74: 410 Thompson Sampling for Linear-Quadratic Control Problems
fP75: 239 Robust and Efficient Computation of Eigenvectors in a Generalized Spectral Method for Constrained Clustering
fP76: 302 Minimax-optimal semi-supervised regression on unknown manifolds
fP77: 2 Minimax Gaussian Classification & Clustering
fP78: 372 Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L_1-norms
fP79: 494 Spectral Methods for Correlated Topic Models
fP80: 131 Quantifying the accuracy of approximate diffusions and Markov chains
fP81: 267 Hierarchically-partitioned Gaussian Process Approximation
fP82: 459 Linking Micro Event History to Macro Prediction in Point Process Models
fP83: 507 A Maximum Matching Algorithm for Basis Selection in Spectral Learning
7:15-9:00 Dinner Buffet, Panorama Ballroom
22-Apr (Sat)
7:30-9:00 Breakfast, Panorama Ballroom C, D & Terrace
8-10 Registration Desk
9:00-10:00 Invited Talk: Sanjoy Dasgupta. Panorama Ballroom A, B
Towards a Theory of Interactive Learning.
See abstract. See slides.
"Interactive learning" refers to scenarios in which a learning agent (human or machine) engages with an information-bearing agent or system (for instance, a human expert) with the goal of efficiently arriving at a useful model. Examples include: active learning of classifiers; automated teaching systems; augmenting unsupervised learning with interactive post-editing; and so on. In particular, such interaction is a basic mechanism by which we can communicate our needs and preferences to the computers that play an increasing role in our lives.
It would be helpful to have unifying mathematical frameworks that can provide a basis for evaluating interactive schemes, and that supply generic interaction algorithms. I will describe one such mathematical framework, that covers a fairly broad range of situations, and illustrate how it yields algorithms for interactive hierarchical clustering and interactive topic modeling.
Bio: Sanjoy Dasgupta is a Professor in the Department of Computer Science and Engineering at UC San Diego. He received his PhD at UC Berkeley in 2000. He works on algorithms for machine learning, with a focus on unsupervised and interactive learning. He is the author of a textbook, 'Algorithms', with Christos Papadimitriou and Umesh Vazirani. He was program co-chair for the Conference on Learning Theory (COLT) in 2009 and for the International Conference on Machine Learning (ICML) in 2013.
10:00-10:30 Coffee Break, Panorama Foyer
10:30-12:10 Bayesian Methods, Panorama Ballroom A, B
Session Chair: Rebecca Steorts
420 Signal-based Bayesian Seismic Monitoring
180 Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets
82 Near-optimal Bayesian Active Learning with Correlated and Noisy Tests
298 On the Hyperprior Choice for the Global Shrinkage Parameter in the Horseshoe Prior
12:10-1:30 Lunch on your own
(note shorter lunch)
1:30-3:10 Large-scale learning, Panorama Ballroom A, B
Session Chair: Pradeep Ravikumar
417 Communication-Efficient Learning of Deep Networks from Decentralized Data
520 Automated Inference with Adaptive Batches
224 Adaptive ADMM with Spectral Penalty Parameter Selection
372 Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L_1-norms
3:10-3:40 Coffee Break Panorama Foyer
3:40-5:20 Sketching, Panorama Ballroom A, B
Session Chair: Anastasios (Tasos) Kyrillidis
384 Sketching Meets Random Projection in the Dual: A Provable Recovery Algorithm for Big and High-dimensional Data
399 Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage
192 Co-Occurring Directions Sketching for Approximate Matrix Multiply
117 Random Consensus Robust PCA