50 things I learned at NIPS 2016

Andreas Stuhlmüller
Ought
Published in
14 min readDec 13, 2016

--

I learned many things about AI and machine learning at the NIPS 2016 conference. Here are a few that are particularly suited to being communicated in the space of a few sentences.

I’ve attempted to link to the person or people who inspired a particular thought, but there’s a lot of variation in how direct the connection is, and any particular item may not reflect the opinion of the linked person.

5680 people registered for NIPS this year

Applied machine learning

  • What methods win Kaggle competitions? Gradient tree boosting (especially XGBoost) and deep neural nets (especially convolutional nets for images and RNNs for some time series problems). 👤
  • Ensembles add 2–5% in performance over the best individual methods, but also lead to more complex systems, so are often not worth it in practice. 👤
  • Current machine learning techniques work best when training data and real data come from the same distribution. When it’s likely that an algorithm will be applied in a setting that is different from the training setting, it can be good to have the test set come from a different distribution than the training set, hopefully mirroring how the real application data will again come from a different distribution. This way, you get a better sense for how the algorithm does under distribution shift. 👤
  • More specifically, if you have two sources of data—say, a large set of general speech data and a much smaller set of in-car speech data—and you want to build a supervised learner that does well on the small set, Andrew Ng recommends this recipe that involves splitting each of the two sets and then step-by-step reducing each of four kinds of errors. 👤
The view from our hotel room in the morning

Neural nets

  • Why does deep learning work now, but not 20 years ago, even though many of the core ideas were there? In one sentence: We have more data, more compute, better software engineering, and a few algorithmic innovations (many layers, ReLUs, better initialization and learning rates, dropout, LSTMs). 👤
  • But why does gradient-based optimization work at all in neural nets despite the non-convexity? One possible, partial answer is overprovisioning: There are generally many hidden units, and there are many ways a neural net can approximately implement the desired input-output relationship. You only need to find one. 👤
  • There’s a potentially more biologically plausible alternative to backprop called equilibrium propagation that requires neither an explicit loss function nor gradients. Training works something like this: (1) Clamp the input of the system to some input value. (2) Let the system converge until there is a stable predicted output. (3) Measure some stats within the system. (4) Clamp the output to the true output value. (5) Let the system converge again. (6) Measure the same stats as before. (7) Update the system’s parameters based on the difference in stats. 👤
  • If you take an LSTM and add a “time gate” that controls at what frequency to be open to new input and how long to be open each time, you can have different neurons that learn to look at a sequence with different frequencies, create a “wormhole” for gradients, save compute, and do better on long sequences and when you need to process inputs from multiple sensors that are sampled at different rates. 👤
The conference venue

Interacting with humans

  • Want to communicate a large dataset (of images, say) to a human using exemplars? One thing you can do is to first find a few prototypes (based on minimizing Maximum Mean Discrepancy between prototype and data distribution), then add a few particularly atypical instances where the prototype and data distributions differ most (by maximizing MMD). 👤
  • Here’s one approach to making robots less annoying: When the human says, “do x now”, make the robot directly execute the command x. Then use data about when such commands happen to learn what to do when the human doesn’t give commands. 👤
  • If we evaluate people’s questions as they try to figure out where ships are in a game of battleship, we find that people can judge which questions are best (according to expected information gain). But, for the most part, the questions they come up with themselves aren’t the most informative ones. 👤
  • What can you do if you want to elicit people’s true beliefs in a crowdsourcing setting where you don’t have access to the ground truth, such as when you ask “Is this essay well-reasoned?” Yes, you could use the Bayesian Truth Serum, but what if you don’t want to ask subjects difficult meta-level questions? As long as you have multiple independent tasks, you can use the Correlated Agreement Mechanism, which suggests that you reward people when they agree on correlated tasks, and punish people for agreement on uncorrelated tasks. 👤
  • When users interact with machine learning systems, it’s not just the systems that are learning—the users’ model of the system changes as well, but this is mostly neglected. How can we model this co-learning process? 👤
  • It’s often useful to have a human in-the-loop when we build machine learning systems (e.g. so that the system can actively delegate particularly difficult tasks to the human, or ask questions). But we can’t differentiate through human minds (yet), which prevents gradient-based end-to-end optimization of the other components. Is there anything we can do about this? 👤
The Sagrada Família from the inside

Bayes in the time of neurons

  • It’s appealing to consider building a posterior on neural net parameters instead of searching for a single good parameter setting — in fact, so appealing that the key ideas of Bayesian neural nets have been developed around 1987–1995. See Yarin Gal’s thesis for a brief history. 👤
  • People generally appreciate that this is a difficult task, but it’s easy to forget just how difficult this may be for real problems such as high-res image synthesis: the dimensionality of the parameter space is generally huge, much larger than the dimensionality of the input space, which may already be quite high-dimensional. 👤
  • On the other hand, a version of Stochastic Gradient HMC seems to make Bayesian Neural Nets useful enough for Bayesian Optimization of the hyperparameters of another (non-Bayesian) neural net on real tasks. 👤
  • Speaking of Bayesian Optimization: For short horizons (up to 30 steps), a neural net trained to do black-box optimization can do better than the standard Bayesian Gaussian Process approach and is a lot faster. 👤
  • How can we deal with adversarial examples? One might hope that simply putting a prior on parameters and being Bayesian would do the trick. It doesn’t. Instead, the existence of such examples seems to have something to do with the fact that our models make overly confident, linear extrapolations. So, specific priors might help, but it doesn’t seem obvious how to find ones that prevent such examples and don’t hurt generalization a lot. 👤 👤
Barcelona

Planning & reinforcement learning

  • X isn’t about Y, now for artificial agents as well: Let’s model planners who choose not (just) based on whether their actions achieve some immediate goal in the world, but based on how well these plans signal something about the agent (such as the agents’s goals). 👤
  • There are a number of approaches to hierarchical planning, including Hierarchical Abstract Machines, MaxQ, Skills, Dynamic Motion Primitives, and Options. So far, it has been a challenge to learn and benefit from the relevant abstractions within a single task, i.e. in the non-amortized setting. A new Option-Critic architecture seems to do somewhat better than Deep Q-Learning within some individual Atari games, but it doesn’t look like a big win yet. You can read this as an argument against current approaches, as an argument for the amortized setting, or both. 👤
  • Here’s how you might start to approach hierarchical planning with Deep RL: Replace the usual function Policy(State) → Action with a parameterized function Policy(State, Task) → Action. An action can either (a) recurse a level, using the same policy as before, but with a new task vector as input, (b) execute an action in the world, or (c) terminate the subtask and pop up a level. 👤
  • Experience replay is a bit of a hack. Ultimately, we’ll need something smarter. 👤
  • Value iteration is similar enough to a sequence of convolutions and max-pooling layers that you can emulate an (unrolled) planning computation with a deep network: a value iteration network. This allows neural nets to do planning, e.g. moving from start to goal in grid-world, or navigating a website to find query. 👤
The Uber party

Reinforcement learning, in more depth

  • Which Deep RL methods work best? This review provides a big table comparing a number of state-of-the-art methods on multiple continuous control tasks, but all fail at tasks with hierarchical structure. 👤
  • To make Deep RL work in practice, take a look at these tips & tricks from John Schulman’s Deep RL course. 👤
  • One general lesson is that, as you iteratively improve your policy, it’s important to constrain the KL divergence between the old and new policy to be less than some constant δ. This δ (in the unit of nats) is better than a fixed step size, since the meaning of the step size changes depending on what the rewards and problem structure look like at different points in training. This is called Trust Region Policy Optimization (or, in a first-order variant, Proximal Policy Optimization) and it matters more as we do more experience replay. 👤 👤
  • If your policy has a small number of parameters (say 20), and sometimes even if it has a moderate number (say 2000), you might be better off using the Cross-Entropy Method than any of the fancy methods above. It works like this: (1) Sample n sets of parameters from some prior that allows for closed-form updating, e.g. a multivariate Gaussian. (2) For each parameter set, compute a noisy score by running your policy on the environment you care about. (3) Take the top 20% percentile (say) of sampled parameter sets. Fit a Gaussian distribution to this set, then go to (1) and repeat using this as the new prior. 👤
  • For both RL and variational inference, there are two widely known ways of optimizing a policy (or variational distribution) based on sampled sequences of actions and outcomes: There’s (a) the likelihood-ratio estimator, which updates the policy such that action sequences that lead to higher scores happen more often and that doesn’t need gradients, and (b) the pathwise estimator, which adjusts individual actions such that the policy results in a higher score and that needs gradients. I previously assumed that, if you can use the pathwise estimator, it’s strictly better—but, for RL, it’s apparently the case that, while pathwise methods may be more sample-efficient, they work less generally due to high bias and don’t scale up as well to very high-dimensional problems. (Really?) 👤
  • Suppose you want to train a neural net policy that can solve a fairly broad class of problems. Here’s one approach: (1) Sample 10 instances of the problem, and solve each of the instances using a problem-specific method, e.g. a method that fits and uses an instance-specific model. (2) Train the neural net to agree with all of the per-instance solutions. But if you’re going to do that, you might do even better by constraining the specific solutions and what the neural net policy would do to be close to each other from the start, fitting both simultaneously. 👤
Barcelona

Generative adversarial nets

  • If you have a finite dataset {x¹, x², …}, say of images, and you want to generate more instances of x; or if you have pairs {(x¹, y¹), (x², y²), …} and, given a new x, you would like to predict the corresponding y; then generative adversarial nets may be for you. At least in the domain of images, some of the most impressive results on such problems have been achieved using GANs. 👤
  • How do GANs work? There are two parameterized differentiable functions, a generator G (think “counterfeiter”) and a discriminator D (think “police”). For x chosen from your dataset, we’ll optimize D(x) to be near 1. For x sampled from the generator using input noise z, i.e. x=G(z), we’ll optimize D(x) to be near 0, but at the same time, we’ll optimize G such that D(G(z)) is near 1. If all goes well, we’ll step-by-step optimize the discriminator until it’s really good at distinguishing generated from real data, and at the same time optimize the generator to be really good at sampling data that is indistinguishable from the real thing. 👤
  • If all goes well. So far, GANs are really finicky to train. Here’s a list of hacks that sometimes help. 👤
  • Why are GAN image samples so sharp, whereas variational autoencoder samples aren’t? One hypothesis is that it has something to do with the fact that the loss function for VAEs is the likelihood. But we can make GANs maximize likelihood as well, and GAN samples are still sharp, so this seems less plausible now. The reason probably has more to do with the fact that VAEs typically use a Gaussian likelihood, or perhaps with some other component of the model architecture, such as the particular approximation strategy used (e.g., VAEs optimize a lower bound). 👤 👤 👤
  • Three big open problems for GANs: (1) How do you address the fact that the minimax game between the generator and discriminator may never approach an equilibrium? In other words, how do you build a system using GANs so that you know that it will converge to a good solution? (2) Even if they do converge, current systems still have issues with global structure: they cannot count (e.g. the number of eyes on a dog) and frequently get long-range connections wrong (e.g. they show multiple perspectives as part of the same image). (3) How can we use GANs in discrete settings, such as for generating text? 👤
The conference venue

Chat bots

  • What methods work best for dialog automation right now? This depends on what exactly the task is, but overall, some form of RNNs with extra memory seem to do best, and in particular seem to do better than n-grams and information retrieval methods (such as nearest neighbors and TF-IDF). Candidate architectures include LSTMs, the Hierarchical Recurrent Encoder-Decoder, a version thereof that adds a latent variable and that can perhaps handle ambiguity and uncertainty better, Multiresolution RNNs that attempt to learn some compositional structure, End-to-End Memory Networks, and an improved version of those. My impression is that nothing works really well so far. 👤 👤
  • On the other hand, Kaggle’s Allen AI Science Challenge, which required algorithmic participants to answer multiple-choice questions from a standardized 8th grade science exam, was won using information retrieval methods, not RNNs. 👤
  • In dialog automation, one of the biggest challenges is in building up an accurate picture (or state) that summarizes the dialog so far. 👤
  • At Facebook, people are pursuing multiple approaches to dialog automation, but the main one is to go directly from dialog history to next response, without transparent intermediate state that can be used for training/evaluation. 👤
  • Facebook’s bAbI tasks include a variety of dialog tasks, including transactions (making a restaurant reservation), Q&A, recommendation, and chit-chat. The Ubuntu dataset with almost 1 million tech troubleshooting dialogs is another useful resource. 👤 👤
  • At the moment, some researchers build user simulators to train their dialog systems, but those are difficult to create — a simulator is effectively another dialog system, but it needs to mimic user behavior, and it’s hard to evaluate how well it is doing (in contrast to the dialog system that is being trained, there’s no notion of “task completion”). 👤
  • If you can’t collect huge numbers of dialogs from real users, what can you do? One strategy is to first learn a semantic representation based on other datasets to “create a space in which reasoning can happen”, and then start using this pre-trained system for dialogs. 👤
The view from our hotel room at night

Idea generators

  • Everything is an algorithm: It may be useful to view web experiments in the social sciences more explicitly as algorithms. Among other things, this makes it clearer that experimental design can take inspiration from existing algorithms, as in the case of MCMC with People. See also: If we formalize existing RL approaches such as training in simulation and reward shaping by writing them down as explicit protocol programs, maybe we can make it easier to incrementally improve these protocols. (I did some work on this project.) 👤 👤
  • Take some computation where you usually wouldn’t keep around intermediate states, such as a planning computation (say value iteration, where you only keep your most recent estimate of the value function) or stochastic gradient descent (where you only keep around your current best estimate of the parameters). Now keep around those intermediate states as well, perhaps reifying the unrolled computation in a neural net, and take gradients to optimize the entire computation with respect to some loss function. Instances: Value Iteration Networks, Learning to learn by gradient descent by gradient descent. 👤 👤
  • If we can overcome adversarial examples, we can train a neural net by giving it the score for a few prototypes—say designs for cars, and the rating a human designer assigned—and then use gradient descent on the inputs to synthesize exemplars that are better than any of the ones we can imagine. We have a “universal engineering machine”, if you like. 👤
  • How can we implement high-level symbolic architectures using biological neural nets? Josh Tenenbaum now calls this “the modern mind-body problem”. 👤
  • Neural nets still contain a lot of discrete structure, e.g. how many neurons there are, how many layers, what activation functions we use, and what’s connected to what. Is there a way to make it all continuous, so that we can run gradient descent on both parameters and structure, with no discrete parts at all? 👤
The view from the Passion tower of the Sagrada Família

Tidbits and factoids

  • 20 years ago, Jürgen Schmidhuber’s first submission on LSTMs got rejected from NIPS. 👤
  • For some products at Baidu, the main purpose is to acquire data from users, not revenue. 👤
  • Boston Dynamics doesn’t use any learning in their robots (so far), including the new Spot Mini demoed at NIPS—it’s all manually programmed. 👤
  • For speech recognition, ML algorithms are now benchmarked against teams of humans, not individuals. 👤
  • When Zoubin Ghahramani asked who in the audience knew the PDP volumes, essentially no hands went up and he was sad. 👤

If you notice any mistakes, please email me at andreas@ought.com.

--

--