The Cartpole environment is one of the simplest MDPs. It is extremely low dimensional, with a four-dimensional observation space and only two actions. The goal of this exercise is to implement several RL algorithm in order to get practical experience with such methods.

The small size and simplicity of this environment makes it is possible to run very quick experiments, which is essential when learning the basics.

Start with a simple linear model (that has only four parameters), and use the sign of the weighted sum to choose between the two actions.

The random guessing algorithm: generate 10,000 random configurations of the model's parameters, and pick the one that achieves the best cumulative reward. It is important to choose the distribution over the parameters correctly.
The hill-climbing algorithm: Start with a random setting of the parameters, add a small amount of noise to the parameters, and evaluate the new parameter configuration. If it performs better than the old configuration, discard the old configuration and accept the new one. Repeat this process for some number of iterations. How long does it take to achieve perfect performance?
Policy gradient algorithm: here, instead of choosing the action as a deterministic function of the sign of the weighted sum, make it so that action is chosen randomly, but where the distribution over actions (of which there are two) depends on the numerical output of the inner product. Policy gradient prescribes a principled parameter update rule [1, 2]. Your goal is to implement this algorithm for the simple linear model, and see how long it takes to converge.

What happens to the above algorithm when the policy is a neural network with tens of thousands of parameters?

Notes

This is a simple task that is meant to help newcomers gain practical experience with implementing simple RL algorithms.

Solutions

Results and some intuition behind the algorithms at this post , and here is the code used.

Train a language model on a jokes corpus

Im2Latex

Sequence-to-sequence models with attention have been enormously successful. They made it possible for neural networks to reach new levels of state of the art in machine translation, speech recognition, and syntactic parsing. Thanks to this work, neural network can now consume inputs of arbitrary shapes, and output sequences of variable length, without much effort on the practitioner's side.

Implement an attention model that takes an image of a PDF math formula, and outputs the characters of the LaTeX source that generates the formula.

Getting Started

For a quick start, download a prebuilt dataset or use these tools to build your own dataset. Alternatively, you can proceed manually with the following steps:

Download a large number papers from arXiv. There is a collection of 29,000 arXiv papers that you could get started with. It is likely that this set of 29,000 papers may contain several hundred thousand formulas, which is more than enough for getting started. As the bandwidth of arXiv is limited, it is important to be mindful of their constraints and to not write crawlers to download all the papers of arxiv.
Use a heuristic to find all the LaTeX formulas in the LaTeX source. It can be done by looking for the text that lies between \begin{equation} and \end{equation}. Here is a list of some of the places where equations can appear in latex files. Additional examples can be found here. It is likely that even a simple heuristic for extracting latex formulas should produce in excess of 100,000 equations; if not, keep refining the heuristic.
Compile images of all the formulas. To keep track of the correspondence between the latex formulas and their images, it is easiest to place exactly one formula on each page. Then, when processing the latex file, it is easy to keep track of the pages. Be sure to not render formulas so large that they exceed an entire page. Also, be sure to render the formulas in several fonts.
Train a visual attention sequence-to-sequence model (as in the Show, Attend, and Tell paper, or perhaps a different variant of visual attention) that would take an image of a formula as input, and output the latex source of the formula, one character at a time. A Theano implementation of the Show, Attend, and Tell paper can help you get started. If you wish to implement your model from scratch, TensorFlow can be a good starting point.
It takes some effort to correctly implement a sequence-to-sequence model with attention. To debug your model, we recommend that you start with a toy synthetic OCR problem, where the inputs are long images that are obtained by concatenating sequences of images of MNIST digits, and the labels should be a sequence of their classifications. While this problem can be solved without an attention model, it is useful as a sanity check, to ensure that the implementation is not badly broken.
We recommend trying the Adam optimizer.

Notes

A success here would be a very cool result and could be used to build a useful online tool.

While this is a very non-trivial project, we've marked it with a one-star difficulty rating because we know it's solvable using current methods. It is still very challenging to really do it, as it requires getting several ML components together correctly.

Solutions

Results, data set, code, and a write-up are available at http://lstm.seas.harvard.edu/latex/. The model is trained on the above data sets and uses an extension of the Show, Attend and Tell paper combined with a multi-row LSTM encoder. Code is written in Torch (based on the seq2seq-attn system), and the model is optimized using SGD. Additional experiments are run using the model to generate HTML from small webpages.

Difference of value functions

Improved Q-learning with continuous actions

Q-learning is one of the oldest and general reinforcement learning (RL) algorithms. It works by estimating the long term future expected return for each state-action pair. Essentially, the goal of the Q-learning algorithm is fold the long term outcome of each state-action pair into a single scalar that tells us how good the state-action pair combination is; then, we could maximize our reward by picking the action with the greatest value of the Q-function. The Q-learning algorithm has been the basis of the DQN algorithm that demonstrated that the combination of RL and deep learning is a fruitful one.

Your goal is to create a robust Q-learning implementation that can solve all Gym environments with continuous action spaces without changing hyperparameters.

You may want to use the Normalized Advantage Function (NAF) model as a starting point. It is especially interesting to experiment with variants of the NAF model: for example, try it with a diagonal covariance. It can be also interesting to explore an advantage function that uses the maximum of several quadratics, which is a convenient function because their argmax is easy to compute.

Notes

This project is mainly concerned with reimplementing an existing algorithm. However, there is significant value in obtaining a very robust implementation, and there is a decent chance that new ideas will end up being required to get it working reliably, across multiple tasks.

Reduced-Information Program Learning

A difficult machine learning task is that of program, or algorithm, learning. Examples of simple algorithmic tasks are RepeatCopy and ReversedAddition. Theoretically, recurrent neural networks (RNNs) are Turing-complete (if they have large enough memory and enough timesteps between reading the input and emitting the output), which means that they can model any computable function. In practice, however, it has been difficult to learn algorithms that output the intended answer on every conceivable input (as opposed to outputting the correct answer on the training data distribution only).

The Neural Programmer-Interpreter (NPI) is an example of a program-learning model which uses execution traces as its supervised signal. This is opposed to the notion of program induction, where programs must be inferred from input-output pairs. Using execution traces as a supervisory signal, the NPI learn to solve extremely hard problems and generalize to inputs of greater length. However, most interesting problems do not have execution traces available: if we knew the detailed execution traces, we would probably know how to solve the problem as well.

The challenge is thus to achieve similar results with partial execution traces. A partial execution trace provides a compact high level description of the way in which a particular instance was solved. Given an input, a partial execution trace provides a rough description of how the solution was computed, without going into all the details. Other weak supervision can provided -- such as the input-output pairs. The challenge in this problem is to design a model that is actually capable of learning from partial execution traces, and to show that it can learn algorithmic tasks (such as bubble sort, quick sort, multiplication, various string operations, etc) quickly. It is desirable to develop a model that can solve these problems using the least specific execution traces possible.

The Inverse DRAW model

Investigate an “Inverse DRAW” model.

The DRAW model is a generative model of natural images that operates by making a large number of small contributions to an additive canvas using an attention model. The attention model used by the DRAW model identifies a small area in the image and "writes" to it.

In the inverse DRAW model, there is a stochastic hidden variable and an attention model that reads from these hidden variables. The outputs of the attention model are provided to an LSTM that produces the observation one dimension (or one group of dimensions) at a time. Thus, while the DRAW model uses attention to decide where to write on the output canvas, the inverse DRAW uses attention to choose the latent variable to be used at a given timestep. The Inverse DRAW model can be seen as a Neural Turing Machine generative model that emits one dimension at a time, where the memory is a read-only latent variable.

The Inverse DRAW model is an interesting concept to explore because the dimensionality of the hidden variable is decoupled from the length of the input. In more detail, the Inverse DRAW model is a variational autoencoder, whose p-model emits the observation one dimension at a time, using attention to choose the appropriate latent variable for each visible dimension. There is a fair bit of choice in the architecture of the approximate posterior. A natural choice is to use the same architecture for the posterior, where the observation will be playing the role of the latent variables.

A useful property of the Inverse DRAW model is that its latent variables may operate at a rate that is different from the observation. This is the case because each dimension of the observation gets assigned to one hidden state. If this model were to successfully be made deep, we would get a hierarchy of representation, where each representation is operating at a variable rate, which is trained to be as well-suited as possible for the current dataset.

It would be interesting to apply this model to a text dataset, and to visualize the latent variables, as well as the precise way in which the model assigns words to latent variables.

Notes

It is a hard project as it is not clear that a models like this can be made to work with current techniques. However, it makes success all the more impressive.

The inverse DRAW model may have a cost function that's very difficult to optimize, so expect a struggle.

Solutions

The code and an associated paper for a model implementing a version of the "Inverse DRAW" model is available here.

Multiobjective RL

In reinforcement learning, we often have several rewards that we care about. For example, in robotic locomotion, we want to maximize forward velocity but minimize joint torque and impact with the ground.

The standard practice is to use a reward function that is a weighted sum of these terms. However, it is often difficult to balance the factors to achieve satisfactory performance on all rewards.

Filter methods are algorithms from multi-objective optimization that seek to generate a sequence of points, so that each one is not strictly dominated by a previous one (see Nocedal & Wright, chapter 15.4).

Develop a filter method for RL that jointly optimizes a collection of reward functions, and test it on the Gym MuJoCo environments. Most of these have summed rewards; you would need to inspect the code of the environments to find the individual components.

Related work

There exists some prior work on multiobjective optimization in an RL context. See the following review paper by Roijers et al.

Notes

Filter methods have not been applied to RL much, so there is a lot of uncertainty around the difficulty of the problem.

Multitask RL with continuous actions.

At present, most machine learning algorithms are trained to solve one task and one task only. But we do not necessarily train models on only one task at a time because we believe that it is the best approach in the long term; on the contrary, while we would like to use multitask learning in as many problems as possible, the multitask learning algorithms are not yet at a stage where they provide a robust and a sizeable improvement across a wide range of domains.

This sort of multitask learning should be particularly important in reinforcement learning settings, since in the long run, experience will be very expensive relative to computation and possibly supervised data. For this reason, it is worthwhile to investigate the feasibility of multitask learning using the RL algorithms that have been developed so far.

Thus the goal is to train a single neural network that can simultaneously solve a collection of MuJoCo environments in Gym. The current enviroments are dissimilar enough that it is unlikely that information can be shared between them. Therefore, your job is to create a set of similar environments that will serve as a good testbed for multi-task learning. Some possibilities include (1) bipedally walking with different limb dimensions and masses, (2) reaching with octopus arms that have different numbers of links, (3) using the same robot model for walking, jumping, and standing.

At the end of learning, the trained neural network should be told (via an additional input) which task it's running on, and achieve high cumulative reward on this task. The goal of this problem is to determine whether there is any benefit whatsoever to training a single neural network on multiple environments versus a single one, where we measure benefit via training speed.

We already know that the multitask learning on Atari has been difficult (see the relevant papers). But will multitask learning work better on MuJoCo environments? The goal is to find out.

The most interesting experiment is to train a multitask net of this kind on all but one MuJoCo environment, and then see if the resulting net can be trained more rapidly on a task that it hasn't been trained on. In other words, we hope that this kind of multitask learning can accelerate training of new tasks. If successful, the results can be significant.

Notes

It is a reasonably risky project, since there is a chance that this kind of transfer will be as difficult as it has been for Atari.

Natural Q-learning

Implement and test a natural version of Q-learning, and compare it to regular Q-learning.

Natural Gradient is a promising idea that has been explored in a significant number of papers and settings. Despite its appeal, modern approaches to natural gradient have not been applied to Q-learning with nonlinear function approximation.

The intuition behind natural gradient is the following: we can identify a neural network with its parameters, and use the backpropagation algorithm to slowly change the parameters to minimize the cost function. But we can also think of a neural network as of a high-dimensional manifold in the infinite-dimensional space of all possible functions, and we can, at least conceptually, run gradient descent in function space, subject to the constraint that we stay on the neural network manifold. This approach has the advantage that it does not depend on the specific parameterization used by the neural network; for example, it is known that tanh units and sigmoid units are precisely equivalent in the family of neural networks that they can represent, but their gradients are different. Thus, the choice of sigmoid versus tanh will affect the backpropagation algorithm, but it will not affect the idealized natural gradient, since natural gradient depends entirely on the neural network manifold, and we already established that the neural network manifold is unaffected by the choice of sigmoid versus tanh. If we formalize the notion of natural gradient, we get that the natural gradient direction is obtained by inverting the regular gradient by the Fisher information matrix. The result is still a challenging problem, but it can be addressed in a variety of ways, some of which are discussed in the papers linked above. But the relevant fact about natural gradient is that its behavior is much more stable and benign in a variety of settings (for example, natural gradient is relatively unaffected by the order of the data in the training set, and is highly amenable to data parallelism), which suggests that natural gradient could improve the stability of the Q-learning algorithm as well.

In this project, your goal is to figure out how to meaningfully apply natural gradient to Q-learning, and to compare the results to a good implementation of Q-learning. Thus, the first step of this project is to implement Q-learning. We recommend either staying with discrete domains (such as Atari), or continuous domains, and to use methods similar to Normalized Advantage Function (NAF). The continuous domains are easier to work with because they are of lower dimensionality and are therefore simpler, but NAF can be harder to implement than standard Q-learning.

It would be especially interesting if Natural Q-learning were capable of solving the RAM-Atari tasks.

Notes

This project isn't guaranteed to be solvable: it could be that Q-learning's occasional instability and failure has little to do with whether it is natural or not.

Solutions

NGDQN model, results, and paper trained on a discrete environment available here.

Parallel TRPO

As it is always desirable to train larger models on harder domains, one important area of research is parallelization. Parallelization has played an important role in deep learning, and has been especially successful in reinforcement learning. The successful development of algorithms that parallelize well will make it possible to train larger models faster, which will advance the field.

The goal of this project is to implement the Trust Region Policy Optimization (TRPO) algorithm so that it would use multiple computers to achieve 15x lower wall-clock time than joschu's single-threaded implementation on the MuJoCo or Atari Gym environments. Given that TRPO is a highly stable algorithm that is extremely easy to use, a well-tuned parallel implementation could have a lot of practical significance.

You may worry that in order to solve this problem, you would need access to a large number of computers. However, it is not so, as it is straightforward to simulate a set of parallel computers using a single core.

Make sure your code remains generic and readable.

Notes

It is known that RL algorithms can be parallelized well, so we expect it to be possible to improve upon the basic implementation. What is less obvious is whether it is possible to get 15x speedup using, say, only 20x more nodes.

Solutions

Preliminary paper describing TRPO with parallel actors here , with the implementation avaiable at this repo. Current results are a 3x speedup with when using 4 cores.

Q-learning on the RAM variant of Atari

The Q-learning algorithm had a lot of success on learning to play Atari games when the inputs to the model are pixels. Atari games are designed to run on a computer that has a very small amount of RAM, so it can be interesting to try to learn to play Atari games when the input to the neural network is the RAM state of the Atari emulator. Surprisingly, getting Q-learning to work well when the inputs are RAM states has been unexpectedly challenging.

Thus, your goal is to develop a Q-learning implementation that can solve many Atari games when the input to the neural network is the RAM state, using the same setting of hyperparameters on all tasks. In your experiments, use the Gym Atari environments that are presented in the RAM way, where the inputs are the complete RAM state of the Atari computer.

The hope here is that in order to succeed, you'd need to invent techniques for Q-learning that are generally applicable, which will be useful.

Notes

This project might not be solvable. It would be surprising if it were to turn out that Q-learning would never succeed on the RAM variants of Atari, but there is some chance that it will turn out to be challenging.

Solutions

The preliminary results can be read in the paper and here are the instructions to run the code. The work has been accepted at Computer Games Workshop and will be presented on 9 July 2016 during IJCAI conference in New York. Feel free to pass by, if you're there!

Better sample efficiency for TRPO

Trust Region Policy Optimization (TRPO) is a scalable implementation of second order policy gradient algorithm that is highly effective on both continuous and discrete control problems. One of the strengths of TRPO is that it is relatively easy to set its hyperparameters, as a hyperparameter setting that performs well on one task tends to perform well on many other tasks. But despite its significant advantages, the TRPO algorithm could be more data efficient.

The problem is to modify a good TRPO implementation so that it would converge on all of Gym's MuJoCo environments using 3x less experience, without a degradation in final average reward. Ideally, the new code should use the same setting of the hyperparameters for every problem.

This will be an impressive achievement, and the result will likely be scientifically significant.

When designing the code, you may find the following ideas useful:

Dynamically adjust the size of the mini-batch.
Dynamically adjust the radius of the trust region.
Use past samples with correctly-set importance weights.

Notes

This problem is very hard, as getting an improvement of this magnitude is likely to require new ideas.

Cartpole: for newcomers to RL

Notes

Solutions

Train a language model on a jokes corpus

Im2Latex

Getting Started

Notes

Solutions

Difference of value functions

Notes

Improved Q-learning with continuous actions

Notes

Reduced-Information Program Learning

The Inverse DRAW model

Notes

Solutions

Multiobjective RL

Related work

Notes

Multitask RL with continuous actions.

Notes

Natural Q-learning

Notes

Solutions

Parallel TRPO

Notes

Solutions

Q-learning on the RAM variant of Atari

Notes

Solutions

Better sample efficiency for TRPO

Notes

Description2Code