Why DeepMind’s breakthrough ML algorithm was so tricky to train
Pulling back the curtain on DeepMind's secrets.

In the past decade, Deep Reinforcement Learning has conquered games ranging from Go to StarCraft and Poker. This progress can be traced back to DeepMind’s DQN algorithm [1], the first time Deep Learning and Reinforcement Learning were successfully combined.
First published in 2013, Deep Q-Networks (DQN) learn to take actions that outperform humans in a suite of Atari games, taking only the pixel values of the game as input. Learning such effective behaviour directly from experience is a remarkable achievement. It’s the first hint at a viable path towards Artificial General Intelligence - the ability of computers to get intelligent across a wide range of tasks, similar or greater than humans are!
Part of the reason DQN was such an impressive advance is that Q-learning - the subtype of Reinforcement Learning used - is very hard to train. To understand why we’ll first look at how Q-Learning works under the hood. We’ll then explore why it’s so tricky, and dig into how DeepMind trained DQN successfully.
How Q-Learning works
DeepMind’s breakthrough combined neural networks with reinforcement learning (RL) for the first time. An RL agent interacts with an environment, receiving rewards if it takes favourable actions. The goal of an RL algorithm is to maximise the long-term sum of rewards it receives. Specifically, DQN uses a type of RL called Q-learning (hence the name DQN) - we’ll see what this means shortly.
This makes RL incredibly powerful. RL can be used to solve any decision-making task.
Figure 1: The interaction loop in RL between an agent and the environment. At each timestep, the agent selects an action, which changes the state of the environment. The environment also provides a reward signal to signify whether the agent’s behaviour is favourable or not. RL learns how to select actions to maximize the sum of future rewards.
However, the future reward from being in a particular game state or taking an action is non-trivial to estimate. One complication is the effect your actions have on the environment can be uncertain, meaning the rewards you receive are also uncertain. It’s especially hard to estimate in cases where we don’t have all the rules of how the environment works. How can we know what rewards will result from an action when those rewards are uncertain, distant in the future, and the number of possible states is very large? For example, the optimal way to play a level of Super Mario could be to jump at the first frame. But how do you know the value of this action when the reward is all the way at the end of the level?
The approach taken by Q-learning is to learn an ‘action-value function’, also known as a Q-function. The Q-function assigns a numerical value to every state-action combination. This value is an estimate of the expected future rewards when taking an action in a certain state - with the Q function defining this value for all states. In Q-learning, we learn the Q-function estimate by interacting with the environment and updating the Q-values of the state-action pairs we take. After taking an action, we update each Q-value using the Q-values of the new state of the environment. By doing this many times, we can get close to estimating the Q-values of that state, and take actions according to this estimate. A schematic depiction of a Q-function in a small grid-world task, and a comparison to a state-value function (also commonly used), is shown in figure 2.
Figure 2: State-value function and Action-value functions for a 3 x 3 gridworld task.
Our estimated Q-function defines a value for all state-value pairs. This number of state-action pairs becomes intractably large to store in a simple table for Atari games. For example, in Breakout, if we take just the paddle and the ball, on a screen that ranges from 300 to 800 pixels in each dimension, the number of states is between 10 $^9$ and 10 $^{11}$ . And that’s before we take into account the Breakout blocks being broken down. This vast number of states is why Deep Neural Networks are necessary - we need a powerful way of representing the Q-function.
Combining Q-Learning and neural networks results in an algorithm that looks like this:
Figure 3: Schematic depiction of the Deep Q-Networks algorithm that learned to play Atari games to a superhuman level, based only on pixel values and rewards.
Why Deep Q-Networks are hard to train
The combination of Q-Learning and neural networks is in theory very powerful. Q-learning enables you to learn any decision making task, and neural networks can represent any function. If we could crack it, it has a vast range of potential applications: from self-driving cars to robotics.
However, it’s very hard to train this combination of Q-Learning and neural networks. Even after many iterations of taking actions in different states and obtaining rewards, sometimes the performance doesn’t improve. A commonly observed behaviour is that after apparent improvements, the performance drops.
This can be still seen after the paper DeepMind released announcing their breakthrough. Let’s take a look at this example from figure 4, DeepMind’s follow-up paper that improved upon DQN (covered later). We can see that after initially improving in performance, DQN suffers sudden drop-offs. Why does this happen?
Figure 4: Comparison of scores between DQN (red) and Double DQN in different Atari games. Original source: [2]
Noisy updates
The Q-learning algorithm makes each update step based on one step of experience. But updating our algorithm after one step of experience can lead to unstable updates due to sampling error. Sampling error is the result of sampling data points from any distribution. In this case, we’re sampling from the environment we are learning from.
If you train on a sequence of recent data points of experience, the data you see is highly correlated. Since it typically takes many steps to traverse the state space, the next states you can visit are closely related to the state you are in. This correlation between the samples we are learning from makes learning inefficient - and breaking them up improves learning by breaking these correlations.
Stabilising DQN: Experience Replay
To mitigate this, DeepMind used a technique called Experience Replay in their DQN algorithm. Here, ‘experience’ refers to the state, actions, reward and next state that the agent observes in a timestep. Experience Replay stores the state, action, reward and successor state from each timestep in memory and selects a batch at random from it on each timestep.
Sampling the data to train on randomises the experience used in each update, breaking the correlations between data points. It, therefore, reduces the variance of the updates. As each step of experience is used in many weight updates, this also means less data is needed.
Figure 5: An overview of how Experience Replay works. Experience is stored from the environment in an experience replay buffer, which is randomly sampled from for learning by feeding into the RL algorithm
Overestimating Q-values
In Q-Learning, we have three points at which the Q-function is used:
- To get the Q value of the first state
- For evaluating which successor state has the highest Q-value to select an action
- To find the Q-value of that successor state
Combining Q-learning with a neural network naively uses the same network for all three, meaning if we overestimate the value of a state, this overestimation is transferred to preceding states. This is because Q-learning uses the maximum action value as an estimate of the maximum expected action value. This can leads to learning towards an incorrect Q-function estimate.
Let’s consider an example to get an intuition for this. Imagine a state-action pair with an action that is the optimal choice for that state, but happens to be underestimated such that it was no longer the highest Q-value for that state. If we were in the state preceding this, your update would use the other maximum Q-value for that state. This is shown in the diagram below, with the overestimated Q value being selected. During learning, it’s normal for value estimates to be imprecise, meaning that overestimations are common.
This wouldn’t be an issue if overestimations of Q-values were uniform across the states. If all Q-values were changed similarly, the actions we select would be the same. However, empirically it seems to be true that they usually aren’t. This means the policy resulting from the approximate Q-values won’t necessarily converge to the optimal policy.
Figure 6: Why overestimation of Q-values occurs in Q-learning
Stablising DQN: Double DQN
The solution to the overestimation problem is to use a second network. This was proposed in 2015 in a paper called ‘Deep Reinforcement Learning with Double Q-Learning’ [2].
In figure 4, we can see the true values in the top row - they are the flat lines. Compared with the DQN estimates, we can see the overestimation of Q-values. We also see that the blue lines - Double DQN - reduce this overestimation. In the scores shown in figure 4, we see that this leads to instability in DQN as the performance drops after initially increasing. So how does Double DQN achieve this improvement?
The name ‘Double DQN’ refers to having a second deep neural network. It uses the network being trained for action selection when interacting with the environment. The Q-function estimate update uses the Q-value of the successor state. This is where the 2nd, ‘target’, network comes in handy. The target network is usually an old version of the network being trained. It is used to find the action with the maximum Q-value of the successor state. However, the original network is used to evaluate this successor action’s Q-value. This evaluated Q-value is a key element of the update step. By decoupling the Q-values used for action selection and action evaluation, we are less likely to pick overestimated values.
The resulting algorithm looks like this:
Figure 7: Double DQN with a target network to prevent Q-learning converging to an incorrect Q-function
Wrap-up
We’ve seen why Q-Learning is hard to train, and how DeepMind developed smart tricks to improve training. We haven’t gone all the way into RL theory here - to get more background and a greater understanding, check out our tutorials for our Introduction to Reinforcement Learning course.
Since DQN came out, other major advances have been made. Famously, DeepMind beat Go with it’s AlphaGo algorithm, in addition to conquering Starcraft, Poker and applying Deep RL to energy use. Each required specific adaptations, but the field of Deep Reinforcement Learning as a whole - begun by DQN - has the potential for remarkable generality in the problems it can solve. You just have to know the right tricks to use to train it.
Citations
[1] Mnih et al., 2013. Playing Atari with Deep Reinforcement Learning. https://arxiv.org/pdf/1312.5602.pdf
[2] van Hasselt, Guez and Silver. 2015. Deep Reinforcement Learning with Double Q-learning. https://arxiv.org/pdf/1509.06461.pdf