Reference material 💾

Deep Q Network (DQN)
- Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, et al. “Human-Level Control through Deep Reinforcement Learning.” Nature 518, no. 7540 (February 26, 2015): 529–33. https://doi.org/10.1038/nature14236.
- Hasselt, Hado van, Arthur Guez, and David Silver. “Deep Reinforcement Learning with Double Q-Learning.” ArXiv:1509.06461 [Cs], December 8, 2015. http://arxiv.org/abs/1509.06461. DDQN
- Schaul, Tom, John Quan, Ioannis Antonoglou, and David Silver. “Prioritized Experience Replay,” February 25, 2016. http://arxiv.org/abs/1511.05952. PDQN
Deep Deterministic Policy Gradient (DDPG)
- Lillicrap, Timothy P., Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. “Continuous Control with Deep Reinforcement Learning.” arXiv, July 5, 2019. http://arxiv.org/abs/1509.02971.
Twin Delayed Deep Deterministic Policy Gradient (TD3)
- Fujimoto, Scott, Herke van Hoof, and David Meger. “Addressing Function Approximation Error in Actor-Critic Methods.” arXiv, October 22, 2018. http://arxiv.org/abs/1802.09477.
- original code: sfujim/TD3
Soft Actor-Critic (SAC)
- Haarnoja, Tuomas, Haoran Tang, Pieter Abbeel, and Sergey Levine. “Reinforcement Learning with Deep Energy-Based Policies.” arXiv, July 21, 2017. http://arxiv.org/abs/1702.08165.
- Haarnoja, Tuomas, Aurick Zhou, Pieter Abbeel, and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor.” arXiv, August 8, 2018. http://arxiv.org/abs/1801.01290.
- Haarnoja, Tuomas, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, and Sergey Levine. “Composable Deep Reinforcement Learning for Robotic Manipulation.” arXiv, March 18, 2018. http://arxiv.org/abs/1803.06773.
- Haarnoja, Tuomas, Aurick Zhou, Kristian Hartikainen, George Tucker, Sehoon Ha, Jie Tan, Vikash Kumar, et al. “Soft Actor-Critic Algorithms and Applications.” arXiv, January 29, 2019. http://arxiv.org/abs/1812.05905.
- original code: haarnoja/sac
Proximal Policy Optimization (PPO)
- Fujimoto, Scott, Herke van Hoof, and David Meger. “Addressing Function Approximation Error in Actor-Critic Methods.” arXiv, October 22, 2018. http://arxiv.org/abs/1802.09477.
- Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. “Proximal Policy Optimization Algorithms.” arXiv, August 28, 2017. http://arxiv.org/abs/1707.06347.
- Engstrom, Logan, Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph, and Aleksander Madry. “Implementation Matters in Deep Policy Gradients: A Case Study on PPO and TRPO.” arXiv, May 25, 2020. http://arxiv.org/abs/2005.12729.
- Andrychowicz, Marcin, Anton Raichuk, Piotr Stańczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, et al. “What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study.” arXiv, June 10, 2020. http://arxiv.org/abs/2006.05990.
- ICLR22 Blog: The 37 Implementation Details of Proximal Policy Optimization