**Seminar**:

Thursday 10.00-12.00, Seminarraum SR 007/008

PI-Gebäude - Arnimallee 6 - 14195 BerlinLanguage: English

Machine learning deals with searching for and generating patterns in data. Although it is traditionally considered a branch of computer science, it heavily relies on mathematical foundations. Thus, it is the primary goal of our seminar to understand these mathematical foundations. In doing so, we will put emphasis on the probabilistic viewpoint. In this semester, we will focus on techniques that allow one to find optimal strategies with the help of reinforcement learning. The students are very much encouraged to implement the methods that they will learn.

The acquaintance with basics of probability theory is required.

The language of the seminar is English. The grades will be based upon presentations and active participation.

1. The reinforcement learning problem: environment, agent, reward, return, Markov property, MDP, state-value V function, action Q function, policy, Bellman equation, examples (Chap. 3)

2. Dynamic programming: iterative policy evaluation, proof of convergence, policy improvement theorem, policy iteration, value iteration (Chapter 4)

3. Monte Carlo methods: policy evaluation, estimation of action values, Monte Carlo control, on- and off- policies, evaluating one policy while following another (Chapter 5).

4. Temporal-difference (TD) learning: TD vs. Monte Carlo, convergence of TD (Bibl. Remarks 6.1-2), sarsa on-policy control, Q-learning off-policy control and a proof of its convergence (Bibl. Remarks 6.5, http://users.isr.ist.utl.pt/~mtjspaan/readingGroup/ProofQlearning.pdf), actor-critic methods (Chapter 6).

5. Eligibility traces: n-step TD methods, forward and backward views of TD(λ) and their equivalence (Chapter 7)

6. Generalization and function approximation: approximation of the V function, gradient descent, convergence for linear approximation (Tsitsiklis and Van Roy, 1997a), control with function approximation, Baird’s counterexample, Tsitsiklis and Van Roy’s counterexample (Chapter 8).

7. Trust Region and Proximal Policy methods (https://arxiv.org/pdf/1502.05477.pdf, https://arxiv.org/pdf/1707.06347.pdf)

8. DQN and Double Q-learning: (https://arxiv.org/pdf/1312.5602.pdf , https://www.nature.com/articles/nature14236.pdf and https://arxiv.org/pdf/1509.06461.pdf)

9. Prioritized Experience Replay (https://arxiv.org/pdf/1511.05952.pdf)

10. Reward shaping (https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/NgHaradaRussell-shaping-ICML1999.pdf, also consider this list of problems with badly chosen rewards: https://docs.google.com/spreadsheets/d/e/2PACX-1vRPiprOaC3HsCf5Tuum8bRfzYUiKLRqJmbOoC-32JorNdfyTiRRsR7Ea5eWtvsWzuxo8bjOxCG84dAg/pubhtml)

The chapters indicated in the topics are from the following monograph:

1. R. Sutton and A. Barto. Reinforcement Learning: an introduction. The MIT Press, 1998. New Version: http://incompleteideas.net/book/RLbook2018.pdf

Additional recommended article:

2. François-Lavet, P Henderson, R Islam, MG Bellemare, J Pineau. Foundations and Trends in Machine Learning 11 (3-4), 219-354. An Introduction to Deep Reinforcement Learning. https://arxiv.org/pdf/1811.12560.pdf

Further sources are given in the brackets in the corresponding topics.