04.01
Instructor: Yaodong Yang
Topics Covered
- Markov Decision Process
- 1.1 Stochastic Process
- 1.2 Markov Process
- 1.2.1 Definition
- 1.2.2 Property
- 1.3 Markov Decision Process (MDP)
- 1.4 Dynamics of MDP
- 1.4.1 Supervised & Unsupervised Learning: Model
- 1.4.2 Reinforcement Learning: Agent
- 1.5 Occupancy Measure
- 1.5.1 Policy
- 1.5.2 Cumulative Reward
- 1.6 Assumptions of Markov Decision Process
- 1.7 Stationary Policy
- Reinforcement Learning Based on Dynamic Programming
- 2.1 MDP Objectives and Policies
- 2.2 Bellman’s Equation for Value Function
- 2.3 Optimal Value Function
- 2.4 Value Iteration & Policy Iteration
- 2.4.1 Value Iteration
- 2.4.2 Synchronous vs. Asynchronous Value Iteration
- 2.4.3 Policy Iteration
- 2.4.4 Example: Policy Evaluation
- 2.4.5 Comparison: Value Iteration & Policy Iteration
- 2.5 Policies in Policy Iteration
- 2.5.1 Policy Optimization
- 2.5.2 Proof
- Model-Based Reinforcement Learning
- 3.1 Learning an MDP Model
- 3.1.1 Model Learning
- 3.1.2 Optimization Strategy
- 3.2 Summary of Markov Decision Process
- 3.1 Learning an MDP Model
- Value Function Estimation
- 4.1 Model-Free Reinforcement Learning
- Monte Carlo Methods
- 5.1 Monte Carlo Value Estimation
- 5.2 Incremental Monte Carlo Updates
- 5.3 Monte Carlo Value Estimation Example
- 5.4 Heuristic Sampling Example
- 5.5 Importance Sampling
- Temporal Difference Learning
- 6.1 Monte Carlo vs. Temporal Difference (MC vs. TD)
- 6.2 Bias vs. Variance
- 6.3 Example: Random Walk
- 6.4 Backup
- 6.5 Multi-Step Sequential Learning
- 6.6 SARSA
- 6.6.1 Online Policy Control
- 6.6.2 ε-Greedy Policy Improvement