04.22
Instructor: Yaodong Yang
Topics Covered
- Policy Gradient Methods
- 1.1 Policy Gradient Theorem
- 1.2 MLE vs. Policy Gradient
- 1.3 Actor-Critic Methods
- 1.3.1 Actor-Critic Training
- 1.3.2 A2C: Advantage Actor-Critic
- 1.4 Summary of Policy Gradient Algorithms
- 1.5 Classification of Deep Policy Gradient Methods
- 1.5.1 Stochastic Policy Methods (A3C)
- 1.5.2 Deterministic Policy Methods
- 1.5.3 Stochastic vs. Deterministic Policy Gradients
- 1.5.4 Off-Policy Policy Gradient Methods
- 1.5.5 Off-Policy Stochastic vs. Deterministic Policy Gradients
- 1.5.6 Deep Deterministic Policy Gradient (DDPG)
- 1.5.7 Twin Delayed Deep Deterministic Policy Gradient (TD3)
- Trust Region and Proximal Policy Optimization
- 2.1 Natural Policy Gradient
- 2.1.1 Covariant Perspective
- 2.1.2 Optimization Perspective
- 2.1.3 Compatible Function Approximation
- 2.2 Connection Between Policy Gradient and Policy Iteration
- 2.3 Performance Difference Bound
- 2.4 Total Variation Divergence
- 2.5 Trust Region Objective
- 2.6 Monotonic Improvement Guarantee
- 2.7 Kullback–Leibler (KL) Divergence
- 2.8 Trust Region Policy Optimization (TRPO)
- 2.9 Natural Gradient Methods
- 2.10 Proximal Policy Optimization (PPO)
- 2.11 Summary
- 2.1 Natural Policy Gradient
- Direct Preference Optimization (DPO)
- 3.1 Preference-Based Policy Optimization Without Explicit Reward Functions
- 3.2 Foundations of Direct Preference Alignment Methods
- Maximum Entropy Reinforcement Learning
- 4.1 Reinforcement Learning and Probabilistic Graphical Models
- 4.2 Planning as Inference
- 4.3 Control as Inference
- 4.4 Soft Q-Learning
- 4.5 Soft Policy Improvement Theorem
- 4.6 Unification of Policy Gradient and Q-Learning
- 4.7 Soft Actor-Critic (SAC)