Lecture 10 - Deep Policy Gradient Methods

04.22

Instructor: Yaodong Yang

Topics Covered

  1. Policy Gradient Methods
    • 1.1 Policy Gradient Theorem
    • 1.2 MLE vs. Policy Gradient
    • 1.3 Actor-Critic Methods
      • 1.3.1 Actor-Critic Training
      • 1.3.2 A2C: Advantage Actor-Critic
    • 1.4 Summary of Policy Gradient Algorithms
    • 1.5 Classification of Deep Policy Gradient Methods
      • 1.5.1 Stochastic Policy Methods (A3C)
      • 1.5.2 Deterministic Policy Methods
      • 1.5.3 Stochastic vs. Deterministic Policy Gradients
      • 1.5.4 Off-Policy Policy Gradient Methods
      • 1.5.5 Off-Policy Stochastic vs. Deterministic Policy Gradients
      • 1.5.6 Deep Deterministic Policy Gradient (DDPG)
      • 1.5.7 Twin Delayed Deep Deterministic Policy Gradient (TD3)
  2. Trust Region and Proximal Policy Optimization
    • 2.1 Natural Policy Gradient
      • 2.1.1 Covariant Perspective
      • 2.1.2 Optimization Perspective
      • 2.1.3 Compatible Function Approximation
    • 2.2 Connection Between Policy Gradient and Policy Iteration
    • 2.3 Performance Difference Bound
    • 2.4 Total Variation Divergence
    • 2.5 Trust Region Objective
    • 2.6 Monotonic Improvement Guarantee
    • 2.7 Kullback–Leibler (KL) Divergence
    • 2.8 Trust Region Policy Optimization (TRPO)
    • 2.9 Natural Gradient Methods
    • 2.10 Proximal Policy Optimization (PPO)
    • 2.11 Summary
  3. Direct Preference Optimization (DPO)
    • 3.1 Preference-Based Policy Optimization Without Explicit Reward Functions
    • 3.2 Foundations of Direct Preference Alignment Methods
  4. Maximum Entropy Reinforcement Learning
    • 4.1 Reinforcement Learning and Probabilistic Graphical Models
    • 4.2 Planning as Inference
    • 4.3 Control as Inference
    • 4.4 Soft Q-Learning
    • 4.5 Soft Policy Improvement Theorem
    • 4.6 Unification of Policy Gradient and Q-Learning
    • 4.7 Soft Actor-Critic (SAC)
Previous
Next