理论中心前沿系列讲座 | 直播：基于人类反馈的强化学习比普通的强化学习更难吗？

2023-10-16 | 作者：微软亚洲研究院

微软亚洲研究院理论中心前沿系列讲座第十二期，将于 10 月 18 日（周三）上午 10:30 - 11:30 与你相见。

本期，我们请到了普林斯顿大学电气与计算机工程系的助理教授金驰，带来以 “Is RLHF More Difficult than Standard RL?” 为主题的讲座分享，届时请锁定 B 站 “微软科技” 直播间！

理论中心前沿系列讲座是微软亚洲研究院的常设系列直播讲座，将邀请全球站在理论研究前沿的研究者介绍他们的研究发现，主题涵盖大数据、人工智能以及其他相关领域的理论进展。通过这一系列讲座，我们期待与各位一起探索当前理论研究的前沿发现，并建立一个活跃的理论研究社区。

欢迎对理论研究感兴趣的老师同学们参与讲座并加入社区（加入方式见后文），共同推动理论研究进步，加强跨学科研究合作，助力打破 AI 发展瓶颈，实现计算机技术实质性发展！

直播信息

直播地址：B 站 “微软科技” 直播间

https://live.bilibili.com/730

如果您希望与讲者互动，欢迎通过 Teams 参会

会议链接：http://z6b.cn/w6RKg

会议 ID：249516 746 294

会议密码：RgqRLW

直播时间：10 月 18 日（周三）上午 10:30 - 11:30

扫码直达 B 站直播间

讲座信息

Chi Jin is an assistant professor at the Electrical and Computer Engineering department of Princeton University. He obtained his PhD degree in Computer Science at University of California, Berkeley, advised by Michael I. Jordan. His research mainly focuses on theoretical machine learning, with special emphasis on nonconvex optimization and Reinforcement Learning (RL). In nonconvex optimization, he provided the first proof showing that first-order algorithm (stochastic gradient descent) is capable of escaping saddle points efficiently. In RL, he provided the first efficient learning guarantees for Q-learning and least-squares value iteration algorithms when exploration is necessary. His works also lay the theoretical foundation for RL with function approximation, multiagent RL and partially observable RL.

报告题目：
Is RLHF More Difficult than Standard RL?

报告摘要：
Reinforcement learning from Human Feedback (RLHF) learns from preference signals, while standard Reinforcement Learning (RL) directly learns from reward signals. Preferences arguably contain less information than rewards, which makes preference-based RL seemingly more difficult. This work theoretically proves that, for a wide range of preference models, we can solve preference-based RL directly using existing algorithms and techniques for reward-based RL, with small or no extra costs. Specifically, (1) for preferences that are drawn from reward-based probabilistic models, we reduce the problem to robust reward-based RL that can tolerate small errors in rewards; (2) for general arbitrary preferences where the objective is to find the von Neumann winner, we reduce the problem to multiagent reward-based RL which finds Nash equilibria for factored Markov games under a restricted set of policies. The latter case can be further reduced to adversarial MDP when preferences only depend on the final state. We instantiate all reward-based RL subroutines by concrete provable algorithms, and apply our theory to a large class of models including tabular MDPs and MDPs with generic function approximation. We further provide guarantees when K-wise comparisons are available.

上期讲座回顾

在上期讲座中，来自清华大学交叉信息研究院助理教授袁洋，带来了以 “On the Power of Foundation Models” 为主题的讲座分享，在范畴论的视角下探讨无限资源条件下基础模型的能力边界。

讲座回放地址：https://www.bilibili.com/video/BV1cF411y7fp/

加入理论研究社区

欢迎扫码加入理论研究社区，与关注理论研究的研究者交流碰撞，群内也将分享微软亚洲研究院理论中心前沿系列讲座的最新信息。

【微信群二维码】

您也可以向MSRA.TheoryCenter@outlook.com 发送以"Subscribe the Lecture Series"为主题的邮件，以订阅讲座信息。

关于微软亚洲研究院理论中心

2021 年 12 月，微软亚洲研究院理论中心正式成立，期待通过搭建国际学术交流与合作枢纽，促进理论研究与大数据和人工智能技术的深度融合，在推动理论研究进步的同时，加强跨学科研究合作，助力打破 AI 发展瓶颈，实现计算机技术实质性发展。目前，理论中心已经汇集了微软亚洲研究院内部不同团队和研究背景的成员，聚焦于解决包括深度学习、强化学习、动力系统学习和数据驱动优化等领域的基础性问题。

想了解关于理论中心的更多信息，请访问https://www.microsoft.com/en-us/research/group/msr-asia-theory-center/