[学习交流] 【上海校区】《强化学习Sutton》读书笔记（三）——动态...

此为《强化学习》第四章。

策略评估

策略评估 (Policy Evaluation) 首先考虑已知策略<span class="MathJax" id="MathJax-Element-63-Frame" tabindex="0" data-mathml="π(a|s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">π(a|s)π(a|s)，求解<span class="MathJax" id="MathJax-Element-64-Frame" tabindex="0" data-mathml="vπ(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vπ(s)vπ(s)。根据上一节中状态值函数的Bellman等式，有

vπ(s)=∑aπ(a|s)∑s′∑rp(s′,r|s,a)[r+γvπ(s′)]vπ(s)=∑aπ(a|s)∑s′∑rp(s′,r|s,a)[r+γvπ(s′)]

如果我们已知整个环境，那么对每个状态<span class="MathJax" id="MathJax-Element-66-Frame" tabindex="0" data-mathml="s" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">ss都可以列出一条这样的方程，联立，即可解出<span class="MathJax" id="MathJax-Element-67-Frame" tabindex="0" data-mathml="vπ(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vπ(s)vπ(s)。

此外，我们也可以使用迭代法求解。首先，随机在每个状态上给定一个值函数<span class="MathJax" id="MathJax-Element-68-Frame" tabindex="0" data-mathml="v0(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">v0(s)v0(s)，然后按照如下的迭代进行：

vk+1(s)=∑aπ(a|s)∑s′∑rp(s′,r|s,a)[r+γvk(s′)]vk+1(s)=∑aπ(a|s)∑s′∑rp(s′,r|s,a)[r+γvk(s′)]

显然，<span class="MathJax" id="MathJax-Element-70-Frame" tabindex="0" data-mathml="vk=vπ" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vk=vπvk=vπ是上述方程的不动点。不加证明的给出，随着迭代的进行，<span class="MathJax" id="MathJax-Element-71-Frame" tabindex="0" data-mathml="vk" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vkvk可以收敛到<span class="MathJax" id="MathJax-Element-72-Frame" tabindex="0" data-mathml="vπ" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vπvπ。此算法被称为迭代策略评估 (Iterative Policy Evaluation) 。书中给出了迭代策略评估的伪代码，如下图。

策略改良

已知策略<span class="MathJax" id="MathJax-Element-34-Frame" tabindex="0" data-mathml="π" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">ππ，并在策略评估中计算得到了各个状态<span class="MathJax" id="MathJax-Element-35-Frame" tabindex="0" data-mathml="s" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">ss的值函数<span class="MathJax" id="MathJax-Element-36-Frame" tabindex="0" data-mathml="v(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">v(s)v(s)，然后考虑改良策略，提升值函数。策略改良 (Policy Improvement) 的方法非常直观，在状态<span class="MathJax" id="MathJax-Element-37-Frame" tabindex="0" data-mathml="s" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">ss下，如果每个行为都固定地指向唯一的下一个状态<span class="MathJax" id="MathJax-Element-38-Frame" tabindex="0" data-mathml="s′" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">s′s′，那么基于贪心算法，直接选择<span class="MathJax" id="MathJax-Element-39-Frame" tabindex="0" data-mathml="v(s′)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">v(s′)v(s′)最大的行为即可。更一般地，如果每个行为都符合一个概率分布到下一个状态<span class="MathJax" id="MathJax-Element-40-Frame" tabindex="0" data-mathml="St+1" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">St+1St+1，得到奖励<span class="MathJax" id="MathJax-Element-41-Frame" tabindex="0" data-mathml="Rt+1" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">Rt+1Rt+1，那么也基于贪心算法，选择一个期望最大的行为作为策略即可。它的数学描述如下：

π′(s)=argmaxaqπ(s,a)=argmaxaE[Rt+1+γvπ(St+1)|St=s,At=a]=argmaxa∑s′∑rp(s′,a|s,a)[r+γvπ(s′)]π′(s)=arg⁡maxaqπ(s,a)=arg⁡maxaE[Rt+1+γvπ(St+1)|St=s,At=a]=arg⁡maxa∑s′∑rp(s′,a|s,a)[r+γvπ(s′)]

可以证明（详见书本），<span class="MathJax" id="MathJax-Element-43-Frame" tabindex="0" data-mathml="vπ′(s)≥vπ(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vπ′(s)≥vπ(s)vπ′(s)≥vπ(s)。如果<span class="MathJax" id="MathJax-Element-44-Frame" tabindex="0" data-mathml="vπ′(s)=vπ(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vπ′(s)=vπ(s)vπ′(s)=vπ(s)，则根据最优Bellman等式，那么<span class="MathJax" id="MathJax-Element-45-Frame" tabindex="0" data-mathml="π" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">ππ和<span class="MathJax" id="MathJax-Element-46-Frame" tabindex="0" data-mathml="π′" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">π′π′都是最优策略。

策略迭代

由前面两节，很容易看到，首先随机给出一个策略<span class="MathJax" id="MathJax-Element-78-Frame" tabindex="0" data-mathml="π0" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">π0π0，然后进行策略评估得到一组值函数<span class="MathJax" id="MathJax-Element-79-Frame" tabindex="0" data-mathml="vπ0(s)" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">vπ0(s)vπ0(s)，再策略改良得到一个更优的策略<span class="MathJax" id="MathJax-Element-80-Frame" tabindex="0" data-mathml="π1" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">π1π1，……，反复迭代即可使策略越来越优，趋向于最优策略，如下图所示。

其中，<span class="MathJax" id="MathJax-Element-81-Frame" tabindex="0" data-mathml="E" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">EE代表策略评估，<span class="MathJax" id="MathJax-Element-82-Frame" tabindex="0" data-mathml="I" role="presentation" style="box-sizing: border-box; outline: 0px; display: inline; line-height: normal; text-align: left; word-spacing: normal; word-wrap: normal; white-space: nowrap; float: none; direction: ltr; max-width: none; max-height: none; min-width: 0px; min-height: 0px; border: 0px; word-break: break-all; position: relative;">II代表策略改良。这样的强化学习算法被称为策略迭代 (Policy Iteration) 。它的伪代码如下图。

值迭代

策略迭代一个主要的问题在于策略评估，无论解方程法还是迭代法都会比较耗时。一种替代方案为值迭代 (Value Iteration) ，过程如下：

vk+1(s)=maxaqk(s,a)=maxaE[Rt+1+γvk(St+1)|St=s,At=a]=maxa∑s′∑rp(s′,a|s,a)[r+γvk(s′)]vk+1(s)=maxaqk(s,a)=maxaE[Rt+1+γvk(St+1)|St=s,At=a]=maxa∑s′∑rp(s′,a|s,a)[r+γvk(s′)]

它的式子和策略改良非常类似，唯一的不同点在于，策略改良寻找了当前状态值函数的更优策略，而值迭代不求策略，而是直接用当前的状态值函数来影响附近状态的值函数。从这个角度出发，值函数省略了策略评估的步骤，不再需要对一个策略求出它具体的状态值函数。

另一个思考的角度是，值迭代表达式和Bellman等式几乎一样，因此也是一种不动点迭代的方法。两种思路的证明过程略，但直观上都还好理解。值迭代的伪代码如下。

异步动态规划

当前我们讨论的策略迭代和值迭代都是同步的，即我们先保存好当前迭代的值函数，然后使用它们求解下一迭代的值函数。但有时，状态空间非常庞大，以至于遍历是一件比较耗时的操作。异步动态规划在原来的值函数上直接修改，且顺序不定（可能出现某个状态迭代过好几轮，而另一状态仍未迭代的情况）。这样的算法节省了空间，同时为实时交互提供了可能。然而，异步动态规划可能会提升迭代效率，也可能会降低迭代效率，这是无法保证的。

泛化的策略迭代

策略迭代通常分为两步，一是从当前的策略得到值函数（策略评估），二是从当前的值函数按照贪心算法得到更好的策略（策略改良）。这两步交替进行，比如策略迭代法所做的那样。但这不是必须的，我们不需要等上一步完全结束后再进行下一步。比如在值迭代中，策略评估只进行了一小步，我们就立刻开始进行策略改良。在异步方法中，策略评估和改良更加耦合和细化。

但无论如何，它们的思路都是类似的，即基于策略评估和策略迭代。抛开不同方法具体的实现方式不同，它们都可以简单地用下图进行概括：

动态规划的效率

动态规划可能不适合用来解规模非常大的问题，但和其他求解MDP的方法相比，它还是高效的。在最差情况下，DP也能够在多项式复杂度内找到最优策略。有时，DP被认为会受限于维度诅咒 (Curse of Dimension) ，因为状态的数量可能会随状态变量数量的增加而呈指数级增长。但作者认为，这是问题本身复杂度提升，不能说明DP不是一个好方法。即使对于大规模的问题，DP相比于直接搜索或者线性规划仍然有很大的优势。

在实践中，DP可以处理百万级状态数量的问题。策略迭代和值迭代都被广泛使用，并且各有优劣。通常这些方法都能比它们理论最低收敛速度收敛得快，尤其当它们从一个比较好的起始点出发开始迭代。对于更大规模的问题，异步方法将更加合适。

不二晨 · 不二晨

奈斯

帐号		自动登录	找回密码
密码			加入黑马

[学习交流] 【上海校区】《强化学习Sutton》读书笔记（三）——动态...

1 个回复

浏览过的版块