AlphaGo Zero学习二

接上文继续学习一个新的深度神经网络构成：

方便回顾，下面重复学习一对新的深度神经网络概述：

新的网络中，使用了一个参数为 θ （需要通过训练来不断调整）的深度神经网络fθ。

【网络输入】19×19×17 0/1值：现在棋盘状态的 s 以及7步历史落子记录。最后一个位置记录黑白，0白1黑。

【网络输出】两个输出：落子概率（362个输出值）和一个评估值（[-1,1]之间）记为 fθ(s)=(p,v)。

【落子概率 p】向量表示下一步在每一个可能位置落子的概率，又称先验概率（加上不下的选择），即 pa=Pr(a|s)（公式表示在当前输入条件下在每个可能点落子的概率）。

【评估值 v】表示现在准备下当前这步棋的选手在输入的这八步历史局面 s下的胜率（我这里强调局面是因为网络的输入其实包含历史对战过程）。

【网络结构】基于Residual Network（大名鼎鼎ImageNet冠军ResNet）的卷积网络，包含20或40个Residual Block（残差模块），加入批量归一化Batch normalisation与非线性整流器rectifier non-linearities模块。

通过阅读论文和原文作者发现还是有很多没有前后连贯的知识内容，继续资料搜寻发现下面两位作者比较不错学习资料：

【专栏】谷歌资深工程师深入浅析AlphaGo Zero与深度强化学习

AlphaZero实践——中国象棋（附论文翻译）

结合三位作者文章进一步对AlphaGo Zero解决算法理解如下：

AlphaGo Zero由蒙特卡洛树搜索MCTS和深度神经网络构成，并交替使用深度学习评估策略(policy evaluation)和蒙特卡洛树搜索优化策略(policy improvement)。

看完多遍MCTS和神经网络相互转化训练过程还是一头雾水，下面先记录看明白的部分内容。

在每一个状态s ，利用深度神经网络fθ预测作为参照执行MCTS搜索，MCTS搜索的输出是每一个状态下在不同位置对应的概率 π（注意这里是一个向量，里面的值是MCTS搜索得出的概率值），此种策略从人类的眼光来看，就是看到现在局面选择需要下在每个不同落子的点的概率。如下面公式的例子，下在(1,3）位置的概率是0.92，有很高概率选这个点作为落子点。

备注：此图摘自《深入浅出看懂AlphaGo元》

上图提到的π来自于下面图示：

Self-play reinforcement learning in AlphaGo Zero

上图摘自论文，是AlphaGo Zero中的自我对弈强化学习过程。

a、Self-Play是自我对弈状态s1, ...,sT ，在每个局面st，是使用最新的神经网络fθ执行蒙特卡洛树搜索（MCTS）的αθ。根据由MCTS计算的概率选择走子，at ∼ πt 。根据游戏规则对棋局结束状态sT进行评分得到游戏获胜者z。

b、训练AlphaGo Zero神经网络，神经网络将棋局原始走子状态作为其输入，将其输入给具有参数θ的多个卷积层，并且输出表示走子概率分布的向量和表示当前玩家在局面的获胜概率标量值。更新神经网络的参数θ以使策略向量与搜索概率的相似度最大化，并使得预测的获胜者和游戏胜者z之间的误差最小化（参见等式1）。新参数用于下一次迭代的自我对弈a中。

论文对自对弈图示原文描述如下：

Self-play reinforcement learning in AlphaGo Zero.

a The program plays a game s1, ..., sT against itself.In each position st, a Monte-Carlo tree search (MCTS) αθ is executed (see Figure 2) using the latest neural network fθ. Moves are selected according to the search probabilities computed by the MCTS, at ∼ πt. The terminal position sT is scored according to the rules of the game to compute the game winner z.

b Neural network training in AlphaGo Zero. The neural network takes the raw board position st as its input, passes it through many convolutional layers with parameters θ, and outputs both a vector pt, representing a probability distribution over moves, and a scalar value vt, representing the probability of the current player winning in position st. The neural network parameters θ are updated so as to maximise the similarity of the policy vector pt to the search probabilities πt , and to minimise the error between the predicted winner vt and the game winner z (see Equation 1). The new parameters are used in the next iteration of self-play a.

为方便学习CNN网络构成，引用结构图如下：

AlphaGo Zero学习二

推荐阅读更多精彩内容