反向传播（BP算法）

说到神经网络，大家看到这个图应该不陌生： ::: hljs-center <img src="https://cos.easydoc.net/17082933/files/kecftgrw.png" style="zoom: 45%;" /> ::: 这是典型的三层神经网络的基本构成，Layer L1是输入层，Layer L2是隐含层，Layer L3是隐含层，我们现在手里有一堆数据$\{x_1,x_2,x_3,...,x_n\}$,输出也是一堆数据$\{y_1,y_2,y_3,...,y_n\}$,现在要他们在隐含层做某种变换，让你把数据灌进去后得到你期望的输出。如果你希望你的输出和原始输入一样，那么就是最常见的自编码模型（Auto-Encoder）。可能有人会问，为什么要输入输出都一样呢？有什么用啊？其实应用挺广的，在图像识别，文本分类等等都会用到，我会专门再写一篇Auto-Encoder的文章来说明，包括一些变种之类的。如果你的输出和原始输入不一样，那么就是很常见的人工神经网络了，相当于让原始数据通过一个映射来得到我们想要的输出数据，也就是我们今天要讲的话题。本文直接举一个例子，带入数值演示反向传播法的过程，公式的推导... 假设，你有这样一个网络层： ::: hljs-center <img src="https://cos.easydoc.net/17082933/files/kecfw3vo.png" style="zoom: 45%;" /> ::: 第一层是输入层，包含两个神经元i1，i2，和截距项b1；第二层是隐含层，包含两个神经元 h1,h2 和截距项 b2 ，第三层是输出 o1,o2 每条线上标的 wi 是层与层之间连接的权重，激活函数我们默认为sigmoid函数。现在对他们赋上初值，如下图： ::: hljs-center <img src="https://cos.easydoc.net/17082933/files/kecfxbol.png" style="zoom: 45%;" /> ::: 其中: > **输入数据：** i1=0.05，i2=0.10 ; > >**输出数据：** o1=0.01,o2=0.99 ; > >**初始权重：** >w1=0.15,w2=0.20,w3=0.25,w4=0.30,w5=0.40,w6=0.45,w7=0.50,w8=0.55 > >**目标：** 给出输入数据i1,i2(0.05和0.10)，使输出尽可能与原始输出o1,o2(0.01和0.99)接近。 # 1、前向传播 ## 1.1 输入层---->隐含层：计算神经元h1的输入加权和： net$_{h 1}=w_{1} * i_{1}+w_{2} * i_{2}+b_{1} * 1$ net$_{h 1}=0.15 * 0.05+0.2 * 0.1+0.35 * 1=0.3775$ 神经元h1的输出o1:(此处用到激活函数为sigmoid函数)： out$_{h 1}=\frac{1}{1+e^{-n \epsilon t_{h 1}}}=\frac{1}{1+e^{-0.3775}}=0.593269992$ 同理，可计算出神经元h2的输出o2： out$_{h 2}=0.596884378$ ## 1.2 隐含层---->输出层：计算输出层神经元o1和o2的值： $n e t_{o 1}=w_{5} *$out$_{h 1}+w_{6} *$out$_{h 2}+b_{2} * 1$ $net_{o 1}=0.4 * 0.593269992+0.45 * 0.596884378+0.6 * 1=1.105905967$ out$_{o 1}=\frac{1}{1+e^{-n \epsilon t_{o 1}}}=\frac{1}{1+e^{-1.105905907}}=0.75136507$ out$_{o 2}=0.772928465$ 这样前向传播的过程就结束了，我们得到输出值为$[0.75136079 , 0.772928465]$，与实际值$[0.01 , 0.99]$相差还很远，现在我们对误差进行反向传播，更新权值，重新计算输出。 # 2、反向传播 ## 2.1 计算总误差总误差：(square error) $E_{\text {total}}=\sum_{\frac{1}{2}}^{1}(\text {target}-\text {output})^{2}$ 但是有两个输出，所以分别计算o1和o2的误差，总误差为两者之和： $E_{o 1}=\frac{1}{2}\left({target}_{o 1}-o u t_{o 1}\right)^{2}=\frac{1}{2}(0.01-0.75136507)^{2}=0.274811083$ $E_{o 2}=0.023560026$ $E_{\text {total}}=E_{o 1}+E_{o 2}=0.274811083+0.023560026=0.298371109$ ## 2.2 隐含层---->输出层的权值更新：以权重参数w5为例，如果我们想知道w5对整体误差产生了多少影响，可以用整体误差对w5求偏导求出：（链式法则） $\frac{\partial E_{t o t a l}}{\partial w_{5}}=\frac{\partial E_{t o t a l}}{\partial o u t_{o 1}} * \frac{\partial o u t_{o 1}}{\partial n e t_{o 1}} * \frac{\partial n e t_{o 1}}{\partial w_{5}}$ 下面的图可以更直观的看清楚误差是怎样反向传播的： ::: hljs-center <img src="https://cos.easydoc.net/17082933/files/kecghc3w.png" style="zoom: 45%;" /> ::: **现在我们来分别计算每个式子的值：** **计算** $\frac{\partial E_{\text {total}}}{\partial\text { out}_{o 1}}$ : $E_{\text {total}}=\frac{1}{2}\left({target}_{o 1}-\right.\text {out}\left._{o 1}\right)^{2}+\frac{1}{2}\left({target}_{o 2}-\right.\text {out} \left._{o 2}\right)^{2}$ $\frac{\partial E_{\text {total }}}{\partial\text { out }_{o 1}}=2 * \frac{1}{2}\left(\right.\text { target}_{o 1}-\text { out} \left._{o 1}\right)^{2-1} *-1+0$ $\frac{\partial E_{\text {total}}}{\partial \text {out}_{o 1}}=-\left(\right.target_{o 1}-out\left._{o 1}\right)=-(0.01-0.75136507)=0.74136507$ **计算** $\frac{\partial o u t_{o 1}}{\partial n e t_{o 1}}$ ： $O u t_{o 1}=\frac{1}{1+e^{-n e t_{o 1}}}$ $\frac{\partial o u t_{o}}{\partial n e t_{o 1}}=o u t_{o 1}\left(1-o u t_{o 1}\right)=0.75136507(1-0.75136507)=0.186815602$ 这一步实际上就是对sigmoid函数求导，比较简单，可以自己推导一下） **计算**$\frac{\partial n e t_{o 1}}{\partial w_{5}}$: net$_{o 1}=w_{5} *$out$_{h 1}+w_{6} *out_{h 2}+b_{2} * 1$ $\frac{\partial n e t_{o 1}}{\partial w_{5}}=1 * o u t_{h 1} * w_{5}^{(1-1)}+0+0=o u t_{h 1}=0.593269992$ **最后三者相乘：** $\frac{\partial E_{\text {total}}}{\partial\text {w}_{5}}$=$\frac{\partial E_{\text {total}}}{\partial\text { out}_{o 1}}$ \* $\frac{\partial n e t_{o 1}}{\partial w_{5}}$ $\frac{\partial E_{\text {total}}}{\partial w_{5}}=0.74136507 * 0.186815602 * 0.593269992=0.082167041$ 这样我们就**计算出整体误差$E_{total}$对$w_5$的偏导值。** **回过头来再看看上面的公式，我们发现：** $\frac{\partial E_{\text {total}}}{\partial w_{5}}=-\left({target}_{o 1}-o u t_{o 1}\right) *out_{o 1}\left(1-\right.out\left._{o 1}\right) *out_{h 1}$ 为了表达方便，**用 $\delta_{o 1}$ 来表示输出层的误差：** $\delta_{o 1}=\frac{\partial E_{\text {total }}}{\partial \text { out }_{o 1}} * \frac{\partial \text { out }_{o 1}}{\partial \text { net }_{o 1}}=\frac{\partial E_{\text {total }}}{\partial n e t_{o 1}}$ $\delta_{o 1}=-\left({target}_{o 1}-o u t_{o 1}\right) *out_{o 1}\left(1-o u t_{o 1}\right)$ 因此，**整体误差E(total)对w5的偏导公式可以写成：** $\frac{\partial E_{\text {total}}}{\partial w_{5}}=\delta_{o 1} out_{h 1}$ **如果输出层误差计为负的话，也可以写成：** $\frac{\partial E_{\text {total}}}{\partial w_{5}}=-\delta_{o 1}out_{h 1}$ **最后我们来更新w5的值：** $w_{5}^{+}=w_{5}-\eta * \frac{\partial E_{\text {total}}}{\partial w_{5}}=0.4-0.5 * 0.082167041=0.35891648$ (其中，$\eta$是学习速率，这里我们取0.5） **同理，可更新w6,w7,w8:** $w_{6}^{+}=0.408666186$ $w_{7}^{+}=0.511301270$ $w_{8}^{+}=0.561370121$ ## 2.3 隐含层---->隐含层的权值更新：方法其实与上面说的差不多，但是有个地方需要变一下，在上文计算总误差对w5的偏导时，是从out(o1)---->net(o1)---->w5,但是在隐含层之间的权值更新时，是out(h1)---->net(h1)---->w1,而out(h1)会接受E(o1)和E(o2)两个地方传来的误差，所以这个地方两个都要计算。 ::: hljs-center <img src="https://cos.easydoc.net/17082933/files/kechekfv.png" style="zoom: 45%;" /> ::: **计算**$\frac{\partial E_{\text {total}}}{\partial \text {out}_{h 1}}$: $\frac{\partial E_{\text {total}}}{\partial o u t_{h 1}}=\frac{\partial E_{o 1}}{\partial o u t_{h 1}}+\frac{\partial E_{o 2}}{\partial o u t_{h 1}}$ **先计算**$\frac{\partial E_{\text {o1}}}{\partial \text {out}_{h 1}}$ $\frac{\partial E_{o 1}}{\partial o u t_{h 1}}=\frac{\partial E_{o 1}}{\partial n e t_{o 1}} * \frac{\partial n e t_{o 1}}{\partial o u t_{h 1}}$ $\frac{\partial E_{o 1}}{\partial n e t_{o 1}}=\frac{\partial E_{o 1}}{\partial o u t_{o 1}} * \frac{\partial o u t_{o 1}}{\partial n e t_{o 1}}=0.74136507 * 0.186815602=0.138498562$ $net_{o 1}=w_{5} *out_{h 1}+w_{6} *out_{h 2}+b_{2} * 1$ $\frac{\partial \text {net}_{o 1}}{\partial \text {out}_{n 1}}=w_{5}=0.40$ $\frac{\partial E_{o 1}}{\partial o u t_{h 1}}=\frac{\partial E_{o 1}}{\partial n e t_{o 1}} * \frac{\partial n e t_{o 1}}{\partial o u t_{h 1}}=0.138498562 * 0.40=0.055399425$ **同理，计算出**：$\frac{\partial E_{o 2}}{\partial o u t_{h 1}}=-0.019049119$ **两者相加得到总值：** $\frac{\partial E_{\text {total }}}{\partial \text { out }_{h 1}}=\frac{\partial E_{o 1}}{\partial \text { out }_{h 1}}+\frac{\partial E_{o 2}}{\partial \text { out }_{h 1}}=0.055399425+-0.019049119=0.036350306$ **再计算**$\frac{\partial o u t_{h 1}}{\partial n e t_{h 1}}$: $out _{h 1}=\frac{1}{1+e^{-n e t} h 1}$ $\frac{\partial\text {out}_{h 1}}{\partial n e t_{h 1}}={out}_{h 1}\left(1-o u t_{h 1}\right)=0.59326999(1-0.59326999)=0.241300709$ **再计算**$\frac{\partial o u t_{h 1}}{\partial w_{1}}$: $net_{h 1}=w_{1} * i_{1}+w_{2} * i_{2}+b_{1} * 1$ $\frac{\partial n e t_{h 1}}{\partial w_{1}}=i_{1}=0.05$ **最后，三者相乘：** $\frac{\partial E_{t o t a l}}{\partial w_{1}}=\frac{\partial E_{t o t a l}}{\partial o u t_{h 1}} * \frac{\partial o u t_{h 1}}{\partial n e t_{h 1}} * \frac{\partial n e t_{h 1}}{\partial w_{1}}$ $\frac{\partial E_{t o t a l}}{\partial w_{1}}=0.036350306 * 0.241300709 * 0.05=0.000438568$ 为了**简化公式**，用sigma(h1)表示隐含层单元h1的误差： $\frac{\partial E_{t o t a l}}{\partial w_{1}}=\left(\sum_{o} \frac{\partial E_{t o t a l}}{\partial o u t_{o}} * \frac{\partial o u t_{o}}{\partial n e t_{o}} * \frac{\partial n e t_{o}}{\partial o u t_{h 1}}\right) * \frac{\partial o u t_{h 1}}{\partial n e t_{h 1}} * \frac{\partial n e t_{h 1}}{\partial w_{1}}$ $\frac{\partial E_{\text {total}}}{\partial w_{1}}=\left(\sum_{o} \delta_{o} * w_{h o}\right) *out_{h 1}\left(1-o u t_{h 1}\right) * i_{1}$ $\frac{\partial E_{\text {total}}}{\partial w_{1}}=\delta_{h 1} i_{1}$ **最后，更新w1的权值：** $w_{1}^{+}=w_{1}-\eta * \frac{\partial E_{\text {total}}}{\partial w_{1}}=0.15-0.5 * 0.000438568=0.149780716$ **同理，额可更新w2,w3,w4的权值：** $w_{2}^{+}=0.19956143$ $w_{3}^{+}=0.24975114$ $w_{4}^{+}=0.29950229$ 这样误差反向传播法就完成了，最后我们再把更新的权值重新计算，不停地迭代，在这个例子中第一次迭代之后，总误差E(total)由0.298371109下降至0.291027924。迭代10000次后，总误差为0.000035085，输出为\[0.015912196,0.984065734] (原输入为[0.01,0.99]),证明效果还是不错的。 --- 转载：[一文弄懂神经网络中的反向传播法——BackPropagation](https://www.cnblogs.com/codehome/p/9718611.html)