机器学习方法(六)：神经网络中的反向传播算法

2019-08-07

× 文章目录

1. 反向传播算法
1. 1.1. 前向传播
2. 1.2. 反向传播
  1. 1.2.1. 输出层—>隐含层
  2. 1.2.2. 隐含层—>输入层
2. Summary
3. Reference

反向传播算法

第一层是输入层，包含两个神经元i1、i2，和截距项b1；第二层是隐含层，包含两个神经元h1、h2和截距项b2，第三层是输出o1、o2，每条线上标的wi是层与层之间连接的权重，激活函数默认为sigmoid函数。

为了更好的进行计算，这里初始化权重weights、偏置项biases和神经网络的输入/输出:

输入数据 i1=0.05，i2=0.10;

输出数据 o1=0.01,o2=0.99;

初始权重 w1=0.15, w2=0.20, w3=0.25, w4=0.30; w5=0.40, w6=0.45, w7=0.50, w8=0.55;

反向传播的目标是优化权重，以便神经网络可以学习如何正确地将任意输入映射到输出。

目标：给定输入数据i1(0.05)和i2(0.10)，使输出尽可能与原始输出o1(0.01)和o2(0.99)接近。

前向传播

首先，让我们看一下在给定的权重和偏差以及输入为0.05和0.10的情况下，神经网络当前的输出。

输入层—>隐含层

计算神经元$h_1$的输入加权和：

$\begin{aligned} net_{h1} &= w_{1} * i_{1}+w_{2} * i_{2}+b_{1} * 1 \\ &= 0.15 * 0.05+0.2 * 0.1+0.35 * 1=0.3775 \end{aligned}$

使用Sigmoid函数计算神经元$h_1$的输出：

$\text out_{h1}=\frac{1}{1+e^{-net_{h 1}}}=\frac{1}{1+e^{-0.3775}}=0.593269992$

同理，可以计算出神经元$h_2$的输出：

$\text out_{h2}=0.596884378$

隐含层—>输出层

计算输出层神经元o1的值：

$\begin{array}{l} \text {net}_{o1}=w_{5} * \text { out }_{h 1}+w_{6} * \text { out }_{h 2}+b_{2} * 1 \\ \text {net}_{o1}=0.4 * 0.593269992+0.45 * 0.596884378+0.6 * 1=1.105905967 \\ \text {out}_{o1}=\frac{1}{1+e^{-n c t_{o 1}}}=\frac{1}{1+e^{-1.105905967}}=0.75136507 \end{array}$

同理，可计算输出层神经元o2的值：

$\text {out}_{o2}=0.772928465$

计算总误差

使用平方误差函数计算每个输出神经元的误差，并对它们求和以得出总误差：

$E_{total}=\sum \frac{1}{2}(\text {target}-\text {output})^{2}$

因为有两个输出，所以分别计算o1的误差：

$E_{o 1}=\frac{1}{2}\left(\text {target}_{o 1}-\text {out}_{o 1}\right)^{2}=\frac{1}{2}(0.01-0.75136507)^{2}=0.274811083$

和计算o2的误差：

$E_{o 2}=0.023560026$

总误差为两者之和：

$E_{total}=E_{o 1}+E_{o 2}=0.274811083+0.023560026=0.298371109$

反向传播

输出层—>隐含层

反向传播的目标是更新网络中的每个权重，以便它们使实际输出更接近目标输出，从而将每个输出神经元和整个网络的误差降到最低。

对于$w_5$来讲，如果我们想知道$w5$对整体误差产生了多少影响，可以用整体误差对$w5$求偏导求出，由求导的链式法则可得：

$\frac{\partial E_{\text {total}}}{\partial w_{5}}=\frac{\partial E_{\text {total}}}{\partial \text {out}_{o1}} * \frac{\partial out_{o1}}{\partial \text {net}_{o1}} * \frac{\partial \text {net}_{o1}}{\partial w_{5}}$

下图可以直观上理解误差是如何反向传播的：

我们需要找出这个方程式中的每一部分：首先，计算总误差相对于输出的梯度$\frac{\partial E_{total}}{\partial out_{o1}}$：

$E_{total}=\frac{1}{2}\left(\text {target}_{o1}-\text {out}_{\text {ol}}\right)^{2}+\frac{1}{2}\left(\text {target }_{o 2}-\text {out}_{o 2}\right)^{2} \\ \begin{aligned} \frac{\partial E_{total}}{\partial out_{o1}} &= 2 * \frac{1}{2}\left(\text {target}_{o1}-\text {out}_{\text {o1}}\right)^{2-1} *-1+0 \\ & = -\left(\text {target}_{o 1}-\text {out}_{\text {o1}}\right) \\ &= -(0.01-0.75136507) \\ &= 0.74136507 \end{aligned}$

在上图神经元中，输出$out_{o1}$与输入$net_{01}$存在以下函数关系：

$out_{o1}=\frac{1}{1+e^{-net_{o1}}}$

由于逻辑函数的偏导数是输出乘以1减去本身，可得：

$\frac{\partial out_{o1}}{ \partial net_{o l}}=out_{o 1}\left(1-out_{o1}\right)=0.75136507(1-0.75136507)=0.186815602$

继续计算$\frac{\partial net_{o1}}{\partial w_{5}}$：

$\begin{array}{l} net_{o1}=w_{5} * \text {out}_{h1}+w_{6} * \text {out}_{h2}+b_{2} * 1 \\ \frac{\partial net_{o1}}{\partial w_{5}}=1 * \text {out}_{h1} * w_{5}^{(1-1)}+0+0=\text {out}_{h 1}=0.593269992\end{array}$

最后，相乘计算$\frac{\partial E_{\text {total}}}{\partial w_{5}}$：

$\begin{aligned} \frac{\partial E_{total}}{\partial w_{5}}&=\frac{\partial E_{total}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{o1}} * \frac{\partial net_{o1}}{\partial w_{5}}\\ &= 0.74136507 * 0.186815602 * 0.593269992 \\ &= 0.082167041 \end{aligned}$

由上面可以得知，整体误差的偏导常常以以下形式组合在一起：
$\frac{\partial E_{total}}{\partial w_{5}}=-\left(\text {target}_{o1}-\text {out}_{o1}\right) * \text {out}_{\text {o1}}\left(1-\text {out}_{o1}\right) * \text {out}_{h 1}$
为了表达方便，用$\delta_{o1}$来表示输出层的误差：
$\begin{array}{l}\delta_{o1}=\frac{\partial E_{total}}{\partial out_{\text {o1}}} * \frac{\partial out_{o1}}{\partial net_{o1}}=\frac{\partial E_{total}}{\partial net_{o1}} \\\delta_{o1}=-\left(\text {target}_{o1}-\text {out}_{o1}\right) * \text {out}_{o1}\left(1-\text {out}_{o1}\right)\end{array}$
因此，整体误差$E(total)$对$w5$的偏导公式可以写成：
$\frac{\partial E_{\text {total}}}{\partial w_{5}}=\delta_{o1} \text {out}_{h1}$

最后使用梯度下降来更新$w_5$：

$w_{5}:=w_{5}-\eta * \frac{\partial E_{total}}{\partial w_{5}}=0.4-0.5 * 0.082167041=0.35891648$

有些地方使用$\alpha$表示学习率，这里使用$\eta$来表示。

我们可以重复这个过程来更新其他权重：

$\begin{array}{l}w_{6}:=0.408666186 \\w_{7}:=0.511301270 \\w_{8}:=0.561370121\end{array}$

隐含层—>输入层

下面接着向后传递通过计算$w_1、w_2、w_3、w_4$，首先计算$\frac{\partial E_{\text {total }}}{\partial w_{1}}$：

$\frac{\partial E_{total}}{\partial w_{1}}=\frac{\partial E_{total}}{\partial out_{h1}} * \frac{\partial out_{h 1}}{\partial net_{h1}} * \frac{\partial net_{h1}}{\partial w_{1}}$

我使用与输出层类似的过程，但略有不同，因为每个隐藏层神经元的输出会影响多个输出神经元的输出（$out_{h_1}$会影响$out_{o1}$和$out_{o2}$）。直观上理解：

在隐含层之间的权值更新时，$out(h1)—>net(h1)—>w1$，而$out(h1)$会接受$E(o1)$和$E(o2)$两个地方传来的误差，所以这个地方两个都要计算。

$\frac{\partial E_{total}}{\partial \text {out}_{h1}}=\frac{\partial E_{o1}}{\partial \text {out}_{h1}}+\frac{\partial E_{o2}}{\partial \text {out}_{h1}}$

计算$\frac{\partial E_{o1}}{\partial \text {out}_{h1}}$：

$\begin{aligned} \frac{\partial E_{o1}}{\partial out_{h1}} &= \frac{\partial E_{o1}}{\partial net_{o1}} * \frac{\partial net_{o1}}{\partial out_{h1}}\\ \\ \frac{\partial E_{o1}}{\partial net_{o1}} &= \frac{\partial E_{o1}}{\partial out_{o1}} * \frac{\partial out_{o1}}{\partial net_{o1}}=0.74136507 * 0.186815602=0.138498562 \\ \\ net_{o 1}&=w_{5} * \text { out }_{h 1}+w_{6} * \text { out }_{h 2}+b_{2} * 1 \\ \\ \frac{\partial n e t_{o 1}}{\partial o u t_{h 1}}&=w_{5}=0.40 \\ \\ \frac{\partial E_{o 1}}{\partial o u t_{h 1}}&=\frac{\partial E_{o 1}}{\partial n e t_{o 1}} * \frac{\partial n e t_{01}}{\partial o u t_{h 1}}=0.138498562 * 0.40=0.055399425 \end{aligned}$

按照此过程，同样求出：

$\frac{\partial E_{o 2}}{\partial o u t_{h 1}}=-0.019049119$

得到总值：

$\frac{\partial E_{total}}{\partial out_{h1}}=\frac{\partial E_{o1}}{\partial out_{h1}}+\frac{\partial E_{o2}}{\partial \text {out}_{h1}}=0.055399425+-0.019049119=0.036350306$

现在，我们求出了$\frac{\partial E_{total}}{\partial out_{h1}}$，，还需要弄清楚$\frac{\partial out_{h 1}}{\partial net_{h1}}$以及$ \frac{\partial net_{h1}}{\partial w_{1}}$，其中：

$\begin{aligned} \text {out}_{h1}&=\frac{1}{1+e^{-net_{h1}}} \\ \frac{\partial out_{h1}}{\partial net_{h1}}&=\text {out}_{h1}\left(1-\text {out}_{h1}\right)=0.59326999(1-0.59326999)=0.241300709 \end{aligned}$

计算$h_1$相对于$w_1$输出神经元的总输入的偏导数：

$\begin{array}{l} \text {net}_{h1}=w_{1} * i_{1}+w_{3} * i_{2}+b_{1} * 1 \\ \frac{\partial net_{n1}}{\partial w_{1}}=i_{1}=0.05 \end{array}$

最后可得：

$\begin{aligned} \frac{\partial E_{total}}{\partial out_{h1}}&=\frac{\partial E_{o1}}{\partial out_{h1}}+\frac{\partial E_{o2}}{\partial out_{h1}}\\ &=0.036350306 * 0.241300709 * 0.05=0.000438568 \end{aligned}$

>
>

简化公式，使用$\delta_{h_1}$表示隐含层单元$h1$的误差
$\begin{aligned} \frac{\partial E_{total}}{\partial w_{1}}&=\left(\sum_{o} \frac{\partial E_{total}}{\partial out_{o}} * \frac{\partial out_{o}}{\partial net_{o}} * \frac{\partial net_{o}}{\partial out_{h1}}\right) * \frac{\partial out_{h 1}}{\partial net_{h 1}} * \frac{\partial net_{h 1}}{\partial w_{1}} \\ \frac{\partial E_{total}}{\partial w_{1}}&=\left(\sum_{o} \delta_{o} * w_{ho}\right) * out_{h1}\left(1-out_{h1}\right) * i_{1} \\ \frac{\partial E_{total}}{\partial w_{1}}&=\delta_{h 1} i_{1} \end{aligned}$

更新$w_1$的值：

$w_{1}:=w_{1}-\eta * \frac{\partial E_{total}}{\partial w_{1}}=0.15-0.5 * 0.000438568=0.149780716$

同理，可得：

$\begin{array}{l}w_{2}:=0.19956143 \\w_{3}:=0.24975114 \\w_{4}:=0.29950229\end{array}$

这样误差反向传播法就完成了，最后我们再把更新的权值重新计算，不停地迭代，在这个例子中第一次迭代之后，总误差E(total)由0.298371109下降至0.291027924。迭代10000次后，总误差为0.000035085，输出为0.015912196,0.984065734(原输出为[0.01,0.99]),

最初以0.05和0.1作为输入时，网络上的误差为0.298371109。经过第一轮反向传播后，总误差现在降至0.291027924。看起来似乎不多，但是在重复此过程10,000次之后，总误差降至0.0000351085。最终，当我们以0.05和0.1作为输入进行前向传播时，两个输出神经元生成0.015912196（相对于0.01）和0.984065734（相对于0.99），证明误差传播算法还是有效的。

import random
import math

#
#   参数解释：
#   "pd_" ：作为一个变量前缀的意思是 偏导数
#   "d_" ：作为一个变量前缀的意思是 导数
#   "w_ho" ：隐含层到输出层的权重系数索引
#   "w_ih" ：输入层到隐含层的权重系数的索引

class NeuralNetwork:
    LEARNING_RATE = 0.5

    def __init__(self, num_inputs, num_hidden, num_outputs, hidden_layer_weights=None, hidden_layer_bias=None,
                 output_layer_weights=None, output_layer_bias=None):
        self.num_inputs = num_inputs

        self.hidden_layer = NeuronLayer(num_hidden, hidden_layer_bias)
        self.output_layer = NeuronLayer(num_outputs, output_layer_bias)

        self.init_weights_from_inputs_to_hidden_layer_neurons(hidden_layer_weights)
        self.init_weights_from_hidden_layer_neurons_to_output_layer_neurons(output_layer_weights)

    def init_weights_from_inputs_to_hidden_layer_neurons(self, hidden_layer_weights):
        weight_num = 0
        for h in range(len(self.hidden_layer.neurons)):
            for i in range(self.num_inputs):
                if not hidden_layer_weights:
                    self.hidden_layer.neurons[h].weights.append(random.random())
                else:
                    self.hidden_layer.neurons[h].weights.append(hidden_layer_weights[weight_num])
                weight_num += 1

    def init_weights_from_hidden_layer_neurons_to_output_layer_neurons(self, output_layer_weights):
        weight_num = 0
        for o in range(len(self.output_layer.neurons)):
            for h in range(len(self.hidden_layer.neurons)):
                if not output_layer_weights:
                    self.output_layer.neurons[o].weights.append(random.random())
                else:
                    self.output_layer.neurons[o].weights.append(output_layer_weights[weight_num])
                weight_num += 1

    def inspect(self):
        print('------')
        print('* Inputs: {}'.format(self.num_inputs))
        print('------')
        print('Hidden Layer')
        self.hidden_layer.inspect()
        print('------')
        print('* Output Layer')
        self.output_layer.inspect()
        print('------')

    def feed_forward(self, inputs):
        hidden_layer_outputs = self.hidden_layer.feed_forward(inputs)
        return self.output_layer.feed_forward(hidden_layer_outputs)

    # Uses online learning, ie updating the weights after each training case
    def train(self, training_inputs, training_outputs):
        self.feed_forward(training_inputs)

        # 1. 输出神经元的值
        pd_errors_wrt_output_neuron_total_net_input = [0] * len(self.output_layer.neurons)
        for o in range(len(self.output_layer.neurons)):
            # ∂E/∂zⱼ
            pd_errors_wrt_output_neuron_total_net_input[o] = self.output_layer.neurons[
                o].calculate_pd_error_wrt_total_net_input(training_outputs[o])

        # 2. 隐含神经元的值
        pd_errors_wrt_hidden_neuron_total_net_input = [0] * len(self.hidden_layer.neurons)
        for h in range(len(self.hidden_layer.neurons)):

            # We need to calculate the derivative of the error with respect to the output of each hidden layer neuron
            # dE/dyⱼ = Σ ∂E/∂zⱼ * ∂z/∂yⱼ = Σ ∂E/∂zⱼ * wᵢⱼ
            d_error_wrt_hidden_neuron_output = 0
            for o in range(len(self.output_layer.neurons)):
                d_error_wrt_hidden_neuron_output += pd_errors_wrt_output_neuron_total_net_input[o] * \
                                                    self.output_layer.neurons[o].weights[h]

            # ∂E/∂zⱼ = dE/dyⱼ * ∂zⱼ/∂
            pd_errors_wrt_hidden_neuron_total_net_input[h] = d_error_wrt_hidden_neuron_output * \
                                                             self.hidden_layer.neurons[
                                                                 h].calculate_pd_total_net_input_wrt_input()

        # 3. 更新输出层权重系数
        for o in range(len(self.output_layer.neurons)):
            for w_ho in range(len(self.output_layer.neurons[o].weights)):
                # ∂Eⱼ/∂wᵢⱼ = ∂E/∂zⱼ * ∂zⱼ/∂wᵢⱼ
                pd_error_wrt_weight = pd_errors_wrt_output_neuron_total_net_input[o] * self.output_layer.neurons[
                    o].calculate_pd_total_net_input_wrt_weight(w_ho)

                # Δw = α * ∂Eⱼ/∂wᵢ
                self.output_layer.neurons[o].weights[w_ho] -= self.LEARNING_RATE * pd_error_wrt_weight

        # 4. 更新隐含层权重系数
        for h in range(len(self.hidden_layer.neurons)):
            for w_ih in range(len(self.hidden_layer.neurons[h].weights)):
                # ∂Eⱼ/∂wᵢ = ∂E/∂zⱼ * ∂zⱼ/∂wᵢ
                pd_error_wrt_weight = pd_errors_wrt_hidden_neuron_total_net_input[h] * self.hidden_layer.neurons[
                    h].calculate_pd_total_net_input_wrt_weight(w_ih)

                # Δw = α * ∂Eⱼ/∂wᵢ
                self.hidden_layer.neurons[h].weights[w_ih] -= self.LEARNING_RATE * pd_error_wrt_weight

    def calculate_total_error(self, training_sets):
        total_error = 0
        for t in range(len(training_sets)):
            training_inputs, training_outputs = training_sets[t]
            self.feed_forward(training_inputs)
            for o in range(len(training_outputs)):
                total_error += self.output_layer.neurons[o].calculate_error(training_outputs[o])
        return total_error


class NeuronLayer:
    def __init__(self, num_neurons, bias):

        # 同一层的所有神经元共享一个截距项b
        self.bias = bias if bias else random.random()

        self.neurons = []
        for i in range(num_neurons):
            self.neurons.append(Neuron(self.bias))

    def inspect(self):
        print('Neurons:', len(self.neurons))
        for n in range(len(self.neurons)):
            print(' Neuron', n)
            for w in range(len(self.neurons[n].weights)):
                print('  Weight:', self.neurons[n].weights[w])
            print('  Bias:', self.bias)

    def feed_forward(self, inputs):
        outputs = []
        for neuron in self.neurons:
            outputs.append(neuron.calculate_output(inputs))
        return outputs

    def get_outputs(self):
        outputs = []
        for neuron in self.neurons:
            outputs.append(neuron.output)
        return outputs


class Neuron:
    def __init__(self, bias):
        self.bias = bias
        self.weights = []

    def calculate_output(self, inputs):
        self.inputs = inputs
        self.output = self.squash(self.calculate_total_net_input())
        return self.output

    def calculate_total_net_input(self):
        total = 0
        for i in range(len(self.inputs)):
            total += self.inputs[i] * self.weights[i]
        return total + self.bias

    # Apply the logistic function to squash the output of the neuron
    def squash(self, total_net_input):
        return 1 / (1 + math.exp(-total_net_input))

    # Determine how much the neuron's total input has to change to move closer to the expected output
    #
    # Now that we have the partial derivative of the error with respect to the output (∂E/∂yⱼ) and
    # the derivative of the output with respect to the total net input (dyⱼ/dzⱼ) we can calculate
    # the partial derivative of the error with respect to the total net input.
    # This value is also known as the delta (δ) [1]
    # δ = ∂E/∂zⱼ = ∂E/∂yⱼ * dyⱼ/dzⱼ
    #
    def calculate_pd_error_wrt_total_net_input(self, target_output):
        return self.calculate_pd_error_wrt_output(target_output) * self.calculate_pd_total_net_input_wrt_input();

    # The error for each neuron is calculated by the Mean Square Error method:
    def calculate_error(self, target_output):
        return 0.5 * (target_output - self.output) ** 2

    # 每一个神经元的误差是由平方差公式计算的
    def calculate_pd_error_wrt_output(self, target_output):
        return -(target_output - self.output)

    def calculate_pd_total_net_input_wrt_input(self):
        return self.output * (1 - self.output)

    def calculate_pd_total_net_input_wrt_weight(self, index):
        return self.inputs[index]


###

# 文中的例子

nn = NeuralNetwork(2, 2, 2, hidden_layer_weights=[0.15, 0.2, 0.25, 0.3], hidden_layer_bias=0.35,
                   output_layer_weights=[0.4, 0.45, 0.5, 0.55], output_layer_bias=0.6)
for i in range(10000):
    nn.train([0.05, 0.1], [0.01, 0.99])
    print(i, round(nn.calculate_total_error([[[0.05, 0.1], [0.01, 0.99]]]), 9))

# XOR 另外一个例子:

# training_sets = [
#     [[0, 0], [0]],
#     [[0, 1], [1]],
#     [[1, 0], [1]],
#     [[1, 1], [0]]
# ]

# nn = NeuralNetwork(len(training_sets[0][0]), 5, len(training_sets[0][1]))
# for i in range(10000):
#     training_inputs, training_outputs = random.choice(training_sets)
#     nn.train(training_inputs, training_outputs)
#     print(i, nn.calculate_total_error(training_sets))