Having understood the one-layer net we can now turn our attention to the logical extension, the two-layer net.
To make our observation more general we also include a second ouput node.
Let this network be fully connected with two weight matrices.
We want to minimize the error over all output nodes, so our cost function is now
$$ E = \sum_k \frac{1}{2} (a_k - t_k)^2 $$The derivation proceeds in the same fashion as before, except that we now have an additional index.
From the picture is is clear that the error in $z_{k-1}$ does not depend on $w_{jk}$, therefore the sum disappears.
$$ \frac{\delta E}{\delta w_{jk}} = (a_k-t_k) \frac{\delta}{\delta w_{jk}} (a_k-t_k)$$Activation is still $a_k = g(z_k)$, and target $t_k$ also does not depend on $w_{jk}$.
$$ \begin{align} \frac{\delta E}{\delta w_{jk}} & = (a_k-t_k) \frac{\delta}{\delta w_{jk}} a_k \\ & = (a_k-t_k) \frac{\delta}{\delta w_{jk}} g_k(z_k) \\ & = (a_k-t_k) g'_k(z_k) \frac{\delta}{\delta w_{jk}} z_k \end{align}$$We also still have
$$ \begin{align} z_k & = \sum_j g_j(z_j) w_{jk} \\ \frac{\delta z_k}{\delta w_{jk}} & = g_j(z_j) = a_j \end{align}$$which means that
$$ \frac{\delta E}{\delta w_{jk}} = (a_k-t_k) g'_k(z_k) a_j $$Looking from activations $a_j$ in the hidden layers to the errors in the output layer we have found the gradient; let us define the first two parts as $\delta_k$:
$$ \begin{align} \delta_k & = (a_k-t_k) g'_k(z_k) \\ \frac{\delta E}{\delta w_{jk}} & = \delta_k a_j \end{align}$$In our learning rule we introduce a smal factor $l$ and again proceed with the weight updates in the direction of the negative gradient, since we want to minimize $E$:
$$ w_{jk} \leftarrow w_{jk} - l \delta_k a_j $$We are looking for the change in $E$ with input weight $w_{ij}$, and we start again with the error function.
From the picture above it is now clear that all $z_k$ depend on $w_{ij}$, so the summation does not immediately disappear.
$$\begin{align} E & = \frac{1}{2} \sum_k (a_k-t_k)^2 \\ \frac{\delta E}{\delta w_{ij}} & = \sum_k (a_k-t_k) \frac{\delta}{\delta w_{ij}} a_k \end{align}$$With $a_k = g_k(z_k)$ we have
$$ \begin{align} \frac{\delta E}{\delta w_{ij}} & = \sum_k (a_k-t_k) \frac{\delta}{w_{ij}} g_k(z_k) \\ & = \sum_k (a_k-t_k) g'_k(z_k) \frac{\delta}{\delta w_{ij}} z_k \end{align}$$For $z_k$ we have
$$\begin{align} z_k & = \sum_j a_j w_{jk} \\ & = \sum_j g_j(z_j) w_{jk} \\ & = \sum_j g_j(\sum_i z_i w_{ij}) w_{jk} \end{align}$$This shows all the weights with an effect on $z_k$.
Note in the picture that $w_{ij}$ has an effect only on the $z_j$ it connects to; in the derivative we can ignore the remaining terms in the sums.
$$\begin{align} \frac{\delta z_k}{\delta w_{ij}} & = w_{jk} \frac{\delta g_j(z_j)}{\delta w_{ij}} \\ & = w_{jk} g'_j(z_j) \frac{\delta z_j}{\delta w_{ij}} \\ & = w_{jk} g'_j(z_j) \frac{\delta}{\delta w_{ij}} \sum_i a_i w_{ij} \\ & = w_{jk} g'_j(z_j) a_i \end{align}$$Now we have the derivative for $z_k$; we can go back to $E$ and use our $\delta_k$:
$$ \begin{align} \frac{\delta E}{\delta w_{ij}} & = \sum_k (a_k-t_k) g'_k(z_k) w_{jk} g'_j(z_j) a_i \\ & = g'_j(z_j) a_i \sum_k (a_k-t_k) g'_k(z_k) w_{jk} \\ & = a_i g'_j(z_j) \sum_k \delta_k w_{jk} \end{align}$$Let us have a $\delta_j$ which is the error backpropagated to layer $j$:
$$\begin{align} \delta_j & = g'_j(z_j) \sum_k \delta_k w_{jk} \\ \frac{\delta E}{\delta w_{ij}} & = \delta_j a_i \\ w_{ij} & \leftarrow w_{ij} - l \delta_j a_i \end{align}$$This works for deep networks with more than two layers: we can calculate the weight gradients at any layer by backpropagating the errors and weight them with the signal going into the layer.
We will implement the two-layer net using only numpy functions again.
We prepare our data:
The last step is not really necessary for this dataset, but we want to make it clear that both our derivation and the implementation work for more than one output value.
import numpy as np
np.random.seed(1)
data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
X = data[:100,] - data[:100].mean(axis=0)
y = np.zeros((len(X),2))
y[:50,0] = 1
y[50:,1] = 1
print(y[:3,])
print(y[-3:,])
A little helper function to print the names an shapes of variables, very useful to understand the code.
def shapes(names,values):
names = names.split(',')
for i in range(len(names)): print(names[i]+':',values[i].shape)
Our activation function is still sigmoid:
def f(x):
return 1. / (1. + np.exp(-x))
The implementation is mostly straight-forward and very close to the notation above.
def nn2(X, y, l=0.01, epochs=100, h=3):
if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
Wij = np.random.rand(h, X.shape[1]) - 0.5
Wjk = np.random.rand(y.shape[1], h) - 0.5
for ep in range(epochs):
zj = np.dot(Wij, X.T)
aj = f(zj)
zk = np.dot(Wjk, aj)
ak = f(zk)
dk = (ak - y.T) * f(zk) * (1 - f(zk))
Wjk += np.dot(-l * dk, aj.T)
dj = f(zj) * (1 - f(zj)) * np.dot(dk.T, Wjk).T
Wij += np.dot(-l * dj, X)
if ep % (epochs/10) == 0:
print(np.sum(abs(ak-y.T), axis=1)/len(X))
shapes('X,y,Wij,Wjk,zj,aj,zk,dk,dj',(X,y,Wij,Wjk,zj,aj,zk,dk,dj))
return Wij, Wjk
Wij, Wjk = nn2(X, y)
Let's try a few values for the number of units in the hidden layer:
Wij, Wjk = nn2(X, y, h=16)
Wij, Wjk = nn2(X, y, h=64)
Let us now apply our code to a large dataset from the UCI Machine Learning website:
https://archive.ics.uci.edu/ml/datasets/bank+marketing
Since this dataset is not as small and easy to overview as the Iris dataset we use Pandas for importing and preparing the data:
import pandas as pd
df = pd.read_csv('bank.csv', delimiter=';').sample(frac=1)
print(df)
Pandas offers a lot of useful functions for data analysis, such as
df.sample(frac=1) | shuffles and avoids problems with sorted data |
df.describe() | Basic statistics for numeric columns |
df.colname.value_counts() | frequency of values, esp. non-numeric columns |
Take a look at https://pandas.pydata.org/pandas-docs/stable/index.html
df.describe()
df.education.value_counts()
df.y.value_counts()
☆ With a little more coding we can get nice summaries for the non-numeric columns:
from IPython.display import display, HTML, display_html
def displaysbs(*args):
html = ''
for x in args:
html += pd.DataFrame(x).to_html()
display_html(html.replace('table','table style="display:inline"'), raw=True)
displaysbs(df.job.value_counts(), df.marital.value_counts(), df.education.value_counts(),
df.contact.value_counts(), df.poutcome.value_counts(), df.y.value_counts())
Since some of the columns are not numeric we cannot use the whole data frame directly; here we select some of the numeric ones and apply a simple conversion to the last column.
X = df[['age', 'duration', 'campaign','pdays', 'previous']]
X = (X - X.mean()) / X.std()
y = (df['y']=='yes') * 1
print(X)
Now we are ready to predict the success or failure of the marketing campaigns:
W1, W2 = nn2(X, y)
We go down from 0.5 quickly, but then we are stuck. This is to be expected with such a simple approach.
As a theoretical concept, multilayer feedforward networks with a single hidden layer are universal approximators: by increasing the number of units in the hidden layer we can approximate any function to any degree.
In machine learning we are usually not so much interested in a spectacular learning success on the training set. The real goal is to apply the learned 'concepts' to data not yet seen. It is this performance that is relevant. Let us now separate our test set from the training set.
from sklearn.model_selection import train_test_split
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=1)
We define a function that only applies the feedforward steps:
def ffwd(X, y, Wij, Wjk):
if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
zj = np.dot(Wij, X.T)
aj = f(zj)
zk = np.dot(Wjk, aj)
ak = f(zk)
return np.sum(abs(ak-y.T), axis=1)/len(X)
Now we can train the network on the training data and then use the weights on the test data.
Wij, Wjk = nn2(Xtr, ytr)
ffwd(Xte, yte, Wij, Wjk)
As we can see the performance on the test set is similar to the training set.
Improving the performance on the training set can lead to 'overfitting': the weights are optimized too much for this particular training data.
Wij, Wjk = nn2(Xtr, ytr, l=0.01, h=3, epochs=2000)
ffwd(Xte, yte, Wij, Wjk)
The performance on the test set is worse; the net has not 'generalized'. This problem can be tackled with more advanced architectures.