Two-Layer Neural Net

Having understood the one-layer net we can now turn our attention to the logical extension, the two-layer net.

To make our observation more general we also include a second ouput node.

Let this network be fully connected with two weight matrices.

  • $a_i$ .. input activation
  • $z_j$ .. actications in hidden layer
  • $z_k$ .. activations in output layer
  • $w_{ij}$ .. weights to hidden layer
  • $w_{jk}$ .. weights to output layer

We want to minimize the error over all output nodes, so our cost function is now

$$ E = \sum_k \frac{1}{2} (a_k - t_k)^2 $$

Gradient for output layer weights

The derivation proceeds in the same fashion as before, except that we now have an additional index.

From the picture is is clear that the error in $z_{k-1}$ does not depend on $w_{jk}$, therefore the sum disappears.

$$ \frac{\delta E}{\delta w_{jk}} = (a_k-t_k) \frac{\delta}{\delta w_{jk}} (a_k-t_k)$$

Activation is still $a_k = g(z_k)$, and target $t_k$ also does not depend on $w_{jk}$.

$$ \begin{align} \frac{\delta E}{\delta w_{jk}} & = (a_k-t_k) \frac{\delta}{\delta w_{jk}} a_k \\ & = (a_k-t_k) \frac{\delta}{\delta w_{jk}} g_k(z_k) \\ & = (a_k-t_k) g'_k(z_k) \frac{\delta}{\delta w_{jk}} z_k \end{align}$$

We also still have

$$ \begin{align} z_k & = \sum_j g_j(z_j) w_{jk} \\ \frac{\delta z_k}{\delta w_{jk}} & = g_j(z_j) = a_j \end{align}$$

which means that

$$ \frac{\delta E}{\delta w_{jk}} = (a_k-t_k) g'_k(z_k) a_j $$

Looking from activations $a_j$ in the hidden layers to the errors in the output layer we have found the gradient; let us define the first two parts as $\delta_k$:

$$ \begin{align} \delta_k & = (a_k-t_k) g'_k(z_k) \\ \frac{\delta E}{\delta w_{jk}} & = \delta_k a_j \end{align}$$

In our learning rule we introduce a smal factor $l$ and again proceed with the weight updates in the direction of the negative gradient, since we want to minimize $E$:

$$ w_{jk} \leftarrow w_{jk} - l \delta_k a_j $$

Gradients for hidden layer weights: Backpropagation

We are looking for the change in $E$ with input weight $w_{ij}$, and we start again with the error function.

From the picture above it is now clear that all $z_k$ depend on $w_{ij}$, so the summation does not immediately disappear.

$$\begin{align} E & = \frac{1}{2} \sum_k (a_k-t_k)^2 \\ \frac{\delta E}{\delta w_{ij}} & = \sum_k (a_k-t_k) \frac{\delta}{\delta w_{ij}} a_k \end{align}$$

With $a_k = g_k(z_k)$ we have

$$ \begin{align} \frac{\delta E}{\delta w_{ij}} & = \sum_k (a_k-t_k) \frac{\delta}{w_{ij}} g_k(z_k) \\ & = \sum_k (a_k-t_k) g'_k(z_k) \frac{\delta}{\delta w_{ij}} z_k \end{align}$$

For $z_k$ we have

$$\begin{align} z_k & = \sum_j a_j w_{jk} \\ & = \sum_j g_j(z_j) w_{jk} \\ & = \sum_j g_j(\sum_i z_i w_{ij}) w_{jk} \end{align}$$

This shows all the weights with an effect on $z_k$.

Note in the picture that $w_{ij}$ has an effect only on the $z_j$ it connects to; in the derivative we can ignore the remaining terms in the sums.

$$\begin{align} \frac{\delta z_k}{\delta w_{ij}} & = w_{jk} \frac{\delta g_j(z_j)}{\delta w_{ij}} \\ & = w_{jk} g'_j(z_j) \frac{\delta z_j}{\delta w_{ij}} \\ & = w_{jk} g'_j(z_j) \frac{\delta}{\delta w_{ij}} \sum_i a_i w_{ij} \\ & = w_{jk} g'_j(z_j) a_i \end{align}$$

Now we have the derivative for $z_k$; we can go back to $E$ and use our $\delta_k$:

$$ \begin{align} \frac{\delta E}{\delta w_{ij}} & = \sum_k (a_k-t_k) g'_k(z_k) w_{jk} g'_j(z_j) a_i \\ & = g'_j(z_j) a_i \sum_k (a_k-t_k) g'_k(z_k) w_{jk} \\ & = a_i g'_j(z_j) \sum_k \delta_k w_{jk} \end{align}$$

Let us have a $\delta_j$ which is the error backpropagated to layer $j$:

$$\begin{align} \delta_j & = g'_j(z_j) \sum_k \delta_k w_{jk} \\ \frac{\delta E}{\delta w_{ij}} & = \delta_j a_i \\ w_{ij} & \leftarrow w_{ij} - l \delta_j a_i \end{align}$$

This works for deep networks with more than two layers: we can calculate the weight gradients at any layer by backpropagating the errors and weight them with the signal going into the layer.

$$ \begin{align} \delta_k & = (a_k-t_k) g'_k(z_k) \\ \delta_j & = g'_j(z_j) \sum_k \delta_k w_{jk} \\ w_{jk} & \leftarrow w_{jk} - l \delta_k a_j \\ w_{ij} & \leftarrow w_{ij} - l \delta_j a_i \end{align}$$

Implementation

We will implement the two-layer net using only numpy functions again.

We prepare our data:

  • read from file, only numeric columns
  • center the values in each column i.e. subtract the mean
    • this helps (a bit) with the learning process
  • provide two output values
    • 0/1 for being the first class
    • 0/1 again for being the second class
    • each row in y is either [1 0] or [0 1]

The last step is not really necessary for this dataset, but we want to make it clear that both our derivation and the implementation work for more than one output value.

In [38]:
import numpy as np
np.random.seed(1)

data = np.genfromtxt('iris.data', delimiter=',', usecols=(0,1,2,3))
X = data[:100,] - data[:100].mean(axis=0)
y = np.zeros((len(X),2))
y[:50,0] = 1
y[50:,1] = 1
print(y[:3,])
print(y[-3:,])
[[1. 0.]
 [1. 0.]
 [1. 0.]]
[[0. 1.]
 [0. 1.]
 [0. 1.]]

A little helper function to print the names an shapes of variables, very useful to understand the code.

In [39]:
def shapes(names,values):
    names = names.split(',')
    for i in range(len(names)): print(names[i]+':',values[i].shape)

Our activation function is still sigmoid:

In [40]:
def f(x):
    return 1. / (1. + np.exp(-x))

The implementation is mostly straight-forward and very close to the notation above.

  • There is a little reshaping and transposing for the np.dot() products
  • While we use the squared errors for the derivatives it is more useful to see the absolute errors
In [41]:
def nn2(X, y, l=0.01, epochs=100, h=3):
    if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
    Wij = np.random.rand(h, X.shape[1]) - 0.5
    Wjk = np.random.rand(y.shape[1], h) - 0.5
    for ep in range(epochs):
        zj = np.dot(Wij, X.T)
        aj = f(zj)
        zk = np.dot(Wjk, aj)
        ak = f(zk)
        dk = (ak - y.T) * f(zk) * (1 - f(zk))
        Wjk += np.dot(-l * dk, aj.T)
        dj = f(zj) * (1 - f(zj)) * np.dot(dk.T, Wjk).T
        Wij += np.dot(-l * dj, X)
        if ep % (epochs/10) == 0: 
            print(np.sum(abs(ak-y.T), axis=1)/len(X))
    shapes('X,y,Wij,Wjk,zj,aj,zk,dk,dj',(X,y,Wij,Wjk,zj,aj,zk,dk,dj))
    return Wij, Wjk
In [42]:
Wij, Wjk = nn2(X, y)
[0.50537322 0.50659499]
[0.45841007 0.47926662]
[0.40002783 0.43491979]
[0.33302202 0.37085896]
[0.27488282 0.30696311]
[0.23162413 0.25637168]
[0.20029925 0.21914628]
[0.17713051 0.19171728]
[0.15944754 0.17100117]
[0.14554057 0.15490222]
X: (100, 4)
y: (100, 2)
Wij: (3, 4)
Wjk: (2, 3)
zj: (3, 100)
aj: (3, 100)
zk: (2, 100)
dk: (2, 100)
dj: (3, 100)

Let's try a few values for the number of units in the hidden layer:

In [43]:
Wij, Wjk = nn2(X, y, h=16)
[0.55057775 0.50957025]
[0.44341626 0.39517687]
[0.32894492 0.28845904]
[0.23933737 0.21300404]
[0.18421798 0.16701879]
[0.1501097  0.13807598]
[0.12750762 0.11854014]
[0.11152904 0.1045134 ]
[0.09963648 0.09394189]
[0.09042437 0.08566947]
X: (100, 4)
y: (100, 2)
Wij: (16, 4)
Wjk: (2, 16)
zj: (16, 100)
aj: (16, 100)
zk: (2, 100)
dk: (2, 100)
dj: (16, 100)
In [44]:
Wij, Wjk = nn2(X, y, h=64)
[0.44151191 0.49669589]
[0.20297177 0.21905948]
[0.14040546 0.14616842]
[0.11060284 0.11346334]
[0.09285005 0.09448387]
[0.08090213 0.08189597]
[0.07222539 0.0728406 ]
[0.06558938 0.06596132]
[0.06032045 0.06052682]
[0.05601679 0.05610562]
X: (100, 4)
y: (100, 2)
Wij: (64, 4)
Wjk: (2, 64)
zj: (64, 100)
aj: (64, 100)
zk: (2, 100)
dk: (2, 100)
dj: (64, 100)

The Bank dataset

Let us now apply our code to a large dataset from the UCI Machine Learning website:

https://archive.ics.uci.edu/ml/datasets/bank+marketing

Since this dataset is not as small and easy to overview as the Iris dataset we use Pandas for importing and preparing the data:

In [2]:
import pandas as pd
df = pd.read_csv('bank.csv', delimiter=';').sample(frac=1)
print(df)
      age           job  marital  education default  balance housing loan  \
2281   58       retired  married  secondary      no      983      no   no   
915    33    management  married  secondary      no     4040     yes   no   
3871   30    unemployed  married  secondary      no      142     yes   no   
3782   50   blue-collar  married  secondary      no     2320     yes   no   
2689   30    management   single   tertiary      no     4213      no   no   
...   ...           ...      ...        ...     ...      ...     ...  ...   
831    47   blue-collar  married  secondary      no     1996      no   no   
1606   41   blue-collar   single    unknown      no     4684      no   no   
1252   38  entrepreneur  married  secondary      no      868      no   no   
1951   46  entrepreneur   single   tertiary      no     1410     yes   no   
1685   31      services  married  secondary      no     1062     yes  yes   

       contact  day month  duration  campaign  pdays  previous poutcome    y  
2281  cellular    5   aug       139         1     -1         0  unknown   no  
915   cellular   20   apr       132         2     -1         0  unknown   no  
3871   unknown   20   may       350         4     -1         0  unknown   no  
3782  cellular    7   apr       310         1     -1         0  unknown   no  
2689  cellular   25   aug       142         4     -1         0  unknown   no  
...        ...  ...   ...       ...       ...    ...       ...      ...  ...  
831    unknown    5   jun       761         2     -1         0  unknown  yes  
1606   unknown   20   jun        30         6     -1         0  unknown   no  
1252  cellular   20   nov       452         2     -1         0  unknown   no  
1951  cellular   19   nov        93         1    189         8  failure   no  
1685  cellular    6   apr       230         1     -1         0  unknown   no  

[4521 rows x 17 columns]

Pandas offers a lot of useful functions for data analysis, such as

df.sample(frac=1) shuffles and avoids problems with sorted data
df.describe() Basic statistics for numeric columns
df.colname.value_counts() frequency of values, esp. non-numeric columns

Take a look at https://pandas.pydata.org/pandas-docs/stable/index.html

In [3]:
df.describe()
Out[3]:
age balance day duration campaign pdays previous
count 4521.000000 4521.000000 4521.000000 4521.000000 4521.000000 4521.000000 4521.000000
mean 41.170095 1422.657819 15.915284 263.961292 2.793630 39.766645 0.542579
std 10.576211 3009.638142 8.247667 259.856633 3.109807 100.121124 1.693562
min 19.000000 -3313.000000 1.000000 4.000000 1.000000 -1.000000 0.000000
25% 33.000000 69.000000 9.000000 104.000000 1.000000 -1.000000 0.000000
50% 39.000000 444.000000 16.000000 185.000000 2.000000 -1.000000 0.000000
75% 49.000000 1480.000000 21.000000 329.000000 3.000000 -1.000000 0.000000
max 87.000000 71188.000000 31.000000 3025.000000 50.000000 871.000000 25.000000
In [6]:
df.education.value_counts()
Out[6]:
secondary    2306
tertiary     1350
primary       678
unknown       187
Name: education, dtype: int64
In [4]:
df.y.value_counts()
Out[4]:
no     4000
yes     521
Name: y, dtype: int64

☆ With a little more coding we can get nice summaries for the non-numeric columns:

In [14]:
from IPython.display import display, HTML, display_html

def displaysbs(*args):
    html = ''
    for x in args:
        html += pd.DataFrame(x).to_html()
    display_html(html.replace('table','table style="display:inline"'), raw=True)

displaysbs(df.job.value_counts(), df.marital.value_counts(), df.education.value_counts(),
          df.contact.value_counts(), df.poutcome.value_counts(), df.y.value_counts())
job
management 969
blue-collar 946
technician 768
admin. 478
services 417
retired 230
self-employed 183
entrepreneur 168
unemployed 128
housemaid 112
student 84
unknown 38
marital
married 2797
single 1196
divorced 528
education
secondary 2306
tertiary 1350
primary 678
unknown 187
contact
cellular 2896
unknown 1324
telephone 301
poutcome
unknown 3705
failure 490
other 197
success 129
y
no 4000
yes 521

Since some of the columns are not numeric we cannot use the whole data frame directly; here we select some of the numeric ones and apply a simple conversion to the last column.

In [46]:
X = df[['age', 'duration', 'campaign','pdays', 'previous']]
X = (X - X.mean()) / X.std()
y = (df['y']=='yes') * 1
print(X)
           age  duration  campaign     pdays  previous
3203 -0.867049 -0.146086 -0.255202 -0.407173 -0.320377
1041 -0.205186 -0.019092 -0.255202 -0.407173 -0.320377
3671 -0.394290 -0.842623 -0.255202 -0.407173 -0.320377
3256 -0.488842 -0.196113  0.066361  1.420613  3.222451
1251  0.267573 -0.323106 -0.255202 -0.407173 -0.320377
...        ...       ...       ...       ...       ...
2432 -0.772497 -0.542458  0.709488 -0.407173 -0.320377
4361 -1.056153 -0.573244  1.031051 -0.407173 -0.320377
3574 -0.677946 -0.473189 -0.576766  1.480540  0.270094
3182 -0.488842  0.234894  0.387925 -0.407173 -0.320377
1912  1.213091  4.240949 -0.576766  1.101000  0.270094

[4521 rows x 5 columns]

Now we are ready to predict the success or failure of the marketing campaigns:

In [47]:
W1, W2 = nn2(X, y)
[0.4738432]
[0.15457749]
[0.16276356]
[0.16529129]
[0.16569943]
[0.16557418]
[0.16534595]
[0.16511641]
[0.16490748]
[0.1647218]
X: (4521, 5)
y: (4521, 1)
Wij: (3, 5)
Wjk: (1, 3)
zj: (3, 4521)
aj: (3, 4521)
zk: (1, 4521)
dk: (1, 4521)
dj: (3, 4521)

We go down from 0.5 quickly, but then we are stuck. This is to be expected with such a simple approach.

Training and Testing

As a theoretical concept, multilayer feedforward networks with a single hidden layer are universal approximators: by increasing the number of units in the hidden layer we can approximate any function to any degree.

  • For an actual implementation such as the one above we may run into numerical problems.
  • The speed of a particular learning method and its convergence to the optimal weights are a different subject.

In machine learning we are usually not so much interested in a spectacular learning success on the training set. The real goal is to apply the learned 'concepts' to data not yet seen. It is this performance that is relevant. Let us now separate our test set from the training set.

from sklearn.model_selection import train_test_split

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, random_state=1)

We define a function that only applies the feedforward steps:

In [48]:
def ffwd(X, y, Wij, Wjk):
    if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
    zj = np.dot(Wij, X.T)
    aj = f(zj)
    zk = np.dot(Wjk, aj)
    ak = f(zk)
    return np.sum(abs(ak-y.T), axis=1)/len(X)

Now we can train the network on the training data and then use the weights on the test data.

In [49]:
Wij, Wjk = nn2(Xtr, ytr)
ffwd(Xte, yte, Wij, Wjk)
[0.51792324]
[0.16689388]
[0.17026981]
[0.16972359]
[0.16872429]
[0.16785448]
[0.1671597]
[0.16660939]
[0.16616981]
[0.16581433]
X: (3616, 5)
y: (3616, 1)
Wij: (3, 5)
Wjk: (1, 3)
zj: (3, 3616)
aj: (3, 3616)
zk: (1, 3616)
dk: (1, 3616)
dj: (3, 3616)
Out[49]:
array([0.16776508])

As we can see the performance on the test set is similar to the training set.

Overfitting

Improving the performance on the training set can lead to 'overfitting': the weights are optimized too much for this particular training data.

In [69]:
Wij, Wjk = nn2(Xtr, ytr, l=0.01, h=3, epochs=2000)
ffwd(Xte, yte, Wij, Wjk)
[0.53927345]
[0.16410845]
[0.16334025]
[0.16302574]
[0.16286953]
[0.1627811]
[0.1625994]
[0.16228727]
[0.1619644]
[0.16168037]
X: (3616, 5)
y: (3616, 1)
Wij: (3, 5)
Wjk: (1, 3)
zj: (3, 3616)
aj: (3, 3616)
zk: (1, 3616)
dk: (1, 3616)
dj: (3, 3616)
Out[69]:
array([0.16319995])

The performance on the test set is worse; the net has not 'generalized'. This problem can be tackled with more advanced architectures.

In [ ]: