# Two-Layer Neural Net

Having explored the one-layer net we can now turn our attention to the logical extension, the two-layer net.

To make our observation more general we also use more than one ouput node.

Indices: i, j, k refer to layers: input, hidden, output 

We use the following expressions to refer to certain values, but not
in the sense of variables, e.g. $w_{ij}$ means some element in the weights from
input to hidden layer, not some variable $w$ indexed by $i$ and $j$.

- $x_i$ .. input values
- $z_j$ .. weighted sums in hidden layer
- $f()$.. activation function in hidden layer
- $a_j$ .. actications in hidden layer
- $z_k$ .. weighted sums in output layer
- $g()$ .. activation function in output layer
- $a_k$ .. activations in output layer
- $t_k$ .. target values
- $w_{ij}$ .. weights from input to hidden layer
- $w_{jk}$ .. weights from hidden to output layer

<img src=nn2.jpg>

We want to minimize the error over all output nodes, so our cost function is now

$$ E = \sum_k \frac{1}{2} (a_k - t_k)^2 $$

We only look at the error for a single observation, since an additional sum
for all observations does not contribute anything useful to the derivation.

## Gradient for output layer weights

The derivation starts in a similar fashion to the single-layer net.

From the picture is is clear that $a_{k-1}$ does not
depend on $w_{jk}$, therefore the sum disappears. 

For composite function F() = f(g()) the chain rule gives us F'() = f'(g())g'() 

$$ \frac{\delta E}{\delta w_{jk}}
= (a_k-t_k) \frac{\delta}{\delta w_{jk}} (a_k-t_k)$$

Activation is $a_k = g(z_k)$, and target $t_k$ does not depend on $w_{jk}$.
This leaves us with

$$ \begin{align}
\frac{\delta E}{\delta w_{jk}}
& = (a_k-t_k) \frac{\delta}{\delta w_{jk}} a_k \\
& = (a_k-t_k) \frac{\delta}{\delta w_{jk}} g(z_k) \\
& = (a_k-t_k) g'(z_k) \frac{\delta}{\delta w_{jk}} z_k
\end{align}$$

We also still have

$$ \begin{align}
z_k & = \sum_j f(z_j) w_{jk} \\
\frac{\delta z_k}{\delta w_{jk}} & = f(z_j) = a_j 
\end{align}$$

since all terms in the sum with weights other than $w_{jk}$ are constant with respect
to $w_{jk}$.

This means that

$$ \frac{\delta E}{\delta w_{jk}} =
(a_k-t_k) g'(z_k) a_j $$

Looking from activations $a_j$ in the hidden layers to the errors
in the output layer we have found the gradient; let us define the
first two parts as $d_k$:

$$ \begin{align}
d_k & = (a_k-t_k) g'(z_k) & (1) \\
\frac{\delta E}{\delta w_{jk}} & = d_k a_j
\end{align}$$

In our learning rule we introduce a smal factor $l$ and again proceed
with the weight updates in the direction of the negative gradient,
since we want to minimize $E$:

$$ w_{jk} \leftarrow w_{jk} - l d_k a_j $$

## Gradients for hidden layer weights: Backpropagation

We are looking for the change in $E$ with input weight $w_{ij}$, and we start again with the error function. 

In the picture above we see that all $a_k, a_{k-1}, ..$ depend on $w_{ij}$, 
so the sum does not immediately disappear.

$$\begin{align}
E & = \frac{1}{2} \sum_k (a_k-t_k)^2 \\
\frac{\delta E}{\delta w_{ij}}
& = \sum_k (a_k-t_k) \frac{\delta}{\delta w_{ij}} a_k
\end{align}$$

With $a_k = g(z_k)$ we have

$$ \begin{align}
\frac{\delta E}{\delta w_{ij}} 
& = \sum_k (a_k-t_k) \frac{\delta}{w_{ij}} g(z_k) &  \\
& = \sum_k (a_k-t_k) g'(z_k) \frac{\delta}{\delta w_{ij}} z_k & (2)
\end{align}$$

For $z_k$ we have

$$\begin{align}
z_k & = \sum_j a_j w_{jk} 
 = \sum_j f(z_j) w_{jk} 
%% & = \sum_j f(\sum_i z_i w_{ij}) w_{jk}
\end{align}$$

This shows all the weights with an effect on $a_k$.

Note in the picture that $w_{ij}$ has an effect only on
the $a_j$ it connects to; in the derivative we can ignore 
the other $j$ terms in the sums.

$$\begin{align}
\frac{\delta z_k}{\delta w_{ij}}
& = w_{jk} \frac{\delta f(z_j)}{\delta w_{ij}} \\
& = w_{jk} f'(z_j) \frac{\delta z_j}{\delta w_{ij}} \\
& = w_{jk} f'(z_j) \frac{\delta}{\delta w_{ij}} \sum_i x_i w_{ij} \\
& = w_{jk} f'(z_j) x_i 
\end{align}$$



Now we have the derivative for $z_k$; we can go back to (2)
and use our $d_k$ from (1):

$$ \begin{align}
\frac{\delta E}{\delta w_{ij}} 
& = \sum_k (a_k-t_k) g'_k(z_k) w_{jk} f'(z_j) x_i \\
& = f'(z_j) x_i \sum_k (a_k-t_k) g'_k(z_k) w_{jk} \\
& =  x_i f'(z_j) \sum_k d_k w_{jk}
\end{align}$$

Let us also define a $d_j$ which is the error backpropagated to layer $j$:

$$\begin{align}
d_j & = g'_j(z_j) \sum_k d_k w_{jk} \\
\frac{\delta E}{\delta w_{ij}} & = d_j x_i \\
w_{ij} & \leftarrow w_{ij} - l d_j x_i 
\end{align}$$

Now we can define our learning rule for any 
activation functions $f$ and
$g$:

$$ \begin{align}
d_k & = (a_k-t_k) g'(z_k) \\
d_j & = f'(z_j) \sum_k d_k w_{jk} \\
w_{jk} & \leftarrow w_{jk} - l d_k a_j \\
w_{ij} & \leftarrow w_{ij} - l d_j x_i 
\end{align}$$

Similar procedures work for deep networks with more than two layers:
we can calculate the gradients at any layer by backpropagating
the errors and weigh them with the signal going into the layer.

## Implementation

We will implement the two-layer net using only numpy functions again.

We prepare our Iris data:
- sklearn has the dataset as load_iris()
- standardize the values in each column i.e. subtract the mean
  and divide by std dev; this tends to speed up learning
- we use numpy.eye() to turn the numbered labels into 'dummies' i.e. 1/0 encoding
  
&star; Both our
derivation and the implementation below work for one or more output values.

In [7]:
np.eye(3)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [8]:
import numpy as np
from sklearn.datasets import load_iris
np.random.seed(1)

iris = load_iris()
X = iris.data
X = (X - X.mean(axis=0)) / X.std(axis=0)
y = iris.target
y = np.eye(len(set(y)))[y]

print(y[:3,])
print(y[-3:,])

[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]
[[0. 0. 1.]
 [0. 0. 1.]
 [0. 0. 1.]]


A little helper function to print the names an shapes of variables, very useful to understand the code.

In [9]:
def shapes(names,values):
    names = names.split(',')
    for i in range(len(names)): print(names[i]+':',values[i].shape)

Our activation function is still sigmoid. 

&star; For single-label classification the softmax function would be
more appropriate for the output layer activation
since its output values sum to one and therefore allow 
for interpreting them as propabilities. The sigmoid function
treats the output values independently and
can be used for multi-label classification. We already have
our gradients for the sigmoid, so we simplify the process and
keep this function for both hidden layer and output layer activation.
In terms of performance there is not much difference on this dataset.

In [10]:
def f(x):
    return 1. / (1. + np.exp(-x))

Our error function as above:

In [11]:
def err(a, y):
    return (np.sum(a - y)**2) / 2

Another little helper function for the accuracy, assuming the highest
activation determines the predicted class:

In [12]:
def acc(a, y):
    return np.sum(np.argmax(a, axis=1) == np.argmax(y, axis=1)) / len(y)

&star; Note that errors and accuracy as defined above can diverge
in the printouts below; consider e.g. output activations
[0.1, 0.8, 0.1] and [0.7, 0.8, 0.7] which have the same accuracy but very different errors.

The implementation is mostly straight-forward and very close to the notation above.

- we set the random seed to make sure that 
  - the results are reproducible
  - any differences in results is not due to random numbers, but to changes in data
    or parameters
- there is some transposing for the np.dot() products
- we return the trained weights so we can later use them on a separate test set

In [13]:
def nn2(X, y, l=0.01, epochs=100, h=10, rs=1):
    np.random.seed(rs)
    Wij = np.random.rand(h, X.shape[1]) - 0.5
    Wjk = np.random.rand(y.shape[1], h) - 0.5
    for ep in range(epochs):
        zj = np.dot(Wij, X.T)
        aj = f(zj)
        zk = np.dot(Wjk, aj)
        ak = f(zk)
        dk = (ak - y.T) * ak * (1 - ak)
        dj = aj * (1 - aj) * np.dot(dk.T, Wjk).T
        Wij += np.dot(-l * dj, X)
        Wjk += np.dot(-l * dk, aj.T)
        if ep % (epochs/10) == 0: 
            print('ep: %3d  err: %9.3f  acc: %4.2f' % 
                  (ep, err(ak.T, y), acc(ak.T, y),))
    shapes('X,y,Wij,Wjk,aj,ak',(X,y,Wij,Wjk,aj,ak))
    return Wij, Wjk

In [14]:
Wij, Wjk = nn2(X, y)

ep:   0  err:  1269.981  acc: 0.66
ep:  10  err:   118.868  acc: 0.67
ep:  20  err:    91.760  acc: 0.75
ep:  30  err:    72.672  acc: 0.79
ep:  40  err:    54.419  acc: 0.83
ep:  50  err:    41.090  acc: 0.85
ep:  60  err:    32.024  acc: 0.86
ep:  70  err:    25.770  acc: 0.87
ep:  80  err:    21.272  acc: 0.87
ep:  90  err:    17.890  acc: 0.89
X: (150, 4)
y: (150, 3)
Wij: (10, 4)
Wjk: (3, 10)
aj: (10, 150)
ak: (3, 150)


Let's try a few values for the number of units in the hidden layer:

In [15]:
Wij, Wjk = nn2(X, y, h=20)

ep:   0  err:  6693.564  acc: 0.25
ep:  10  err:    59.681  acc: 0.73
ep:  20  err:    61.872  acc: 0.83
ep:  30  err:    47.110  acc: 0.85
ep:  40  err:    33.814  acc: 0.86
ep:  50  err:    24.580  acc: 0.87
ep:  60  err:    18.371  acc: 0.87
ep:  70  err:    14.109  acc: 0.88
ep:  80  err:    11.097  acc: 0.88
ep:  90  err:     8.911  acc: 0.89
X: (150, 4)
y: (150, 3)
Wij: (20, 4)
Wjk: (3, 20)
aj: (20, 150)
ak: (3, 150)


In [16]:
Wij, Wjk = nn2(X, y, h=200)

ep:   0  err:  4895.415  acc: 0.22
ep:  10  err:     2.182  acc: 0.67
ep:  20  err:  1284.618  acc: 0.85
ep:  30  err:   687.279  acc: 0.86
ep:  40  err:   317.415  acc: 0.78
ep:  50  err:   161.987  acc: 0.83
ep:  60  err:   149.245  acc: 0.85
ep:  70  err:   137.755  acc: 0.85
ep:  80  err:   130.689  acc: 0.86
ep:  90  err:   126.757  acc: 0.86
X: (150, 4)
y: (150, 3)
Wij: (200, 4)
Wjk: (3, 200)
aj: (200, 150)
ak: (3, 150)


## The Bank dataset

Let us now apply our code to a large dataset from the UCI Machine Learning website:

https://archive.ics.uci.edu/ml/datasets/bank+marketing

The data is fairly recent (2012) and "related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit." 

Download the zip archive and extract bank.csv, not bank-full.csv.

This dataset is not as small and easy to overview as the Iris dataset. 
We use Pandas for importing and preparing the data:

In [17]:
import pandas as pd
df = pd.read_csv('bank.csv', delimiter=';').sample(frac=1)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
3433,58,management,married,secondary,no,139,no,no,cellular,27,may,188,1,161,2,other,no
1800,44,entrepreneur,married,tertiary,no,0,yes,no,unknown,9,jul,300,1,-1,0,unknown,no
1686,47,technician,married,secondary,no,302,yes,no,unknown,20,jun,89,3,-1,0,unknown,no
1576,48,management,married,secondary,no,117,yes,no,cellular,16,apr,635,1,-1,0,unknown,no
4511,46,blue-collar,married,secondary,no,668,yes,no,unknown,15,may,1263,2,-1,0,unknown,yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
690,55,technician,single,secondary,no,3339,yes,no,unknown,2,jun,63,1,-1,0,unknown,no
3476,33,blue-collar,married,secondary,no,625,yes,no,unknown,28,may,410,1,-1,0,unknown,no
3114,28,services,single,secondary,no,94,yes,no,cellular,16,apr,300,1,-1,0,unknown,no
3106,27,services,divorced,secondary,no,10,yes,no,cellular,22,jul,527,1,-1,0,unknown,no


Pandas offers a lot of useful functions for data preparation and analysis, such as 

| | |
|---|:---|
| df.sample(frac=1)       |shuffles and avoids problems with sorted data |
| df.describe()     |Basic statistics for numeric columns                                                     |
| df.colname.value_counts() |frequency of values, esp. non-numeric columns                      |
| df[0].replace() | eg. replace(['low', 'med', 'high'], [1, 2, 3]) |

Take a look at https://pandas.pydata.org/pandas-docs/stable/index.html

The value counts for the target variable reveal that the
dataset is highly unbalanced:

In [18]:
df.y.value_counts()

no     4000
yes     521
Name: y, dtype: int64

To simplify our analysis we
re-sample the observations.

In [19]:
minobs = min(df.y.value_counts().values)
df = df.groupby('y').sample(n=minobs).sample(frac=1)

df.y.value_counts()

yes    521
no     521
Name: y, dtype: int64

We map the yes/no in the target variable to 'dummies' i.e. [1, 0]
or [0, 1]:

In [20]:
y = np.asarray(pd.get_dummies(df.y))
y

array([[0, 1],
       [1, 0],
       [1, 0],
       ...,
       [0, 1],
       [0, 1],
       [1, 0]], dtype=uint8)

We drop the target column and get dummy variables for the 
remaining non-numeric columns:

In [21]:
df = df.drop(columns=['y'])
df = pd.get_dummies(df)
df

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
2376,35,152,2,563,1,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
532,34,663,20,111,1,-1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
623,40,1509,12,333,1,58,3,0,0,0,...,0,0,1,0,0,0,1,0,0,0
3310,29,2893,3,250,1,-1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1151,44,205,3,289,1,-1,0,1,0,0,...,0,0,0,1,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1303,42,1519,19,230,1,92,1,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4024,48,2330,4,15,1,-1,0,0,1,0,...,1,0,0,0,0,0,0,0,0,1
1946,42,257,12,955,2,-1,0,0,0,0,...,0,0,1,0,0,0,0,0,0,1
1763,57,2887,21,819,10,-1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


Since the values in the numeric columns differ in magnitude we use standard
scaling:

In [22]:
X = (df-df.mean())/df.std()
X

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_admin.,job_blue-collar,job_entrepreneur,...,month_jun,month_mar,month_may,month_nov,month_oct,month_sep,poutcome_failure,poutcome_other,poutcome_success,poutcome_unknown
2376,-0.560654,-0.449271,-1.677993,0.501248,-0.559083,-0.488684,-0.420425,-0.357188,-0.465920,-0.169116,...,2.924281,-0.156712,-0.563041,-0.309188,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016
532,-0.644833,-0.285764,0.526113,-0.792744,-0.559083,-0.488684,-0.420425,2.796961,-0.465920,-0.169116,...,-0.341636,-0.156712,-0.563041,-0.309188,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016
623,-0.139760,-0.015064,-0.453489,-0.157199,-0.559083,0.025739,1.090205,-0.357188,-0.465920,-0.169116,...,-0.341636,-0.156712,1.774365,-0.309188,-0.217235,-0.136217,2.635816,-0.231383,-0.307322,-1.673389
3310,-1.065727,0.427782,-1.555542,-0.394813,-0.559083,-0.488684,-0.420425,-0.357188,-0.465920,-0.169116,...,2.924281,-0.156712,-0.563041,-0.309188,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016
1151,0.196956,-0.432313,-1.555542,-0.283163,-0.559083,-0.488684,-0.420425,2.796961,-0.465920,-0.169116,...,-0.341636,-0.156712,-0.563041,3.231179,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1303,0.028598,-0.011865,0.403663,-0.452069,-0.559083,0.322186,0.083118,-0.357188,-0.465920,-0.169116,...,-0.341636,-0.156712,-0.563041,-0.309188,4.598902,-0.136217,-0.379025,-0.231383,3.250789,-1.673389
4024,0.533671,0.247636,-1.433092,-1.067574,-0.559083,-0.488684,-0.420425,-0.357188,2.144233,-0.169116,...,2.924281,-0.156712,-0.563041,-0.309188,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016
1946,0.028598,-0.415674,-0.453489,1.623470,-0.182749,-0.488684,-0.420425,-0.357188,-0.465920,-0.169116,...,-0.341636,-0.156712,1.774365,-0.309188,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016
1763,1.291281,0.425862,0.648564,1.234128,2.827921,-0.488684,-0.420425,-0.357188,2.144233,-0.169116,...,-0.341636,-0.156712,-0.563041,-0.309188,-0.217235,-0.136217,-0.379025,-0.231383,-0.307322,0.597016


Now we are ready to train for predicting 
the success or failure of the marketing campaigns:

In [23]:
W1, W2 = nn2(X, y)

ep:   0  err:  1832.234  acc: 0.50
ep:  10  err:     2.574  acc: 0.79
ep:  20  err:     0.008  acc: 0.83
ep:  30  err:     0.009  acc: 0.84
ep:  40  err:     0.003  acc: 0.85
ep:  50  err:     0.000  acc: 0.85
ep:  60  err:     0.001  acc: 0.85
ep:  70  err:     0.005  acc: 0.86
ep:  80  err:     0.008  acc: 0.87
ep:  90  err:     0.008  acc: 0.87
X: (1042, 51)
y: (1042, 2)
Wij: (10, 51)
Wjk: (2, 10)
aj: (10, 1042)
ak: (2, 1042)


The performance on the training data seems promising, but it does not
tell us much about how the net will perform on data not seen during training.

## Training and Testing

In theory, feed-forward networks with a single hidden layer and a non-linear
activation function  are universal 
function approximators: 
by increasing the number of units in the hidden layer we can approximate any function to any degree of non-zero error. The function is
defined by a set of data points, the training set.

However, there are some practical considerations:

- A given number of hidden units corresponds to a certain minimum error
- This error can only be realized with the corresponding optimal weights
- The time the learning method takes to converge to the optimal weights can
  be prohibitive
- The learning method may not actually arrive at the optimal weights
- The error will be different on data not seen during training

In machine learning we are not so much interested in great learning success on 
the training set. 
We do not need to predict or estimate data in the training set -- those
values are already known.
The real goal is to apply the learned 'concepts' to data not 
yet seen.  

Let us now separate our test set from the training set.

In [24]:
from sklearn.model_selection import train_test_split

Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.3, 
                                      shuffle=True, random_state=1)

We define a function that only applies the feedforward steps:

In [25]:
def ffwd(X, y, Wij, Wjk):
    if y.ndim==1: y = np.reshape(np.ravel(y), (len(y),1))
    zj = np.dot(Wij, X.T)
    aj = f(zj)
    zk = np.dot(Wjk, aj)
    ak = f(zk)
    return acc(ak.T, y)

Now we can train the network on the training data and then use the weights on the test data.

In [26]:
Wij, Wjk = nn2(Xtr, ytr)
ffwd(Xte, yte, Wij, Wjk)

ep:   0  err:   967.371  acc: 0.51
ep:  10  err:     1.046  acc: 0.74
ep:  20  err:     0.030  acc: 0.83
ep:  30  err:     0.004  acc: 0.84
ep:  40  err:     0.021  acc: 0.86
ep:  50  err:     0.048  acc: 0.87
ep:  60  err:     0.074  acc: 0.87
ep:  70  err:     0.096  acc: 0.88
ep:  80  err:     0.104  acc: 0.88
ep:  90  err:     0.097  acc: 0.89
X: (729, 51)
y: (729, 2)
Wij: (10, 51)
Wjk: (2, 10)
aj: (10, 729)
ak: (2, 729)


0.8019169329073482

As expected, the accuracy on the test set is worse than
on the training set.

## Overfitting

By increasing the number of hidden units we can improve the performance on the 
training set; however, this can lead to 'overfitting': the weights are
adapted too much to the training data, and will not generalize to the
unseen test data.

In [27]:
Wij, Wjk = nn2(Xtr, ytr, epochs=500, h=50)
ffwd(Xte, yte, Wij, Wjk)

ep:   0  err: 25977.461  acc: 0.49
ep:  50  err:  1807.182  acc: 0.88
ep: 100  err:    19.023  acc: 0.91
ep: 150  err:     9.783  acc: 0.95
ep: 200  err:     0.001  acc: 0.97
ep: 250  err:     0.000  acc: 0.98
ep: 300  err:     0.037  acc: 0.98
ep: 350  err:     0.021  acc: 0.99
ep: 400  err:     0.010  acc: 0.99
ep: 450  err:     0.005  acc: 0.99
X: (729, 51)
y: (729, 2)
Wij: (50, 51)
Wjk: (2, 50)
aj: (50, 729)
ak: (2, 729)


0.792332268370607

The accuracy on the training set has improved a lot, but on the
test set it is a little worse.

The number of hidden units suitable to avoid overfitting
depends on the application.

There are some rules of thumb, such as: 


In [28]:
def nhid(nin, nout, nsamp):
    return nsamp / (2 * (nin + nout))

nhid(Xtr.shape[1], 1, Xtr.shape[0])

7.009615384615385

The result is similar to our default values:

In [29]:
Wij, Wjk = nn2(Xtr, ytr, h=7)
ffwd(Xte, yte, Wij, Wjk)

ep:   0  err:  6655.140  acc: 0.48
ep:  10  err:     0.994  acc: 0.70
ep:  20  err:     0.424  acc: 0.79
ep:  30  err:     0.009  acc: 0.84
ep:  40  err:     0.012  acc: 0.85
ep:  50  err:     0.068  acc: 0.85
ep:  60  err:     0.138  acc: 0.86
ep:  70  err:     0.179  acc: 0.87
ep:  80  err:     0.175  acc: 0.88
ep:  90  err:     0.142  acc: 0.88
X: (729, 51)
y: (729, 2)
Wij: (7, 51)
Wjk: (2, 7)
aj: (7, 729)
ak: (2, 729)


0.8115015974440895

Of course, different random seeds produce different results: 

In [30]:
Wij, Wjk = nn2(Xtr, ytr, h=7, rs=7)
ffwd(Xte, yte, Wij, Wjk)

ep:   0  err:   135.293  acc: 0.54
ep:  10  err:     0.349  acc: 0.69
ep:  20  err:     0.278  acc: 0.81
ep:  30  err:     0.120  acc: 0.84
ep:  40  err:     0.040  acc: 0.85
ep:  50  err:     0.004  acc: 0.86
ep:  60  err:     0.000  acc: 0.87
ep:  70  err:     0.003  acc: 0.87
ep:  80  err:     0.006  acc: 0.88
ep:  90  err:     0.006  acc: 0.88
X: (729, 51)
y: (729, 2)
Wij: (7, 51)
Wjk: (2, 7)
aj: (7, 729)
ak: (2, 729)


0.7891373801916933