Connectionist machine learning methods can directly deal with numeric input data, most commonly performing one of these functions:

- predict classes for observations
- estimate other numeric output values

In our sample application here we will learn to predict the default of credit card payments i.e. classify customers in default vs non-default.

The UCI site provides a dataset on credit card payments and defaults in Taiwan from 2005. We can use these data to construct reasonably realistic synthetic training data for our robot application.

The following notebook code uses the shell escape ! to run a command line statement, in this case wget for downloading the dataset. Note the option -nc which avoids repeated downloads when the file is already present.

In [107]:

```
!wget -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls
```

The Pandas package can read various formats of data files, including JSON, CSV, and of course XLS. We do not need any proprietary software to read the data file.

The first line contains additional headers which we skip with the skiprows option.

There is a small inconsistency in naming the PAY columns which we correct using the rename() function.

Other than that the data looks good, which we can check here using the head() function:

- all column headers are valid variable names, except for the last, which we have renamed
- all data are numeric, at first glance they are all integer
- no immediately obvious missing values or other problems
- the values cover a wide range in magnitude, we will need to address that

In [108]:

```
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'all':lambda x: '%6.2f' % x})
df = pd.read_excel('default of credit card clients.xls', skiprows=(1))
df = df.rename(columns={'default payment next month':'DEF', 'PAY_0':'PAY_1'})
df.head()
```

Out[108]:

If the Pandas package is missing from your system then you can install it on the command line:

`python3 -m pip install pandas --user`

Many training datasets for classification suffer from a severe imbalance in the number of observations per class. This must be addressed in some manner, otherwise the net can simply learn to predict the most frequent class and still achieve a high accuracy.

Among the various methods to deal with this problem we choose a simple approach:

- choose the number of most frequent classes
- determine the minimum number of observations for each of the most frequent classes
- draw samples of this size from each class

Pandas dataframes have some nifty group-by and sample functions that allow us to draw
samples from the groups formed by the column values. In this fashion we can
get **equal numbers of observations** for all issues which makes the learning performance
much easier to interpret.

The value count reveals a strong imbalance in the number of observations per default classes:

In [109]:

```
df.value_counts('DEF')
```

Out[109]:

In [110]:

```
minobs = min(df.value_counts('DEF').values)
print(minobs)
```

We re-sample the data to make our interpretations of the machine learning result simpler:

In [111]:

```
df = df.groupby('DEF').sample(n=minobs, random_state=1).sample(frac=1, random_state=1)
df.iloc[:5,:]
```

Out[111]:

Now we have the same number of defaulting and non-defaulting clients in the dataset:

In [112]:

```
print(df['DEF'].value_counts())
```

Since these data are all numeric we can convert them to NumPy arrays which tend to be computationally more efficient than Pandas dataframes, and also somewhat easier to use (when not doing statistics).

We omit the first column (ID) and the last (payment default). To avoid stating all the column names we use the .iloc method of indexing.

In [113]:

```
X = np.asarray(df.iloc[:,1:-1])
```

Numerical data is much easier to handle in machine learning applications when it is scaled into a common order of magnitude; we divide all values in each row by the row maximum. Another common choice is centering and scaling i.e. substract the mean and divide by the standard devation.

We will need the maximum values later when we scale additional input in our application, so this parameter will go into the saved model.

In [114]:

```
Xmax = X.max(0)
```

In [115]:

```
X = X / Xmax
#X = (X - X.mean(0)) / X.std(0)
print(X)
```

Since DEF is a single number in {0,1} for each observation we can let our neural net directly learn this value.

We convert to small int numpy array just to have both X and Y in the same internal format.

In [116]:

```
Y = np.asarray(df['DEF'], dtype='int8')
print(Y)
```

The basic neural net structure contains

- an input layer of $x_i$ values describing an observation, such as a text encoded as bag of words
- a hidden layer of $h_i$ activations computed from the input layer using weights $U$
- an output layer of $o_i$ values computed from the hidden layer with weights $V$

The output $o$ is compared to the values $y$ in the training set. The difference $o-y$ is used in the learning algorithm to update the weights.

In the simple feed-forward model with one hidden layer $h$ we compute $h$ from input $x$ and output $o$ from $h$:

\begin{align} h & = f(U x ) \\ o & = g(V h) \end{align}with $f()$ and $g()$ some non-linear activation functions, such as tanh, sigmoid, or relu.

Input $x$ is a vector with $n$ elements, a numeric representation of an observation.

With an additional input $x_{n+1} = 1$ we can ommit bias $b$ and shorten the notation.

The weight matrices $U$ and $V$ are the trainable parameters. The optimization (learning) aims at approximating $o$ to the training data $y$ i.e. minimizing the errors $y-o$. A cost function is defined based on the errors. A gradient descent algorithm is commonly used for optimization.

Let's see how this works on a very small subset of the data.

We will only use the payment status and take the x and y values from the first n observations:

In [117]:

```
n = 10
x = np.asarray(df[['PAY_6','PAY_5','PAY_4','PAY_3','PAY_2','PAY_1']][:n])
print(x)
```

In [118]:

```
y = np.asarray(df[['DEF']][:n])[:,0]
print(y)
```

Now we set the number of hidden states and compute some random weights U and V:

In [119]:

```
hid = 5
U = np.random.random((hid, x.shape[1])) * 4 - 2
print(U)
```

In [120]:

```
V = np.random.random((1,hid)) * 4 - 2
print(V)
```

We want to compute f(U x) so we need matrix multiplication and a non-linear function f.

Matrix multiplication can be done with np.dot(), e.g. to get the matrix-vector product for the first observation:

In [121]:

```
print(np.dot(U, x[0]))
```

However, NumPy easily lets us process all observations at once if we transpose x.

Note the first column which is identical to the result above:

In [122]:

```
print(np.dot(U, x.T))
```

For our non-linear function we use the sigmoid as plotted below.

☆ This is the classic choice for artificial neural nets; its simple derivative f'(x) = f(x)(1 - f(x)) makes it very attractive for implementation. Sadly it suffers from the vanishing gradient problem: we can see from the plot that as the absolut values of the input go beyond several units the curve becomes very flat meaning that the gradient goes towards zero. This is a problem for gradient-based optimization. For this reason various other non-linear functions are now more commonly in use, such as the ReLu (rectified linear unit).

In [123]:

```
import matplotlib.pyplot as plt
a = np.linspace(-5, 5, 1000)
b = 1 / (1 + np.exp(-a) )
plt.figure(figsize=(8, 3))
plt.plot(a, b)
```

Out[123]:

We apply the non-linear function to the result of the matrix multiplication to get the states h of the hidden layer.

We are still computing all observations at once rather than going through a loop, so the column vectors are the states h for each observation in x:

In [124]:

```
h = 1 / (1 + np.exp(-1 * np.dot(U, x.T)))
print(h)
```

Again we can compute the output for all hidden states in a single statement.

☆ Note that we could have more than one output state so the result is a matrix; however, here there is only one output state, so we get a matrix with one row (and not a vector).

In [125]:

```
o = 1/(1 + np.exp(-1 * np.dot(V, h)))
print(o)
```

Clearly random weights do not achieve anything close to the actual values y (unless we are very, very lucky).

In [126]:

```
print(o-y)
```

We sum the squared errors and call it the **cost function**:

In [127]:

```
c = np.sum((o-y)**2)/n
print(c)
```

By applying some optimization method we can adjust the weights to achieve lower (and maybe even minimal) cost.

- In neural networks this is commonly done using gradient descent
- the derivation of the cost function with respect to the weights is computed
- the weights are adjusted in the direction of lower cost.
- this is done repeatedly, usually for a set number of epochs i.e. runs through the whole data

Or, we can simple repeat the random process above until we arrive at a satisfactory result: not a practical approach, but interesting to see.

Given specific input, number of hidden units and range of weights the cost cannot go below a certain threshold.

In [128]:

```
c = 1
e = 0
while c > 0.1:
U = np.random.random((hid, x.shape[1])) * 4 - 2
V = np.random.random((1,hid)) * 4 - 2
h = 1/(1 + np.exp(-1 * np.dot(U, x.T)))
o = 1/(1 + np.exp(-1 * np.dot(V, h)))
c = np.sum((o-y)**2)/n
e += 1
print('y: ', np.asarray([y])*1.0)
print('o: ', o)
print('sq.err:', (o-y)**2)
print('cost: ', c)
print('epochs:', e)
```

Usually in such a situation values of output o below 0.5 are taken as 0 and above 0.5 as 1 which would give us a more practically useful result (payment default or not).

However, depending on the application, the output values can also be interpreted as probabilities.

In [129]:

```
d = np.asarray([int(v>0.5) for v in o[0]])
print('predict: ', d)
print('actual: ', y)
print('accuracy:', np.sum(y==d)/n)
```

Connectionist machine learning experienced a revival in the late 2000's with the advent of deep learning: such models are commonly characterized by

- several hidden layers instead of just one
- stochastic gradient descent: only a small part of the training data is used in
minibatches to update the weights; running through all
the data is called one
**epoch**. - more elaborate network architectures are employed, such as dropout, max-pooling, and even recurrent designs

*
☆ If you are interested in Deep Learning
then take a look at Tensorflow; using
the Keras package which is now part of Tensorflow makes it easy to construct elaborate
deep learning networks. However, since Tensorflow is somewhat cutting edge
it is often tricky to get it to work on your computer, especially when you wish
to make use of your graphics card in GPU computing; version problems continue to haunt
that software, which is why we stick with the much more stable sklearn
package: it is just one pip install away and works flawlessly out of the box on any
system. Sadly, this is not true for Tensorflow.
*

We use a simple feed-forward net as provided by the science kit learn package, the Multi-Layer-Perceptron Classifier.

We call MLPClassifier() with the max_iter parameter to save computation time here; see the full documentation at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html for additional parameters.

By default MLPClassifier() uses only one hidden layer with 100 units, however we can easily change to a deep learning structure, e.g. by adding hidden_layer_sizes=(200,50,) to the parameter list.

We split the data into **training** and **testing** set. Common splits are
about 90:10 to 70:30.

We also print the number of observations per label value to check for even distribution after splitting.

- The net adapts its weights on the training set
- then the weights remain fixed and are applied to the testing set

The scores show the performance on the training and testing data.

For the training phase we can find parameters to achieve accuracy close to 1.0 by a
process called **overfitting**. However, only the performance on the test
data which was not seen during training is relevant for practical application.

In [130]:

```
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
print(X.shape, Y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
print('training size:', X_train.shape[0],
'testing size:', X_test.shape[0],
'label counts:', np.unique(y_train, return_counts=True)[1])
clf = MLPClassifier(hidden_layer_sizes=(100,20,),
max_iter=50).fit(X_train, y_train)
print('score train:', clf.score(X_train, y_train))
print('score test: ', clf.score(X_test, y_test))
```

The accuracy on the test set is the benchmark for how well the net has 'learned' the concepts in the training data i.e. how well it can 'generalize' to new data.

The accuracy on the test set is usually worse than on the training set.

In a more elaborate setting we would further split into training, validation, and test set, since adjusting the parameters is itself an optimization process.

- training set: the weights are adapted
- validation set: the performance is evaluated, the parameters are adapted, and training starts again
- test set: in the
**final step**the performance is recorded without further parameter change

In other words, we are optimizing the weights on the training set and the parameters
on the validation set. The test set is used for the final performance on real-world
data i.e. data the net has not seen in training, and *we have not seen* while adjusting
our parameters.

After training the MLPClassifier gives us the choice of output:

- clf.predict_proba() -- the probability for each class
- clf.predict() -- the (most likely) numeric label

In [131]:

```
print('pred prob: ', clf.predict_proba(X[:1]))
print('pred class:', clf.predict(X[:1]))
```

We save our model in a pickle file so that we can use it again later to process new input. We will also need the Xmax values for scaling new input.

In [132]:

```
import pickle
pickle.dump((clf, Xmax), open('ccd.pkl', 'wb'))
```

Check that the model produces identical results when loaded from the pickle file:

In [133]:

```
clf2, Xmax2 = pickle.load(open('ccd.pkl', 'rb'))
print(clf2.predict_proba(df.iloc[:1,1:-1] /Xmax2 ))
```

Number of cases per class -- observed (Y) and predicted (p):

In [135]:

```
p = clf.predict(X)
```

In [136]:

```
p.shape
```

Out[136]:

In [137]:

```
sum(p==1), sum(Y==1), sum(p==0), sum(Y==0)
```

Out[137]: