Machine Learning for RPA: Numeric Data

Connectionist machine learning methods can directly deal with numeric input data, most commonly performing one of these functions:

  • predict classes for observations
  • estimate other numeric output values

In our sample application here we will learn to predict the default of credit card payments i.e. classify customers in default vs non-default.

UCI ML Credit Card Defaults Dataset

The UCI site provides a dataset on credit card payments and defaults in Taiwan from 2005. We can use these data to construct reasonably realistic synthetic training data for our robot application.

The following notebook code uses the shell escape ! to run a command line statement, in this case wget for downloading the dataset. Note the option -nc which avoids repeated downloads when the file is already present.

In [107]:
!wget -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls
File ‘default of credit card clients.xls’ already there; not retrieving.

The Pandas package can read various formats of data files, including JSON, CSV, and of course XLS. We do not need any proprietary software to read the data file.

The first line contains additional headers which we skip with the skiprows option.

There is a small inconsistency in naming the PAY columns which we correct using the rename() function.

Other than that the data looks good, which we can check here using the head() function:

  • all column headers are valid variable names, except for the last, which we have renamed
  • all data are numeric, at first glance they are all integer
  • no immediately obvious missing values or other problems
  • the values cover a wide range in magnitude, we will need to address that
In [108]:
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'all':lambda x: '%6.2f' % x})

df = pd.read_excel('default of credit card clients.xls', skiprows=(1))
df = df.rename(columns={'default payment next month':'DEF', 'PAY_0':'PAY_1'})
df.head()
Out[108]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEF
0 1 20000 2 2 1 24 2 2 -1 -1 ... 0 0 0 0 689 0 0 0 0 1
1 2 120000 2 2 2 26 -1 2 0 0 ... 3272 3455 3261 0 1000 1000 1000 0 2000 1
2 3 90000 2 2 2 34 0 0 0 0 ... 14331 14948 15549 1518 1500 1000 1000 1000 5000 0
3 4 50000 2 2 1 37 0 0 0 0 ... 28314 28959 29547 2000 2019 1200 1100 1069 1000 0
4 5 50000 1 2 1 57 -1 0 -1 0 ... 20940 19146 19131 2000 36681 10000 9000 689 679 0

5 rows × 25 columns

If the Pandas package is missing from your system then you can install it on the command line:

python3 -m pip install pandas --user

Imbalance in the Number of Observations per Class

Many training datasets for classification suffer from a severe imbalance in the number of observations per class. This must be addressed in some manner, otherwise the net can simply learn to predict the most frequent class and still achieve a high accuracy.

Among the various methods to deal with this problem we choose a simple approach:

  • choose the number of most frequent classes
  • determine the minimum number of observations for each of the most frequent classes
  • draw samples of this size from each class

Pandas dataframes have some nifty group-by and sample functions that allow us to draw samples from the groups formed by the column values. In this fashion we can get equal numbers of observations for all issues which makes the learning performance much easier to interpret.

The value count reveals a strong imbalance in the number of observations per default classes:

In [109]:
df.value_counts('DEF')
Out[109]:
DEF
0    23364
1     6636
dtype: int64
In [110]:
minobs = min(df.value_counts('DEF').values)
print(minobs)
6636

We re-sample the data to make our interpretations of the machine learning result simpler:

In [111]:
df = df.groupby('DEF').sample(n=minobs, random_state=1).sample(frac=1, random_state=1)
df.iloc[:5,:]
Out[111]:
ID LIMIT_BAL SEX EDUCATION MARRIAGE AGE PAY_1 PAY_2 PAY_3 PAY_4 ... BILL_AMT4 BILL_AMT5 BILL_AMT6 PAY_AMT1 PAY_AMT2 PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 DEF
2231 2232 230000 2 1 2 49 2 2 -1 -1 ... 0 2119 0 0 1100 0 2119 0 2916 1
6574 6575 200000 1 1 1 38 -2 -2 -2 -2 ... 4115 1315 0 2658 0 4125 1315 0 0 0
13810 13811 40000 1 1 1 47 2 2 2 2 ... 12595 14386 14005 2000 1000 0 2000 0 2000 0
7934 7935 180000 2 2 2 28 1 2 -1 -1 ... 0 22357 21541 0 1600 0 22357 1500 785 1
7277 7278 150000 2 1 1 44 1 2 2 -1 ... 10195 24232 22000 0 0 10195 24232 0 0 1

5 rows × 25 columns

Now we have the same number of defaulting and non-defaulting clients in the dataset:

In [112]:
print(df['DEF'].value_counts())
1    6636
0    6636
Name: DEF, dtype: int64

Since these data are all numeric we can convert them to NumPy arrays which tend to be computationally more efficient than Pandas dataframes, and also somewhat easier to use (when not doing statistics).

We omit the first column (ID) and the last (payment default). To avoid stating all the column names we use the .iloc method of indexing.

In [113]:
X = np.asarray(df.iloc[:,1:-1])

Scaling

Numerical data is much easier to handle in machine learning applications when it is scaled into a common order of magnitude; we divide all values in each row by the row maximum. Another common choice is centering and scaling i.e. substract the mean and divide by the standard devation.

We will need the maximum values later when we scale additional input in our application, so this parameter will go into the saved model.

In [114]:
Xmax = X.max(0)
In [115]:
X = X / Xmax
#X = (X - X.mean(0)) / X.std(0)

print(X)
[[  0.29   1.00   0.17 ...   0.00   0.00   0.01]
 [  0.25   0.50   0.17 ...   0.00   0.00   0.00]
 [  0.05   0.50   0.17 ...   0.00   0.00   0.00]
 ...
 [  0.54   1.00   0.17 ...   0.00   0.01   0.00]
 [  0.38   0.50   0.17 ...   0.02   0.03   0.02]
 [  0.04   0.50   0.33 ...   0.00   0.01   0.00]]

Since DEF is a single number in {0,1} for each observation we can let our neural net directly learn this value.

We convert to small int numpy array just to have both X and Y in the same internal format.

In [116]:
Y = np.asarray(df['DEF'], dtype='int8')
print(Y)
[  1.00   0.00   0.00 ...   0.00   1.00   0.00]

Feed-Forward Neural Net

The basic neural net structure contains

  • an input layer of $x_i$ values describing an observation, such as a text encoded as bag of words
  • a hidden layer of $h_i$ activations computed from the input layer using weights $U$
  • an output layer of $o_i$ values computed from the hidden layer with weights $V$

The output $o$ is compared to the values $y$ in the training set. The difference $o-y$ is used in the learning algorithm to update the weights.

In the simple feed-forward model with one hidden layer $h$ we compute $h$ from input $x$ and output $o$ from $h$:

\begin{align} h & = f(U x ) \\ o & = g(V h) \end{align}

with $f()$ and $g()$ some non-linear activation functions, such as tanh, sigmoid, or relu.

Input $x$ is a vector with $n$ elements, a numeric representation of an observation.

With an additional input $x_{n+1} = 1$ we can ommit bias $b$ and shorten the notation.

The weight matrices $U$ and $V$ are the trainable parameters. The optimization (learning) aims at approximating $o$ to the training data $y$ i.e. minimizing the errors $y-o$. A cost function is defined based on the errors. A gradient descent algorithm is commonly used for optimization.

Not a Magic Black Box

Let's see how this works on a very small subset of the data.

We will only use the payment status and take the x and y values from the first n observations:

In [117]:
n = 10
x = np.asarray(df[['PAY_6','PAY_5','PAY_4','PAY_3','PAY_2','PAY_1']][:n])
print(x)
[[ -1.00  -1.00  -1.00  -1.00   2.00   2.00]
 [ -2.00  -2.00  -2.00  -2.00  -2.00  -2.00]
 [  2.00   2.00   2.00   2.00   2.00   2.00]
 [  0.00  -1.00  -1.00  -1.00   2.00   1.00]
 [  2.00  -1.00  -1.00   2.00   2.00   1.00]
 [  0.00   0.00   0.00   0.00   0.00   0.00]
 [ -2.00  -2.00  -2.00  -2.00  -2.00  -2.00]
 [  0.00   0.00   0.00   0.00   0.00   0.00]
 [  0.00   0.00   0.00   0.00   0.00   0.00]
 [ -1.00  -1.00  -2.00  -2.00  -2.00   1.00]]
In [118]:
y = np.asarray(df[['DEF']][:n])[:,0]
print(y)
[  1.00   0.00   0.00   1.00   1.00   0.00   0.00   1.00   0.00   0.00]

Now we set the number of hidden states and compute some random weights U and V:

In [119]:
hid = 5
U = np.random.random((hid, x.shape[1])) * 4 - 2
print(U)
[[ -1.12   1.25  -0.57   1.90   1.26  -0.93]
 [ -0.43  -0.96   1.36   0.81   0.78  -0.69]
 [ -0.82  -2.00   0.60   0.76   0.71  -0.37]
 [  0.78   0.43  -1.16   1.44  -1.17  -1.13]
 [  1.97   0.74  -1.03  -0.33  -0.95   0.18]]
In [120]:
V = np.random.random((1,hid)) * 4 - 2
print(V)
[[  1.47  -1.51   0.48   0.33  -1.24]]

We want to compute f(U x) so we need matrix multiplication and a non-linear function f.

Matrix multiplication can be done with np.dot(), e.g. to get the matrix-vector product for the first observation:

In [121]:
print(np.dot(U, x[0]))
[ -0.80  -0.60   2.13  -6.08  -2.89]

However, NumPy easily lets us process all observations at once if we transpose x.

Note the first column which is identical to the result above:

In [122]:
print(np.dot(U, x.T))
[[ -0.80  -3.57   3.57  -0.99   2.48   0.00  -3.57   0.00   0.00  -6.23]
 [ -0.60  -1.73   1.73  -0.34   1.24   0.00  -1.73   0.00   0.00  -5.20]
 [  2.13   2.26  -2.26   1.68   2.31   0.00   2.26   0.00   0.00  -1.68]
 [ -6.08   1.61  -1.61  -4.18   1.70   0.00   1.61   0.00   0.00  -0.56]
 [ -2.89  -1.15   1.15  -1.10   1.85   0.00  -1.15   0.00   0.00   2.10]]

For our non-linear function we use the sigmoid as plotted below.

☆ This is the classic choice for artificial neural nets; its simple derivative f'(x) = f(x)(1 - f(x)) makes it very attractive for implementation. Sadly it suffers from the vanishing gradient problem: we can see from the plot that as the absolut values of the input go beyond several units the curve becomes very flat meaning that the gradient goes towards zero. This is a problem for gradient-based optimization. For this reason various other non-linear functions are now more commonly in use, such as the ReLu (rectified linear unit).

In [123]:
import matplotlib.pyplot as plt
a = np.linspace(-5, 5, 1000)
b = 1 / (1 + np.exp(-a) )
plt.figure(figsize=(8, 3))
plt.plot(a, b)
Out[123]:
[<matplotlib.lines.Line2D at 0x7f93c0899e48>]

We apply the non-linear function to the result of the matrix multiplication to get the states h of the hidden layer.

We are still computing all observations at once rather than going through a loop, so the column vectors are the states h for each observation in x:

In [124]:
h = 1 / (1 + np.exp(-1 * np.dot(U, x.T)))
print(h)
[[  0.31   0.03   0.97   0.27   0.92   0.50   0.03   0.50   0.50   0.00]
 [  0.35   0.15   0.85   0.42   0.78   0.50   0.15   0.50   0.50   0.01]
 [  0.89   0.91   0.09   0.84   0.91   0.50   0.91   0.50   0.50   0.16]
 [  0.00   0.83   0.17   0.02   0.85   0.50   0.83   0.50   0.50   0.36]
 [  0.05   0.24   0.76   0.25   0.86   0.50   0.24   0.50   0.50   0.89]]

Again we can compute the output for all hidden states in a single statement.

☆ Note that we could have more than one output state so the result is a matrix; however, here there is only one output state, so we get a matrix with one row (and not a vector).

In [125]:
o = 1/(1 + np.exp(-1 * np.dot(V, h)))
print(o)
[[  0.57   0.56   0.33   0.47   0.46   0.44   0.56   0.44   0.44   0.29]]

Clearly random weights do not achieve anything close to the actual values y (unless we are very, very lucky).

In [126]:
print(o-y)
[[ -0.43   0.56   0.33  -0.53  -0.54   0.44   0.56  -0.56   0.44   0.29]]

We sum the squared errors and call it the cost function:

In [127]:
c = np.sum((o-y)**2)/n
print(c)
0.22738725407473867

By applying some optimization method we can adjust the weights to achieve lower (and maybe even minimal) cost.

  • In neural networks this is commonly done using gradient descent
  • the derivation of the cost function with respect to the weights is computed
  • the weights are adjusted in the direction of lower cost.
  • this is done repeatedly, usually for a set number of epochs i.e. runs through the whole data

Or, we can simple repeat the random process above until we arrive at a satisfactory result: not a practical approach, but interesting to see.

Given specific input, number of hidden units and range of weights the cost cannot go below a certain threshold.

In [128]:
c = 1
e = 0
while c > 0.1:
    U = np.random.random((hid, x.shape[1])) * 4 - 2
    V = np.random.random((1,hid)) * 4 - 2
    h = 1/(1 + np.exp(-1 * np.dot(U, x.T)))
    o = 1/(1 + np.exp(-1 * np.dot(V, h)))
    c = np.sum((o-y)**2)/n
    e += 1
    
print('y:     ', np.asarray([y])*1.0)
print('o:     ', o)
print('sq.err:', (o-y)**2)
print('cost:  ', c)
print('epochs:', e)
y:      [[  1.00   0.00   0.00   1.00   1.00   0.00   0.00   1.00   0.00   0.00]]
o:      [[  0.96   0.18   0.43   0.94   0.89   0.29   0.18   0.29   0.29   0.21]]
sq.err: [[  0.00   0.03   0.19   0.00   0.01   0.08   0.03   0.51   0.08   0.04]]
cost:   0.09835989425654287
epochs: 24657

Usually in such a situation values of output o below 0.5 are taken as 0 and above 0.5 as 1 which would give us a more practically useful result (payment default or not).

However, depending on the application, the output values can also be interpreted as probabilities.

In [129]:
d = np.asarray([int(v>0.5) for v in o[0]])
print('predict: ', d)
print('actual:  ', y)
print('accuracy:', np.sum(y==d)/n)
predict:  [  1.00   0.00   0.00   1.00   1.00   0.00   0.00   0.00   0.00   0.00]
actual:   [  1.00   0.00   0.00   1.00   1.00   0.00   0.00   1.00   0.00   0.00]
accuracy: 0.9

Deep Learning

Connectionist machine learning experienced a revival in the late 2000's with the advent of deep learning: such models are commonly characterized by

  • several hidden layers instead of just one
  • stochastic gradient descent: only a small part of the training data is used in minibatches to update the weights; running through all the data is called one epoch.
  • more elaborate network architectures are employed, such as dropout, max-pooling, and even recurrent designs

☆ If you are interested in Deep Learning then take a look at Tensorflow; using the Keras package which is now part of Tensorflow makes it easy to construct elaborate deep learning networks. However, since Tensorflow is somewhat cutting edge it is often tricky to get it to work on your computer, especially when you wish to make use of your graphics card in GPU computing; version problems continue to haunt that software, which is why we stick with the much more stable sklearn package: it is just one pip install away and works flawlessly out of the box on any system. Sadly, this is not true for Tensorflow.

MLPClassifier

We use a simple feed-forward net as provided by the science kit learn package, the Multi-Layer-Perceptron Classifier.

We call MLPClassifier() with the max_iter parameter to save computation time here; see the full documentation at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html for additional parameters.

By default MLPClassifier() uses only one hidden layer with 100 units, however we can easily change to a deep learning structure, e.g. by adding hidden_layer_sizes=(200,50,) to the parameter list.

We split the data into training and testing set. Common splits are about 90:10 to 70:30.

We also print the number of observations per label value to check for even distribution after splitting.

  • The net adapts its weights on the training set
  • then the weights remain fixed and are applied to the testing set

The scores show the performance on the training and testing data.

For the training phase we can find parameters to achieve accuracy close to 1.0 by a process called overfitting. However, only the performance on the test data which was not seen during training is relevant for practical application.

In [130]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

print(X.shape, Y.shape)

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
print('training size:', X_train.shape[0], 
      'testing size:', X_test.shape[0],
      'label counts:', np.unique(y_train, return_counts=True)[1])

clf = MLPClassifier(hidden_layer_sizes=(100,20,),
                    max_iter=50).fit(X_train, y_train)

print('score train:', clf.score(X_train, y_train))
print('score test: ', clf.score(X_test, y_test))
(13272, 23) (13272,)
training size: 10617 testing size: 2655 label counts: [5347.00 5270.00]
score train: 0.7246868230196855
score test:  0.7054613935969868
/home/hugo/.local/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

The accuracy on the test set is the benchmark for how well the net has 'learned' the concepts in the training data i.e. how well it can 'generalize' to new data.

The accuracy on the test set is usually worse than on the training set.

In a more elaborate setting we would further split into training, validation, and test set, since adjusting the parameters is itself an optimization process.

  • training set: the weights are adapted
  • validation set: the performance is evaluated, the parameters are adapted, and training starts again
  • test set: in the final step the performance is recorded without further parameter change

In other words, we are optimizing the weights on the training set and the parameters on the validation set. The test set is used for the final performance on real-world data i.e. data the net has not seen in training, and we have not seen while adjusting our parameters.

After training the MLPClassifier gives us the choice of output:

  • clf.predict_proba() -- the probability for each class
  • clf.predict() -- the (most likely) numeric label
In [131]:
print('pred prob: ', clf.predict_proba(X[:1]))
print('pred class:', clf.predict(X[:1]))
pred prob:  [[  0.31   0.69]]
pred class: [  1.00]

Persistent ML Model

We save our model in a pickle file so that we can use it again later to process new input. We will also need the Xmax values for scaling new input.

In [132]:
import pickle

pickle.dump((clf, Xmax), open('ccd.pkl', 'wb'))

Check that the model produces identical results when loaded from the pickle file:

In [133]:
clf2, Xmax2 = pickle.load(open('ccd.pkl', 'rb'))

print(clf2.predict_proba(df.iloc[:1,1:-1] /Xmax2 ))
[[  0.31   0.69]]

Misc

Number of cases per class -- observed (Y) and predicted (p):

In [135]:
p = clf.predict(X)
In [136]:
p.shape
Out[136]:
(13272,)
In [137]:
sum(p==1), sum(Y==1), sum(p==0), sum(Y==0)
Out[137]:
(5605, 6636, 7667, 6636)