Machine Learning for RPA: Numeric Data¶
Connectionist machine learning methods can directly deal with numeric input data, most commonly performing one of these functions:
- predict classes for observations
- estimate other numeric output values
In our sample application here we will learn to predict the default of credit card payments i.e. classify customers in default vs non-default.
UCI ML Credit Card Defaults Dataset¶
The UCI site provides a dataset on credit card payments and defaults in Taiwan from 2005. We can use these data to construct reasonably realistic synthetic training data for our robot application.
The following notebook code uses the shell escape ! to run a command line statement, in this case wget for downloading the dataset. Note the option -nc which avoids repeated downloads when the file is already present.
!wget -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls
File ‘default of credit card clients.xls’ already there; not retrieving.
The Pandas package can read various formats of data files, including JSON, CSV, and of course XLS. We do not need any proprietary software to read the data file.
The first line contains additional headers which we skip with the skiprows option.
There is a small inconsistency in naming the PAY columns which we correct using the rename() function.
Other than that the data looks good, which we can check here using the head() function:
- all column headers are valid variable names, except for the last, which we have renamed
- all data are numeric, at first glance they are all integer
- no immediately obvious missing values or other problems
- the values cover a wide range in magnitude, we will need to address that
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'all':lambda x: '%6.2f' % x})
df = pd.read_excel('default of credit card clients.xls', skiprows=(1))
df = df.rename(columns={'default payment next month':'DEF', 'PAY_0':'PAY_1'})
df.head()
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
2 | 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
3 | 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
4 | 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 25 columns
If the Pandas package is missing from your system then you can install it on the command line:
python3 -m pip install pandas --user
Imbalance in the Number of Observations per Class¶
Many training datasets for classification suffer from a severe imbalance in the number of observations per class. This must be addressed in some manner, otherwise the net can simply learn to predict the most frequent class and still achieve a high accuracy.
Among the various methods to deal with this problem we choose a simple approach:
- choose the number of most frequent classes
- determine the minimum number of observations for each of the most frequent classes
- draw samples of this size from each class
Pandas dataframes have some nifty group-by and sample functions that allow us to draw samples from the groups formed by the column values. In this fashion we can get equal numbers of observations for all issues which makes the learning performance much easier to interpret.
The value count reveals a strong imbalance in the number of observations per default classes:
df.value_counts('DEF')
DEF 0 23364 1 6636 dtype: int64
minobs = min(df.value_counts('DEF').values)
print(minobs)
6636
We re-sample the data to make our interpretations of the machine learning result simpler:
df = df.groupby('DEF').sample(n=minobs, random_state=1).sample(frac=1, random_state=1)
df.iloc[:5,:]
ID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2231 | 2232 | 230000 | 2 | 1 | 2 | 49 | 2 | 2 | -1 | -1 | ... | 0 | 2119 | 0 | 0 | 1100 | 0 | 2119 | 0 | 2916 | 1 |
6574 | 6575 | 200000 | 1 | 1 | 1 | 38 | -2 | -2 | -2 | -2 | ... | 4115 | 1315 | 0 | 2658 | 0 | 4125 | 1315 | 0 | 0 | 0 |
13810 | 13811 | 40000 | 1 | 1 | 1 | 47 | 2 | 2 | 2 | 2 | ... | 12595 | 14386 | 14005 | 2000 | 1000 | 0 | 2000 | 0 | 2000 | 0 |
7934 | 7935 | 180000 | 2 | 2 | 2 | 28 | 1 | 2 | -1 | -1 | ... | 0 | 22357 | 21541 | 0 | 1600 | 0 | 22357 | 1500 | 785 | 1 |
7277 | 7278 | 150000 | 2 | 1 | 1 | 44 | 1 | 2 | 2 | -1 | ... | 10195 | 24232 | 22000 | 0 | 0 | 10195 | 24232 | 0 | 0 | 1 |
5 rows × 25 columns
Now we have the same number of defaulting and non-defaulting clients in the dataset:
print(df['DEF'].value_counts())
1 6636 0 6636 Name: DEF, dtype: int64
Since these data are all numeric we can convert them to NumPy arrays which tend to be computationally more efficient than Pandas dataframes, and also somewhat easier to use (when not doing statistics).
We omit the first column (ID) and the last (payment default). To avoid stating all the column names we use the .iloc method of indexing.
X = np.asarray(df.iloc[:,1:-1])
Scaling¶
Numerical data is much easier to handle in machine learning applications when it is scaled into a common order of magnitude; we divide all values in each row by the row maximum. Another common choice is centering and scaling i.e. substract the mean and divide by the standard devation.
We will need the maximum values later when we scale additional input in our application, so this parameter will go into the saved model.
Xmax = X.max(0)
X = X / Xmax
#X = (X - X.mean(0)) / X.std(0)
print(X)
[[ 0.29 1.00 0.17 ... 0.00 0.00 0.01] [ 0.25 0.50 0.17 ... 0.00 0.00 0.00] [ 0.05 0.50 0.17 ... 0.00 0.00 0.00] ... [ 0.54 1.00 0.17 ... 0.00 0.01 0.00] [ 0.38 0.50 0.17 ... 0.02 0.03 0.02] [ 0.04 0.50 0.33 ... 0.00 0.01 0.00]]
Since DEF is a single number in {0,1} for each observation we can let our neural net directly learn this value.
We convert to small int numpy array just to have both X and Y in the same internal format.
Y = np.asarray(df['DEF'], dtype='int8')
print(Y)
[ 1.00 0.00 0.00 ... 0.00 1.00 0.00]
Feed-Forward Neural Net¶
The basic neural net structure contains
- an input layer of $x_i$ values describing an observation, such as a text encoded as bag of words
- a hidden layer of $h_i$ activations computed from the input layer using weights $U$
- an output layer of $o_i$ values computed from the hidden layer with weights $V$
The output $o$ is compared to the values $y$ in the training set. The difference $o-y$ is used in the learning algorithm to update the weights.
In the simple feed-forward model with one hidden layer $h$ we compute $h$ from input $x$ and output $o$ from $h$:
\begin{align} h & = f(U x ) \\ o & = g(V h) \end{align}
with $f()$ and $g()$ some non-linear activation functions, such as tanh, sigmoid, or relu.
Input $x$ is a vector with $n$ elements, a numeric representation of an observation.
With an additional input $x_{n+1} = 1$ we can ommit bias $b$ and shorten the notation.
The weight matrices $U$ and $V$ are the trainable parameters. The optimization (learning) aims at approximating $o$ to the training data $y$ i.e. minimizing the errors $y-o$. A cost function is defined based on the errors. A gradient descent algorithm is commonly used for optimization.
Not a Magic Black Box¶
Let's see how this works on a very small subset of the data.
We will only use the payment status and take the x and y values from the first n observations:
n = 10
x = np.asarray(df[['PAY_6','PAY_5','PAY_4','PAY_3','PAY_2','PAY_1']][:n])
print(x)
[[ -1.00 -1.00 -1.00 -1.00 2.00 2.00] [ -2.00 -2.00 -2.00 -2.00 -2.00 -2.00] [ 2.00 2.00 2.00 2.00 2.00 2.00] [ 0.00 -1.00 -1.00 -1.00 2.00 1.00] [ 2.00 -1.00 -1.00 2.00 2.00 1.00] [ 0.00 0.00 0.00 0.00 0.00 0.00] [ -2.00 -2.00 -2.00 -2.00 -2.00 -2.00] [ 0.00 0.00 0.00 0.00 0.00 0.00] [ 0.00 0.00 0.00 0.00 0.00 0.00] [ -1.00 -1.00 -2.00 -2.00 -2.00 1.00]]
y = np.asarray(df[['DEF']][:n])[:,0]
print(y)
[ 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00]
Now we set the number of hidden states and compute some random weights U and V:
hid = 5
U = np.random.random((hid, x.shape[1])) * 4 - 2
print(U)
[[ -1.12 1.25 -0.57 1.90 1.26 -0.93] [ -0.43 -0.96 1.36 0.81 0.78 -0.69] [ -0.82 -2.00 0.60 0.76 0.71 -0.37] [ 0.78 0.43 -1.16 1.44 -1.17 -1.13] [ 1.97 0.74 -1.03 -0.33 -0.95 0.18]]
V = np.random.random((1,hid)) * 4 - 2
print(V)
[[ 1.47 -1.51 0.48 0.33 -1.24]]
We want to compute f(U x) so we need matrix multiplication and a non-linear function f.
Matrix multiplication can be done with np.dot(), e.g. to get the matrix-vector product for the first observation:
print(np.dot(U, x[0]))
[ -0.80 -0.60 2.13 -6.08 -2.89]
However, NumPy easily lets us process all observations at once if we transpose x.
Note the first column which is identical to the result above:
print(np.dot(U, x.T))
[[ -0.80 -3.57 3.57 -0.99 2.48 0.00 -3.57 0.00 0.00 -6.23] [ -0.60 -1.73 1.73 -0.34 1.24 0.00 -1.73 0.00 0.00 -5.20] [ 2.13 2.26 -2.26 1.68 2.31 0.00 2.26 0.00 0.00 -1.68] [ -6.08 1.61 -1.61 -4.18 1.70 0.00 1.61 0.00 0.00 -0.56] [ -2.89 -1.15 1.15 -1.10 1.85 0.00 -1.15 0.00 0.00 2.10]]
For our non-linear function we use the sigmoid as plotted below.
☆ This is the classic choice for artificial neural nets with a simple derivative f'(x) = f(x)(1 - f(x)). However, we can see from the plot that as the absolut values of the input go beyond several units the curve becomes very flat meaning that the gradient goes towards zero. This is a problem for multi-layer networks using error backpropagation: the gradient becomes smaller in each layer, so that the early layers get almost no update at all: the vanishing gradient problem. Various other non-linear functions are now more commonly in use, such as the ReLu (rectified linear unit).
import matplotlib.pyplot as plt
a = np.linspace(-5, 5, 1000)
b = 1 / (1 + np.exp(-a) )
plt.figure(figsize=(8, 3))
plt.plot(a, b)
[<matplotlib.lines.Line2D at 0x7f93c0899e48>]
We apply the non-linear function to the result of the matrix multiplication to get the states h of the hidden layer.
We are still computing all observations at once rather than going through a loop, so the column vectors are the states h for each observation in x:
h = 1 / (1 + np.exp(-1 * np.dot(U, x.T)))
print(h)
[[ 0.31 0.03 0.97 0.27 0.92 0.50 0.03 0.50 0.50 0.00] [ 0.35 0.15 0.85 0.42 0.78 0.50 0.15 0.50 0.50 0.01] [ 0.89 0.91 0.09 0.84 0.91 0.50 0.91 0.50 0.50 0.16] [ 0.00 0.83 0.17 0.02 0.85 0.50 0.83 0.50 0.50 0.36] [ 0.05 0.24 0.76 0.25 0.86 0.50 0.24 0.50 0.50 0.89]]
Again we can compute the output for all hidden states in a single statement.
☆ Note that we could have more than one output state so the result is a matrix; however, here there is only one output state, so we get a matrix with one row (and not a vector).
o = 1/(1 + np.exp(-1 * np.dot(V, h)))
print(o)
[[ 0.57 0.56 0.33 0.47 0.46 0.44 0.56 0.44 0.44 0.29]]
Clearly random weights do not achieve anything close to the actual values y (unless we are very, very lucky).
print(o-y)
[[ -0.43 0.56 0.33 -0.53 -0.54 0.44 0.56 -0.56 0.44 0.29]]
We sum the squared errors and call it the cost function:
c = np.sum((o-y)**2)/n
print(c)
0.22738725407473867
By applying some optimization method we can adjust the weights to achieve lower (and maybe even minimal) cost.
- In neural networks this is commonly done using gradient descent
- the derivation of the cost function with respect to the weights is computed
- the weights are adjusted in the direction of lower cost.
- this is done repeatedly, usually for a set number of epochs i.e. runs through the whole data
Or, we can simple repeat the random process above until we arrive at a satisfactory result: not a practical approach, but interesting to see.
Given specific input, number of hidden units and range of weights the cost cannot go below a certain threshold.
c = 1
e = 0
while c > 0.1:
U = np.random.random((hid, x.shape[1])) * 4 - 2
V = np.random.random((1,hid)) * 4 - 2
h = 1/(1 + np.exp(-1 * np.dot(U, x.T)))
o = 1/(1 + np.exp(-1 * np.dot(V, h)))
c = np.sum((o-y)**2)/n
e += 1
print('y: ', np.asarray([y])*1.0)
print('o: ', o)
print('sq.err:', (o-y)**2)
print('cost: ', c)
print('epochs:', e)
y: [[ 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00]] o: [[ 0.96 0.18 0.43 0.94 0.89 0.29 0.18 0.29 0.29 0.21]] sq.err: [[ 0.00 0.03 0.19 0.00 0.01 0.08 0.03 0.51 0.08 0.04]] cost: 0.09835989425654287 epochs: 24657
Usually in such a situation values of output o below 0.5 are taken as 0 and above 0.5 as 1 which would give us a more practically useful result (payment default or not).
However, depending on the application, the output values can also be interpreted as probabilities.
d = np.asarray([int(v>0.5) for v in o[0]])
print('predict: ', d)
print('actual: ', y)
print('accuracy:', np.sum(y==d)/n)
predict: [ 1.00 0.00 0.00 1.00 1.00 0.00 0.00 0.00 0.00 0.00] actual: [ 1.00 0.00 0.00 1.00 1.00 0.00 0.00 1.00 0.00 0.00] accuracy: 0.9
Deep Learning¶
Connectionist machine learning experienced a revival in the late 2000's with the advent of deep learning: such models are commonly characterized by
- several hidden layers instead of just one
- stochastic gradient descent: only a small part of the training data is used in minibatches to update the weights; running through all the data is called one epoch.
- more elaborate network architectures are employed, such as dropout, max-pooling, and even recurrent designs
MLPClassifier¶
We use a simple feed-forward net as provided by the science kit learn package, the Multi-Layer-Perceptron Classifier.
We call MLPClassifier() with the max_iter parameter to save computation time here; see the full documentation at https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html for additional parameters.
By default MLPClassifier() uses only one hidden layer with 100 units, however we can easily change to a deep learning structure, e.g. by adding hidden_layer_sizes=(200,50,) to the parameter list.
We split the data into training and testing set. Common splits are about 90:10 to 70:30.
We also print the number of observations per label value to check for even distribution after splitting.
- The net adapts its weights on the training set
- then the weights remain fixed and are applied to the testing set
The scores show the performance on the training and testing data.
For the training phase we can find parameters to achieve accuracy close to 1.0 by a process called overfitting. However, only the performance on the test data which was not seen during training is relevant for practical application.
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
print(X.shape, Y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
print('training size:', X_train.shape[0],
'testing size:', X_test.shape[0],
'label counts:', np.unique(y_train, return_counts=True)[1])
clf = MLPClassifier(hidden_layer_sizes=(100,20,),
max_iter=50).fit(X_train, y_train)
print('score train:', clf.score(X_train, y_train))
print('score test: ', clf.score(X_test, y_test))
(13272, 23) (13272,) training size: 10617 testing size: 2655 label counts: [5347.00 5270.00] score train: 0.7246868230196855 score test: 0.7054613935969868
/home/hugo/.local/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (50) reached and the optimization hasn't converged yet. % self.max_iter, ConvergenceWarning)
The accuracy on the test set is the benchmark for how well the net has 'learned' the concepts in the training data i.e. how well it can 'generalize' to new data.
The accuracy on the test set is usually worse than on the training set.
In a more elaborate setting we would further split into training, validation, and test set, since adjusting the parameters is itself an optimization process.
- training set: the weights are adapted
- validation set: the performance is evaluated, the parameters are adapted, and training starts again
- test set: in the final step the performance is recorded without further parameter change
In other words, we are optimizing the weights on the training set and the parameters on the validation set. The test set is used for the final performance on real-world data i.e. data the net has not seen in training, and we have not seen while adjusting our parameters.
After training the MLPClassifier gives us the choice of output:
- clf.predict_proba() -- the probability for each class
- clf.predict() -- the (most likely) numeric label
print('pred prob: ', clf.predict_proba(X[:1]))
print('pred class:', clf.predict(X[:1]))
pred prob: [[ 0.31 0.69]] pred class: [ 1.00]
Persistent ML Model¶
We save our model in a pickle file so that we can use it again later to process new input. We will also need the Xmax values for scaling new input.
import pickle
pickle.dump((clf, Xmax), open('ccd.pkl', 'wb'))
Check that the model produces identical results when loaded from the pickle file:
clf2, Xmax2 = pickle.load(open('ccd.pkl', 'rb'))
print(clf2.predict_proba(df.iloc[:1,1:-1] /Xmax2 ))
[[ 0.31 0.69]]
Misc¶
Number of cases per class -- observed (Y) and predicted (p):
p = clf.predict(X)
p.shape
(13272,)
sum(p==1), sum(Y==1), sum(p==0), sum(Y==0)
(5605, 6636, 7667, 6636)