Machine Learning for RPA: Text Data

We wish to automatically determine the correct response given some text as input.

We will use the connectionist machine learning approach here to process text data. Since this ML method cannot directly deal with text we need to convert the text data into numeric data.

Some of the packages in the import statements below should be included in your Python distribution, but you probably have to do additional installations.

In [145]:
import pandas as pd
import numpy as np
import csv
import gzip
import string
from random import random, sample
rs = 2 # random state

If you get errors for missing packages you can install them from within the notebook by using the ! system command escape, e.g.

!python3 -m pip install pandas --user

On the command line: as above, but without the leading !. The --user option avoids problems with permissions; without it the installer tries to write to system directories and needs root permission.

The Consumer Complaints Dataset

Data on customer relationship interaction is difficult to obtain, for legal and business reasons. One of the few large datasets available freely is the Consumer Complaints DB which can be downloaded at the following sites (and some probably others):

This is a huge dataset of financial products and service complaints with a total of over 4 million records. The zipped CSV file is over 400 MB. Reading the file with the Pandas read_csv() function should take about 10-30 seconds depending on your hardware.

Option 1: Download the Whole File (may not work due to quota limit)

This is a big download; it may exceed your disk quota on the lab computers, depending on how much data from other courses you are still storing there. Note that if you are using the GUI file browser to delete files and folders it may be necessary to empty the trash in order to actually make the free space available.

The following command should download the complaints data file (option -nc means no clobber i.e. do not overwrite existing). Enter it on the command line in a terminal:

wget -nc https://files.consumerfinance.gov/ccdb/complaints.csv.zip

To get an idea of the file contents we read only a few rows:

In [146]:
pd.read_csv('complaints.csv.zip', nrows=5).head()
Out[146]:
Date received Product Sub-product Issue Sub-issue Consumer complaint narrative Company public response Company State ZIP code Tags Consumer consent provided? Submitted via Date sent to company Company response to consumer Timely response? Consumer disputed? Complaint ID
0 2022-04-15 Checking or savings account Checking account Managing an account Deposits and withdrawals NaN NaN UNITED SERVICES AUTOMOBILE ASSOCIATION FL 32812 NaN NaN Referral 2022-04-18 In progress Yes NaN 5462556
1 2022-05-02 Credit reporting, credit repair services, or o... Credit reporting Improper use of your report Credit inquiries on your report that you don't... NaN NaN TRANSUNION INTERMEDIATE HOLDINGS, INC. FL 33713 NaN NaN Web 2022-05-02 In progress Yes NaN 5529881
2 2022-03-16 Credit reporting, credit repair services, or o... Credit reporting Incorrect information on your report Information belongs to someone else NaN Company has responded to the consumer and the ... Experian Information Solutions Inc. NJ 8081 NaN Consent not provided Web 2022-03-16 Closed with explanation Yes NaN 5330688
3 2022-03-16 Credit reporting, credit repair services, or o... Other personal consumer report Incorrect information on your report Information belongs to someone else NaN NaN Experian Information Solutions Inc. FL 34205 NaN NaN Web 2022-03-16 In progress Yes NaN 5329460
4 2022-03-16 Credit reporting, credit repair services, or o... Credit reporting Problem with a credit reporting company's inve... Was not notified of investigation status or re... NaN NaN Experian Information Solutions Inc. VA 20170 NaN NaN Web 2022-03-16 In progress Yes NaN 5330551

We can see that

  • there are many missing values indicated by NaN
  • only a few columns are relevant for our purpose

Problem: Large Input Data and Limited RAM

No matter how high-level your hardware, there will always be problems that it will struggle with; here are a few tips to deal with these situations.

We will only load some columns into RAM; if you still run into trouble with memory then add the following option to the read_csv() parameters:

skiprows=lambda i: i>0 and random()>0.1

This will randomly skip 90% of the records and still provide a workable dataset for trying things out.

Another option is to use the nrows parameter to read only the first n rows from the file. This is faster, and fine if the rows in the file are already shuffled. Another advantage is that you always end up with the same first couple of rows, which is good for debugging.

nrows=100000

Depending on your RAM you might run into memory problems when trying to read the whole file. We use an incremental method of reading the data in smaller chunks.

☆ It is a good idea to work on a small part of the data during development as it speeds up the computation in the machine learning training phase considerably. Once you have everything set up you can try to skip less data and watch the accuracy on the test set improve (hopefully).

The dropna() function removes records with any missing value; we are left with a much smaller dataset.

In [147]:
df = pd.DataFrame()
rdr = pd.read_csv('complaints.csv.zip', chunksize=100000, nrows=100*1000,
             usecols=[ 'Issue',  
                       'Consumer complaint narrative', 
                       'Company response to consumer', ])
for chunk in rdr:
    chunk.dropna(how='any', inplace=True) 
    df = df.append(chunk)

df = df.rename(columns={
    "Company response to consumer": "response",
    "Consumer complaint narrative": "narrative",})

print(df.shape)
(12826, 3)

Option 2: Use Interactive Filter for Download (success guaranteed)

Go to

https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data

and click on 'Filter before you download'

then restrict to 'With Narrative' and Product = 'Credit card'

This should result in about 18000 rows and a download size of about 25 MB (uncompressed). The rest of the code in this notebook works exactly the same as with the full dataset. The results of the machine learning will be a little less impressive, but still perfectly fine for our purpose.

If you do not want to use the interactive filter then the wget command below should also work in the terminal on the command line and download the filtered data only.

wget -O complaints-cc.csv -nc 'https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?date_received_max=2022-05-19&date_received_min=2011-12-01&field=all&format=csv&has_narrative=true&no_aggs=true&product=Credit%20card&size=18838&sort=created_date_desc'

Once the file is downloaded you can compress it to reduce disk usage: on the command line enter

gzip complaints-cc.csv

The size should go from 25 MB to about 8 MB.

In [148]:
df = pd.read_csv('complaints-cc.csv.gz', usecols=[ 'Issue',  
                       'Consumer complaint narrative', 
                       'Company response to consumer', ])
df.dropna(inplace=True) 
df = df.rename(columns={
    "Company response to consumer": "response",
    "Consumer complaint narrative": "narrative",})
print(df.shape)
(18838, 3)

Exploring the Complaints Dataset

The first thing to do with a Pandas DataFrame is to print the head i.e. the leading 5 rows, just to see what we can expect from this dataset:

In [149]:
df.head()
Out[149]:
Issue narrative response
0 Billing disputes XXXX XXXX, 2015Walmart/Synchrony BankXXXX, GA ... Closed with monetary relief
1 Rewards I open up a credit card with banana republic, ... Closed with explanation
2 Rewards Hi, I applied for the Citi XXXX Honors Visa Si... Closed with explanation
3 Other I looked at my credit report in XXXX. I had a ... Closed with explanation
4 Privacy Hello, Today I received an email from Barclayc... Closed with explanation

As a very rough ballpark figure, we want thousands of observations for machine learning; dozens or even hundreds tend to be insufficient to show any effect at all. Obviously this depends very much on what we are trying to achieve.

In [150]:
df.shape
Out[150]:
(18838, 3)

The value_counts() function is very useful to get an idea of the different values and their frequencies:

In [151]:
df['Issue'].value_counts()[:10]
Out[151]:
Billing disputes                         3102
Other                                    1940
Identity theft / Fraud / Embezzlement    1723
Closing/Cancelling account               1440
Customer service / Customer relations     973
Rewards                                   900
Delinquent account                        834
Advertising and marketing                 818
APR or interest rate                      785
Late fee                                  771
Name: Issue, dtype: int64
In [152]:
df['response'].value_counts()[:10]
Out[152]:
Closed with explanation            12241
Closed with monetary relief         4327
Closed with non-monetary relief     2140
Closed                               105
Untimely response                     25
Name: response, dtype: int64

We define our target column and our classes, keeping the number of observations in each class roughly equal to simplify the interpretation of the machine learning results.

In [153]:
target = 'response'
vc = df[target].value_counts()[:3]
classes = list(vc.keys())
obs = vc.values
[ (i, classes[i], obs[i]) for i in range(len(classes)) ]
Out[153]:
[(0, 'Closed with explanation', 12241),
 (1, 'Closed with monetary relief', 4327),
 (2, 'Closed with non-monetary relief', 2140)]

At this points we can drop all rows with values other than our target classes. Hopefully this will save some memory in those situations when we are at the limit of the free RAM; however, the garbage collection of the Python interpreter is somewhat unpredictable and may not make the free memory available immediately.

In [154]:
mask = [bool(x in classes) for x in df[target]] 
df = df[mask] 
df.shape
Out[154]:
(18708, 3)

Imbalance in the Number of Observations per Class

Many training datasets for classification suffer from a severe imbalance in the number of observations per class. This must be addressed in some manner, otherwise the net can simply learn to predict the most frequent class and still achieve a high accuracy.

Among the various methods to deal with this problem we choose a simple approach:

  • choose the number of most frequent classes
  • determine the minimum number of observations for each of the most frequent classes
  • draw samples of this size from each class

Pandas dataframes have some nifty group-by and sample functions that allow us to draw samples from the groups formed by the column values. In this fashion we can get equal numbers of observations for all issues which makes the learning performance much easier to interpret.

In [155]:
minobs = min(obs)
print(minobs)
2140

Now we are ready to use the groupby() and sample() functions:

In [156]:
df = df.groupby(target).sample(n=minobs, random_state=rs)
# shuffle
df = df.sample(frac=1, random_state=rs).reset_index(drop=True)
print(df.shape)
print(df[target].value_counts())
(6420, 3)
Closed with monetary relief        2140
Closed with non-monetary relief    2140
Closed with explanation            2140
Name: response, dtype: int64

Optional: Saving a DataFrame

Another nice feature of the Pandas DataFrame is the to_csv() method. Once you have managed to transform a data frame into just the right format and content you can write it to disk for later processing using its to_csv() method which can even infer the desired compression method from the file name extension, e.g. to use gnu-zip

df.to_csv('complaints-cc.csv.gz')

Exploring Length of Input

Here is another peek at the narratives:

In [157]:
df['narrative'][:5]
Out[157]:
0    I applied on line in XX/XX/XXXX Looking for a ...
1    I have had a Bank of America credit card for o...
2    Chase Ink credit card advertises XXXX points f...
3    Credit Card was lost prior to business trip. I...
4    Citi, who adminstrates Best Buy 's credit card...
Name: narrative, dtype: object

The mean string length of the narratives gives us an idea of their size:

In [158]:
df['narrative'].str.len().mean()
Out[158]:
1131.4704049844236

The average length of English-language words is about 5 characters, of course very much depending on the type of text -- coming to about 220 words per narrative.

Let's see if this is true for our collection, using a very simple (and not completely accurate) method of splitting sentences into words by white space (blanks):

In [159]:
np.mean([ len(lst) for lst in [ x.split() for x in df['narrative'] ] ])
Out[159]:
208.94532710280373

The average narrative contains roughly 200 words.

We now need a method of converting these words into numbers, since our connectionist machine learning method can only work on numeric input. We need a way to transform each narrative into a list of numbers. Ideally all lists should have the same length, which will be the length of the input for our neural net.

The Python package sklearn contains a lot of useful text encoding and machine learning code, such as the CountVectorizer which allows us to easily encode the narratives without hassles such as tokenizing and punctuation removal.

Simple Neural Net Approach for Classification

To illustrate a simple approach in connectionist learning we will predict the target from the narrative by using the bag of words method i.e. encoding the input words by set membership in a vocabulary.

Text Encoding

In the following example two sentences are encoded using a very small vocabulary:

In [160]:
voc = ('cat', 'dog', 'sat', 'mat')
sents = (('the', 'cat', 'sat', 'on', 'the', 'mat'), ('the', 'dog', 'sat'))
[[1 * (w in s) for w in voc] for s in sents]
Out[160]:
[[1, 0, 1, 1], [0, 1, 1, 0]]

We could now set up our vocabulary and encode each narrative using plain Python. However, it is usually not a good idea to re-invent the wheel when existing packages are probably able to do a better job.

The CountVectorizer in the package sklearn provides a convenient way to encode our narratives as bags of words.

  • max_features is the size of the dictionary containing the most frequent words
  • the stop words will not be used for encoding, and we add some application-specific patterns here; we also add 'not' since it would otherwise be excluded as a stop word
  • token_pattern allows us to easily exclude all the application-specific words such as references to legal codes: we only accept strings of 3 or more letters a-z
  • ngram_range accepts a list of min/max number of words in a pattern, letting us e.g. include two-word sequences among the most frequent entries in the vocabulary
  • binary=True means that we do not encode with the number of occurrences but only yes/no
  • dtype allows us to set a more space-efficient data type than the default int64, resulting in memory savings for problems at the limit of our RAM capacity

These are only some of the many parameters that determine the outcome of the machine learning; note that we the developers are also performing an optimization here, not just the computer. During the process of developing a solution we are choosing and changing parameters until we arrive at a satisfactory outcome. When we evaluate the performance we will have to remember this fact.

In [161]:
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.feature_extraction import text 
maxf = 500
vctr = CountVectorizer(
    lowercase=True, 
    binary=True, 
    dtype=np.int8,
    max_features=maxf, 
    #ngram_range=(1,2),
    stop_words=text.ENGLISH_STOP_WORDS.union(('xxxx', 'xxxx/xxxx', 
                                              'xx/xx/xxxx')).difference(('not',)),
    token_pattern=r'\b[a-zA-Z]{3,}\b'
    )
vctr.fit(df['narrative'])
Out[161]:
CountVectorizer(binary=True, dtype=<class 'numpy.int8'>, max_features=500,
                stop_words=frozenset({'a', 'about', 'above', 'across', 'after',
                                      'afterwards', 'again', 'against', 'all',
                                      'almost', 'alone', 'along', 'already',
                                      'also', 'although', 'always', 'am',
                                      'among', 'amongst', 'amoungst', 'amount',
                                      'an', 'and', 'another', 'any', 'anyhow',
                                      'anyone', 'anything', 'anyway',
                                      'anywhere', ...}),
                token_pattern='\\b[a-zA-Z]{3,}\\b')

It's always a good idea to inspect variables wherever possible. Here we take a look at the actual words in our dictionary (or rather a fraction).

In [162]:
voc = vctr.vocabulary_

print([ (w, voc[w]) for w in sorted(voc, key=voc.get)][:100])
[('able', 0), ('accept', 1), ('access', 2), ('according', 3), ('account', 4), ('accounts', 5), ('act', 6), ('action', 7), ('activity', 8), ('actually', 9), ('added', 10), ('addition', 11), ('additional', 12), ('address', 13), ('advised', 14), ('agencies', 15), ('agency', 16), ('agent', 17), ('ago', 18), ('agreed', 19), ('agreement', 20), ('allow', 21), ('allowed', 22), ('america', 23), ('american', 24), ('amex', 25), ('annual', 26), ('answer', 27), ('apparently', 28), ('application', 29), ('applied', 30), ('apply', 31), ('applying', 32), ('approved', 33), ('approximately', 34), ('apr', 35), ('ask', 36), ('asked', 37), ('asking', 38), ('assistance', 39), ('assured', 40), ('attached', 41), ('attempt', 42), ('attempted', 43), ('authorized', 44), ('available', 45), ('aware', 46), ('away', 47), ('bad', 48), ('balance', 49), ('balances', 50), ('bank', 51), ('banking', 52), ('based', 53), ('believe', 54), ('benefit', 55), ('best', 56), ('billing', 57), ('bills', 58), ('bonus', 59), ('bureau', 60), ('bureaus', 61), ('business', 62), ('buy', 63), ('called', 64), ('calling', 65), ('calls', 66), ('came', 67), ('cancel', 68), ('cancelled', 69), ('capital', 70), ('card', 71), ('cards', 72), ('care', 73), ('case', 74), ('cash', 75), ('caused', 76), ('cfpb', 77), ('change', 78), ('changed', 79), ('charge', 80), ('charged', 81), ('charges', 82), ('charging', 83), ('chase', 84), ('check', 85), ('checked', 86), ('checking', 87), ('citi', 88), ('citibank', 89), ('claim', 90), ('claimed', 91), ('clear', 92), ('clearly', 93), ('close', 94), ('closed', 95), ('closing', 96), ('collect', 97), ('collection', 98), ('collections', 99)]

To get an idea of what our encoding really does we look at the first narrative in full:

In [163]:
df['narrative'].values[0]
Out[163]:
"I applied on line in XX/XX/XXXX Looking for a lower interest rate or 0 % for a few months it asking would want to do a balance transfer if I was approved. So I gave the information. It said I had been approved but never said the interest rate. Or anything else. I got kicked out of my home. So I never got the credit card to activate or decline the card. A month later the did the balance transfer without my permission. I called spoke to a supervisor names XXXX.Told him I did not want the card because the interest rate was w a y to high. He apoligisted said he would take care of the problem He said he would close the account and I would owe nothing. Now it 's showing on my credit report months later."

To see how the CountVectorizer transforms text to numbers we follow the process for the first narrative. The build_analyzer() function returns a callable to the input processor. This allows us to see the result of the tokenization:

In [164]:
anlz = vctr.build_analyzer()
toks = anlz(df['narrative'].values[0])
print(toks)
['applied', 'line', 'looking', 'lower', 'rate', 'months', 'asking', 'want', 'balance', 'transfer', 'approved', 'gave', 'information', 'said', 'approved', 'said', 'rate', 'got', 'kicked', 'home', 'got', 'credit', 'card', 'activate', 'decline', 'card', 'month', 'later', 'did', 'balance', 'transfer', 'permission', 'called', 'spoke', 'supervisor', 'names', 'told', 'did', 'not', 'want', 'card', 'rate', 'high', 'apoligisted', 'said', 'care', 'problem', 'said', 'close', 'account', 'owe', 'showing', 'credit', 'report', 'months', 'later']

We can check the result of the transform() function with the vocabulary and the tokens above, at least for the first couple of values.

In [165]:
print(vctr.transform(df['narrative'].values[:1])[0,:100])
  (0, 4)	1
  (0, 30)	1
  (0, 33)	1
  (0, 38)	1
  (0, 49)	1
  (0, 64)	1
  (0, 71)	1
  (0, 73)	1
  (0, 94)	1

The transform() function of the CountVectorizer object encodes the narratives into bag-of-words vectors.

  • the encoder produces a sparse matrix which saves a lot of RAM
  • however, it is not directly compatible with some neural net packages (such as Keras); if we cannot work with sparse matrices then
    • the Numpy package allows us to specify 8 bit integer as the data type of a newly created array; the default is int64.
    • Using the smaller representation saves RAM and makes it possible to tackle bigger problems with given memory limits.
  • sklearn modules work fine with sparse matrices, so there is no need for conversion to dense here

Instead of wasting lots of space on zero values, the sparse matrix X contains a directory of the non-zero values. Note that everything still works just like for dense matrices, e.g. shape.

In [166]:
X = vctr.transform(df['narrative'])

#X = np.asarray(vctr.transform(df['narrative']).todense(), dtype='int8')

print(X.shape)
(6420, 500)

The narrative is now encoded as a bag of words i.e. bits indicating the presence of vocubulary words. Note that

  • the size of the encoding is fixed regardless of the text length
  • order and context of words in the text are lost

For training and testing X and Y are commonly defined as

  • X is the matrix of input encodings, one row for each observation, in our case the bag-of-words
  • Y is the correct output, in our case labels indicating membership in one of the classes

The correct output is encoded as categorical i.e. labels indicating the class of each observation. Remember that indices in Python start at zero; when we have three classes then their labels are 0,1,2

In [167]:
Y = np.array([ classes.index(x)  for x in df[target] ], dtype='int8')

print(Y[:10])
[1 1 2 1 1 2 1 1 0 0]

X and Y are now Numpy arrays:

  • the number of rows in X is to the number of observations i.e. the size of the whole training set
  • the number of columns in X is the number of words in the vocabulary
  • Y contains the numeric label for each observation
In [168]:
print(X.shape, Y.shape)
(6420, 500) (6420,)

And now we have our input and output arrays -- this is what the net needs to learn:

In [169]:
X
Out[169]:
<6420x500 sparse matrix of type '<class 'numpy.int8'>'
	with 226492 stored elements in Compressed Sparse Row format>
In [170]:
Y
Out[170]:
array([1, 1, 2, ..., 1, 2, 1], dtype=int8)

MLPClassifier

We use the Multi-Layer-Perceptron Classifier from the sklearn package:

  • basic but useful implementation of a feed-forward neural net.
  • the hidden_layer_sizes parameter accepts a list of hidden layer sizes so we can experiment with deep architectures,
  • although in this examples no big gains are to be expected.

  • Increasing the number of maximum iterations beyond a certain value does not result in better performance on the test set, only on the training set: this situation is known as overfitting.

  • The performance on the test set starts to degrade again with max_iter > 5

In [171]:
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
print('training size:', X_train.shape[0], 
      'testing size:', X_test.shape[0],
      'label counts:', np.unique(y_train, return_counts=True)[1])

clf = MLPClassifier(random_state=1, max_iter=5, 
                    hidden_layer_sizes=(100,10,)).fit(X_train, y_train)

print('score train:', clf.score(X_train, y_train))
print('score test: ', clf.score(X_test, y_test))
training size: 5136 testing size: 1284 label counts: [1739 1699 1698]
score train: 0.6999610591900312
score test:  0.5630841121495327
/home/hugo/.local/lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:617: ConvergenceWarning: Stochastic Optimizer: Maximum iterations (5) reached and the optimization hasn't converged yet.
  % self.max_iter, ConvergenceWarning)

Check

Without any training the classifier works with its initial random weight values i.e. it can only do random guessing and should achieve a score of 1/n for n classes on the test set. To check this we can set the test_size to something like 0.999 so we are left with only a few training observations per class, effectively leaving the weights close to their initial random values.

  • Change the test_size and then click the Run button to execute the code after you changed it.

  • Then, as we decrease the test_size back to about 0.1 or 0.2 the score should increase (although not dramatically).

After training the MLPClassifier gives us:

  • clf.predict_proba() -- the probability for each class
  • clf.predict() -- the (most likely) numeric label
In [172]:
print('pred prob: ', clf.predict_proba(X_test[:1]))
print('pred class:', clf.predict(X_test[:1, :]))
pred prob:  [[0.19487806 0.6118888  0.19323314]]
pred class: [1]

Saving and Loading Trained Models

Once we have trained our model on a training dataset we want to save it for future use in robotic applications.

We use the standard Python pickle module. Note that we can pickle any Python object, including a tuple of objects.

In [173]:
inp = vctr.transform(['On my credit report critical info was missing'])
print('pred prob:', clf.predict_proba(inp))
pred prob: [[0.42818171 0.16470074 0.40711755]]
In [174]:
import pickle

pickle.dump((clf, vctr), open('resp.pkl', 'wb'))

Later when we apply the encoding and classifier model in practical applications we load the objects from file.

We should get the exact same propabilities in predicting from the input above:

In [175]:
clf2, vctr2 = pickle.load(open('resp.pkl', 'rb'))

inp = vctr2.transform(['On my credit report critical info was missing'])
clf2.predict_proba(inp)
Out[175]:
array([[0.42818171, 0.16470074, 0.40711755]])
In [ ]: