We wish to automatically determine the correct response given some text as input.
We will use the connectionist machine learning approach here to process text data. Since this ML method cannot directly deal with text we need to convert the text data into numeric data.
Some of the packages in the import statements below should be included in your Python distribution, but you probably have to do additional installations.
import pandas as pd
import numpy as np
import csv
import gzip
import string
from random import random, sample
rs = 2 # random state
If you get errors for missing packages you can install them from within the notebook by using the ! system command escape, e.g.
!python3 -m pip install pandas --user
On the command line: as above, but without the leading !. The --user option avoids problems with permissions; without it the installer tries to write to system directories and needs root permission.
Data on customer relationship interaction is difficult to obtain, for legal and business reasons. One of the few large datasets available freely is the Consumer Complaints DB which can be downloaded at the following sites (and some probably others):
This is a huge dataset of financial products and service complaints with a total of over 4 million records. The zipped CSV file is over 400 MB. Reading the file with the Pandas read_csv() function should take about 10-30 seconds depending on your hardware.
This is a big download; it may exceed your disk quota on the lab computers, depending on how much data from other courses you are still storing there. Note that if you are using the GUI file browser to delete files and folders it may be necessary to empty the trash in order to actually make the free space available.
The following command should download the complaints data file (option -nc means no clobber i.e. do not overwrite existing). Enter it on the command line in a terminal:
wget -nc https://files.consumerfinance.gov/ccdb/complaints.csv.zip
To get an idea of the file contents we read only a few rows:
pd.read_csv('complaints.csv.zip', nrows=5).head()
We can see that
Problem: Large Input Data and Limited RAM
No matter how high-level your hardware, there will always be problems that it will struggle with; here are a few tips to deal with these situations.
We will only load some columns into RAM; if you still run into trouble with memory then add the following option to the read_csv() parameters:
skiprows=lambda i: i>0 and random()>0.1
This will randomly skip 90% of the records and still provide a workable dataset for trying things out.
Another option is to use the nrows parameter to read only the first n rows from the file. This is faster, and fine if the rows in the file are already shuffled. Another advantage is that you always end up with the same first couple of rows, which is good for debugging.
nrows=100000
Depending on your RAM you might run into memory problems when trying to read the whole file. We use an incremental method of reading the data in smaller chunks.
☆ It is a good idea to work on a small part of the data during development as it speeds up the computation in the machine learning training phase considerably. Once you have everything set up you can try to skip less data and watch the accuracy on the test set improve (hopefully).
The dropna() function removes records with any missing value; we are left with a much smaller dataset.
df = pd.DataFrame()
rdr = pd.read_csv('complaints.csv.zip', chunksize=100000, nrows=100*1000,
usecols=[ 'Issue',
'Consumer complaint narrative',
'Company response to consumer', ])
for chunk in rdr:
chunk.dropna(how='any', inplace=True)
df = df.append(chunk)
df = df.rename(columns={
"Company response to consumer": "response",
"Consumer complaint narrative": "narrative",})
print(df.shape)
Go to
https://www.consumerfinance.gov/data-research/consumer-complaints/#download-the-data
and click on 'Filter before you download'
then restrict to 'With Narrative' and Product = 'Credit card'
This should result in about 18000 rows and a download size of about 25 MB (uncompressed). The rest of the code in this notebook works exactly the same as with the full dataset. The results of the machine learning will be a little less impressive, but still perfectly fine for our purpose.
If you do not want to use the interactive filter then the wget command below should also work in the terminal on the command line and download the filtered data only.
wget -O complaints-cc.csv -nc 'https://www.consumerfinance.gov/data-research/consumer-complaints/search/api/v1/?date_received_max=2022-05-19&date_received_min=2011-12-01&field=all&format=csv&has_narrative=true&no_aggs=true&product=Credit%20card&size=18838&sort=created_date_desc'
Once the file is downloaded you can compress it to reduce disk usage: on the command line enter
gzip complaints-cc.csv
The size should go from 25 MB to about 8 MB.
df = pd.read_csv('complaints-cc.csv.gz', usecols=[ 'Issue',
'Consumer complaint narrative',
'Company response to consumer', ])
df.dropna(inplace=True)
df = df.rename(columns={
"Company response to consumer": "response",
"Consumer complaint narrative": "narrative",})
print(df.shape)
The first thing to do with a Pandas DataFrame is to print the head i.e. the leading 5 rows, just to see what we can expect from this dataset:
df.head()
As a very rough ballpark figure, we want thousands of observations for machine learning; dozens or even hundreds tend to be insufficient to show any effect at all. Obviously this depends very much on what we are trying to achieve.
df.shape
The value_counts() function is very useful to get an idea of the different values and their frequencies:
df['Issue'].value_counts()[:10]
df['response'].value_counts()[:10]
We define our target column and our classes, keeping the number of observations in each class roughly equal to simplify the interpretation of the machine learning results.
target = 'response'
vc = df[target].value_counts()[:3]
classes = list(vc.keys())
obs = vc.values
[ (i, classes[i], obs[i]) for i in range(len(classes)) ]
At this points we can drop all rows with values other than our target classes. Hopefully this will save some memory in those situations when we are at the limit of the free RAM; however, the garbage collection of the Python interpreter is somewhat unpredictable and may not make the free memory available immediately.
mask = [bool(x in classes) for x in df[target]]
df = df[mask]
df.shape
Many training datasets for classification suffer from a severe imbalance in the number of observations per class. This must be addressed in some manner, otherwise the net can simply learn to predict the most frequent class and still achieve a high accuracy.
Among the various methods to deal with this problem we choose a simple approach:
Pandas dataframes have some nifty group-by and sample functions that allow us to draw samples from the groups formed by the column values. In this fashion we can get equal numbers of observations for all issues which makes the learning performance much easier to interpret.
minobs = min(obs)
print(minobs)
Now we are ready to use the groupby() and sample() functions:
df = df.groupby(target).sample(n=minobs, random_state=rs)
# shuffle
df = df.sample(frac=1, random_state=rs).reset_index(drop=True)
print(df.shape)
print(df[target].value_counts())
Another nice feature of the Pandas DataFrame is the to_csv() method. Once you have managed to transform a data frame into just the right format and content you can write it to disk for later processing using its to_csv() method which can even infer the desired compression method from the file name extension, e.g. to use gnu-zip
df.to_csv('complaints-cc.csv.gz')
Here is another peek at the narratives:
df['narrative'][:5]
The mean string length of the narratives gives us an idea of their size:
df['narrative'].str.len().mean()
The average length of English-language words is about 5 characters, of course very much depending on the type of text -- coming to about 220 words per narrative.
Let's see if this is true for our collection, using a very simple (and not completely accurate) method of splitting sentences into words by white space (blanks):
np.mean([ len(lst) for lst in [ x.split() for x in df['narrative'] ] ])
The average narrative contains roughly 200 words.
We now need a method of converting these words into numbers, since our connectionist machine learning method can only work on numeric input. We need a way to transform each narrative into a list of numbers. Ideally all lists should have the same length, which will be the length of the input for our neural net.
The Python package sklearn contains a lot of useful text encoding and machine learning code, such as the CountVectorizer which allows us to easily encode the narratives without hassles such as tokenizing and punctuation removal.
To illustrate a simple approach in connectionist learning we will predict the target from the narrative by using the bag of words method i.e. encoding the input words by set membership in a vocabulary.
In the following example two sentences are encoded using a very small vocabulary:
voc = ('cat', 'dog', 'sat', 'mat')
sents = (('the', 'cat', 'sat', 'on', 'the', 'mat'), ('the', 'dog', 'sat'))
[[1 * (w in s) for w in voc] for s in sents]
We could now set up our vocabulary and encode each narrative using plain Python. However, it is usually not a good idea to re-invent the wheel when existing packages are probably able to do a better job.
The CountVectorizer in the package sklearn provides a convenient way to encode our narratives as bags of words.
These are only some of the many parameters that determine the outcome of the machine learning; note that we the developers are also performing an optimization here, not just the computer. During the process of developing a solution we are choosing and changing parameters until we arrive at a satisfactory outcome. When we evaluate the performance we will have to remember this fact.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import text
maxf = 500
vctr = CountVectorizer(
lowercase=True,
binary=True,
dtype=np.int8,
max_features=maxf,
#ngram_range=(1,2),
stop_words=text.ENGLISH_STOP_WORDS.union(('xxxx', 'xxxx/xxxx',
'xx/xx/xxxx')).difference(('not',)),
token_pattern=r'\b[a-zA-Z]{3,}\b'
)
vctr.fit(df['narrative'])
It's always a good idea to inspect variables wherever possible. Here we take a look at the actual words in our dictionary (or rather a fraction).
voc = vctr.vocabulary_
print([ (w, voc[w]) for w in sorted(voc, key=voc.get)][:100])
To get an idea of what our encoding really does we look at the first narrative in full:
df['narrative'].values[0]
To see how the CountVectorizer transforms text to numbers we follow the process for the first narrative. The build_analyzer() function returns a callable to the input processor. This allows us to see the result of the tokenization:
anlz = vctr.build_analyzer()
toks = anlz(df['narrative'].values[0])
print(toks)
We can check the result of the transform() function with the vocabulary and the tokens above, at least for the first couple of values.
print(vctr.transform(df['narrative'].values[:1])[0,:100])
The transform() function of the CountVectorizer object encodes the narratives into bag-of-words vectors.
Instead of wasting lots of space on zero values, the sparse matrix X contains a directory of the non-zero values. Note that everything still works just like for dense matrices, e.g. shape.
X = vctr.transform(df['narrative'])
#X = np.asarray(vctr.transform(df['narrative']).todense(), dtype='int8')
print(X.shape)
The narrative is now encoded as a bag of words i.e. bits indicating the presence of vocubulary words. Note that
For training and testing X and Y are commonly defined as
The correct output is encoded as categorical i.e. labels indicating the class of each observation. Remember that indices in Python start at zero; when we have three classes then their labels are 0,1,2
Y = np.array([ classes.index(x) for x in df[target] ], dtype='int8')
print(Y[:10])
X and Y are now Numpy arrays:
print(X.shape, Y.shape)
And now we have our input and output arrays -- this is what the net needs to learn:
X
Y
We use the Multi-Layer-Perceptron Classifier from the sklearn package:
although in this examples no big gains are to be expected.
Increasing the number of maximum iterations beyond a certain value does not result in better performance on the test set, only on the training set: this situation is known as overfitting.
The performance on the test set starts to degrade again with max_iter > 5
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
print('training size:', X_train.shape[0],
'testing size:', X_test.shape[0],
'label counts:', np.unique(y_train, return_counts=True)[1])
clf = MLPClassifier(random_state=1, max_iter=5,
hidden_layer_sizes=(100,10,)).fit(X_train, y_train)
print('score train:', clf.score(X_train, y_train))
print('score test: ', clf.score(X_test, y_test))
Without any training the classifier works with its initial random weight values i.e. it can only do random guessing and should achieve a score of 1/n for n classes on the test set. To check this we can set the test_size to something like 0.999 so we are left with only a few training observations per class, effectively leaving the weights close to their initial random values.
Change the test_size and then click the Run button to execute the code after you changed it.
Then, as we decrease the test_size back to about 0.1 or 0.2 the score should increase (although not dramatically).
After training the MLPClassifier gives us:
print('pred prob: ', clf.predict_proba(X_test[:1]))
print('pred class:', clf.predict(X_test[:1, :]))
Once we have trained our model on a training dataset we want to save it for future use in robotic applications.
We use the standard Python pickle module. Note that we can pickle any Python object, including a tuple of objects.
inp = vctr.transform(['On my credit report critical info was missing'])
print('pred prob:', clf.predict_proba(inp))
import pickle
pickle.dump((clf, vctr), open('resp.pkl', 'wb'))
Later when we apply the encoding and classifier model in practical applications we load the objects from file.
We should get the exact same propabilities in predicting from the input above:
clf2, vctr2 = pickle.load(open('resp.pkl', 'rb'))
inp = vctr2.transform(['On my credit report critical info was missing'])
clf2.predict_proba(inp)