Case Study: Personal Finance¶
Here we will use the Robot framework and the connectionist machine learning methods in a case study for a consumer credit company.
For obvious reasons training datasets with actual data in this area are not available. However, there are some datasets in related areas that we can combine to create semi-synthetic data for our robots to work on.
We combine the following datasets for our sample application:
- credit card billing and payment data (credit card default dataset)
- verbal communications of customers (consumer complaints dataset; optional)
This will provide a moderately realistic scenario for robot automation.
The credit management application will be designed in Flask with a customer service interface that shows
- past payments and highlights the risk estimated by the machine learning module for an individual customer to default
- ranks cases as initialized by customer complaints based on the default probability
Flask Application for Credit Management¶
The application needs to
- import client records
- create billing and payment records for clients and periods (calendar months):
- billing amounts BILL_AMT[1-6]
- payment amounts PAY_AMT[1-6]
- payment status PAY_[1-6]
- list client data
- list billing and payment data for individual clients
In order to apply our machine learning model we need the financial data for the last 6 months; the credit card dataset provides this data for April-September 2005.
In order so model our application on a realistic scenario we implement a separate payment management application that provides the transactional data (billing, payment, and status) per month but does not include the client master data (credit limit, age..).
Assumptions¶
For this case study let's assume that
- we have full control over the client credit app and its Sqlite DB file
- but not over the billing data: we have to get that via the web interface of the billing app
☆ Note that this type of separation happens quite often in practical settings; e.g. the grades for courses are computed by the teaching support system; however, there is no connection to the administrative system for entering those grades into the student records -- this has to be done by manual file download and import.
Again we rely on Python package Pandas and the Sqlite API; the creation of the montly billing record will also be done in Python script.
The listing will be provided via the application web site. Put the code into a file named e.g. billing.py and run it in a separate terminal.
from flask import Flask, render_template, request
import sqlite3
app = Flask(__name__)
def getconn():
return sqlite3.connect("billing.db")
@app.route("/")
def billing():
return """<h1>Billing App</h1>
<ul>
<li><a href=getbill>Get Billing Data</a></li>
<li><a href=todo>Enter Billing Data </a></li>
</ul>"""
@app.route("/getbill")
def getbill():
return """<h1>Get Billing Data</h1>
<form action=listbill method=POST>
<table>
<tr><td>Client ID:<td><input type=text name=cid>
</table><input type=submit value=OK></form>"""
@app.route('/listbill', methods=['POST'])
def listbill():
cid = request.form['cid']
conn = getconn()
cur = conn.cursor()
rows = cur.execute("select cid, year, month, bill, pay, stat "
+ " from bildat where cid = ? order by year, month", (cid,))
html = "<h3>Billing Data</h3><table>\n"
html += "<tr><th>CID<th>Year<th>Month<th>Bill<th>Pay<th>Status</tr>\n"
for row in rows:
html += ("<tr>"
+ "<td align=right>%d"
+ "<td align=right>%d"
+ "<td align=right>%d"
+ "<td align=right>%.2f"
+ "<td align=right>%.2f"
+ "<td align=right>%d\n") % row
return html + "</table>\n"
conn.close()
app.run(host='localhost', port=8090, debug=True, use_reloader=True)
We run our billing application on the local host while also running the credit app, so we need to use a different port number.
Now we can define some tests:
*** Settings *** Library SeleniumLibrary *** Variables *** ${LOGIN URL} http://localhost:8090/ ${BROWSER} Firefox *** Test Cases *** Valid Home Page Open Browser ${LOGIN URL} Firefox Page Should Contain Billing App Valid Form Click Link //a[@href="getbill"] Page Should Contain Get Billing Data Valid Listing Input Text //input[@name="cid"] 1 Click Element //input[@type="submit"] Page Should Contain Billing Data Element Should Contain //table//tr/td[1] 1 Element Should Contain //table//tr/td[2] 2005 Element Should Contain //table//tr/td[3] 4 # XPath: td after another td with given content Element Should Contain //table//tr/td[preceding::td[text()='3102.00']] 689.00 [Teardown] Close Browser
Import Credit Card Dataset into Flask Application¶
We generate customer records and corresponding records for billing and payment from the credit card default dataset.
!wget -nc https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls
File ‘default of credit card clients.xls’ already there; not retrieving.
import pandas as pd
import numpy as np
df = pd.read_excel('default of credit card clients.xls', skiprows=(1))
df = df.rename(columns={'ID':'CID', 'default payment next month':'DEF', 'PAY_0':'PAY_1'})
df.head()
CID | LIMIT_BAL | SEX | EDUCATION | MARRIAGE | AGE | PAY_1 | PAY_2 | PAY_3 | PAY_4 | ... | BILL_AMT4 | BILL_AMT5 | BILL_AMT6 | PAY_AMT1 | PAY_AMT2 | PAY_AMT3 | PAY_AMT4 | PAY_AMT5 | PAY_AMT6 | DEF | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 20000 | 2 | 2 | 1 | 24 | 2 | 2 | -1 | -1 | ... | 0 | 0 | 0 | 0 | 689 | 0 | 0 | 0 | 0 | 1 |
1 | 2 | 120000 | 2 | 2 | 2 | 26 | -1 | 2 | 0 | 0 | ... | 3272 | 3455 | 3261 | 0 | 1000 | 1000 | 1000 | 0 | 2000 | 1 |
2 | 3 | 90000 | 2 | 2 | 2 | 34 | 0 | 0 | 0 | 0 | ... | 14331 | 14948 | 15549 | 1518 | 1500 | 1000 | 1000 | 1000 | 5000 | 0 |
3 | 4 | 50000 | 2 | 2 | 1 | 37 | 0 | 0 | 0 | 0 | ... | 28314 | 28959 | 29547 | 2000 | 2019 | 1200 | 1100 | 1069 | 1000 | 0 |
4 | 5 | 50000 | 1 | 2 | 1 | 57 | -1 | 0 | -1 | 0 | ... | 20940 | 19146 | 19131 | 2000 | 36681 | 10000 | 9000 | 689 | 679 | 0 |
5 rows × 25 columns
Let's check the size of the dataset:
len(df)
30000
Bulk Import Records via Pandas/Sqlite API¶
We could now use our robot interface to insert billing, payment, and status records for each customer: however, since we are looking at many thousands of such records the robot interface is not feasible -- the performance is just nowhere near sufficient. Instead, we
- make a numpy array from the pandas dataframe
- break the records into monthly entries and write to a CSV file
- import the CSV file via Pandas into SQLite
This sounds complicated, but it's actually very fast, and in our Python code we have full control how we create the CSV files.
import sqlite3
data = np.asarray(df)
conn = sqlite3.connect("credit.db")
cur = conn.cursor()
cur.execute("drop table if exists client")
conn.commit()
fout = open("tmpdat.csv", "w")
fout.write("id,lim,sex,edu,mar,age\n")
for x in data:
fout.write("%d,%d,%d,%d,%d,%d\n" % tuple(x[:6]))
fout.close()
pd.read_csv("tmpdat.csv").to_sql("client", conn, index = False)
conn.commit()
conn = sqlite3.connect("billing.db")
cur = conn.cursor()
cur.execute('drop table if exists bildat')
conn.commit()
fout = open('tmpdat.csv', 'w')
fout.write('cid,year,month,bill,pay,stat\n')
for x in data:
for j in range(6):
cid = x[0]
year = 2005
month = 4 + j # start at April
stat = x[6 + j] # skip 6 columns
bill = x[12 + j]
pay = x[18 + j]
fout.write('%d,%d,%d,%d,%d,%d\n' % (cid, year, month, bill, pay, stat))
fout.close()
pd.read_csv('tmpdat.csv').to_sql('bildat', conn, index = False)
conn.commit()
Check DB contents: do we have 6 monthly entries per client?
for row in cur.execute('select * from bildat where cid = 2'):
print(row)
(2, 2005, 4, 2682, 0, -1) (2, 2005, 5, 1725, 1000, 2) (2, 2005, 6, 2682, 1000, 0) (2, 2005, 7, 3272, 1000, 0) (2, 2005, 8, 3455, 0, 0) (2, 2005, 9, 3261, 2000, 2)
Optional: Generate Complaint Cases¶
Here we have the option to create cases for complaints from the consumer complaints dataset and connect them to the credit card user records. When we generate data we are in control of the parameters, e.g. the correlation between types of complaints and pay/default in the client billing data.
We can then apply the text encoding methods from the previous part to the complaints and connect them to the billing data:
- train the pay/default forecast model to include the text data from complaints
- apply the trained model to specific cases
☆ This is left as an exercise for those who desire an additional challenge!
Apply Machine Learning Module to Cases¶
The service application interface uses RPA to query the customer data and integrate the results with the default estimation from the machine learning module.
Given a client ID we want to
- query the Credit App for the data of the client
- query the Billing App for the payment and status data
- apply our machine learning model to predict the default of this client
- show the results in a reasonably nice layout
Let's add the following to our mytools.py library:
def predict_default(cid, billdat):
np.set_printoptions(formatter={'all':lambda x: '%6.2f' % x})
cur = sqlite3.connect("credit.db").cursor()
credat = cur.execute("select lim, sex, edu, mar, age from client where id = "
+ cid).fetchone()
inp = np.asarray([[ float(x) for x in credat ] + [ float(x) for x in billdat ]])
clf, Xmax = pickle.load(open('ccd.pkl', 'rb'))
pred = clf.predict_proba(inp[:1] / Xmax)
fout = open('tmpdat.html', 'w')
fout.write('<p><b>Client ' + cid + '</b>\n'
+ '<p> Client Data: ' + str(credat) + '\n'
+ '<p> Billing Data: ' + str(billdat) + '\n'
+ '<p> Prediction Pay/Default: %f %f\n' % tuple(pred[0]))
fout.close()
Now we are ready to use the Robot Framework for accessing the Billing data of a given client.
To make this flexible with the client ID as a parameter we could e.g.
- dynamically generate Robot code files from a Python script (easy)
- use this Robot Framework code in a larger test file as a module defined in the Keywords section (tricky)
Accessing the BIlling App Data¶
In this case the layout of the billing data is well-known and fixed so we can use indexing in the HTML table.
The FOR IN loop makes the code somewhat more elegant. Do NOT write 6 x 3 almost identical statements using Get WebElement.
Note the multiple spaces between each word in the FOR line; if we only leave one blank between FOR and ${i} this would count as single word in the Robot Framework syntax and produce an error. This really takes some getting used to.
Are you wondering about the nested loop? Compare the format of the training data file with the layout of the billing data web site:
- in the training data we have pay (status) followed by bill amt and pay amt
- in the billing web site we have bill, pay, status
Another typical case of a practical situation: the data does not really need to be processed, just re-arranged.
☆ Be very careful about these situations: errors do not necessarily show immediately.
Append To List lets us create a list of values. Note the somewhat strange initialisation in the Variables section: no value for empty list.
*** Settings *** Library SeleniumLibrary Library Collections Library mytools.py *** Variables *** ${URL} http://localhost:8090/ ${BROWSER} Firefox @{l}= ${cid}= 100 *** Test Cases *** Predict Client Default Open Browser ${URL} Firefox Click Link //a[@href="getbill"] Input Text //input[@name="cid"] ${cid} Click Element //input[@type="submit"] FOR ${j} IN 6 4 5 FOR ${i} IN 2 3 4 5 6 7 ${s}= Get WebElement //table//tr[${i}]/td[${j}] Append To List ${l} ${s.text} END END Predict Default ${cid} ${l} [Teardown] Close Browser
The HTML file contains the result of the prediction by the machine learning module.
from IPython.display import HTML
HTML(filename="tmpdat.html")
Client 100
Client Data: (20000, 1, 2, 1, 38)
Billing Data: ['0', '0', '0', '0', '0', '-1', '17973.00', '19367.00', '19559.00', '18240.00', '17928.00', '150.00', '1699.00', '1460.00', '626.00', '1750.00', '150.00', '0.00']
Prediction Pay/Default: 0.469680 0.530320
☆ We now have robot code to perform the pay/default forecast which we run from the command line, but we could connect this feature to the rest of the application in various ways, such as
- provide a link for the result of the analysis, e.g. add a forecast link in the client listing
- start the robot when the link is clicked, e.g. using os.system() when running on localhost
- include the forecast results directly in the response to the forecast link by reading and returning the content from tmpdat.html
- make the solution multi-user (e.g. filenames based on time, client id, ..)
Your Own Project¶
Now it's your turn to create an inspiring application of robot process automation using machine learning!
Set yourself a challenge!