Automatic generation of reports is a standard task in programming. Here we will look at some very simple methods of creating documents from analysis results; first, we need to choose a format:
Creation of PDF documents is of course possible in Python, but somewhat difficult without using a package that hides the details. Let us first look into some simpler formats to see what's actually happening:
Here is a small RTF document with a few basic formatting commands:
{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs40\par Header Line
\f0\fs25\par New par using plain, \b bold \b0, and \i italic \i0 text.
\par\par Empty par, end of text.
}
Let us create a simple report for a correlation in the UCI Wine dataset:
http://archive.ics.uci.edu/ml/datasets/Wine
This dataset is somewhat confusing, the description has to be read carefully; in the wine.names file only 13 attributes are found, but the description tells us that the first column is the class, one of three types of wine.
import pandas as pd
df = pd.read_csv('wine.data', delimiter=',',
names=['class','Alcohol','Malic_acid','Ash','Alcalinity_of_ash',
'Magnesium','Total_phenols','Flavanoids','Nonflavanoid_phenols',
'Proanthocyanins','Color_intensity','Hue',
'OD280_OD315_of_diluted_wines','Proline'])
print(df.iloc[:3,])
from scipy.stats import pearsonr
xname, yname = 'Ash', 'Hue'
r, p = pearsonr(df[xname], df[yname])
print('\nCorrelation ' + xname + '/' + yname + ': ' + str(r) + ' p: ' + str(p))
We can see that this correlation is very weak and not significant. These two statements can easily be automated:
if abs(r) < 0.5: print('The correlation is weak.')
if p > 0.01: print('The correlation is not significant at alpha = 0.01.')
Let us put this into a function so we can use it again later.
To make this component reusable we split the result in several parts, so we can then add the formatting as required.
We also add the round() function to make the output more readable.
def cortxt(df, xname, yname, alpha):
r, p = pearsonr(df[xname], df[yname])
direc = 'negative' if r < 0 else 'positive'
stren = 'weak' if abs(r < 0.5) else 'strong'
signi = 'significant' if p < alpha else 'non-significant'
return round(r, 3), round(p, 3), signi + ' ' + stren + ' ' + direc
print(cortxt(df, 'Ash', 'Hue', 0.01))
What else can be automated? Basically everything, starting only with the name of the data file. However, then we would need to first find the indices and names of the columns.
Given a fixed format like the UCI website all this could be done (exercise!), but for now let us just automate the task of returning the printed correlation analysis for two columns in a given file; we have to pass the complete list of columns names, separated by comma.
Since delimiter and significance level will often stay at their default values we use that concept when we define the function: optional parameters.
def repcor(fn, colnames, xname, yname, delim=',', alpha=0.01):
import pandas as pd
df = pd.read_csv('wine.data', delimiter=delim, names=colnames.split(','))
from scipy.stats import pearsonr
r, p, t = cortxt(df, xname, yname, alpha)
return r, p, t, alpha
fn = 'wine.data'
cols = 'cl,alco,acid,ash,alcal,magn,tphen,flav,nflav,pac,col,hue,od,pl'
repcor(fn, cols, 'ash', 'hue')
Now we can create output components depending on the required format. To make a very simple RTF report we can use the following code.
The output file output.rtf will be created in the current directory. When run from a Jupyter notebook the current directory is the directory of this notebook.
def rtfcor(fn, cols, xn, yn, outfn='output.rtf'):
r, p, t, a = repcor(fn, cols, xn, yn)
f = open(outfn, 'w')
f.write('{\\rtf{\\fonttbl {\\f0 Times New Roman;}}\\f0\\fs40\\par Correlation Report\n')
f.write('\\f0\\fs25\\par\\par File: ' + fn + '\n')
f.write('\\par Variables: ' + xn + ', ' + yn + '\n')
f.write('\\par\\par r: \\b ' + str(r) + '\\b0' + '\n')
f.write('\\par p: ' + str(p) + '\n')
f.write('\\par\\par There is a ' + t + ' correlation at alpha = ' + str(a) + '}\n')
f.close()
rtfcor(fn, cols, 'ash', 'hue')
A similar version for HTML:
def htmlcor(fn, cols, xn, yn, outfn='output.html'):
r, p, t, a = repcor(fn, cols, xn, yn)
f = open(outfn, 'w')
f.write('<html><body><h1>Correlation Report</h1>\n')
f.write('<p> File: ' + fn + '</p>\n')
f.write('<p> Variables: ' + xn + ', ' + yn + '</p>\n')
f.write('<p> r: <b>' + str(r) + '</b>\n')
f.write('<br/> p: ' + str(p) + '</p>\n')
f.write('<p> There is a ' + t + ' correlation at alpha = ' + str(a) + '</p></body></html>\n')
f.close()
htmlcor(fn, cols, 'ash', 'hue')
EXERCISES:
Create functions for generating reports for the following:
Add
Provide documentation for your functions:
In a notebook this is very easy be adding Markdown cells.
Test your code with various data files!