Generating Reports

Automatic generation of reports is a standard task in programming. Here we will look at some very simple methods of creating documents from analysis results; first, we need to choose a format:

  • PDF
  • RTF
  • HTML
  • Latex

Creation of PDF documents is of course possible in Python, but somewhat difficult without using a package that hides the details. Let us first look into some simpler formats to see what's actually happening:

Here is a small RTF document with a few basic formatting commands:

{\rtf
{\fonttbl {\f0 Times New Roman;}}
\f0\fs40\par Header Line
\f0\fs25\par New par using plain, \b bold \b0, and \i italic \i0 text.
\par\par Empty par, end of text.
} 

Let us create a simple report for a correlation in the UCI Wine dataset:

http://archive.ics.uci.edu/ml/datasets/Wine

This dataset is somewhat confusing, the description has to be read carefully; in the wine.names file only 13 attributes are found, but the description tells us that the first column is the class, one of three types of wine.

  • Since we are dealing with a pandas dataframe, indexing needs the iloc notation.
  • We can assign several values at once, a very convenient Python feature
  • By using the names of the columns both for the data selection and the reporting we avoid errors in consistency (in later changes).
In [58]:
import pandas as pd
df = pd.read_csv('wine.data', delimiter=',', 
                 names=['class','Alcohol','Malic_acid','Ash','Alcalinity_of_ash',
                        'Magnesium','Total_phenols','Flavanoids','Nonflavanoid_phenols',
                        'Proanthocyanins','Color_intensity','Hue',
                        'OD280_OD315_of_diluted_wines','Proline'])         
print(df.iloc[:3,])

from scipy.stats import pearsonr

xname, yname = 'Ash', 'Hue'
r, p = pearsonr(df[xname], df[yname])

print('\nCorrelation ' + xname + '/' + yname + ': ' + str(r) + ' p: ' + str(p))
   class  Alcohol  Malic_acid   Ash  Alcalinity_of_ash  Magnesium  \
0      1    14.23        1.71  2.43               15.6        127   
1      1    13.20        1.78  2.14               11.2        100   
2      1    13.16        2.36  2.67               18.6        101   

   Total_phenols  Flavanoids  Nonflavanoid_phenols  Proanthocyanins  \
0           2.80        3.06                  0.28             2.29   
1           2.65        2.76                  0.26             1.28   
2           2.80        3.24                  0.30             2.81   

   Color_intensity   Hue  OD280_OD315_of_diluted_wines  Proline  
0             5.64  1.04                          3.92     1065  
1             4.38  1.05                          3.40     1050  
2             5.68  1.03                          3.17     1185  

Correlation Ash/Hue: -0.07466688903277302 p: 0.3219074697966556

We can see that this correlation is very weak and not significant. These two statements can easily be automated:

In [59]:
if abs(r) < 0.5: print('The correlation is weak.')
if p > 0.01: print('The correlation is not significant at alpha = 0.01.')
The correlation is weak.
The correlation is not significant at alpha = 0.01.

Let us put this into a function so we can use it again later.

To make this component reusable we split the result in several parts, so we can then add the formatting as required.

We also add the round() function to make the output more readable.

In [60]:
def cortxt(df, xname, yname, alpha):
    r, p = pearsonr(df[xname], df[yname])
    direc = 'negative' if r < 0 else 'positive'
    stren = 'weak' if abs(r < 0.5) else 'strong'
    signi = 'significant' if p < alpha else 'non-significant'
    return round(r, 3), round(p, 3), signi + ' ' + stren + ' ' + direc 
    
print(cortxt(df, 'Ash', 'Hue', 0.01))
(-0.075, 0.322, 'non-significant weak negative')

What else can be automated? Basically everything, starting only with the name of the data file. However, then we would need to first find the indices and names of the columns.

Given a fixed format like the UCI website all this could be done (exercise!), but for now let us just automate the task of returning the printed correlation analysis for two columns in a given file; we have to pass the complete list of columns names, separated by comma.

Since delimiter and significance level will often stay at their default values we use that concept when we define the function: optional parameters.

In [61]:
def repcor(fn, colnames, xname, yname, delim=',', alpha=0.01):
    import pandas as pd
    df = pd.read_csv('wine.data', delimiter=delim, names=colnames.split(','))       
    from scipy.stats import pearsonr
    r, p, t = cortxt(df, xname, yname, alpha)
    return r, p, t, alpha

fn = 'wine.data'
cols = 'cl,alco,acid,ash,alcal,magn,tphen,flav,nflav,pac,col,hue,od,pl'
repcor(fn, cols, 'ash', 'hue')
Out[61]:
(-0.075, 0.322, 'non-significant weak negative', 0.01)

Now we can create output components depending on the required format. To make a very simple RTF report we can use the following code.

The output file output.rtf will be created in the current directory. When run from a Jupyter notebook the current directory is the directory of this notebook.

In [62]:
def rtfcor(fn, cols, xn, yn, outfn='output.rtf'):
    r, p, t, a = repcor(fn, cols, xn, yn)
    f = open(outfn, 'w')
    f.write('{\\rtf{\\fonttbl {\\f0 Times New Roman;}}\\f0\\fs40\\par Correlation Report\n')
    f.write('\\f0\\fs25\\par\\par File: ' + fn + '\n')
    f.write('\\par Variables: ' + xn + ', ' + yn + '\n')
    f.write('\\par\\par r: \\b ' + str(r) + '\\b0' + '\n')
    f.write('\\par p: ' + str(p) + '\n')
    f.write('\\par\\par There is a ' + t + ' correlation at alpha = ' + str(a) + '}\n')
    f.close()
    
rtfcor(fn, cols, 'ash', 'hue')

A similar version for HTML:

In [63]:
def htmlcor(fn, cols, xn, yn, outfn='output.html'):
    r, p, t, a = repcor(fn, cols, xn, yn)
    f = open(outfn, 'w')
    f.write('<html><body><h1>Correlation Report</h1>\n')
    f.write('<p> File: ' + fn + '</p>\n')
    f.write('<p> Variables: ' + xn + ', ' + yn + '</p>\n')
    f.write('<p> r: <b>' + str(r) + '</b>\n')
    f.write('<br/> p: ' + str(p) + '</p>\n')
    f.write('<p> There is a ' + t + ' correlation at alpha = ' + str(a) + '</p></body></html>\n')
    f.close()
    
htmlcor(fn, cols, 'ash', 'hue')

EXERCISES:

Create functions for generating reports for the following:

  • KMeans clustering
  • Linear Model

Add

  • options for parameter selection
  • meaningful default values
  • more formats: Latex, PDF, ...

Provide documentation for your functions:

  • how are they to be used
  • what are the assumptions

In a notebook this is very easy be adding Markdown cells.

Test your code with various data files!