A Little More Advanced Python Concepts

Files

Data resides as files on disks, SSDs, USB sticks, SD cards, and other types of storage. Fortunately, the type of storage does not matter to us: all we need is the filename to read the data.

The simplest case is a text file in the current directory:

for line in open(fn):
   print(line)
  • the open() function opens a file, in this case for reading
  • the loop iterates over the lines from the file

Also useful:

  • open(fn).readlines() returns a list of the lines
  • open(fn).read() returns the contents of the file as a string
  • len() returns the length of a string or list
  • split() returns a list of words in a string as separated by
    • whitespace per default, i.e. blank, tab, newline
    • split(delim) separates by given delimiter e.g. split(',')

Encodings

A file is just a sequence of bytes. The application determines the meaning of those bytes.

When we work with text files we need a mapping from bytes to characters: an encoding.

The number of bytes for each character depends on the encoding:

  • In the ASCII code the characters in the English language and a number of other frequently used symbols are encoded in one byte.
  • In ISO-8859-1 (also known as Latin-1) the ASCII characters and many characters used in Western European languages are also encoded in one byte.
  • In UTF-8 the ASCII characters are encoded in one byte, and the roughly 100k other characters contained in the Unicode list are encoded in two, three, or four bytes.

UTF-8 has become the standard and most common encoding. However, many other encodings are still in use.

Note that for the ASCII characters all three encodings are identical; therefore, files containing only ASCII characters rarely cause encoding-related problems.

ASCII also contains a number of control characters that have no visual meaning for display or printing, e.g., EOT (end of transmission), BEL (bell or beep). Ideally, a data file is plain text i.e. it contains only printable ASCII characters and whitespace (blank, tab, newline). Some usefull Linux command line tools:

  • ls -l myfile.txt: shows the number of bytes in a file (among other things).
  • file myfile.txt: determines the type of content and encoding.
  • wc myfile.txt: counts lines, words, and bytes.

The following Python code illustrates some approaches for counting bytes in a file:

  • len( open(filename).read() )
  • len(bytes(open(filename).read(), 'utf-8'))
  • len( open(filename, "rb").read() )
  • In the first statement the file is read and the bytes are interpreted in the default encoding, which is usually UTF-8, resulting in a string; the length of the string is returned. If the file only contains ASCII characters then that number if equal to the number of bytes. If the file contains characters such as German umlauts then the number of bytes is not equal to the number of characters, and the value is not the number of bytes in the file.
  • In the second statement the input process is reversed: the string is converted to bytes using the UTF-8 encoding; this corresponds to the number of bytes in a file that is encoded in UTF-8.
  • In the third statement the file content is read binary i.e. no encodings are involved. The result is a bytes object whose length is printed. This is the best solution since it works with any file content.

Example:

grep - search for string in file and print lines where found:

  ...
  for line in open(filename):
    if string in line:
      print(line.rstrip())


rstrip() removes whitespace at the end of the line

Reading text files that contain data

Many files are in some sort of binary format that cannot easily be read with anything except the software for which this format was developed. Proprietary software often uses this approach to force users into using only the products of that particular company.

Fortunately, many people now understand the folly of that approach, and increasingly data files are in a format that uses text to represent numbers in their printed and human-readable form.

Among the various approaches the simple files using one line per observation are particularly easy to read: values in columns

  • are fixed-length padded with space or tabs, or
  • separated by a delimiter character

Among the later the CSV file format, comma separated value, is a simple de-facto standard; however, it is actually more of a guideline. A CSV file

  • does not necessarily use comma as separator
  • may or may not have one or more header lines
  • may contain empty lines

The Iris dataset

Fisher's Iris data from 1936 is the best-known pattern recognition dataset and provides excellent possibilities for study since it is small but still interesting enough for various statistical and machine learning methods.

Download the iris data file from the UCI website:

https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data

Put the file into your current directory. Use your favorite editor and examine the content:

  • comma separated values
  • no header line
  • blank line at the end
  • last column not numeric

Just a little tricky.. we could start writing our own solution:

In [27]:
filename = 'iris.data' #  or sys.argv[1]
data = []
for line in open(filename):
    line = line.strip()
    if len(line) == 0: continue
    row = []
    for str in line.split(','):
        try:
            row.append(float(str))
        except:
            row.append(str)
    data.append(row)
print(data[:10])
  
[[5.1, 3.5, 1.4, 0.2, 'Iris-setosa'], [4.9, 3.0, 1.4, 0.2, 'Iris-setosa'], [4.7, 3.2, 1.3, 0.2, 'Iris-setosa'], [4.6, 3.1, 1.5, 0.2, 'Iris-setosa'], [5.0, 3.6, 1.4, 0.2, 'Iris-setosa'], [5.4, 3.9, 1.7, 0.4, 'Iris-setosa'], [4.6, 3.4, 1.4, 0.3, 'Iris-setosa'], [5.0, 3.4, 1.5, 0.2, 'Iris-setosa'], [4.4, 2.9, 1.4, 0.2, 'Iris-setosa'], [4.9, 3.1, 1.5, 0.1, 'Iris-setosa']]

Idea: Try to convert string to float, if that fails append string

Variation: select (numeric!) columns to import from CSV file

  • cols = [ int(s) for s in sys.argv[2:] ]

  • splt = line.strip().split(',')

  • row = [ float(splt[i]) for i in cols ]

Complicated..

Let us not reinvent the wheel!

Use numpy.genfromtxt():

The last column contains the name of the species as a string. For the moment, let us simply ignore it.

In [28]:
import numpy as np

fn = 'iris.data'
data = np.genfromtxt(fn, delimiter=',', usecols=(0,1,2,3))
print(data.shape)
print(data[:10,])
(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]]

Toolkits

  • matplotlib: plotting
  • scikit-learn: various machine learning algoritms
  • statsmodels: statistical modelling

If not already installed:

  • Package manager provided by operating system, e.g. apt for Debian-based systems
    • needs root permissions
    • only provides packages in the OS distribution
  • pip3 for Python installation
    • no root permissions needed for user install (in home directory)
    • can install packages not available in OS distribution

Plotting

A nice plot can make us see patterns in the data. It is usually a good idea to plot data before doing any other type of analysis.

pip3 install matplotlib --user

If you are using a conda installation of Python: use the conda command to install package. It is generally not a good idea to mix pip and conda; this can result in strange and undesirable situations.

The plot() command in the matplotlib package has lots of options, study them at your leisure, e.g. at

https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html

Having read in our Iris data we can do a simple plot on the first two columns.

In [29]:
# read iris data if not already done so
import matplotlib.pyplot as plt

x = data[:50,0]
y = data[:50,1]
plt.plot(x, y, 'o', c='blue')
plt.show()
    

If you studied the data using your editor you probably realized that there are three species of iris flowers, each with 50 observations.

Because of this we can plot the individual species with different colors:

In [30]:
plt.plot(data[:50,0], data[:50,1], 'o', c='blue')
plt.plot(data[50:100,0], data[50:100,1], 'o', c='red')
plt.plot(data[100:,0], data[100:,1], 'o', c='green')
plt.show()

Exercises:

  • The statements above are not very elegant, can you come up with a better solution, probably involving a loop?

  • Find the docs for matplotlib and add some nice features to the plots, such as labels and titles

  • Can you put several plots into one figure, e.g. x/y from first/second col and next to that first/third col?

  • Having managed that, it should be easy to put all combinations of x and y into a single (big) figure.

  • Find other data files from the UCI website and plot them in a nice way.