Data resides as files on disks, SSDs, USB sticks, SD cards, and other types of storage. Fortunately, the type of storage does not matter to us: all we need is the filename to read the data.
The simplest case is a text file in the current directory:
for line in open(fn):
print(line)
Also useful:
A file is just a sequence of bytes. The application determines the meaning of those bytes.
When we work with text files we need a mapping from bytes to characters: an encoding.
The number of bytes for each character depends on the encoding:
UTF-8 has become the standard and most common encoding. However, many other encodings are still in use.
Note that for the ASCII characters all three encodings are identical; therefore, files containing only ASCII characters rarely cause encoding-related problems.
ASCII also contains a number of control characters that have no visual meaning for display or printing, e.g., EOT (end of transmission), BEL (bell or beep). Ideally, a data file is plain text i.e. it contains only printable ASCII characters and whitespace (blank, tab, newline). Some usefull Linux command line tools:
The following Python code illustrates some approaches for counting bytes in a file:
Example:
grep - search for string in file and print lines where found:
...
for line in open(filename):
if string in line:
print(line.rstrip())
rstrip() removes whitespace at the end of the line
Many files are in some sort of binary format that cannot easily be read with anything except the software for which this format was developed. Proprietary software often uses this approach to force users into using only the products of that particular company.
Fortunately, many people now understand the folly of that approach, and increasingly data files are in a format that uses text to represent numbers in their printed and human-readable form.
Among the various approaches the simple files using one line per observation are particularly easy to read: values in columns
Among the later the CSV file format, comma separated value, is a simple de-facto standard; however, it is actually more of a guideline. A CSV file
Fisher's Iris data from 1936 is the best-known pattern recognition dataset and provides excellent possibilities for study since it is small but still interesting enough for various statistical and machine learning methods.
Download the iris data file from the UCI website:
https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
Put the file into your current directory. Use your favorite editor and examine the content:
Just a little tricky.. we could start writing our own solution:
filename = 'iris.data' # or sys.argv[1]
data = []
for line in open(filename):
line = line.strip()
if len(line) == 0: continue
row = []
for str in line.split(','):
try:
row.append(float(str))
except:
row.append(str)
data.append(row)
print(data[:10])
Idea: Try to convert string to float, if that fails append string
Variation: select (numeric!) columns to import from CSV file
cols = [ int(s) for s in sys.argv[2:] ]
splt = line.strip().split(',')
row = [ float(splt[i]) for i in cols ]
Complicated..
Let us not reinvent the wheel!
Use numpy.genfromtxt():
The last column contains the name of the species as a string. For the moment, let us simply ignore it.
import numpy as np
fn = 'iris.data'
data = np.genfromtxt(fn, delimiter=',', usecols=(0,1,2,3))
print(data.shape)
print(data[:10,])
If not already installed:
A nice plot can make us see patterns in the data. It is usually a good idea to plot data before doing any other type of analysis.
pip3 install matplotlib --user
If you are using a conda installation of Python: use the conda command to install package. It is generally not a good idea to mix pip and conda; this can result in strange and undesirable situations.
The plot() command in the matplotlib package has lots of options, study them at your leisure, e.g. at
https://matplotlib.org/3.2.1/api/_as_gen/matplotlib.pyplot.plot.html
Having read in our Iris data we can do a simple plot on the first two columns.
# read iris data if not already done so
import matplotlib.pyplot as plt
x = data[:50,0]
y = data[:50,1]
plt.plot(x, y, 'o', c='blue')
plt.show()
If you studied the data using your editor you probably realized that there are three species of iris flowers, each with 50 observations.
Because of this we can plot the individual species with different colors:
plt.plot(data[:50,0], data[:50,1], 'o', c='blue')
plt.plot(data[50:100,0], data[50:100,1], 'o', c='red')
plt.plot(data[100:,0], data[100:,1], 'o', c='green')
plt.show()
Exercises:
The statements above are not very elegant, can you come up with a better solution, probably involving a loop?
Find the docs for matplotlib and add some nice features to the plots, such as labels and titles
Can you put several plots into one figure, e.g. x/y from first/second col and next to that first/third col?
Having managed that, it should be easy to put all combinations of x and y into a single (big) figure.
Find other data files from the UCI website and plot them in a nice way.