Programming Project: Clustering Iris

Clustering is a suitable method for exploring structure in a dataset. The Iris dataset serves as a splendid example with its four variables of petal and sepal length and width. In this dataset we already know that each observation belongs to one of three classes. However, by simply ignoring the last column containing the actual species we can simulate exploration.

In [44]:
import numpy as np
import matplotlib.pyplot as plt

data = np.genfromtxt('iris.data', usecols=(0,1,2,3), delimiter=',')
print(data.shape)
print(data[:3,])
(150, 4)
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]]

Separating the first two Classes

After reading in the data we concentrate our efforts on the first 100 observations. We also use only the first two columns of the data since this makes it easy to draw everything.

In [45]:
X = data[:100,:2]

The first two species of Iris flowers are easily separated. In fact they are linearly separable i.e. we can draw a straight line to completely separate the two classes.

In [46]:
plt.plot(X[:50,0], X[:50,1], 'o', c='b')
plt.plot(X[50:100,0], X[50:100,1], 'o', c='r')
plt.plot([4.5, 6.0], [2.1, 3.9], linewidth=3)
plt.show()

To simulate exploration we ignore our knowledge of the dataset and plot all observations in the same color; we know that there are two clusters, but we want to see how we can discover them.

In [47]:
plt.plot(X[:,0], X[:,1], 'o')
plt.show()

The plot above simulates a situation where we do not yet know proper clusters for the data. However, if we have a hunch that there two clusters, we can use the KMeans clustering method to test our idea.

The KMeans clustering method is simple:

  1. start with random centers for the number of clusters specified
  2. assign each data point to the closest center
  3. for each cluster compute the new center as the average of the points assigned to that cluster
  4. repeat from step 2 until the cluster centers no longer change, or a maximum number of steps is reached

To run the code below you may need to install the sklearn module: on the command line enter

pip3 install sklearn --user

We will use the following code again, so let's define a function. We use a default value for the number of clusters i.e. when we do not specify this parameter the default value will be used.

We also print the cluster centers and the labels for the observations, although these values are available in the result that is passed in the return statements However, when calling the function we want to see these values anyway, and this method will save us a few print statements later.

In [48]:
from sklearn.cluster import KMeans

def clust(dat, nc=2):
    km = KMeans(n_clusters=nc, random_state=0).fit(dat)  
    print(km.cluster_centers_)
    l = km.labels_
    print(l[:int(len(l)/nc)])
    print(l[-int(len(l)/nc):])
    return km
    
km = clust(X)
[[6.02666667 2.79111111]
 [5.01636364 3.34181818]]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 1 0 0 0 0 1 0]
  • We create a KMeans object by specifying the number of clusters.
  • We also set the random state, otherwise the result can change on each run, which would be confusing here.
  • The fit() function takes the Iris data and performs the steps listed above. Typically the algorithm converges quickly, even for large datasets.
  • We print the final list of cluster centers.
  • The labels show the point assignments. The values correspond to the order of the centers in the list.

The labels show two things:

  • the first half of the data points have been assigned to the second cluster, and most of the remaining to the first. This order will change with different inital random states.
  • a few points break the regularity; from our knowledge of this dataset we can already assume that they have been assigned the wrong cluster considering their actual classes -- which are unknown to this KMeans procedure!

The last point is crucial: the order of the observations is irrelevant to KMeans. We did not pass the actual class data to the KMeans procedure, therefore we should not be surprised that the result is not perfect from our point of view.

Let us print the cluster centers together with the observations, using the labels for colors:

In [49]:
for i in range(len(X)):
    plt.plot(X[i,0], X[i,1], 'o', color='rb'[km.labels_[i] ] )
c = km.cluster_centers_
plt.plot(c[:,0], c[:,1], '*')
plt.show()

Compare with the data plotted using our knowledge of the actual classes i.e. species of Iris flowers:

In [50]:
plt.plot(X[:50,0], X[:50,1], 'o', c='b')
plt.plot(X[50:,0], X[50:,1], 'o', c='r')
plt.plot(c[:,0], c[:,1], '*')
plt.show()

It is fairly obvious that the five points assigned to the 'wrong' cluster are actually closer to the blue center than to the red.

Using more than two dimensions

So far we used the first two columns of our Iris data. This has the advantage of simple plotting: with just two columns we can directly interpret them as dimensions for points in a figure.

However, there is no reason for such a limitation in the KMeans algorithm. We can easily use all four columns of our Iris data:

In [51]:
km = clust(data[:100,])
[[5.936 2.77  4.26  1.326]
 [5.006 3.418 1.464 0.244]]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]

Everything in the KMeans algorithm works in exactly the same way. Naturally, the cluster centers now have 4 dimensions. Note that the values in the first two columns are similar but not identical to the previous results.

For the plotting we could choose any combination of columns, but just by looking at the labels we can see that a plot is hardly necessary: the seperation is now perfect. With the additional information provided by the remaining columns the KMeans procedure has divided the observations flawlessly, from our point of view, thereby providing a means to automatically and reliably identify Iris flowers by using only the measurements of their petals and sepals. Sadly, this only works for the first two species in the dataset. On the other hand, this is one of the reasons why this dataset is such a timeless classic in statistics and machine learning.

Separating the second and third classes

When we look at the second and third species in the dataset we find that there is an obvious overlap in the data points:

In [52]:
plt.plot(data[50:100,0], data[50:100,1], 'o', c='r')
plt.plot(data[100:,0], data[100:,1], 'o', c='g')
plt.show()

It should come as no surprise that our clustering function does not perform flawlessly on this part of the dataset:

In [53]:
km = clust(data[50:,])
[[5.9016129  2.7483871  4.39354839 1.43387097]
 [6.85       3.07368421 5.74210526 2.07105263]]
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0]
[1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1
 1 0 1 1 1 0 1 1 1 0 1 1 0]

Using all observations and columns

Now we can tackle the full dataset of Iris flowers. There are three classes:

In [54]:
plt.plot(data[:50,0], data[:50,1], 'o', c='b')
plt.plot(data[50:100,0], data[50:100,1], 'o', c='r')
plt.plot(data[100:,0], data[100:,1], 'o', c='g')
plt.show()

We need to provide the KMeans procedure with a value for the number of clusters.

In [55]:
km = clust(data, nc=3)
[[6.85       3.07368421 5.74210526 2.07105263]
 [5.006      3.418      1.464      0.244     ]
 [5.9016129  2.7483871  4.39354839 1.43387097]]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1]
[0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0
 0 2 0 0 0 2 0 0 0 2 0 0 2]

From looking at the labels we can see that the first cluster identifies the first species of Iris, while the second and third cluster are not separated.

Note that the first 50 labels correspond to the first 50 observations in the dataset i.e. the first species of Iris, while the last 50 labels correspond to the last 50 observations.

The label values of 0, 1, and 2 correspond to the cluster centers in the result.

Predicting

While clustering alone provides lots of interesting applications the KMeans algorithm can also be used for prediction. For this purpose we split our data in training and testing sets: if we treat KMeans as a machine learning method we do not want to evaluate it on data that is has already seen during training, but on new observations.

We know that there are 50 observations for each Iris species. We take the first 40 of each for training, and the last 10 of each species for testing.

Indexing with multiple ranges is a little tricky in Python, but Numpy has a handy little helper function for that.

In [56]:
tr = data[np.r_[0:40,50:90,100:140]]
te = data[np.r_[40:50,90:100,140:150]]

This must leave us with a total of 40 * 3 = 120 observations for training, and 30 for testing:

In [57]:
print(tr.shape, te.shape)
(120, 4) (30, 4)

Now we can run KMeans on the training data only:

In [58]:
km = clust(tr, nc=3)
[[6.89677419 3.05483871 5.80967742 2.01935484]
 [5.0375     3.44       1.4625     0.2325    ]
 [5.94897959 2.75306122 4.42653061 1.44897959]]
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1]
[0 2 0 0 0 0 2 0 0 0 0 0 0 2 2 0 0 0 0 2 0 2 0 2 0 0 2 2 0 0 0 0 0 2 0 0 0
 0 2 0]

For the prediction we run several statements to make the output more readable:

In [59]:
print(km.predict(te[:10]))
print(km.predict(te[10:20]))
print(km.predict(te[20:]))
[1 1 1 1 1 1 1 1 1 1]
[2 2 2 2 2 2 2 2 2 2]
[0 0 2 0 0 0 2 0 0 2]

The first and second classes are identified perfectly, the third is not.