Using KNN on Iris Data

Posted on Wed 29 August 2018 in machine_learning

Preface

Today's blog will be focusing on utilizing the KNN(K-Nearest Neighbors) algorithm. As mentioned in the previous blog, I have been making an effort to become more comfortable with machine learning. This means getting a better grasp on all the various ML algorithms. Funnily enough, after starting this ML Journey and posting the previous blog, the Humble Bundle released a Machine Learning book bundle. For this specific blog, I am heavily influenced by one of the books from the bundle, Introduction to Machine Learning with Python.

Side Note

I have integrated Jupyter Notebooks to the blog. This was accomplished fairly simply using pelican-ipynb. The only downside is that I had to change the website theme, as much as I loved pelican-blue, it was having conflicts with the jupyter css and after attempting to debug it myself for 30+ minutes, I decided to just switch the blog theme to my second favorite, Flex.

Tools Used

Let's Begin

First off we need to make our imports. Numpy for any math calculations, and pandas for manipulating the data. We will use sklearn tools throughout as needed.

In [1]:
import numpy as np
import pandas as pd

Load the Data

Instead of using that messy boxing data from the previous blog, we'll use the classic Iris dataset that is included in sklearn's datasets.

Loading the iris dataset, comes in the form of a Bunch object(similar to a dictionary).

In [2]:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

The main thing to keep in mind with the Bunch object is that it does not just give you the raw data. It also contains a bunch of information about it.

In [3]:
print("Keys of iris_dataset: \n{}".format(iris_dataset.keys()))
Keys of iris_dataset: 
dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names'])

You'll notice that it has a data key, which contains the values that will be used for learning. The rest of the keys are just labels and info for us humans to better understand the data.

In [4]:
print("Target names: {}".format(iris_dataset['target_names'])) # Contains the classes we wish to predict
Target names: ['setosa' 'versicolor' 'virginica']
In [5]:
print("Feature names: \n{}".format(iris_dataset['feature_names'])) # Contains descriptions of each feature
Feature names: 
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
In [6]:
print("Shape of data: {}".format(iris_dataset['data'].shape)) # data contains 150 flowers(samples), with 4 features each
Shape of data: (150, 4)

Split the Data

Now we will use sklearn to split the data into training and testing data. 25% to testing.

In [7]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(iris_dataset['data'], iris_dataset['target'], random_state=0) # random state makes the shuffle consistent, giving us the same shuffled output everytime, for consistency sake

Visualize the Data

Before we can use pandas for plotting, we must first change our data to a panda-friendly dataframe.

In [8]:
iris_dataframe = pd.DataFrame(X_train, columns=iris_dataset.feature_names)

Then we can create a scatter matrix from the dataframe that will illustrate the relationships between each 2D combination of features for each class label of iris. This will help us gauge how well certain features cluster and distinguish from different classes. If the classes look like they can be seperated by a line, that is usually a good sign.

In [10]:
pd.plotting.scatter_matrix(iris_dataframe, c=Y_train, figsize=(15, 15),
                          marker='o', hist_kwds={'bins':20}, s=60,
                          alpha=.8)
Out[10]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000002057881E0F0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578843550>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578869BE0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578899278>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000205788C4908>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000205788C4940>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002057891C668>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578945CF8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x00000205789763C8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x000002057899CA58>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000205789D1128>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x00000205789FA7B8>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x0000020578A1FE48>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578A51518>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578A7ABA8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x0000020578AAC278>]],
      dtype=object)

From the plots, we can see the 3 flower classes are well seperated. This is good, meaning our machine learning model should be able to seperate them.

Fitting the KNN Classifier

Now to setup our classifier, we will use KNN.

KNN is a very simple algorithm to understand but still very effective. You basically store all of the training data points. Whenever a new data point is introduced, you plot that new data point. You then look for the training data point that is closest to it. You will then classify the new data point as the same class as the training data point. This specific example is if you only do 1 nearest neighbor. The K part of KNN means it is scalable. You can use multiple neighbors, the only difference is to just choose the class that the majority of the training data points are.

In [11]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1)
In [12]:
knn.fit(X_train, Y_train)
Out[12]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=1, p=2,
           weights='uniform')

Evaluate the Model

We can use the KNN score method to check the overall accuracy of the model.

In [13]:
print("Test set score: {:.2f}".format(knn.score(X_test, Y_test)))
Test set score: 0.97

Conclusion

97%? That is exceptional! That literally means it will accurately classify an Iris correctly 97% of the time. That's more effective than a human done whilst also been accomplish in an instant once given the raw describing data.

If this blog did not fully makes sense, definitely check out Introduction to Machine Learning with Python. This is just a dumbed down version of what they go over in the first chapter.