# KNN¶

Code that performs nested cross-validation and the k-nearest neighbour algorithm, estimate distances between data samples, build confusion matrices and calculate performance metrics.

## 1. Exploratory Data Analysis¶

The first part of tackling any ML problem is visualising the data in order to understand some of the properties of the problem at-hand. The code below loads the iris dataset for you. With only 4 features (sepal length, sepal width, petal length, and petal width), and 3 classes, it is possible to use scatter plots to visualise interactions between different pairings of features. An example of how this visualisation might look is shown below:

Your first task, drawing on the labs, is to recreate this 4x4 grid, with each off-diagonal subplot showing the interaction between two features, with each of the three classes represented as a different colour. The on-diagonal subplots (representing a single feature) should show a histogram on that feature.

You should create a function that, given data X and labels y, plots this 4x4 grid. The function should be invoked as,

myplotGrid(X,y)



where X is your training data and y are the labels

### 1.2. Exploratory Data Analysis under noise¶

When data are collected under real-world settings (e.g., from webcams or other errors) they usually contain some amount of noise that makes classification more challenging. In the cell below, invoke your exploratory data analysis function above on a noisy version of your data X.

Try to perturb your data with some Gaussian noise,

# initialize random seed to replicate results over different runs
np.random.seed(mySeed)
XN=X+np.random.normal(0,0.5,X.shape)



and then invoke

myplotGrid(XN,y)



## 2. Implementing kNN¶

In the cell below, develop your own code for performing k-Nearest Neighbour classification. You may use the scikit-learn k-NN implementation from the labs as a guide -- and as a way of verifying your results -- but it is important that your implementation does not use any libraries other than the basic numpy and matplotlib functions.

Define a function that performs k-NN given a set of data. Your function should be invoked similary to:

    y_ = mykNN(X,y,X_,options)



where X is your training data, y is your training outputs, X_ are your testing data and y_ are your predicted outputs for X_. The options argument (can be a list or a set of separate arguments depending on how you choose to implement the function) should at least contain the number of neighbours to consider as well as the distance function employed.

Hint: it helps to break the problem into various sub-problems, implemented as helper function. For example, you might want to implement a separate function for calculating the distances between two vectors, znd another function that uncovers the nearest neighbour(s) to a given vector.

## 3. Nested Cross-validation using your implementation of KNN¶

In the cell below, develop your own code for performing 5-fold nested cross-validation along with your implemenation of k-NN above. Within this you should write code that evaluates classifier performance. You must write your own code -- the scikit-learn module may only be used for verification purposes.

Your code for nested cross-validation should invoke your kNN function (see above). You cross validation function should be invoked similary to:

accuracies_fold, best_parameters_fold = myNestedCrossVal(X,y,5,list(range(1,11)),['euclidean','manhattan'],mySeed)



where X is your data matrix (containing all samples and features for each sample), 5 is the number of folds, y are your known output labels, list(range(1,11) evaluates the neighbour parameter from 1 to 10, and ['euclidean','manhattan'] evaluates the two distances on the validation sets. mySeed is simply a random seed to enable us to replicate your results. The outputs could be a list of accuracy values, one per fold, and a list of the corresponding parameter tuples (distance, k) used to calculate these.

Notes:

• you should perform nested cross-validation on both your original data X, as well as the data pertrubed by noise as shown in the cells above (XN)
• you should implement/validate at least two distance functions
• you should evaluate number of neighbours from 1 to 10
• your function should return a list of accuracies per fold, and a list of the corresponding parameters
• for each fold, your function should print:
• the accuracy per distinct set of parameters on the validation set
• the best set of parameters for the fold after validation
• the confusion matrix per fold (on the testing set)

### 3.2. Summary of results¶

Using your results from above, fill out the following table using the clean data:

SEE ABOVE

Now fill out the following table using the noisy data:

SEE ABOVE

### 3.3. Confusion matrix summary¶

Summarise the overall results of your nested cross validation evaluation of your K-NN algorithm using two summary confusion matrices (one for the noisy data, one for the clean data). You might want to adapt your code above to also return a list of confusion matrices (one for each fold), e.g.

accuracies_fold, best_parameters_fold, confusion_matrix_fold = myNestedCrossValConf(X,y,5,list(range(1,11)),['euclidean','manhattan'],mySeed)



Then write a function to print the two matrices below. Make sure you label the matrix so that it is readable. You might also show class-relative precision and recall.

## 4. Questions¶

Now answer the following questions as fully as you can. The answers should be based on your implementation above. Write your answers in new Markdown cells below each question.

### Q1. Choice of parameters¶

Do the best parameters change per fold? Is there one parameter choice that is best regardless of the data used?

Yes the value of k is varied, although distance type stays the same apart from one noisey round where manhattan came out better.

### Q2. Clean vs. noisy¶

Does the best parameter choice change depending on whether we use clean or noisy data? (Answer for both distance function and number of neighbours.)