Machine Learning Hive Beekeeper Notebook¶

This is a working notebook for common questions/information about the Machine Learning modules and hives. While writing this, I've gone through the modules again and tried to answer any questions I think might come up here, so you won't have to fiddle around with google on the day - of course, I'll probably miss things so feel free to add stuff!

This module has a few aims:

Explain some common algorithms, the thinking behind them and how they can be applied to real world problems.
To get people used to using 3rd party "black box" libraries, and to start to understand why libraries are so helpful in general.
To give illustrations of how powerful data analysis/science is using Python (or indeed any other coding language).
To serve as a soft introduction into the world of neural networks if people want to go down that route.

It's organised as one algorithm per notebook, in a very simple format of explaining the algorithm, example code, and a project that is slightly more challenging than the example but shouldn't be unreasonable given the person doing it has a solid foundation of Python. Naturally I think the first questions people will ask are:

Why do machine learning over statistics?¶

Machine learning and statistics are similar areas, and often use the same methods in finding an answer (for example, regression), but the difference comes from their implementation in Python - machine learning libraries such as scikitlearn will build a model object we can make predictions from, whereas stats libraries like scipy and statsmodels are more useful for getting statistical summaries like test statistics and p-values. In practice, it depends on the application - if you are running an experiment and need to see the correlation between two variables, you will probably want to use a statistics package, but if you are building a model to predict where earthquakes might hit, you will probably want to use machine learning.

It's all good knowing these models - but what model do I need to apply?¶

A lot of this depends on what you are trying to achieve and what data you have - do we want to categorise data into sets or do we want to find a continuous relationship between them? Do we need the results to be fast or accurate? Do we have a training set we can use the model for? All of these questions help us decide what model we want to use (sometimes the answer isn't just to slam it into a neural network!)

K Means Clustering¶

How do I choose k? Surely there has to be a better way than the elbow method?¶

This is where knowledge of your field comes in - for example, if you are trying to cluster population centers, you will use k values that make sense for the area (number of cities etc).

What's going on with the make_blobs function?¶

Get them to print everything the make_blobs function give us - it will give a list, with the points in the first index, and a lot of stuff we don't need (in fact, the list is the list which gives a corresponece from each point to the blob it belongs in!). For our example, we don't want to other stuff, so we just index it out.

What's the indexing trick?¶

Our data is a 2 column matrix - each row is a point which we want to plot - since matplotlib wants a list of x coordinates then a list of y coordinates, we need to format the data to get it into a position we can use the plot function on it. The indexing in numpy works by giving what is basically a list of slices - since we want to index all rows we put the empty slice ":" in the first spot, and "0" for the second spot because we only want everything from the first column. Similarly, we use [:,1] for the y values because we want every row, but every second value for each row.

Also consider the native Python equivilent and how they would do it then using indexing and/or for loops. It's much more of a pain!

Completely confused about the "model" variable, what does it mean?¶

This is a product of object orientated programming - we actually initialise a k means model for 4 centers using the KMeans() function, and then fit it to our data using the fit method. This is a method on the model object - not a function - so the model "changes itself" to fit our data. What's important to realise is all the information we need is stored inside the model object, we don't get it from running functions on it!

How can I do Mini Project 1 if the data is 4 dimensional?¶

The great thing about the algorithm is that it's generalised - we don't care about what these points represent (see project 2), and how many dimensions our data points have - the reason this is a 4 dimensional example is to remove the crutch of geometric representation - you have to trust in the code!

The solution for this Mini Project is pretty much identical for the example but backwards - we want to use the elbow method first to find a suitable K and then build a model using the suitable k. People who are struggling might try and guess for values of K - try and get these people to think about why that might not be a good idea (model always gets better for larger k, but that's not the point of the model! We don't want to overfit). People who are excelling through this might wonder if there's a better way of finding a K-value suitable for our data. In fact, there is a better algorithm for when we don't know our K-value called the DBSCAN method - but I haven't written this guide yet!

How do I get started on Mini Project 2? There's no numbers!¶

Explain how RGB values work - everything else pretty much follows from here - I think it's a numpy function to convert a picture to RGB

What is K for Mini Project 2?¶

If we want to quantise for 16 colours - what does this mean?

K Nearest Neighbours¶

Most of the questions above also apply to this guide - I decided to move a bit faster at this point because I'm expecting people to be getting more comfortable with numpy and pandas at this point. The most confusion will come from the illustration of KNN - I think people will be able to understand it at this point but might need some help understanding certain bits of the code. The idea is that we set upper and lower bounds for x and y based on our data, with a buffer of 1 so the graph has a bit of space, then set up a meshgrid, which is basically a 2 dimensional matrix with each point on the graph being represented. We then need to format this a little to get it into (x,y) pairs, which we will have one of for each point on the graph. Then we predict every one of these points and plot it behind our actual data to show the boundry.

Support Vector Machines¶

Why use this vs the KNN algorithm?¶

First priority is data - if you need a linear seperation between your data KNN isn't going to give you that. Other than that, SVM is a much slower algorithm than KNN, so if we don't need to be accurate but need to be fast, then SVM might not be appropriate.

Decision Trees¶

Why use it?¶

Gives us a fast model that we can use as humans to classify data even if it's multidimensional. Great when speed and explaination are needed over a "black box" and accuracy!

How do I get those fancy visualiations?¶

Honestly, this took me the best part of 4 hours to do. Get them to google it and figure it out for their system - knowing how these events run nothing will run the same on my system compared to anyone elses!