This is a working notebook for common questions/information about the Machine Learning modules and hives. While writing this, I've gone through the modules again and tried to answer any questions I think might come up here, so you won't have to fiddle around with google on the day - of course, I'll probably miss things so feel free to add stuff!
This module has a few aims:
It's organised as one algorithm per notebook, in a very simple format of explaining the algorithm, example code, and a project that is slightly more challenging than the example but shouldn't be unreasonable given the person doing it has a solid foundation of Python. Naturally I think the first questions people will ask are:
Machine learning and statistics are similar areas, and often use the same methods in finding an answer (for example, regression), but the difference comes from their implementation in Python - machine learning libraries such as scikitlearn will build a model object we can make predictions from, whereas stats libraries like scipy and statsmodels are more useful for getting statistical summaries like test statistics and p-values. In practice, it depends on the application - if you are running an experiment and need to see the correlation between two variables, you will probably want to use a statistics package, but if you are building a model to predict where earthquakes might hit, you will probably want to use machine learning.
A lot of this depends on what you are trying to achieve and what data you have - do we want to categorise data into sets or do we want to find a continuous relationship between them? Do we need the results to be fast or accurate? Do we have a training set we can use the model for? All of these questions help us decide what model we want to use (sometimes the answer isn't just to slam it into a neural network!)
This is where knowledge of your field comes in - for example, if you are trying to cluster population centers, you will use k values that make sense for the area (number of cities etc).
Get them to print everything the make_blobs function give us - it will give a list, with the points in the first index, and a lot of stuff we don't need (in fact, the list is the list which gives a corresponece from each point to the blob it belongs in!). For our example, we don't want to other stuff, so we just index it out.
Our data is a 2 column matrix - each row is a point which we want to plot - since matplotlib wants a list of x coordinates then a list of y coordinates, we need to format the data to get it into a position we can use the plot function on it. The indexing in numpy works by giving what is basically a list of slices - since we want to index all rows we put the empty slice ":" in the first spot, and "0" for the second spot because we only want everything from the first column. Similarly, we use [:,1] for the y values because we want every row, but every second value for each row.
Also consider the native Python equivilent and how they would do it then using indexing and/or for loops. It's much more of a pain!
This is a product of object orientated programming - we actually initialise a k means model for 4 centers using the KMeans() function, and then fit it to our data using the fit method. This is a method on the model object - not a function - so the model "changes itself" to fit our data. What's important to realise is all the information we need is stored inside the model object, we don't get it from running functions on it!
The great thing about the algorithm is that it's generalised - we don't care about what these points represent (see project 2), and how many dimensions our data points have - the reason this is a 4 dimensional example is to remove the crutch of geometric representation - you have to trust in the code!
The solution for this Mini Project is pretty much identical for the example but backwards - we want to use the elbow method first to find a suitable K and then build a model using the suitable k. People who are struggling might try and guess for values of K - try and get these people to think about why that might not be a good idea (model always gets better for larger k, but that's not the point of the model! We don't want to overfit). People who are excelling through this might wonder if there's a better way of finding a K-value suitable for our data. In fact, there is a better algorithm for when we don't know our K-value called the DBSCAN method - but I haven't written this guide yet!
Explain how RGB values work - everything else pretty much follows from here - I think it's a numpy function to convert a picture to RGB
If we want to quantise for 16 colours - what does this mean?
Most of the questions above also apply to this guide - I decided to move a bit faster at this point because I'm expecting people to be getting more comfortable with numpy and pandas at this point. The most confusion will come from the illustration of KNN - I think people will be able to understand it at this point but might need some help understanding certain bits of the code. The idea is that we set upper and lower bounds for x and y based on our data, with a buffer of 1 so the graph has a bit of space, then set up a meshgrid, which is basically a 2 dimensional matrix with each point on the graph being represented. We then need to format this a little to get it into (x,y) pairs, which we will have one of for each point on the graph. Then we predict every one of these points and plot it behind our actual data to show the boundry.
First priority is data - if you need a linear seperation between your data KNN isn't going to give you that. Other than that, SVM is a much slower algorithm than KNN, so if we don't need to be accurate but need to be fast, then SVM might not be appropriate.
Gives us a fast model that we can use as humans to classify data even if it's multidimensional. Great when speed and explaination are needed over a "black box" and accuracy!
Honestly, this took me the best part of 4 hours to do. Get them to google it and figure it out for their system - knowing how these events run nothing will run the same on my system compared to anyone elses!