- Import the tab separated dataset 'Datasets/Forams/Forams.txt' into a pandas dataframe.
- This dataset has several variables that describe the morphology of 2 different foram species. These are 'Ø [mm]','Perc [%]' and 'Growth [%]'. Ø is a measure of size that sedimentologists use. They prefer the use of $\phi$ instead of grain size. $\phi$ which is related to grain size by:
$\phi = -log_2 D$, were $D$ is the grain size in mm.
Here is a helpful guide to understanding sedimentological
grain size scales:
|−6 to −8
|−5 to −6
||Very coarse gravel
|−4 to −5
|−3 to −4
|−2 to −3
|−1 to −2
||Very fine gravel
|0 to −1
||Very coarse sand
|1 to 0
|2 to 1
|3 to 2
|4 to 3
||Very fine sand
|8 to 4
|10 to 8
|20 to 10
Plot the three parameters against one another in a pairplot and color the points by Species.
- Randomize the dataset and make the first half into a training dataset.
- Use the scikit-learn algorithm GaussianNB() to create a model for your data. Here you want to use the Species designations as your training values
- Classify the second half of your dataset using the model you made and plot the classified datapoints in a pairplot.
- How does this compare with just using the 'Species' to set the hue? Did the paleontologists do a good job in picking their species?
- Perform a principal component analysis on original DataFrame to get 2 components. Display your data in the coordinate system of these components using sns.scatterplot. Color by Species.
- Use GaussianNB() to classify these data as before, but this time use your PCA_1 to classify. The two PCAs are not independent as the sum to unity, so you can use either one. Based on your pairplot, place forams into one of two clusters: cluster_1, cluster_2. Then use the cluster designation as your Y in GaussianNB()
- Plot your classified data. Does this do a better job than in part 1? Does this also mean that we don't need paleontologists??
3. K mean Clustering¶
- Use the Scikitlearn KMeans algorithm on your data, using the principal components as your dimensions.
- Plot your clustered data. Did the algorithm work?