Practice Problems

Lecture 24

Answer each number in a separate cell

Rename the notebook with your lastName and the lecture

ex. CychB_24

Turn this notebook into triton-ed by the end of class

1. Classification

  • Import the tab separated dataset 'Datasets/Forams/Forams.txt' into a pandas dataframe.
  • This dataset has several variables that describe the morphology of 2 different foram species. These are 'Ø [mm]','Perc [%]' and 'Growth [%]'. Ø is a measure of size that sedimentologists use. They prefer the use of $\phi$ instead of grain size. $\phi$ which is related to grain size by:

$\phi = -log_2 D$, were $D$ is the grain size in mm.

Here is a helpful guide to understanding sedimentological grain size scales:

$\phi$ scale Size range Wentworth class Other names
<−8 >256 mm Boulder
−6 to −8 64–256 mm Cobble
−5 to −6 32–64 mm Very coarse gravel Pebble
−4 to −5 16–32 mm Coarse gravel Pebble
−3 to −4 8–16 mm Medium gravel Pebble
−2 to −3 4–8 mm Fine gravel Pebble
−1 to −2 2–4 mm Very fine gravel Granule
0 to −1 1–2 mm Very coarse sand
1 to 0 0.5–1 mm Coarse sand
2 to 1 0.25–0.5 mm Medium sand
3 to 2 125–250 µm Fine sand
4 to 3 62.5–125 µm Very fine sand
8 to 4 3.9–62.5 µm Silt Mud
10 to 8 0.98–3.9 µm Clay Mud
20 to 10 0.95–977 nm Colloid Mud

Plot the three parameters against one another in a pairplot and color the points by Species.

  • Randomize the dataset and make the first half into a training dataset.
  • Use the scikit-learn algorithm GaussianNB() to create a model for your data. Here you want to use the Species designations as your training values
  • Classify the second half of your dataset using the model you made and plot the classified datapoints in a pairplot.
  • How does this compare with just using the 'Species' to set the hue? Did the paleontologists do a good job in picking their species?

2. PCA

  • Perform a principal component analysis on original DataFrame to get 2 components. Display your data in the coordinate system of these components using sns.scatterplot. Color by Species.
  • Use GaussianNB() to classify these data as before, but this time use your PCA_1 to classify. The two PCAs are not independent as the sum to unity, so you can use either one. Based on your pairplot, place forams into one of two clusters: cluster_1, cluster_2. Then use the cluster designation as your Y in GaussianNB()
  • Plot your classified data. Does this do a better job than in part 1? Does this also mean that we don't need paleontologists??

3. K mean Clustering

  • Use the Scikitlearn KMeans algorithm on your data, using the principal components as your dimensions.
  • Plot your clustered data. Did the algorithm work?