This notebook serves as an introduction to the data set, detailing its features, providing a statistical summary and offering some insights by performing a "first pass" Exploratory Data Analysis to highlight certain aspects. Please note that the code is in R
.
Package loading:
require(pacman) # package manager
pacman::p_load( # This is equivalent to library (+install.packages if needed)
# File path constructor (auto root dir determination)
here,
# Read in data
readr,
# Data wrangling
dplyr, forcats, tidyr, purrr,
# Machine learning and statistical generation for convenience
mlr,
# Plotting
ggplot2, corrplot, viridis, ggthemes
)
# Set theme for ggplot2 output
theme_set(theme_few())
Loading required package: pacman
Load the data set:
data <- read_tsv(here::here("Data","Data.txt")) # automatic type inferance for the underlying data
Parsed with column specification: cols( .default = col_double(), Label = col_character(), Ident = col_character(), Status = col_character() ) See spec(...) for full column specifications.
How many observations and features (respectively) have been read in?
data %>% dim()
The some of the column of the data set are not necessarily a feature in the machine learning sense. Some of the features explained bellow are prefixed with *
because they mostly pertain to thumbnail extraction. Others are prefixed with -
, as they are relevant to Marine Scotland's workflow processes but not to the classificationj task it self.
Without further ado, here's the description for each column in the data set:
-
!Item: Line identifier from source (pid) files.-
Label: Scan identifier (used to link back to scanning metadata).*
X: X coordinate of the centre of gravity of the particle.*
Y: Y coordinate of the centre of gravity of the particle.*
XM: X coordinate of the centre of gravity of the grey level in the particle.*
YM: Y coordinate of the centre of gravity of the grey level in the particle.*
BX: X coordinate of the top left point of the smallest rectangle enclosing the particle (used to extract thumbnails).*
BY: Y coordinate of the top left point of the smallest rectangle enclosing the particle (used to extract thumbnails).*
Width: Width of the smallest rectangle enclosing the particle (used to extract thumbnails).*
Height: Height of the smallest rectangle enclosing the particle (used to extract thumbnails).*
Angle: Angle between the primary axis and a line parallel to the x-axis of the image (used to get particle positioning).*
XStart: X coordinate of the top left point of the image (used to locate particle).*
YStart: Y coordinate of the top left point of the image (used to locate particle).0
in the data set...0
in the data set...-
Compentropy: Unused variable, set to 0
in the data set.-
Compmean: as above.-
Compslope: as above.-
CompM1: as above.-
CompM2: as above.-
CompM3: as above.-
Tag: Logical flag for workflow process to signify if a particle is taggable (not to be used for classification)-
Status: Whether the entry is part of the learning data set or not. All observations in the data set are part of the learning set, ergo the only value present is Learning
.While the features X
, Y
, XM
and YM
refer to position in the image of the centre of gravity of the object (absolute and adjusted by grey value concentration respectively) and thus do not make sense to be used directly for classification, they are still useful for derived features and are kept at this time. Their usefulness stems from the fact that they serve as coordinates.
There are a number of possible derived features that may be of interest. These are:
Mean_exc
: representing the average grey value excluding holes within the particle, and defined as Mean_exc = IntDen/Area_excESD
: the Equivalent Spherical Diameter, defined as ESD=$2 \times \sqrt{\textrm{Area}/\pi}$Elongation
: also known as "ellipse" elongation, this refers to the ratio of the major and minor axes of the best fitting ellipse, defined as Elongation=Major/MinorRange
: refers to the range of grey values within the particle, defined as Range=Max-MinMeanPos
: describes where the mean of the grey values "sits" in relation to the extremes of the grey values' distribution. Defined as MeanPos=(Max-Mean)/RangeCentroidsD
: describes the distance between the geometric centre of the particle and the centre of mass (via grey values distribution). Defined with the usual distance formula: CentroidsD=$\sqrt{(XM-X)^2 + (YM-Y)^2}$CV
: the coefficient of variation (also known as the relative standard deviation), expressed as a percentage. Defined as CV=$100\times(StdDev/Mean)$SR
: Defined as SR=$100\times(StdDev/Range)$, expresses as a percentage the relationship of the standard deviation to the range.PerimAreaexc
: the ratio of the perimeter (outside boundary of particle) and its area excluding any holes, defined as PerimAreaexc=Perim/Area_excFeretAreaexc
: as above, but using the Feret's diameter instead of the perimeterPerimFeret
: defined as the ratio of the perimeter to Feret's diameter (Perim/Feret)PerimMaj
: as above, but using the primary axis (Perim/Major)Circexc
: as circularity, but using the area excluding holes, defined as Circexc=$(4\pi$ $\times$ Area_exc)/$Perim^2$CDexc
: defined as $CentroidsD^2/Area_exc$, this is similar to circularity in that if the result of the division is 4/$\pi$, then it satisfies the area equation for a circle (Area=$\pi/4 \times Diameter^2$). This used the distance between the centre of mass and the geometrical centre, as defined previously.Furthermore, the Ident
column contains the target label, which in R needs to be encoded as a factor.
Based on the above:
data <- data %>%
mutate(Mean_exc = IntDen / Area_exc, #create derived features
ESD = 2 * sqrt(Area / pi),
Elongation = Major / Minor,
Range = Max - Min,
MeanPos = (Max - Mean)/Range,
CentroidsD = sqrt((XM - X)^2 + (YM - Y)^2),
CV = 100 * (StdDev / Mean),
SR = 100 * (StdDev / Range),
PerimAreaexc = Perim. / Area_exc,
FeretAreaexc = Feret / Area_exc,
PerimFeret = Perim. / Feret,
PerimMaj = Perim. / Major,
Circexc = (4*pi*Area_exc) / Perim.^2,
CDexc = (CentroidsD^2)/Area_exc
) %>%
select(-c( #remove unneeded features
X, Y, XM, YM, `!Item`, Label,
BX, BY, Width, XMg5, YMg5,
Height, Angle, XStart, YStart,
Status, Compentropy, Compmean,
Compslope, CompM1,CompM2, CompM3,
Tag, Status
)) %>%
mutate(Ident = as.factor(Ident)) %>% #convert to factor
select(Ident, everything()) #re-arrange so that ident is the first column
The leaves us with 49 columns, i.e. 48 features and the target label:
data %>% names()
For the following table, please note that:
data %>% mlr::summarizeColumns() %>% select(-disp, -mad) %>% knitr::kable(digits = 2)
|name |type | na| mean| median| min| max| nlevs| |:------------|:-------|--:|----------:|---------:|---------:|------------:|-----:| |Ident |factor | 0| NA| NA| 52.00| 376.00| 24| |Area |numeric | 0| 37980.36| 4015.00| 631.00| 884694.00| 0| |Mean |numeric | 0| 206.78| 206.79| 120.17| 249.81| 0| |StdDev |numeric | 0| 30.04| 31.95| 0.75| 76.87| 0| |Mode |numeric | 0| 233.31| 249.00| 0.00| 255.00| 0| |Min |numeric | 0| 116.32| 112.00| 0.00| 247.00| 0| |Max |numeric | 0| 252.53| 253.00| 249.00| 255.00| 0| |Perim. |numeric | 0| 1550.69| 594.38| 94.08| 68099.09| 0| |Major |numeric | 0| 255.00| 102.36| 29.88| 2287.11| 0| |Minor |numeric | 0| 100.49| 51.65| 5.52| 940.29| 0| |Circ. |numeric | 0| 0.27| 0.17| 0.00| 0.92| 0| |Feret |numeric | 0| 324.33| 128.79| 30.81| 6482.91| 0| |IntDen |numeric | 0| 7411707.61| 787846.00| 127810.00| 183715502.00| 0| |Median |numeric | 0| 208.47| 212.00| 97.00| 251.00| 0| |Skew |numeric | 0| -0.48| -0.35| -11.66| 2.09| 0| |Kurt |numeric | 0| 0.59| -0.38| -1.74| 175.78| 0| |%Area |numeric | 0| 3.24| 0.35| 0.00| 58.70| 0| |Area_exc |numeric | 0| 36440.82| 3920.00| 476.00| 867595.00| 0| |Fractal |numeric | 0| 1.20| 1.19| 0.98| 1.79| 0| |Skelarea |numeric | 0| 3051.11| 477.50| 2.00| 197716.00| 0| |Slope |numeric | 0| 4.81| 0.62| 0.06| 208.65| 0| |Histcum1 |numeric | 0| 183.61| 183.00| 58.00| 248.00| 0| |Histcum2 |numeric | 0| 206.92| 211.00| 96.00| 248.00| 0| |Histcum3 |numeric | 0| 231.44| 237.00| 146.00| 249.00| 0| |Nb1 |numeric | 0| 6.57| 2.00| 0.00| 125.00| 0| |Nb2 |numeric | 0| 6.04| 1.00| 0.00| 244.00| 0| |Nb3 |numeric | 0| 4.82| 1.00| 0.00| 333.00| 0| |Symetrieh |numeric | 0| 5.71| 3.12| 1.58| 178.42| 0| |Symetriev |numeric | 0| 5.73| 3.17| 1.63| 178.44| 0| |Symetriehc |numeric | 0| 5.82| 3.09| 1.53| 178.42| 0| |Symetrievc |numeric | 0| 5.85| 3.13| 1.61| 178.44| 0| |Convperim |numeric | 0| 968.14| 396.00| 110.00| 15108.00| 0| |Convarea |numeric | 0| 60638.08| 5612.50| 660.00| 5477284.00| 0| |Fcons |numeric | 0| 87.61| 88.07| 0.00| 3399.12| 0| |ThickR |numeric | 0| 2.92| 2.37| 0.15| 44.27| 0| |Mean_exc |numeric | 0| 216.53| 208.34| 120.34| 593.29| 0| |ESD |numeric | 0| 150.50| 71.50| 28.34| 1061.33| 0| |Elongation |numeric | 0| 3.33| 1.97| 1.00| 70.72| 0| |Range |numeric | 0| 136.22| 141.00| 6.00| 255.00| 0| |MeanPos |numeric | 0| 0.35| 0.35| 0.06| 0.74| 0| |CentroidsD |numeric | 0| 3.24| 0.93| 0.00| 71.19| 0| |CV |numeric | 0| 15.81| 15.50| 0.30| 60.35| 0| |SR |numeric | 0| 21.42| 21.57| 4.75| 35.31| 0| |PerimAreaexc |numeric | 0| 0.17| 0.11| 0.01| 1.40| 0| |FeretAreaexc |numeric | 0| 0.04| 0.03| 0.00| 0.36| 0| |PerimFeret |numeric | 0| 4.52| 3.85| 2.11| 34.24| 0| |PerimMaj |numeric | 0| 5.92| 4.74| 1.98| 45.66| 0| |Circexc |numeric | 0| 0.26| 0.17| 0.00| 0.92| 0| |CDexc |numeric | 0| 0.00| 0.00| 0.00| 0.13| 0|
There are no missing values, but also different scales are at play.
Most of the features are not normally distributed, as can be seen below:
options(repr.plot.width=20, repr.plot.height=15) # Specify plot dimensions for jupyter notebook
data %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_density() +
theme(text = element_text(size = 14))
Weak correlations are filtered out in the next step.
data %>%
select(- Ident) %>%
cor() %>% # calculate the corelation matrix
ggcorrplot::ggcorrplot(., p.mat = ggcorrplot::cor_pmat(data[,2:ncol(data)]),
sig.level = 0.05, # signficance level threshold
insig = "blank",
method = "circle", type = "full",
show.diag = FALSE, lab = FALSE,
colors = viridis::plasma(3)
) +
ggtitle("Correlation Matrix")
All squares filled i.e. all correlations are significant at the level set (0.05).
Filter out weak correlations by arbitrarily setting a threshold, in this case, |correlation|>=0.8.
corrInteresting <- data %>% select(-Ident) %>% cor() # calculate correlation matrix
corrInteresting[-0.8<=corrInteresting & corrInteresting<=0.8] <- NA # assign NAs to weaker correlations
corrInteresting %>% # plot
ggcorrplot::ggcorrplot(., method = "square", type = "lower", lab = TRUE,
colors = viridis::plasma(3)
) +
ggtitle("Correlation Matrix", subtitle = "|Correlations| >= 0.8")
# clean up
rm(corrInteresting)
The Ident
column codifies the target label for prediction. The possible values and their respective number of cases are presented in the graph below, after the description of each label.
The labels can be grouped into biological (of domain significance) and non-Biological (of no importance). These are:
Non-biological:
Biological:
Let's plot the number of observations for each class:
options(repr.plot.width=15, repr.plot.height=10) # Specify plot dimensions for jupyter notebook
# Create grouping of certain classes not being of biological importance:
nonBio <- c("badfocus", "detritus", "Fiber", "grey_line", "grey_surface")
# Plot no. of observations for each class
data %>% group_by(Ident) %>% summarise(cases = n()) %>%
mutate(category =
if_else(Ident %in% nonBio, "non-Biological", "Biological")
) %>%
mutate_if(~ is.character(.), ~ as.factor(.)) %>% # convert characters to factors
ggplot(aes(fct_reorder(Ident, cases),
cases, fill = category)) +
geom_col() +
geom_label(aes(label = cases, colour = category),
fill = "white", show.legend = F) +
theme(legend.position = c(0.8,0.5), text = element_text(size=18),
legend.background = element_rect(fill = "white", colour = "#cccccc")) +
ylim(0, 400) +
labs(x = "", fill = "Category: ", y = "No. of observations",
title = "Distribution of the data set",
subtitle = "Across target labels and grouped by category"
) +
coord_flip()
# Clean up
rm(nonBio)
Clearly, this is an imbalanced data set.