Predict the age of abalone from physical measurements
Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.
From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).
To practice building a prediction model using machine learning techniques.
The model could be used for other problems that require prediction analysis.
Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.
Name | Data Type | Meas. | Description |
---|---|---|---|
Sex | nominal | M, F, and I (infant) | |
Length | continuous | mm | Longest shell measurement |
Diameter | continuous | mm | perpendicular to length |
Height | continuous | mm | with meat in shell |
Whole weight | continuous | grams | whole abalone |
Shucked weight | continuous | grams | weight of meat |
Viscera weight | continuous | grams | gut weight (after bleeding) |
Shell weight | continuous | grams | after being dried |
Rings | integer | +1.5 gives the age in years | |
# box and whisker plots
data.plot(kind='box', subplots=True, layout=(2,4), figsize=(12,11), sharex=False, sharey=False)
plt.show()
# histograms
data.hist(figsize=(12,8))
plt.show()
# scatter plot matrix
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(12,8))
plt.show()
data.describe()
Length | Diameter | Height | Whole_Weight | Shucked_Weight | Viscera_Weight | Shell_Weight | Rings | Male | Female | Infant | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 | 4177.000000 |
mean | 0.523992 | 0.407881 | 0.139516 | 0.828742 | 0.359367 | 0.180594 | 0.238831 | 9.933684 | 0.365813 | 0.312904 | 0.321283 |
std | 0.120093 | 0.099240 | 0.041827 | 0.490389 | 0.221963 | 0.109614 | 0.139203 | 3.224169 | 0.481715 | 0.463731 | 0.467025 |
min | 0.075000 | 0.055000 | 0.000000 | 0.002000 | 0.001000 | 0.000500 | 0.001500 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.450000 | 0.350000 | 0.115000 | 0.441500 | 0.186000 | 0.093500 | 0.130000 | 8.000000 | 0.000000 | 0.000000 | 0.000000 |
50% | 0.545000 | 0.425000 | 0.140000 | 0.799500 | 0.336000 | 0.171000 | 0.234000 | 9.000000 | 0.000000 | 0.000000 | 0.000000 |
75% | 0.615000 | 0.480000 | 0.165000 | 1.153000 | 0.502000 | 0.253000 | 0.329000 | 11.000000 | 1.000000 | 1.000000 | 1.000000 |
max | 0.815000 | 0.650000 | 1.130000 | 2.825500 | 1.488000 | 0.760000 | 1.005000 | 29.000000 | 1.000000 | 1.000000 | 1.000000 |
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 3 different algorithms:
1. Muliple Regression
2. Principal Component Analysis (PCA)
3. Neural Networks
Let’s build and evaluate our models:
# Display chart
plot_yyhat(y_test, y_pred)
QUESTION TO ASK
Next Steps Improve the model (and the chance of overfitting) by reducing the number of explanatory factors to only those which are necessary, which may improve prediction accuracy as well if overfitting is present.
plot_yyhat(y_test, y_pred2)
# Display chart
plot_yyhat(y_test, y_pred3)
# Chart Results
plot_yyhat(y_test,y_pred4)
# Chart Results
plot_yyhat(y_test,y_pred5)
The best achieved MAE error was 1.52, using a perceptron architecture with 2 hidden layers ([10,5]), an alpha of 0.01, a learning rate of 0.01, and a *logistic* activation function. Compare these results to the ones obtained during Part 1, which achieved a MAE of 1.639.
Despite the higher accuracy of the neural network, there are several disadvantages to neural network modeling, including the difficulty of hyperparameter tuning. Notice that we simply used a trial-and-error process to select the hyperparameters. There are very few hard-and-fast rules for selecting appropriate hyperparameters, and it can often be more of an art than a science.
Although cross-validation is meant to minimize over-generalization, sometimes it is necessary to manually examine models for potential over-fitting issues. Being able to quickly and easily interpret models can be a significant advantage in many cases. Of course, if all you need is accuracy, neural networks might be a great choice.
And after all this work, did we actually solve the core problem? Personally, I would say no. An MAE of ~1.5 rings is not good enough for the researchers to no longer need to measure the rings manually. I would recommend the manual ring measuring process in this case, unfortunately.