Machine Learning Modeling of Abalone Data

by Wilfred Morgan

alt text


Problem Definition

Table of Contents

Description

Table of Contents | Section

Informal problem definition:

Predict the age of abalone from physical measurements

Formal problem definition:

Predicting the age of abalone from physical measurements. The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope -- a boring and time-consuming task. Other measurements, which are easier to obtain, are used to predict the age. Further information, such as weather patterns and location (hence food availability) may be required to solve the problem.

From the original data examples with missing values were removed (the majority having the predicted value missing), and the ranges of the continuous values have been scaled for use with an ANN (by dividing by 200).

Problem assumptions:

  • Dataset contains accurate observations
  • No missing values in the dataset
  • Data format is already prepared for importing and analysis

Similar problems:

  • Iris Plant Class Prediction Model

Motivation

Table of Contents | Section

Description:

To practice building a prediction model using machine learning techniques.

Solution Benefits (model predictions)

  • Gain a deeper understanding of machine learning techniques.
  • Obtain more experience with Python and data science projects.
  • Can adapt model to other projects.

Solution Application

The model could be used for other problems that require prediction analysis.

Manual Solution

Table of Contents | Section

How is the problem currently solved

The age of abalone is determined by cutting the shell through the cone, staining it, and counting the number of rings through a microscope.

How would a subject matter expert make manual predictions

Predict age based on size of abalone.

How a programmer might hand code a solution

Write a program that estimates the age based on the overall size of the abalone. It would return an age range based on the size.


Exploratory Data Analysis

Table of Contents

Basic Information

Table of Contents | Section

  • Number of instances: 4177
  • Number of attributes: 8
  • Target variable: Age

Attribute information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name Data Type Meas. Description
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years
In [1]:
# Import Libaries
import pandas as pd
import numpy as np
In [2]:
# Import Data
data_location = "/Users/wmemorgan/Google Drive/Computer_Data_Science_Lab/abalone/data/02_prepared_data/abalone.data"
column_names = ['Sex','Length','Diameter','Height','Whole_Weight',
                'Shucked_Weight','Viscera_Weight','Shell_Weight','Rings']
data = pd.read_csv(data_location, names=column_names)

Display Sample Data

Table of Contents | Section

In [3]:
#Verify number of observations
len(data)
Out[3]:
4177
In [4]:
# Shape
print(data.shape)
(4177, 9)

Peek at the Data

In [5]:
data.head()
Out[5]:
Sex Length Diameter Height Whole_Weight Shucked_Weight Viscera_Weight Shell_Weight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.150 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.070 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.210 9
3 M 0.440 0.365 0.125 0.5160 0.2155 0.1140 0.155 10
4 I 0.330 0.255 0.080 0.2050 0.0895 0.0395 0.055 7
In [6]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4177 entries, 0 to 4176
Data columns (total 9 columns):
Sex               4177 non-null object
Length            4177 non-null float64
Diameter          4177 non-null float64
Height            4177 non-null float64
Whole_Weight      4177 non-null float64
Shucked_Weight    4177 non-null float64
Viscera_Weight    4177 non-null float64
Shell_Weight      4177 non-null float64
Rings             4177 non-null int64
dtypes: float64(7), int64(1), object(1)
memory usage: 293.8+ KB
In [7]:
data.describe()
Out[7]:
Length Diameter Height Whole_Weight Shucked_Weight Viscera_Weight Shell_Weight Rings
count 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000 4177.000000
mean 0.523992 0.407881 0.139516 0.828742 0.359367 0.180594 0.238831 9.933684
std 0.120093 0.099240 0.041827 0.490389 0.221963 0.109614 0.139203 3.224169
min 0.075000 0.055000 0.000000 0.002000 0.001000 0.000500 0.001500 1.000000
25% 0.450000 0.350000 0.115000 0.441500 0.186000 0.093500 0.130000 8.000000
50% 0.545000 0.425000 0.140000 0.799500 0.336000 0.171000 0.234000 9.000000
75% 0.615000 0.480000 0.165000 1.153000 0.502000 0.253000 0.329000 11.000000
max 0.815000 0.650000 1.130000 2.825500 1.488000 0.760000 1.005000 29.000000
In [8]:
# Class distribution by Gender
print(data.groupby('Sex').size())
Sex
F    1307
I    1342
M    1528
dtype: int64
In [9]:
# Class distribution by Rings
print(data.groupby('Rings').size())
Rings
1       1
2       1
3      15
4      57
5     115
6     259
7     391
8     568
9     689
10    634
11    487
12    267
13    203
14    126
15    103
16     67
17     58
18     42
19     32
20     26
21     14
22      6
23      9
24      2
25      1
26      1
27      2
29      1
dtype: int64

Initial Visualization

Table of Contents | Section

In [10]:
# Import Libaries
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

Univariate Plots

In [11]:
# box and whisker plots
data.plot(kind='box', subplots=True, layout=(2,4), figsize=(12,11), sharex=False, sharey=False)
plt.show()
In [12]:
# histograms
data.hist(figsize=(12,8))
plt.show()

Multivariate Plots

In [13]:
sns.pairplot(data=data, hue="Rings")
plt.show()
In [33]:
# scatter plot matrix
from pandas.plotting import scatter_matrix
scatter_matrix(data, figsize=(12,8))

plt.show()