# 데이터 다루기 (Data Handling) - Iris Data Set¶

## 1. Iris(붓꽃) Data Set 개요¶

• 참고: https://archive.ics.uci.edu/ml/datasets/Iris
• 데이터 원본: https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
• 데이터 제목: Iris Plants Database
• 원작자 정보
• Creator: R.A. Fisher
• Donor: Michael Marshall (MARSHALL%[email protected])
• Date: July, 1988
• 사례 개수: 150
• 각 붓꽃 종류당 50개씩
• 4개의 독립 변수(Feature)
• a: 꽃받침 길이 (sepal length in cm)
• b: 꽃받침 넓이 (sepal width in cm)
• c: 꽃잎 길이 (petal length in cm)
• d: 꽃이 넓이 (petal width in cm)
• 3부류의 백합 종류 (Image Source: wikipedia)
Iris SetosaIris VersicolorIris Virginica
• 참고 (Image Source: wikipedia)
• Sepal(꽃받침), Petal(꽃잎)

## 2. 데이터 로딩하기¶

In [1]:
import urllib2
from scipy import stats
from pandas import Series, DataFrame
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

path = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
raw_csv = urllib2.urlopen(path)
feature_names = ('sepal length', 'sepal width', 'petal length', 'petal width')
all_names = feature_names + ('class',)
df = pd.read_csv(raw_csv, names=all_names)

/Users/yhhan/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:273: UserWarning: Matplotlib is building the font cache using fc-list. This may take a moment.
warnings.warn('Matplotlib is building the font cache using fc-list. This may take a moment.')

In [2]:
df

Out[2]:
sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
20 5.4 3.4 1.7 0.2 Iris-setosa
21 5.1 3.7 1.5 0.4 Iris-setosa
22 4.6 3.6 1.0 0.2 Iris-setosa
23 5.1 3.3 1.7 0.5 Iris-setosa
24 4.8 3.4 1.9 0.2 Iris-setosa
25 5.0 3.0 1.6 0.2 Iris-setosa
26 5.0 3.4 1.6 0.4 Iris-setosa
27 5.2 3.5 1.5 0.2 Iris-setosa
28 5.2 3.4 1.4 0.2 Iris-setosa
29 4.7 3.2 1.6 0.2 Iris-setosa
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 Iris-virginica
121 5.6 2.8 4.9 2.0 Iris-virginica
122 7.7 2.8 6.7 2.0 Iris-virginica
123 6.3 2.7 4.9 1.8 Iris-virginica
124 6.7 3.3 5.7 2.1 Iris-virginica
125 7.2 3.2 6.0 1.8 Iris-virginica
126 6.2 2.8 4.8 1.8 Iris-virginica
127 6.1 3.0 4.9 1.8 Iris-virginica
128 6.4 2.8 5.6 2.1 Iris-virginica
129 7.2 3.0 5.8 1.6 Iris-virginica
130 7.4 2.8 6.1 1.9 Iris-virginica
131 7.9 3.8 6.4 2.0 Iris-virginica
132 6.4 2.8 5.6 2.2 Iris-virginica
133 6.3 2.8 5.1 1.5 Iris-virginica
134 6.1 2.6 5.6 1.4 Iris-virginica
135 7.7 3.0 6.1 2.3 Iris-virginica
136 6.3 3.4 5.6 2.4 Iris-virginica
137 6.4 3.1 5.5 1.8 Iris-virginica
138 6.0 3.0 4.8 1.8 Iris-virginica
139 6.9 3.1 5.4 2.1 Iris-virginica
140 6.7 3.1 5.6 2.4 Iris-virginica
141 6.9 3.1 5.1 2.3 Iris-virginica
142 5.8 2.7 5.1 1.9 Iris-virginica
143 6.8 3.2 5.9 2.3 Iris-virginica
144 6.7 3.3 5.7 2.5 Iris-virginica
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns

• data frame의 describe()를 통해 기본적인 통계치 알아보기
In [3]:
df.describe()

Out[3]:
sepal length sepal width petal length petal width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [4]:
iris_names = ('Iris-setosa', 'Iris-versicolor', 'Iris-virginica')
df_group = df.groupby('class')['class']
print df_group.count()

Iris_Se_Sub_Df = df[df['class'] == iris_names[0]]
Iris_Ve_Sub_Df = df[df['class'] == iris_names[1]]
Iris_Vi_Sub_Df = df[df['class'] == iris_names[2]]
print
print Iris_Se_Sub_Df

class
Iris-setosa        50
Iris-versicolor    50
Iris-virginica     50
Name: class, dtype: int64

sepal length  sepal width  petal length  petal width        class
0            5.1          3.5           1.4          0.2  Iris-setosa
1            4.9          3.0           1.4          0.2  Iris-setosa
2            4.7          3.2           1.3          0.2  Iris-setosa
3            4.6          3.1           1.5          0.2  Iris-setosa
4            5.0          3.6           1.4          0.2  Iris-setosa
5            5.4          3.9           1.7          0.4  Iris-setosa
6            4.6          3.4           1.4          0.3  Iris-setosa
7            5.0          3.4           1.5          0.2  Iris-setosa
8            4.4          2.9           1.4          0.2  Iris-setosa
9            4.9          3.1           1.5          0.1  Iris-setosa
10           5.4          3.7           1.5          0.2  Iris-setosa
11           4.8          3.4           1.6          0.2  Iris-setosa
12           4.8          3.0           1.4          0.1  Iris-setosa
13           4.3          3.0           1.1          0.1  Iris-setosa
14           5.8          4.0           1.2          0.2  Iris-setosa
15           5.7          4.4           1.5          0.4  Iris-setosa
16           5.4          3.9           1.3          0.4  Iris-setosa
17           5.1          3.5           1.4          0.3  Iris-setosa
18           5.7          3.8           1.7          0.3  Iris-setosa
19           5.1          3.8           1.5          0.3  Iris-setosa
20           5.4          3.4           1.7          0.2  Iris-setosa
21           5.1          3.7           1.5          0.4  Iris-setosa
22           4.6          3.6           1.0          0.2  Iris-setosa
23           5.1          3.3           1.7          0.5  Iris-setosa
24           4.8          3.4           1.9          0.2  Iris-setosa
25           5.0          3.0           1.6          0.2  Iris-setosa
26           5.0          3.4           1.6          0.4  Iris-setosa
27           5.2          3.5           1.5          0.2  Iris-setosa
28           5.2          3.4           1.4          0.2  Iris-setosa
29           4.7          3.2           1.6          0.2  Iris-setosa
30           4.8          3.1           1.6          0.2  Iris-setosa
31           5.4          3.4           1.5          0.4  Iris-setosa
32           5.2          4.1           1.5          0.1  Iris-setosa
33           5.5          4.2           1.4          0.2  Iris-setosa
34           4.9          3.1           1.5          0.1  Iris-setosa
35           5.0          3.2           1.2          0.2  Iris-setosa
36           5.5          3.5           1.3          0.2  Iris-setosa
37           4.9          3.1           1.5          0.1  Iris-setosa
38           4.4          3.0           1.3          0.2  Iris-setosa
39           5.1          3.4           1.5          0.2  Iris-setosa
40           5.0          3.5           1.3          0.3  Iris-setosa
41           4.5          2.3           1.3          0.3  Iris-setosa
42           4.4          3.2           1.3          0.2  Iris-setosa
43           5.0          3.5           1.6          0.6  Iris-setosa
44           5.1          3.8           1.9          0.4  Iris-setosa
45           4.8          3.0           1.4          0.3  Iris-setosa
46           5.1          3.8           1.6          0.2  Iris-setosa
47           4.6          3.2           1.4          0.2  Iris-setosa
48           5.3          3.7           1.5          0.2  Iris-setosa
49           5.0          3.3           1.4          0.2  Iris-setosa


## 3. 탐색적 자료 분석 (Exploratory data analysis)¶

• Sepal length (a) and Sepal width (b)
• Sepal length (a) and Petal length (c)
• Sepal length (a) and Petal width (d)
• Sepal width (b) and Petal length (c)
• Sepal width (b) and Petal width (d)
• Petal length (c) and Petal width (d)
In [5]:
unit_str = ' (cm)'
options = {
0: {
'data_x': feature_names[0],
'data_y': feature_names[1],
'label_x': feature_names[0] + unit_str,
'label_y': feature_names[1] + unit_str,
'ylim_min': 1.5,
'ylim_max': 5.0
},
1: {
'data_x': feature_names[0],
'data_y': feature_names[2],
'label_x': feature_names[0] + unit_str,
'label_y': feature_names[2] + unit_str,
'ylim_min': 0.0,
'ylim_max': 9.0
},
2: {
'data_x': feature_names[0],
'data_y': feature_names[3],
'label_x': feature_names[0] + unit_str,
'label_y': feature_names[3] + unit_str,
'ylim_min': -0.5,
'ylim_max': 3.5
},
3: {
'data_x': feature_names[1],
'data_y': feature_names[2],
'label_x': feature_names[1] + unit_str,
'label_y': feature_names[2] + unit_str,
'ylim_min': 0.0,
'ylim_max': 9.0
},
4: {
'data_x': feature_names[1],
'data_y': feature_names[3],
'label_x': feature_names[1] + unit_str,
'label_y': feature_names[3] + unit_str,
'ylim_min': 0.0,
'ylim_max': 3.5
},
5: {
'data_x': feature_names[2],
'data_y': feature_names[3],
'label_x': feature_names[2] + unit_str,
'label_y': feature_names[3] + unit_str,
'ylim_min': 0.0,
'ylim_max': 3.5
}
}
ax = []
fig = plt.figure(figsize=(17, 12))
for i in range(0,6):

for i in range(0,6):
se = ax[i].scatter(Iris_Se_Sub_Df[options[i]['data_x']], Iris_Se_Sub_Df[options[i]['data_y']], color='red')
ve = ax[i].scatter(Iris_Ve_Sub_Df[options[i]['data_x']], Iris_Ve_Sub_Df[options[i]['data_y']], color='blue')
vi = ax[i].scatter(Iris_Vi_Sub_Df[options[i]['data_x']], Iris_Vi_Sub_Df[options[i]['data_y']], color='green')
ax[i].set_xlabel(options[i]['label_x'])
ax[i].set_ylabel(options[i]['label_y'])
ax[i].set_ylim([options[i]['ylim_min'], options[i]['ylim_max']])
ax[i].legend((se, ve, vi), iris_names)

In [6]:
df2 = df.ix[:,0:4]
df2[0:5]

Out[6]:
sepal length sepal width petal length petal width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [21]:
from pandas.tools.plotting import scatter_matrix

_ = scatter_matrix(df2, figsize=(9,9), diagonal='kde')

• KDE: Kernel Density Estimattion (위 그래프의 대각선에 있는 Line 그래프 4개)
• 해당 변수가 가지는 값의 범위와 그 값을 가지는 정도 (히스토그램에 대한 Smoothing 그래프)
• 자세한 설명: http://darkpgmr.tistory.com/147
In [23]:
stats = {}
for i in range(0,4):
stats[i] = {}
stats[i]['mean'] = (Iris_Se_Sub_Df[feature_names[i]].mean(),
Iris_Ve_Sub_Df[feature_names[i]].mean(),
Iris_Vi_Sub_Df[feature_names[i]].mean())
stats[i]['std'] = (Iris_Se_Sub_Df[feature_names[i]].std(),
Iris_Ve_Sub_Df[feature_names[i]].std(),
Iris_Vi_Sub_Df[feature_names[i]].std())

ind = Series([0.5, 1.5, 2.5])
width = 0.5
fig = plt.figure(figsize=(20, 5))
ay = []
for i in range(0,4):

for i in range(0,4):
ay[i].bar(ind, stats[i]['mean'], 0.5, color='magenta', yerr=stats[i]['std'])
ay[i].set_xlim([0, 3.5])
ay[i].set_ylabel('Mean of ' + feature_names[i])
ay[i].set_xticks(ind + width/2)
ay[i].set_xticklabels(iris_names)

• box plot
In [85]:
_ = df2.boxplot()

/Users/yhhan/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning:
The default value for 'return_type' will change to 'axes' in a future release.
To use the future behavior now, set return_type='axes'.
To keep the previous behavior and silence this warning, set return_type='dict'.
if __name__ == '__main__':


## 4. scikit 활용한 데이터 탐색¶

In [9]:
from sklearn.datasets import load_iris
from sklearn import tree
print type(iris)

<class 'sklearn.datasets.base.Bunch'>

In [10]:
iris.keys()

Out[10]:
['target_names', 'data', 'target', 'DESCR', 'feature_names']
In [11]:
iris.target_names

Out[11]:
array(['setosa', 'versicolor', 'virginica'],
dtype='|S10')
In [12]:
iris.feature_names

Out[12]:
['sepal length (cm)',
'sepal width (cm)',
'petal length (cm)',
'petal width (cm)']
In [13]:
print iris.DESCR

Iris Plants Database

Notes
-----
Data Set Characteristics:
:Number of Instances: 150 (50 in each of three classes)
:Number of Attributes: 4 numeric, predictive attributes and the class
:Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris-Setosa
- Iris-Versicolour
- Iris-Virginica
:Summary Statistics:

============== ==== ==== ======= ===== ====================
Min  Max   Mean    SD   Class Correlation
============== ==== ==== ======= ===== ====================
sepal length:   4.3  7.9   5.84   0.83    0.7826
sepal width:    2.0  4.4   3.05   0.43   -0.4194
petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
============== ==== ==== ======= ===== ====================

:Missing Attribute Values: None
:Class Distribution: 33.3% for each of 3 classes.
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%[email protected])
:Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
- Fisher,R.A. "The use of multiple measurements in taxonomic problems"
Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
Mathematical Statistics" (John Wiley, NY, 1950).
- Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
(Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
- Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
Structure and Classification Rule for Recognition in Partially Exposed
Environments".  IEEE Transactions on Pattern Analysis and Machine
Intelligence, Vol. PAMI-2, No. 1, 67-71.
- Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
on Information Theory, May 1972, 431-433.
- See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
conceptual clustering system finds 3 classes in the data.
- Many, many more ...


In [14]:
iris.data[0:5]

Out[14]:
array([[ 5.1,  3.5,  1.4,  0.2],
[ 4.9,  3. ,  1.4,  0.2],
[ 4.7,  3.2,  1.3,  0.2],
[ 4.6,  3.1,  1.5,  0.2],
[ 5. ,  3.6,  1.4,  0.2]])
In [15]:
iris.target[0:5]

Out[15]:
array([0, 0, 0, 0, 0])
In [16]:
iris.data[50:55]

Out[16]:
array([[ 7. ,  3.2,  4.7,  1.4],
[ 6.4,  3.2,  4.5,  1.5],
[ 6.9,  3.1,  4.9,  1.5],
[ 5.5,  2.3,  4. ,  1.3],
[ 6.5,  2.8,  4.6,  1.5]])
In [17]:
iris.target[50:55]

Out[17]:
array([1, 1, 1, 1, 1])

## 5. Spark을 활용한 데이터 탐색¶

In [5]:
import findspark

findspark.init()
from pyspark import SparkContext, SparkFiles, SQLContext

if not 'sc' in locals():
sc = SparkContext()

import urllib
_ = urllib.urlretrieve ("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", "iris.data")
data_file = "./iris.data"
raw_data = sc.textFile(data_file)

# num of parallel cores
print sc.defaultParallelism

print raw_data.count()

raw_data = sc.textFile(data_file).filter(lambda x: x != '')
print raw_data.count()

print raw_data.take(5)

2
151
150
[u'5.1,3.5,1.4,0.2,Iris-setosa', u'4.9,3.0,1.4,0.2,Iris-setosa', u'4.7,3.2,1.3,0.2,Iris-setosa', u'4.6,3.1,1.5,0.2,Iris-setosa', u'5.0,3.6,1.4,0.2,Iris-setosa']

In [60]:
def parse_raw_data(line):
line_split = line.split(",")[0:4]
return np.array([float(x) for x in line_split])

vector_data = raw_data.map(parse_raw_data)
print vector_data.take(5)

[array([ 5.1,  3.5,  1.4,  0.2]), array([ 4.9,  3. ,  1.4,  0.2]), array([ 4.7,  3.2,  1.3,  0.2]), array([ 4.6,  3.1,  1.5,  0.2]), array([ 5. ,  3.6,  1.4,  0.2])]

• Summary statistics using Spark ML
In [74]:
from pyspark.mllib.stat import Statistics
from math import sqrt

# Compute column summary statistics.
summary = Statistics.colStats(vector_data)

print "Statistics:"
for i in range(4):
print " Mean - {}: {}".format(feature_names[i], round(summary.mean()[i],3))
print " St. Dev - {}: {}".format(feature_names[i], round(sqrt(summary.variance()[i]),3))
print " Max value - {}: {}".format(feature_names[i], round(summary.max()[i],3))
print " Min value - {}: {}".format(feature_names[i], round(summary.min()[i],3))
print " Number of non-zero values - {}: {}".format(feature_names[i], summary.numNonzeros()[i])
print

Statistics:
Mean - sepal length: 5.843
St. Dev - sepal length: 0.828
Max value - sepal length: 7.9
Min value - sepal length: 4.3
Number of non-zero values - sepal length: 150.0

Mean - sepal width: 3.054
St. Dev - sepal width: 0.434
Max value - sepal width: 4.4
Min value - sepal width: 2.0
Number of non-zero values - sepal width: 150.0

Mean - petal length: 3.759
St. Dev - petal length: 1.764
Max value - petal length: 6.9
Min value - petal length: 1.0
Number of non-zero values - petal length: 150.0

Mean - petal width: 1.199
St. Dev - petal width: 0.763
Max value - petal width: 2.5
Min value - petal width: 0.1
Number of non-zero values - petal width: 150.0


• Correlation
In [75]:
from pyspark.mllib.stat import Statistics
correlation_matrix = Statistics.corr(vector_data, method="spearman")

In [76]:
print type(correlation_matrix)

<type 'numpy.ndarray'>

In [77]:
print pd.DataFrame(correlation_matrix)

          0         1         2         3
0  1.000000 -0.159457  0.881386  0.834421
1 -0.159457  1.000000 -0.303421 -0.277511
2  0.881386 -0.303421  1.000000  0.936003
3  0.834421 -0.277511  0.936003  1.000000