The Iris dataset and pandas

pandas logo

Iris flowers


Python Data Analysis Library

https://pandas.pydata.org/

The pandas website.


Wes McKinney: pandas in 10 minutes | Walkthrough

https://www.youtube.com/watch?v=_T8LGqJtuGc

Video by the creator of pandas.


Python for Data Analysis notebooks

https://github.com/wesm/pydata-book

Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media


10 Minutes to pandas

http://pandas.pydata.org/pandas-docs/stable/10min.html

Official pandas tutorial.


UC Irvine Machine Learning Repository: Iris Data Set

https://archive.ics.uci.edu/ml/datasets/iris

About the Iris data set from UC Irvine's machine learning repository.

Loading data

In [1]:
# Import pandas.
import pandas as pd
In [2]:
# Load the iris data set from a URL.
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")
In [3]:
df
Out[3]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
7 5.0 3.4 1.5 0.2 setosa
8 4.4 2.9 1.4 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa
10 5.4 3.7 1.5 0.2 setosa
11 4.8 3.4 1.6 0.2 setosa
12 4.8 3.0 1.4 0.1 setosa
13 4.3 3.0 1.1 0.1 setosa
14 5.8 4.0 1.2 0.2 setosa
15 5.7 4.4 1.5 0.4 setosa
16 5.4 3.9 1.3 0.4 setosa
17 5.1 3.5 1.4 0.3 setosa
18 5.7 3.8 1.7 0.3 setosa
19 5.1 3.8 1.5 0.3 setosa
20 5.4 3.4 1.7 0.2 setosa
21 5.1 3.7 1.5 0.4 setosa
22 4.6 3.6 1.0 0.2 setosa
23 5.1 3.3 1.7 0.5 setosa
24 4.8 3.4 1.9 0.2 setosa
25 5.0 3.0 1.6 0.2 setosa
26 5.0 3.4 1.6 0.4 setosa
27 5.2 3.5 1.5 0.2 setosa
28 5.2 3.4 1.4 0.2 setosa
29 4.7 3.2 1.6 0.2 setosa
... ... ... ... ... ...
120 6.9 3.2 5.7 2.3 virginica
121 5.6 2.8 4.9 2.0 virginica
122 7.7 2.8 6.7 2.0 virginica
123 6.3 2.7 4.9 1.8 virginica
124 6.7 3.3 5.7 2.1 virginica
125 7.2 3.2 6.0 1.8 virginica
126 6.2 2.8 4.8 1.8 virginica
127 6.1 3.0 4.9 1.8 virginica
128 6.4 2.8 5.6 2.1 virginica
129 7.2 3.0 5.8 1.6 virginica
130 7.4 2.8 6.1 1.9 virginica
131 7.9 3.8 6.4 2.0 virginica
132 6.4 2.8 5.6 2.2 virginica
133 6.3 2.8 5.1 1.5 virginica
134 6.1 2.6 5.6 1.4 virginica
135 7.7 3.0 6.1 2.3 virginica
136 6.3 3.4 5.6 2.4 virginica
137 6.4 3.1 5.5 1.8 virginica
138 6.0 3.0 4.8 1.8 virginica
139 6.9 3.1 5.4 2.1 virginica
140 6.7 3.1 5.6 2.4 virginica
141 6.9 3.1 5.1 2.3 virginica
142 5.8 2.7 5.1 1.9 virginica
143 6.8 3.2 5.9 2.3 virginica
144 6.7 3.3 5.7 2.5 virginica
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns


Selecting rows and columns

In [4]:
df['species']
Out[4]:
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
5         setosa
6         setosa
7         setosa
8         setosa
9         setosa
10        setosa
11        setosa
12        setosa
13        setosa
14        setosa
15        setosa
16        setosa
17        setosa
18        setosa
19        setosa
20        setosa
21        setosa
22        setosa
23        setosa
24        setosa
25        setosa
26        setosa
27        setosa
28        setosa
29        setosa
         ...    
120    virginica
121    virginica
122    virginica
123    virginica
124    virginica
125    virginica
126    virginica
127    virginica
128    virginica
129    virginica
130    virginica
131    virginica
132    virginica
133    virginica
134    virginica
135    virginica
136    virginica
137    virginica
138    virginica
139    virginica
140    virginica
141    virginica
142    virginica
143    virginica
144    virginica
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object
In [5]:
df[['petal_length', 'species']]
Out[5]:
petal_length species
0 1.4 setosa
1 1.4 setosa
2 1.3 setosa
3 1.5 setosa
4 1.4 setosa
5 1.7 setosa
6 1.4 setosa
7 1.5 setosa
8 1.4 setosa
9 1.5 setosa
10 1.5 setosa
11 1.6 setosa
12 1.4 setosa
13 1.1 setosa
14 1.2 setosa
15 1.5 setosa
16 1.3 setosa
17 1.4 setosa
18 1.7 setosa
19 1.5 setosa
20 1.7 setosa
21 1.5 setosa
22 1.0 setosa
23 1.7 setosa
24 1.9 setosa
25 1.6 setosa
26 1.6 setosa
27 1.5 setosa
28 1.4 setosa
29 1.6 setosa
... ... ...
120 5.7 virginica
121 4.9 virginica
122 6.7 virginica
123 4.9 virginica
124 5.7 virginica
125 6.0 virginica
126 4.8 virginica
127 4.9 virginica
128 5.6 virginica
129 5.8 virginica
130 6.1 virginica
131 6.4 virginica
132 5.6 virginica
133 5.1 virginica
134 5.6 virginica
135 6.1 virginica
136 5.6 virginica
137 5.5 virginica
138 4.8 virginica
139 5.4 virginica
140 5.6 virginica
141 5.1 virginica
142 5.1 virginica
143 5.9 virginica
144 5.7 virginica
145 5.2 virginica
146 5.0 virginica
147 5.2 virginica
148 5.4 virginica
149 5.1 virginica

150 rows × 2 columns

In [6]:
df[2:6]
Out[6]:
sepal_length sepal_width petal_length petal_width species
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
In [7]:
df[['petal_length', 'species']][2:6]
Out[7]:
petal_length species
2 1.3 setosa
3 1.5 setosa
4 1.4 setosa
5 1.7 setosa
In [8]:
df.loc[2:6]
Out[8]:
sepal_length sepal_width petal_length petal_width species
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
6 4.6 3.4 1.4 0.3 setosa
In [9]:
df.loc[:, 'species']
Out[9]:
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
5         setosa
6         setosa
7         setosa
8         setosa
9         setosa
10        setosa
11        setosa
12        setosa
13        setosa
14        setosa
15        setosa
16        setosa
17        setosa
18        setosa
19        setosa
20        setosa
21        setosa
22        setosa
23        setosa
24        setosa
25        setosa
26        setosa
27        setosa
28        setosa
29        setosa
         ...    
120    virginica
121    virginica
122    virginica
123    virginica
124    virginica
125    virginica
126    virginica
127    virginica
128    virginica
129    virginica
130    virginica
131    virginica
132    virginica
133    virginica
134    virginica
135    virginica
136    virginica
137    virginica
138    virginica
139    virginica
140    virginica
141    virginica
142    virginica
143    virginica
144    virginica
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object
In [10]:
df.loc[:, ['sepal_length', 'species']]
Out[10]:
sepal_length species
0 5.1 setosa
1 4.9 setosa
2 4.7 setosa
3 4.6 setosa
4 5.0 setosa
5 5.4 setosa
6 4.6 setosa
7 5.0 setosa
8 4.4 setosa
9 4.9 setosa
10 5.4 setosa
11 4.8 setosa
12 4.8 setosa
13 4.3 setosa
14 5.8 setosa
15 5.7 setosa
16 5.4 setosa
17 5.1 setosa
18 5.7 setosa
19 5.1 setosa
20 5.4 setosa
21 5.1 setosa
22 4.6 setosa
23 5.1 setosa
24 4.8 setosa
25 5.0 setosa
26 5.0 setosa
27 5.2 setosa
28 5.2 setosa
29 4.7 setosa
... ... ...
120 6.9 virginica
121 5.6 virginica
122 7.7 virginica
123 6.3 virginica
124 6.7 virginica
125 7.2 virginica
126 6.2 virginica
127 6.1 virginica
128 6.4 virginica
129 7.2 virginica
130 7.4 virginica
131 7.9 virginica
132 6.4 virginica
133 6.3 virginica
134 6.1 virginica
135 7.7 virginica
136 6.3 virginica
137 6.4 virginica
138 6.0 virginica
139 6.9 virginica
140 6.7 virginica
141 6.9 virginica
142 5.8 virginica
143 6.8 virginica
144 6.7 virginica
145 6.7 virginica
146 6.3 virginica
147 6.5 virginica
148 6.2 virginica
149 5.9 virginica

150 rows × 2 columns

In [11]:
df.loc[2:6, ['sepal_length', 'species']]
Out[11]:
sepal_length species
2 4.7 setosa
3 4.6 setosa
4 5.0 setosa
5 5.4 setosa
6 4.6 setosa
In [12]:
df.iloc[2]
Out[12]:
sepal_length       4.7
sepal_width        3.2
petal_length       1.3
petal_width        0.2
species         setosa
Name: 2, dtype: object
In [13]:
df.iloc[2:4, 1]
Out[13]:
2    3.2
3    3.1
Name: sepal_width, dtype: float64
In [14]:
df.at[3, 'species']
Out[14]:
'setosa'
In [15]:
df.iloc[1:10:2]
Out[15]:
sepal_length sepal_width petal_length petal_width species
1 4.9 3.0 1.4 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
5 5.4 3.9 1.7 0.4 setosa
7 5.0 3.4 1.5 0.2 setosa
9 4.9 3.1 1.5 0.1 setosa

Boolean selects

In [16]:
df.loc[:, 'species'] == 'setosa'
Out[16]:
0       True
1       True
2       True
3       True
4       True
5       True
6       True
7       True
8       True
9       True
10      True
11      True
12      True
13      True
14      True
15      True
16      True
17      True
18      True
19      True
20      True
21      True
22      True
23      True
24      True
25      True
26      True
27      True
28      True
29      True
       ...  
120    False
121    False
122    False
123    False
124    False
125    False
126    False
127    False
128    False
129    False
130    False
131    False
132    False
133    False
134    False
135    False
136    False
137    False
138    False
139    False
140    False
141    False
142    False
143    False
144    False
145    False
146    False
147    False
148    False
149    False
Name: species, Length: 150, dtype: bool
In [17]:
df.loc[df.loc[:, 'species'] == 'versicolor']
Out[17]:
sepal_length sepal_width petal_length petal_width species
50 7.0 3.2 4.7 1.4 versicolor
51 6.4 3.2 4.5 1.5 versicolor
52 6.9 3.1 4.9 1.5 versicolor
53 5.5 2.3 4.0 1.3 versicolor
54 6.5 2.8 4.6 1.5 versicolor
55 5.7 2.8 4.5 1.3 versicolor
56 6.3 3.3 4.7 1.6 versicolor
57 4.9 2.4 3.3 1.0 versicolor
58 6.6 2.9 4.6 1.3 versicolor
59 5.2 2.7 3.9 1.4 versicolor
60 5.0 2.0 3.5 1.0 versicolor
61 5.9 3.0 4.2 1.5 versicolor
62 6.0 2.2 4.0 1.0 versicolor
63 6.1 2.9 4.7 1.4 versicolor
64 5.6 2.9 3.6 1.3 versicolor
65 6.7 3.1 4.4 1.4 versicolor
66 5.6 3.0 4.5 1.5 versicolor
67 5.8 2.7 4.1 1.0 versicolor
68 6.2 2.2 4.5 1.5 versicolor
69 5.6 2.5 3.9 1.1 versicolor
70 5.9 3.2 4.8 1.8 versicolor
71 6.1 2.8 4.0 1.3 versicolor
72 6.3 2.5 4.9 1.5 versicolor
73 6.1 2.8 4.7 1.2 versicolor
74 6.4 2.9 4.3 1.3 versicolor
75 6.6 3.0 4.4 1.4 versicolor
76 6.8 2.8 4.8 1.4 versicolor
77 6.7 3.0 5.0 1.7 versicolor
78 6.0 2.9 4.5 1.5 versicolor
79 5.7 2.6 3.5 1.0 versicolor
80 5.5 2.4 3.8 1.1 versicolor
81 5.5 2.4 3.7 1.0 versicolor
82 5.8 2.7 3.9 1.2 versicolor
83 6.0 2.7 5.1 1.6 versicolor
84 5.4 3.0 4.5 1.5 versicolor
85 6.0 3.4 4.5 1.6 versicolor
86 6.7 3.1 4.7 1.5 versicolor
87 6.3 2.3 4.4 1.3 versicolor
88 5.6 3.0 4.1 1.3 versicolor
89 5.5 2.5 4.0 1.3 versicolor
90 5.5 2.6 4.4 1.2 versicolor
91 6.1 3.0 4.6 1.4 versicolor
92 5.8 2.6 4.0 1.2 versicolor
93 5.0 2.3 3.3 1.0 versicolor
94 5.6 2.7 4.2 1.3 versicolor
95 5.7 3.0 4.2 1.2 versicolor
96 5.7 2.9 4.2 1.3 versicolor
97 6.2 2.9 4.3 1.3 versicolor
98 5.1 2.5 3.0 1.1 versicolor
99 5.7 2.8 4.1 1.3 versicolor
In [18]:
x = df.loc[df.loc[:, 'species'] == 'versicolor']
In [19]:
x.loc[51]
Out[19]:
sepal_length           6.4
sepal_width            3.2
petal_length           4.5
petal_width            1.5
species         versicolor
Name: 51, dtype: object
In [20]:
x.iloc[1]
Out[20]:
sepal_length           6.4
sepal_width            3.2
petal_length           4.5
petal_width            1.5
species         versicolor
Name: 51, dtype: object

Summary statictics

In [21]:
df.head()
Out[21]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [22]:
df.tail()
Out[22]:
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
In [23]:
df.describe()
Out[23]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
In [24]:
(df.loc[df.loc[:, 'species'] == 'versicolor']).describe()
Out[24]:
sepal_length sepal_width petal_length petal_width
count 50.000000 50.000000 50.000000 50.000000
mean 5.936000 2.770000 4.260000 1.326000
std 0.516171 0.313798 0.469911 0.197753
min 4.900000 2.000000 3.000000 1.000000
25% 5.600000 2.525000 4.000000 1.200000
50% 5.900000 2.800000 4.350000 1.300000
75% 6.300000 3.000000 4.600000 1.500000
max 7.000000 3.400000 5.100000 1.800000
In [25]:
(df.loc[df.loc[:, 'species'] == 'setosa']).describe()
Out[25]:
sepal_length sepal_width petal_length petal_width
count 50.00000 50.000000 50.000000 50.00000
mean 5.00600 3.418000 1.464000 0.24400
std 0.35249 0.381024 0.173511 0.10721
min 4.30000 2.300000 1.000000 0.10000
25% 4.80000 3.125000 1.400000 0.20000
50% 5.00000 3.400000 1.500000 0.20000
75% 5.20000 3.675000 1.575000 0.30000
max 5.80000 4.400000 1.900000 0.60000
In [26]:
(df.loc[df.loc[:, 'species'] == 'virginica']).describe()
Out[26]:
sepal_length sepal_width petal_length petal_width
count 50.00000 50.000000 50.000000 50.00000
mean 6.58800 2.974000 5.552000 2.02600
std 0.63588 0.322497 0.551895 0.27465
min 4.90000 2.200000 4.500000 1.40000
25% 6.22500 2.800000 5.100000 1.80000
50% 6.50000 3.000000 5.550000 2.00000
75% 6.90000 3.175000 5.875000 2.30000
max 7.90000 3.800000 6.900000 2.50000
In [27]:
df.mean()
Out[27]:
sepal_length    5.843333
sepal_width     3.054000
petal_length    3.758667
petal_width     1.198667
dtype: float64

Plots

In [28]:
import seaborn as sns
In [29]:
sns.pairplot(df, hue='species')
C:\Users\mclou\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[29]:
<seaborn.axisgrid.PairGrid at 0x1d5fb084f28>

End