CSVファイルの読み込み

In [1]:
# pandasライブラリの読み込み
import pandas as pd

# iris.csv の読み込み
df = pd.read_csv("https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv")

df
Out[1]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
... ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

150 rows × 5 columns

データの基本情報・要約統計量

In [2]:
# データフレームの基本情報 1
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
In [3]:
# 統計量の要約表示
df.describe()
Out[3]:
sepal_length sepal_width petal_length petal_width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000

様々な情報の取得

基本情報

In [4]:
# データの形状を表示
df.shape
Out[4]:
(150, 5)
In [5]:
# データの列情報を表示
df.columns
Out[5]:
Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')
In [6]:
# 各列のデータ型を表示
df.dtypes
Out[6]:
sepal_length    float64
sepal_width     float64
petal_length    float64
petal_width     float64
species          object
dtype: object

部分表示

In [ ]:
# データの一部を表示(最初の5件)
df.head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
# データの一部を表示(最後の5件)
df.tail()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica

特定の行・列データのスライス

In [15]:
# 行番号・列番号から抽出(iloc属性)
df.iloc[1,1]
Out[15]:
3.0
In [16]:
# 行番号から抽出(iloc属性)
df.iloc[1]
Out[16]:
sepal_length       4.9
sepal_width          3
petal_length       1.4
petal_width        0.2
species         setosa
Name: 1, dtype: object
In [17]:
# 列番号から抽出(iloc属性、「 : 」 は「すべて取り出す」を意味する)
df.iloc[ : , 1]
Out[17]:
0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, Length: 150, dtype: float64
In [22]:
# index, column から抽出(loc属性)
df.loc[ 1 , 'sepal_width' ]
Out[22]:
3.0
In [20]:
# column から抽出(loc属性、「 : 」 は「すべて取り出す」を意味する)
df.loc[ : , 'sepal_width' ] 
Out[20]:
0      3.5
1      3.0
2      3.2
3      3.1
4      3.6
      ... 
145    3.0
146    2.5
147    3.0
148    3.4
149    3.0
Name: sepal_width, Length: 150, dtype: float64
In [7]:
# column から抽出2(カラム名の指定)
df_species = df[ 'species' ]
df_species
Out[7]:
0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: object
In [24]:
# 複数行を抽出(loc属性)
df_select_index = df.loc[ 2:3 ]
df_select_index 
Out[24]:
sepal_length sepal_width petal_length petal_width species
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
In [25]:
# 複数 column を抽出(カラム名の複数指定)
df_select_column = df[ ['sepal_length' ,	'sepal_width' , 'petal_length' , 'petal_width' ]  ]
df_select_column
Out[25]:
sepal_length sepal_width petal_length petal_width
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
... ... ... ... ...
145 6.7 3.0 5.2 2.3
146 6.3 2.5 5.0 1.9
147 6.5 3.0 5.2 2.0
148 6.2 3.4 5.4 2.3
149 5.9 3.0 5.1 1.8

150 rows × 4 columns

In [ ]:
# 特定 column における質的データを条件に、マッチする行を取得 1
df_species_setosa = df[ df['species'] == 'setosa' ]
df_species_setosa.head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
# 特定 column における質的データを条件に、マッチする行を取得 2 ( query )
df_species_setosa2 = df.query( "species == 'setosa' " )
df_species_setosa2.head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [13]:
# 特定 column における量的データを条件に、マッチする行を取得
df_sepal_length = df[ df['sepal_length'] > 7.0 ]
df_sepal_length
Out[13]:
sepal_length sepal_width petal_length petal_width species
102 7.1 3.0 5.9 2.1 virginica
105 7.6 3.0 6.6 2.1 virginica
107 7.3 2.9 6.3 1.8 virginica
109 7.2 3.6 6.1 2.5 virginica
117 7.7 3.8 6.7 2.2 virginica
118 7.7 2.6 6.9 2.3 virginica
122 7.7 2.8 6.7 2.0 virginica
125 7.2 3.2 6.0 1.8 virginica
129 7.2 3.0 5.8 1.6 virginica
130 7.4 2.8 6.1 1.9 virginica
131 7.9 3.8 6.4 2.0 virginica
135 7.7 3.0 6.1 2.3 virginica

欠損データの扱い

In [ ]:
# 欠損の確認 > このサンプルデータではもともと欠損なし
df.isnull().sum()
Out[ ]:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64
In [ ]:
# 欠損値を含む行を削除
df = df.dropna(how='any')
# 再度状態の確認
df.isnull().sum()
Out[ ]:
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64

データの集約

In [ ]:
# 特定カラムのユニークな要素( unique() )
df_species_unique = df['species'].unique()
df_species_unique
Out[ ]:
array(['setosa', 'versicolor', 'virginica'], dtype=object)
In [ ]:
# 特定カラムの種類ごとの件数 1( value_counts() )
df_species_counts = df['species'].value_counts()
df_species_counts
Out[ ]:
versicolor    50
virginica     50
setosa        50
Name: species, dtype: int64
In [ ]:
# 特定カラムの種類ごとの件数 2( groupby() , count() )
df_species_counts2 =df.groupby('species').count()
df_species_counts2
Out[ ]:
sepal_length sepal_width petal_length petal_width
species
setosa 50 50 50 50
versicolor 50 50 50 50
virginica 50 50 50 50
In [ ]:
# 特定カラムにおける種類ごとの平均( groupby() , mean() )
df_species_mean = df.groupby('species').mean()
df_species_mean
Out[ ]:
sepal_length sepal_width petal_length petal_width
species
setosa 5.006 3.418 1.464 0.244
versicolor 5.936 2.770 4.260 1.326
virginica 6.588 2.974 5.552 2.026

並べ替え

In [ ]:
# データの並べ替え( 降順 )
df_sepal_length_sort = df.sort_values('sepal_length', ascending=False )
df_sepal_length_sort.head()
Out[ ]:
sepal_length sepal_width petal_length petal_width species
131 7.9 3.8 6.4 2.0 virginica
135 7.7 3.0 6.1 2.3 virginica
122 7.7 2.8 6.7 2.0 virginica
117 7.7 3.8 6.7 2.2 virginica
118 7.7 2.6 6.9 2.3 virginica

データの視覚化

In [ ]:
# ヒストグラムの表示
df.hist( figsize=(9, 6) )
Out[ ]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f596ede1eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f596ed81c50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f596edb3eb8>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f596ed71160>]],
      dtype=object)
In [ ]:
# 折れ線グラフの表示
df.plot( figsize=(9, 6) )
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f597069c160>
In [ ]:
# サブプロットの表示
df.plot( subplots=True,  figsize=(9, 6)  )
Out[ ]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f596eceb860>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f596ec0a7f0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f596ebbb7f0>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f596ebf17f0>],
      dtype=object)
In [ ]:
# プロットする列の指定: 引数x, y
df.plot(x='sepal_length', y='sepal_width',  figsize=(9, 6) )
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f596eab12e8>
In [ ]:
# 最初の5件を積み上げ棒グラフで表示
df[:5].plot.bar( stacked=True,  figsize=(9, 6)  )
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f596ea2f518>
In [ ]:
# 散布図の表示
df.plot(kind='scatter', x='sepal_length', y='petal_length', alpha=0.5,  figsize=(9, 6) )
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f596e8cb518>
In [ ]:
# 複数のプロットを重ねる場合、1つ目の plot() の戻り値 AxesSubplot を 追加の plot() の引数 ax に指定
ax = df.plot(kind='scatter', x='sepal_length', y='petal_length', alpha=0.5,  figsize=(9, 6) )
df.plot(kind='scatter', x='sepal_length', y='sepal_width',  marker='s', c='g', s=50, alpha=0.5, ax=ax)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f596e6f3518>
In [ ]:
# plot.scatter メソッドでも可
df.plot.scatter(x='sepal_length', y='petal_length', alpha=0.5)
Out[ ]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f596e658748>

参考:sklearn 付属のデータセットからの読み込み

In [ ]:
# iris データセットの読み込む
from sklearn.datasets import load_iris

# 基本情報を表示
iris = load_iris()
print(iris.data.shape)
print(iris.feature_names)
print(iris.target.shape)
print(iris.target_names)

print('*****************************************')
	
# データセットの概要説明を表示
print(iris.DESCR)
(150, 4)
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
(150,)
['setosa' 'versicolor' 'virginica']
*****************************************
.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%[email protected])
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...
In [ ]:
# pandasライブラリの読み込み
import pandas as pd

#  irisデータをデータフレームに編入 / ターゲットの 0,1,2 は名称に変更
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = iris.target_names[iris.target]

df.head()
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [ ]:
# ターゲットを2品種にする(例えば2値分類用に)
df = df[ (df['target']=='versicolor') | (df['target']=='virginica') ]

df.head()
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
50 7.0 3.2 4.7 1.4 versicolor
51 6.4 3.2 4.5 1.5 versicolor
52 6.9 3.1 4.9 1.5 versicolor
53 5.5 2.3 4.0 1.3 versicolor
54 6.5 2.8 4.6 1.5 versicolor