7. 데이터 준비하기: 다듬기, 변형, 병합¶

데이터 분석과 모델링 작업은 데이터를 불러오고, 다듬고, 변형하고 재정렬하는, 데이터 준비 과정에 많은 시간이 소요
가끔 파일이나 데이터베이스에서 저장된 데이터가 애플리케이션에서 사용하기 쉽지 않은 방식으로 저장되어 있을 때도 존재
데이터가 저장된 형태를 다른 형태로 바꾸기 위해(파이썬, Perl, R, 자바 혹은 awk나 sed 같은 유닉스의 텍스트 처리 유틸리티도 사용하지만 파이썬 표준 라이브러리와 pandas를 함께 사용하면 큰 어려움 없이 데이터를 원하는 형태로 가공 가능
pandas는 이런 작업을 위한 유연하고 빠른 고수준의 알고리즘과 처리 기능 제공
pandas에서 찾을 수 없는 새로운 형태의 데이터 처리 방식을 발견하게 되면 알려달라고 함
pandas는 대부분의 설계와 구현에 실제 애플리케이션 개발 과정 중에 발생한 요구사항 반영

7.1 데이터 합치기¶

pandas 객체에 저장된 데이터는 여러 내장 함수를 이용해 합치기 가능¶

pandas.merge는 하나 이상의 키를 기준으로 DataFrame의 로우를 합친다. SQL이나 다른 관계형 데이터 베이스의 join 연산과 유사
pandas.concat은 하나의 축을 따라 객체를 이어붙임
combine_first 인스턴스 메서드는 두 객체를 포개서 한 객체에서 누락된 데이터를 다른 객체에 있는 값으로 채울 수 있도록 한다.

내장 함수는 앞으로 이 책 전반에 걸쳐 계속 사용하게 될 것이므로 빠른 시간 내에 익숙해지자(피할 수 없으면 즐기라!)¶

7.1.1 데이터베이스 스타일로 DataFrame 합치기¶

merge나 join 연산은 관계형 데이터베이스의 핵심적인 연산으로, 키를 하나 이상 사용해서 데이터 집합의 로우를 합침

In [1]:

import pandas as pd
from pandas import DataFrame, Series

In [169]:

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})

In [170]:

df2 = DataFrame({'key': ['a', 'b', 'd'],
                 'data2': range(3)})

In [7]:

df1

Out[7]:

	data1	key
0	0	b
1	1	b
2	2	a
3	3	c
4	4	a
5	5	a
6	6	b

In [8]:

df2

Out[8]:

	data2	key
0	0	a
1	1	b
2	2	d

일대다의 예제¶

df1 데이터는 key 칼럼에 여러 개의 a,b 존재
df2의 kye 칼럼은 유일한 로우가 존재
이 객체에 대해 merge 함수 호출

In [9]:

pd.merge(df1, df2)

Out[9]:

	data1	key	data2
0	0	b	1
1	1	b	1
2	6	b	1
3	2	a	0
4	4	a	0
5	5	a	0

여기에서 나는 어떤 칼럼을 병합할 것인지 명시하지 않았는데, merge 함수는 겹치는 칼럼의 이름을 키로 사용

이 예에서는 key 칼럼(data1, key, data2, key 중 key가 겹침)
이렇게도 작동한다고 예를 들었지만 명시적으로 지정하는 습관을 들이자
df1에서 c와 df2에서 d는 교차하는게 없어서 값 누락
상식적으로 기본 전제는 교차하는 것이다. 왜냐하면 교차도 안하는데 출력해주면 나중에 문제가 될 수 있으니 이렇게 한듯하다. 교차 안되는건 옵션으로 표현해 줄 수 있을 것이다. 왜냐고? 그게 안전빵이니까

In [10]:

pd.merge(df1, df2, on='key')

Out[10]:

	data1	key	data2
0	0	b	1
1	1	b	1
2	6	b	1
3	2	a	0
4	4	a	0
5	5	a	0

두 객체에 공통되는 칼럼 이름이 하나도 없다면 따로 지정¶

df3는 칼럼이 lkey와 data1
df4는 칼럼이 rkey와 data2
서로 겹치는게 하나도 없다.
그러니 명시적으로 지정해줘야 한다.

In [11]:

df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                 'data1': range(7)})

In [12]:

df4 = DataFrame({'rkey': ['a', 'b', 'd'],
                 'data2': range(3)})

In [13]:

df3

Out[13]:

	data1	lkey
0	0	b
1	1	b
2	2	a
3	3	c
4	4	a
5	5	a
6	6	b

In [14]:

df4

Out[14]:

	data2	rkey
0	0	a
1	1	b
2	2	d

In [15]:

# left, right 둘 다 같이 써줘야 한다.
# 하나만 써주면 어떻게 될까? 이럴땐 Test 해보는게 짱!
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Out[15]:

	data1	lkey	data2	rkey
0	0	b	1	b
1	1	b	1	b
2	6	b	1	b
3	2	a	0	a
4	4	a	0	a
5	5	a	0	a

In [16]:

# TeypeError! 
# 소스를 보면 right_on == left_on 비교하는 곳이 있다.
# 당연히 right_on은 명시하지 않았으니 NoneType으로 넘어가서 비교가 안된다!
# 아 그 전에 len 함수에서 NoneType의 길이를 재려고 하니 오류가 나는 것임!
pd.merge(df3, df4, left_on='lkey')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-16-20cc39e49f9a> in <module>()
      3 # 당연히 right_on은 명시하지 않았으니 NoneType으로 넘어가서 비교가 안된다!
      4 # 아 그 전에 len 함수에서 NoneType의 길이를 재려고 하니 오류가 나는 것임!
----> 5 pd.merge(df3, df4, left_on='lkey')

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     34                          right_on=right_on, left_index=left_index,
     35                          right_index=right_index, sort=sort, suffixes=suffixes,
---> 36                          copy=copy)
     37     return op.get_result()
     38 if __debug__:

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy)
    178         (self.left_join_keys,
    179          self.right_join_keys,
--> 180          self.join_names) = self._get_merge_keys()
    181 
    182     def get_result(self):

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _get_merge_keys(self)
    296         left_keys, right_keys
    297         """
--> 298         self._validate_specification()
    299 
    300         left_keys = []

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _validate_specification(self)
    414                     raise AssertionError()
    415                 self.left_on = [None] * n
--> 416         if not ((len(self.right_on) == len(self.left_on))):
    417             raise AssertionError()
    418 

TypeError: object of type 'NoneType' has no len()

In [17]:

# None은 당연히 길이가 없지.
len(None)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-17-a613272357ac> in <module>()
      1 # None은 당연히 길이가 없지.
----> 2 len(None)

TypeError: object of type 'NoneType' has no len()

앞의 결과를 잘 살펴보면 'c'와 'd'에 해당하는 값이 빠짐
merge 함수는 기본적으로 내부조인(inner join)을 수행하여 교집합인 결과를 반환

how keyword¶

'left': 왼쭉 우선 외부조인. 왼쪽의 모든 로우를 포함하는 결과 반환
'right': 오른쪽 우선 외부조인. 오른쪽의 모든 로우를 포함하는 결과 반환
'outer': 완전 외부조인. 합집합인 결과 반환

In [18]:

pd.merge(df1, df2, how='outer')

Out[18]:

	data1	key	data2
0	0	b	1
1	1	b	1
2	6	b	1
3	2	a	0
4	4	a	0
5	5	a	0
6	3	c	NaN
7	NaN	d	2

In [19]:

# left에만 있는 c까지 포함이 된 것을 확인할 수 있다.
pd.merge(df1, df2, how='left')

Out[19]:

	data1	key	data2
0	0	b	1
1	1	b	1
2	6	b	1
3	2	a	0
4	4	a	0
5	5	a	0
6	3	c	NaN

In [171]:

# right에만 있는 d까지 포함
pd.merge(df1, df2, how='right')

Out[171]:

	data1	key	data2
0	0	b	1
1	1	b	1
2	6	b	1
3	2	a	0
4	4	a	0
5	5	a	0
6	NaN	d	2

다 대 다 병합은 잘 정의되어 있긴 하지만 직관적이지는 않다.¶

In [172]:

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                 'data1': range(6)})

In [173]:

df2 = DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                 'data2': range(5)})

In [174]:

df1

Out[174]:

	data1	key
0	0	b
1	1	b
2	2	a
3	3	c
4	4	a
5	5	b

In [175]:

df2

Out[175]:

	data2	key
0	0	a
1	1	b
2	2	a
3	3	b
4	4	d

In [176]:

pd.merge(df1, df2, on='key', how='left')

Out[176]:

	data1	key	data2
0	0	b	1
1	0	b	3
2	1	b	1
3	1	b	3
4	5	b	1
5	5	b	3
6	2	a	0
7	2	a	2
8	4	a	0
9	4	a	2
10	3	c	NaN

다 대 다 조인은 두 로우의 데카르트 곱 반환(b)¶

왼쪽 외부조인을 기준
왼쪽에 있는 b는 0,1,5 이렇게 3개 존재
오른쪽에 있는 b는 1,3 이렇게 2개 존재
0,1,5를 기준으로 2개씩 조합. 3 * 2 = 6. 총 6개의 결과
이 조인 메서드는 결과에 나타나는 구별되는 키에 대해서만 적용

In [25]:

# how parameter default is inner
pd.merge(df1, df2, how='inner')

Out[25]:

	data1	key	data2
0	0	b	1
1	0	b	3
2	1	b	1
3	1	b	3
4	5	b	1
5	5	b	3
6	2	a	0
7	2	a	2
8	4	a	0
9	4	a	2

In [26]:

pd.merge(df1, df2)

Out[26]:

	data1	key	data2
0	0	b	1
1	0	b	3
2	1	b	1
3	1	b	3
4	5	b	1
5	5	b	3
6	2	a	0
7	2	a	2
8	4	a	0
9	4	a	2

In [27]:

pd.merge?

how Parameter¶

how : {'left', 'right', 'outer', 'inner'}, default 'inner'

left: use only keys from left frame (SQL: left outer join)
right: use only keys from right frame (SQL: right outer join)
outer: use union of keys from both frames (SQL: full outer join)
inner: use intersection of keys from both frames (SQL: inner join)

여러 개의 키를 병합하려면 칼럼 이름이 들어간 리스트 넘김¶

In [28]:

left = DataFrame({'key1': ['foo', 'foo', 'bar'],
                  'key2': ['one', 'two', 'one'],
                  'lval': [1, 2, 3]})

In [29]:

right = DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                   'key2': ['one', 'one', 'one', 'two'],
                   'rval': [4, 5, 6, 7]})

In [30]:

left

Out[30]:

	key1	key2	lval
0	foo	one	1
1	foo	two	2
2	bar	one	3

In [31]:

right

Out[31]:

	key1	key2	rval
0	foo	one	4
1	foo	one	5
2	bar	one	6
3	bar	two	7

In [32]:

pd.merge(left, right, on=['key1', 'key2'], how='outer')

Out[32]:

	key1	key2	lval	rval
0	foo	one	1	4
1	foo	one	1	5
2	foo	two	2	NaN
3	bar	one	3	6
4	bar	two	NaN	7

outer 해석¶

on parameter로 key1, key2를 기준으로 병합
left에서 key1=foo, key2=one 일 때 lval은 1이다.
right에서는 rval이 4이다.
이렇게 1개의 로우가 완성된다.
모든 로우에 대해서 실행하고
2, 4 인덱스처럼 한 쪽에만 값이 있는 경우는 NaN으로 표시

merge 메서드의 종류에 따라 어떤 키 조합이 결과로 반환되는지 알려면 실제 구현과는 조금 다르지만 여러 개의 키가 들어있는 튜풀의 배열이 단일 조인 키로 사용된다고 생각하면 된다.(라고 써져 있는데 해석 안됨...)¶

칼럼과 칼럼을 조인할 때 전달한 DataFrame 객체의 색인은 무시된다.¶

머지 연산에서 고려해야 할 사항¶

겹치는 칼럼 이름에 대한 처리
축의 이름을 변경해서 수동으로 칼럼 이름을 겹치게 할 수도 있고
merge 함수에 있는 suffixes 인자를 통해 두 DataFrame 객체에서 겹치는 칼럼 이름 뒤에 붙인 문자열을 지정할 수도 있음

In [33]:

pd.merge(left, right, on='key1')

Out[33]:

	key1	key2_x	lval	key2_y	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

In [34]:

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Out[34]:

	key1	key2_left	lval	key2_right	rval
0	foo	one	1	one	4
1	foo	one	1	one	5
2	foo	two	2	one	4
3	foo	two	2	one	5
4	bar	one	3	one	6
5	bar	one	3	two	7

merge 함수 인자 목록¶

인자	설명
left	머지하려는 DataFrame 중 왼쪽에 위치한 DataFrame
right	머지하려는 DataFrame 중 오른쪽에 위치한 DataFrame
how	조인방법. 'inner', 'outer', 'left', 'right'. 기본값은 inner
on	조인하려는 로우 이름. 반드시 두 DataFrame 객체 모두에 있는 이름이어야 한다. 만약 명시되지 않고 다른 조인 키도 주어지지 않으면 left와 right에서 공통되는 칼럼을 조인 키로 사용한다.
left_on	조인 키로 사용할 left DataFrame의 칼럼
right_on	조인 키로 사용할 right DataFrame의 칼럼
left_index	조인 키로 사용할 left DataFrame의 색인 로우(다중 색인일 경우의 키)
right_index	조인 키로 사용할 right DataFrame의 색인 로우(다중 색인일 경우의 키)
sort	조인 키에 따라 병합된 데이터를 사전 순으로 정렬. 기본값은 True. 대용량 데이터의 경우 False라면 성능상의 이득을 얻을 수도 있다.
suffixes	칼럼 이름이 겹칠 경우 각 칼럼 이름 뒤에 붙일 문자열의 튜플. 기본값은 ('_x', '_y'). 만약 'data'라는 칼럼 이름이 양쪽 DataFrame에 같이 존재하면 결과에서는 'data_x', 'data_y'로 나타난다.
copy	False일 경우 예외적인 경우에 결과로 데이터가 복사되지 않도록 한다. 기본값은 항상 복사가 이루어진다.

7.1.2 색인 머지하기¶

머지하려는 키가 DataFrame의 색인일 수 있다.
이때에는 left_index = True, right_index = True 옵션을 지정해 해당 색인을 머지 키로 사용할 수 있다.

In [35]:

left1 = DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                   'value': range(6)})

In [36]:

right1 = DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [37]:

left1

Out[37]:

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

In [38]:

right1

Out[38]:

	group_val
a	3.5
b	7.0

In [39]:

# left_on으로 어떤 컬럼이 조인 키로 사용될지 결정
# right_index로 어떤 인덱스가 조인 키로 사용될지 결정
pd.merge(left1, right1, left_on='key', right_index=True)

Out[39]:

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0

해석¶

left의 key를 기준으로 left1, right1을 머지하는데 right는 index를 기준으로 하겠다.
left1에 a가 3개 존재
right1에 a가 존재
머지된 값을 보면 양쪽에 모두 있는 a를 기준으로 합쳐진 것을 볼 수 있다.
다시 한 번 언급하지만 left1의 key와 right1의 right_index가 겹쳐지기 때문에 가능한 상황이다.
다른 옵션이 없기 때문에(default: inner. 교차) left1에 있는 c는 보여주지 않는다.

In [40]:

pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Out[40]:

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0
5	c	5	NaN

how에 outer¶

예상했던대로 left1의 c가 포함된 것을 확인할 수 있다.

In [41]:

# right_on을 입력하지 않으니 당연히 Error!
pd.merge(left1, right1, left_on='key')

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-273a0ff3019a> in <module>()
      1 # right_on을 입력하지 않으니 당연히 Error!
----> 2 pd.merge(left1, right1, left_on='key')

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     34                          right_on=right_on, left_index=left_index,
     35                          right_index=right_index, sort=sort, suffixes=suffixes,
---> 36                          copy=copy)
     37     return op.get_result()
     38 if __debug__:

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy)
    178         (self.left_join_keys,
    179          self.right_join_keys,
--> 180          self.join_names) = self._get_merge_keys()
    181 
    182     def get_result(self):

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _get_merge_keys(self)
    296         left_keys, right_keys
    297         """
--> 298         self._validate_specification()
    299 
    300         left_keys = []

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _validate_specification(self)
    414                     raise AssertionError()
    415                 self.left_on = [None] * n
--> 416         if not ((len(self.right_on) == len(self.left_on))):
    417             raise AssertionError()
    418 

TypeError: object of type 'NoneType' has no len()

In [42]:

# right_on에 group_val을 입력하면 교차하는게 하나도 없으니 아무것도 표시하지 않음
pd.merge(left1, right1, left_on='key', right_on='group_val')

Out[42]:

Int64Index([], dtype=int64)

Empty DataFrame

In [43]:

pd.merge(left1, right1, left_on='key', right_on='group_val', how='outer')

Out[43]:

	key	value	group_val
0	a	0	NaN
1	a	2	NaN
2	a	3	NaN
3	b	1	NaN
4	b	4	NaN
5	c	5	NaN
6	NaN	NaN	3.5
7	NaN	NaN	7.0

하나도 겹치는게 없을때 how로 outer¶

겹치는게 하나도 없으니 inner 조인은 아무것도 안나오는 반면 outer 조인은 모두 하나 이상은 NaN인 것을 확인할 수 있다.

머지는 기본적으로 교집합을 구하지만 외부조인을 실행해서 합집합을 구할 수도 있다.¶

In [44]:

pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Out[44]:

	key	value	group_val
0	a	0	3.5
2	a	2	3.5
3	a	3	3.5
1	b	1	7.0
4	b	4	7.0
5	c	5	NaN

계층 색인된 데이터는 약간 복잡하다.¶

In [45]:

lefth = DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                   'key2': [2000, 2001, 2002, 2001, 2002],
                   'data': np.arange(5.)})

In [46]:

righth = DataFrame(np.arange(12).reshape((6, 2)),
                   index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                          [2001, 2000, 2000, 2000, 2001, 2002]],
                   columns=['event1', 'event2'])

In [47]:

lefth

Out[47]:

	data	key1	key2
0	0	Ohio	2000
1	1	Ohio	2001
2	2	Ohio	2002
3	3	Nevada	2001
4	4	Nevada	2002

In [48]:

righth

Out[48]:

		event1	event2
Nevada	2001	0	1
Nevada	2000	2	3
Ohio	2000	4	5
	2000	6	7
	2001	8	9
	2002	10	11

이런 경우에는 리스트로 여러 개의 칼럼을 지정해서 머지해야 한다(중복되는 색인 값을 다룰 때는 주의)¶

In [49]:

# key1, key2와 right_index가 같으니 됨
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Out[49]:

	data	key1	key2	event1	event2
0	0	Ohio	2000	4	5
0	0	Ohio	2000	6	7
1	1	Ohio	2001	8	9
2	2	Ohio	2002	10	11
3	3	Nevada	2001	0	1

In [179]:

# key1과 right_index는 다르다. key2에 해당하는 연도가 없어서 Error!
pd.merge(lefth, righth, left_on=['key1'], right_index=True)

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-179-83621148f53a> in <module>()
      1 # key1과 right_index는 다르다. key2에 해당하는 연도가 없어서 Error!
----> 2 pd.merge(lefth, righth, left_on=['key1'], right_index=True)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy)
     34                          right_on=right_on, left_index=left_index,
     35                          right_index=right_index, sort=sort, suffixes=suffixes,
---> 36                          copy=copy)
     37     return op.get_result()
     38 if __debug__:

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in __init__(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy)
    178         (self.left_join_keys,
    179          self.right_join_keys,
--> 180          self.join_names) = self._get_merge_keys()
    181 
    182     def get_result(self):

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _get_merge_keys(self)
    296         left_keys, right_keys
    297         """
--> 298         self._validate_specification()
    299 
    300         left_keys = []

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _validate_specification(self)
    406             if self.right_index:
    407                 if not ((len(self.left_on) == self.right.index.nlevels)):
--> 408                     raise AssertionError()
    409                 self.right_on = [None] * n
    410         elif self.right_on is not None:

AssertionError:

In [50]:

pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')

Out[50]:

	data	key1	key2	event1	event2
0	0	Ohio	2000	4	5
0	0	Ohio	2000	6	7
1	1	Ohio	2001	8	9
2	2	Ohio	2002	10	11
3	3	Nevada	2001	0	1
4	4	Nevada	2002	NaN	NaN
4	NaN	Nevada	2000	2	3

양쪽에 공통으로 있는 여러 개의 색인을 머지하는 것은 일도 아니다.¶

In [51]:

left2 = DataFrame([[1., 2.], [3., 4.], [5., 6.]], index=['a', 'c', 'e'],
                  columns=['Ohio', 'Nevada'])

In [52]:

right2 = DataFrame([[7., 8.,], [9., 10.], [11., 12.], [13, 14]],
                   index=['b', 'c', 'd', 'e'], columns=['Missouri', 'Alabama'])

In [53]:

left2

Out[53]:

	Ohio	Nevada
a	1	2
c	3	4
e	5	6

In [54]:

right2

Out[54]:

	Missouri	Alabama
b	7	8
c	9	10
d	11	12
e	13	14

In [55]:

pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Out[55]:

	Ohio	Nevada	Missouri	Alabama
a	1	2	NaN	NaN
b	NaN	NaN	7	8
c	3	4	9	10
d	NaN	NaN	11	12
e	5	6	13	14

In [56]:

pd.merge(left2, right2, left_index=True, right_index=True)

Out[56]:

	Ohio	Nevada	Missouri	Alabama
c	3	4	9	10
e	5	6	13	14

색인으로 머지할 때 DataFrame의 join 메서드 편리성¶

join 메서드는 칼럼이 겹치지 않으며 완전히 같거나 유사한 색인 구조를 가진 여러 개의 DataFrame 객체를 병합할 때 사용 가능

In [57]:

# 위와 똑같은 결과인데 join을 사용하면 훨씬 깔끔하다
left2.join(right2, how='outer')

Out[57]:

	Ohio	Nevada	Missouri	Alabama
a	1	2	NaN	NaN
b	NaN	NaN	7	8
c	3	4	9	10
d	NaN	NaN	11	12
e	5	6	13	14

In [180]:

# 색인 기준으로 머지할 때 사용!
left2.join(right2, how='inner')

Out[180]:

	Ohio	Nevada	Missouri	Alabama
c	3	4	9	10
e	5	6	13	14

과거에 작성된 pandas의 일부 코드 제약으로 인해 DataFrame의 join 메서드는 왼쪽 우선 조인 수행¶

join 메서드를 호출한 DataFrame의 칼럼 중 하나에 대해서 조인을 수행하는 것도 가능

In [58]:

left1

Out[58]:

	key	value
0	a	0
1	b	1
2	a	2
3	a	3
4	b	4
5	c	5

In [59]:

right1

Out[59]:

	group_val
a	3.5
b	7.0

In [60]:

# 왼쪽 우선 조인이라 c가 있음
left1.join(right1, on='key')

Out[60]:

	key	value	group_val
0	a	0	3.5
1	b	1	7.0
2	a	2	3.5
3	a	3	3.5
4	b	4	7.0
5	c	5	NaN

색인 대 색인으로 두 DataFrame을 합치려면 그냥 간단히 머지하려는 DataFrame의 리스트를 join 메서드에 넘기면 된다. 하지만 일반적으로 이런 머지는 다음의 concat 메서드를 사용한다¶

concat 메서드??¶

메서드라고 했는데 없음... 넘길때 list로 넘기는걸 concat 메서드라고 부르나?
아니면 list 안에서 ,를 기준으로 넘기는 걸 concat 메서드라고 부르나?
.concat() 으로 나와야 할 것 같은데 -_-;;

In [61]:

another = DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                    index=['a', 'c', 'e', 'f'], columns=['New York', 'Oregon'])

In [62]:

left2

Out[62]:

	Ohio	Nevada
a	1	2
c	3	4
e	5	6

In [63]:

right2

Out[63]:

	Missouri	Alabama
b	7	8
c	9	10
d	11	12
e	13	14

In [64]:

another

Out[64]:

	New York	Oregon
a	7	8
c	9	10
e	11	12
f	16	17

In [65]:

left2.join([right2, another])

Out[65]:

	Ohio	Nevada	Missouri	Alabama	New York	Oregon
a	1	2	NaN	NaN	7	8
c	3	4	9	10	9	10
e	5	6	13	14	11	12

In [66]:

left2

Out[66]:

	Ohio	Nevada
a	1	2
c	3	4
e	5	6

In [67]:

right2

Out[67]:

	Missouri	Alabama
b	7	8
c	9	10
d	11	12
e	13	14

In [68]:

another

Out[68]:

	New York	Oregon
a	7	8
c	9	10
e	11	12
f	16	17

In [69]:

left2.join([right2, another], how='outer')

Out[69]:

	Ohio	Nevada	Missouri	Alabama	New York	Oregon
a	1	2	NaN	NaN	7	8
b	NaN	NaN	7	8	NaN	NaN
c	3	4	9	10	9	10
d	NaN	NaN	11	12	NaN	NaN
e	5	6	13	14	11	12
f	NaN	NaN	NaN	NaN	16	17

7.1.3 축 따라 이어붙이기¶

concatenation(이어 붙이기)
binding(연결)
stacking(적층)
NumPy는 ndarray를 연결하는 concatenate 함수 제공

In [70]:

arr = np.arange(12).reshape((3, 4))

In [71]:

arr

Out[71]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [181]:

type(arr)

Out[181]:

numpy.ndarray

In [72]:

# axis = column
np.concatenate([arr, arr], axis=1)

Out[72]:

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [73]:

np.concatenate([arr, arr], axis=0)

Out[73]:

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Series나 DataFrame 같은 pandas 객체의 컨텍스트 내부에는 축마다 이름이 있어서 배열을 쉽게 이어붙일 수 있도록 되어 있다. 다음 특이 사항 고려¶

만약에 연결하려는 두 객체의 색인이 서로 다르다면 결과는 그 색인의 교집합이어야 하는가 아니면 합집합이어야 하는가?
합쳐진 결과에서 합쳐져기 전 객체의 데이터를 구분할 수 있는가?
어떤 축으로 연결할 것인지 고려해야 하는가?

pandas의 concat 함수는 이 사항에 대한 해답 제공. 맨땅에 헤딩해봐야 한다! 코드도 직접 쳐봐야 한다!!¶

색인이 겹치지 않는 3개의 Series 객체¶

In [184]:

s1 = Series([0, 1], index=['a', 'b'])

In [185]:

s2 = Series([2, 3, 4], index=['c', 'd', 'e'])

In [186]:

s3 = Series([5, 6], index=['f', 'g'])

In [187]:

s1

Out[187]:

a    0
b    1
dtype: int64

In [188]:

s2

Out[188]:

c    2
d    3
e    4
dtype: int64

In [189]:

s3

Out[189]:

f    5
g    6
dtype: int64

이 세 객체를 리스트로 묶어서 concat 함수에 전달하면 값과 색인을 연결!¶

In [190]:

pd.concat([s1, s2, s3])

Out[190]:

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [191]:

# Series인 것을 확인할 수 있다.
type( pd.concat([s1, s2, s3]) )

Out[191]:

pandas.core.series.Series

concat 함수는 axis = 0 기본값. 새로운 Series 객체 생성¶

만약 axis=1을 넘긴다면 결과는 Series가 아니라 DataFrame이 될 것이다.(axis=1은 칼럼을 의미)

In [192]:

pd.concat([s1, s2, s3], axis=1)

Out[192]:

	0	1	2
a	0	NaN	NaN
b	1	NaN	NaN
c	NaN	2	NaN
d	NaN	3	NaN
e	NaN	4	NaN
f	NaN	NaN	5
g	NaN	NaN	6

In [193]:

# DataFrame인 것을 확인할 수 있다
type( pd.concat([s1, s2, s3], axis=1) )

Out[193]:

pandas.core.frame.DataFrame

겹치는 축이 없기 때문에 외부조인(outer 메서드)으로 정렬된 합집합을 얻었지만 join='inner'를 넘겨서 교집합을 구할 수도 있다.¶

In [194]:

s4 = pd.concat([s1 * 5, s3])

In [195]:

s1

Out[195]:

a    0
b    1
dtype: int64

In [196]:

s1 * 5

Out[196]:

a    0
b    5
dtype: int64

In [197]:

s3

Out[197]:

f    5
g    6
dtype: int64

In [198]:

s4

Out[198]:

a    0
b    5
f    5
g    6
dtype: int64

In [88]:

pd.concat([s1, s4], axis=1)

Out[88]:

	0	1
a	0	0
b	1	5
f	NaN	5
g	NaN	6

In [89]:

pd.concat([s1, s4], axis=1, join='inner')

Out[89]:

	0	1
a	0	0
b	1	5

join_axes 인자로 머지하려는 축을 직접 지정도 가능¶

In [90]:

pd.concat([s1, s4], axis=1, join_axes=[['a', 'c', 'b', 'e']])

Out[90]:

	0	1
a	0	0
c	NaN	NaN
b	1	5
e	NaN	NaN

Series를 이어붙이기 전의 개별 Series를 구분할 수 없다는 문제가 생기는데, 이 경우 이어붙인 축에 대해 계층적 색인을 생성하여 식별이 가능하도록 할 수 있다.
계층적 색인을 사용하려면 keys 인자 사용

In [91]:

result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])

In [92]:

s1

Out[92]:

a    0
b    1
dtype: int64

In [93]:

s2

Out[93]:

c    2
d    3
e    4
dtype: int64

In [94]:

s3

Out[94]:

f    5
g    6
dtype: int64

In [95]:

# s1 = one, s2 = two, s3 = three로 할당
result

Out[95]:

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [96]:

# unstack 함수에 대한 자세한 내용은 나중에 알아보자
result.unstack()

Out[96]:

	a	b	f	g
one	0	1	NaN	NaN
two	0	1	NaN	NaN
three	NaN	NaN	5	6

Series는 axis=1로 병합할 경우 keys는 DataFrame의 칼럼 제목이 된다.¶

In [97]:

pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Out[97]:

	one	two	three
a	0	NaN	NaN
b	1	NaN	NaN
c	NaN	2	NaN
d	NaN	3	NaN
e	NaN	4	NaN
f	NaN	NaN	5
g	NaN	NaN	6

DataFrame 객체에 대해서도 지금까지와 같은 방식으로 적용 가능¶

In [200]:

df1 = DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'], 
                columns=['one', 'two'])

In [201]:

df2 = DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                columns=['three', 'four'])

In [202]:

df1

Out[202]:

	one	two
a	0	1
b	2	3
c	4	5

In [203]:

df2

Out[203]:

	three	four
a	5	6
c	7	8

In [204]:

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Out[204]:

	level1		level2
	one	two	three	four
a	0	1	5	6
b	2	3	NaN	NaN
c	4	5	7	8

리스트 대신 객체의 사전을 넘긴다면 사전의 키가 keys 옵셥으로 사용됨¶

In [205]:

pd.concat({'level1': df1, 'level2': df2}, axis=1)

Out[205]:

	level1		level2
	one	two	three	four
a	0	1	5	6
b	2	3	NaN	NaN
c	4	5	7	8

In [206]:

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
          names=['upper', 'lower'])

Out[206]:

upper	level1		level2
lower	one	two	three	four
a	0	1	5	6
b	2	3	NaN	NaN
c	4	5	7	8

In [207]:

# Concatenate pandas objects along a particular axis with optional set logic
# along the other axes. Can also add a layer of hierarchical indexing on the
# concatenation axis, which may be useful if the labels are the same (or
# oeverlapping) on the passed axis number
pd.concat?

In [208]:

pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
          names=['upper'])

Out[208]:

upper	level1		level2
	one	two	three	four
a	0	1	5	6
b	2	3	NaN	NaN
c	4	5	7	8

In [212]:

# names는 2개까지만 됨. 현재 df에서
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
          names=['upper', 'lower', 'test'])

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-212-d81fd2f4ba86> in <module>()
      1 pd.concat([df1, df2], axis=1, keys=['level1', 'level2'],
----> 2           names=['upper', 'lower', 'test'])

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in concat(objs, axis, join, join_axes, ignore_index, keys, levels, names, verify_integrity)
    883                        ignore_index=ignore_index, join=join,
    884                        keys=keys, levels=levels, names=names,
--> 885                        verify_integrity=verify_integrity)
    886     return op.get_result()
    887 

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in __init__(self, objs, axis, join, join_axes, keys, levels, names, ignore_index, verify_integrity)
    958         self.verify_integrity = verify_integrity
    959 
--> 960         self.new_axes = self._get_new_axes()
    961 
    962     def get_result(self):

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _get_new_axes(self)
   1160             concat_axis = None
   1161         else:
-> 1162             concat_axis = self._get_concat_axis()
   1163 
   1164         new_axes[self.axis] = concat_axis

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _get_concat_axis(self)
   1195         else:
   1196             concat_axis = _make_concat_multiindex(indexes, self.keys,
-> 1197                                                   self.levels, self.names)
   1198 
   1199         self._maybe_check_integrity(concat_axis)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/merge.pyc in _make_concat_multiindex(indexes, keys, levels, names)
   1274             names = names + _get_consensus_names(indexes)
   1275 
-> 1276         return MultiIndex(levels=levels, labels=label_list, names=names)
   1277 
   1278     new_index = indexes[0]

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/index.pyc in __new__(cls, levels, labels, sortorder, names, copy)
   1599 
   1600         if names is not None:
-> 1601             subarr._set_names(names)
   1602 
   1603 

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/index.pyc in _set_names(self, values)
   1783         if len(values) != self.nlevels:
   1784             raise ValueError('Length of names (%d) must be same as level '
-> 1785                               '(%d)' % (len(values),self.nlevels))
   1786         # set the name
   1787         for name, level in zip(values, self.levels):

ValueError: Length of names (4) must be same as level (2)

마지막으로, DataFrame의 로우 색인이 분석에 불필요한 경우에는 어떻게 할 것인가?¶

그 때는 ignore_index_True 옵션 적용¶

In [108]:

df1 = DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])

In [109]:

df2 = DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [110]:

df1

Out[110]:

	a	b	c	d
0	0.778910	0.311274	1.141611	-2.813149
1	-0.360643	-0.794361	2.023458	-0.449211
2	-1.677890	-0.345416	0.718920	0.171574

In [111]:

df2

Out[111]:

	b	d	a
0	-2.771987	1.315566	1.019023
1	-0.435467	-1.595407	-0.038993

In [112]:

# index 부분이 하나로 합쳐짐
pd.concat([df1, df2], ignore_index=True)

Out[112]:

	a	b	c	d
0	0.778910	0.311274	1.141611	-2.813149
1	-0.360643	-0.794361	2.023458	-0.449211
2	-1.677890	-0.345416	0.718920	0.171574
3	1.019023	-2.771987	NaN	1.315566
4	-0.038993	-0.435467	NaN	-1.595407

In [113]:

pd.concat([df1, df2])

Out[113]:

	a	b	c	d
0	0.778910	0.311274	1.141611	-2.813149
1	-0.360643	-0.794361	2.023458	-0.449211
2	-1.677890	-0.345416	0.718920	0.171574
0	1.019023	-2.771987	NaN	1.315566
1	-0.038993	-0.435467	NaN	-1.595407

concat 함수 인자¶

인자	설명
objs	이어붙일 pandas 객체의 사전이나 리스트. 필수 인자
axis	이어붙일 축 방향. 기본값은 0
join	조인 방식. 'inner'(내부조인, 교집합)와 'outer'(외부조인, 합집합)가 있으며 기본값은 'outer'
join_axes	합집합/교집합을 수행하는 대신 다른 n-1 축으로 사용할 색인을 지정한다.
keys	이어붙일 객체나 이어붙인 축에 대한 계층 색인을 생성하는 데 연관된 값이다. 리스트나 임의의 값이 들어있는 배열, 튜플의 배열 또는 배열의 리스트(levels 옵션에 다차원 배열이 넘어온 경우)가 될 수 있다.
levels	계층 색인 레벨로 사용할 색인을 지정한다. keys가 넘어온 경우 여러 개의 색인을 지정한다.
names	keys나 levels 혹은 둘 다 있을 경우, 생성된 계층 레벨을 위한 이름
verify_integrity	이어붙인 객체에 중복되는 축이 있는지 검사하고 있다면 예외를 발생시킨다. 기본값은 False로, 중복을 허용한다.
ignore_index	이어붙인 축의 색인을 유지하지 않고 range(total_length)로 새로운 색인을 생성한다.

7.1.4 겹치는 데이터 합치기¶

데이터를 합칠 때 머지나 이어붙이기로는 불가능한 상황이 있는데, 두 데이터셋의 색인이 일부 겹치거나 전체가 겹치는 경우가 그렇다.
벡터화된 if-else 구문을 표현하는 NumPy의 Where 함수를 통해 자세히 알아보자

In [114]:

a = Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
           index=['f', 'e', 'd', 'c', 'b', 'a'])

In [115]:

b = Series(np.arange(len(a), dtype=np.float64),
            index=['f', 'e', 'd', 'c', 'b', 'a'])

In [116]:

b[-1] = np.nan

In [117]:

a`

Out[117]:

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [214]:

np.arange(len(a))

Out[214]:

array([0, 1, 2, 3, 4, 5])

In [118]:

Out[118]:

f     0
e     1
d     2
c     3
b     4
a   NaN
dtype: float64

In [119]:

# 오랜만에 나와서 np.where가 어떤 기능인지 까먹었었다!
# pd.isnull(a)에서 null은 것은 True을 돌려준다.
# 삼항 연산자처럼 null인 True가 있으면 b값을 대입하고
# null이 False면 a값을 대입한다.
np.where(pd.isnull(a), b, a)

Out[119]:

array([ 0. ,  2.5,  2. ,  3.5,  4.5,  nan])

In [120]:

pd.isnull(a)

Out[120]:

f     True
e    False
d     True
c    False
b    False
a     True
dtype: bool

Series 객체의 combine_first라는 메서드는 위와 동일한 연산을 제공, 데이터 정렬 기능까지 제공¶

In [215]:

b[:-2]

Out[215]:

f    0
e    1
d    2
c    3
dtype: float64

In [216]:

a[2:]

Out[216]:

d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [121]:

b[:-2].combine_first(a[2:])

Out[121]:

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

In [122]:

b[:-2].combine_first

Out[122]:

<bound method Series.combine_first of f    0
e    1
d    2
c    3
dtype: float64>

In [123]:

b[:-2]

Out[123]:

f    0
e    1
d    2
c    3
dtype: float64

In [124]:

b.combine_first(a)

Out[124]:

f     0
e     1
d     2
c     3
b     4
a   NaN
dtype: float64

In [125]:

# a가 False일 때만 b 값을 참조한다.
# a를 먼저 combine 하되 False라면 b값 참조
a.combine_first(b)

Out[125]:

f    0.0
e    2.5
d    2.0
c    3.5
b    4.5
a    NaN
dtype: float64

In [126]:

Out[126]:

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

DataFrame에서 combine_first 메서드는 칼럼에 대해 같은 동작¶

이를 통해 호출하는 객체에서 누락된 데이터를 인자로 넘긴 객체에 있는 값으로 채워 넣기 가능

In [127]:

df1 = DataFrame({'a': [1., np.nan, 5., np.nan],
                 'b': [np.nan, 2., np.nan, 6.],
                 'c': range(2, 18, 4)})

In [128]:

df2 = DataFrame({'a': [5., 4., np.nan, 3., 7.],
                 'b': [np.nan, 3., 4., 6., 8.]})

In [129]:

df1

Out[129]:

	a	b	c
0	1	NaN	2
1	NaN	2	6
2	5	NaN	10
3	NaN	6	14

In [130]:

df2

Out[130]:

	a	b
0	5	NaN
1	4	3
2	NaN	4
3	3	6
4	7	8

In [131]:

df1.combine_first(df2)

Out[131]:

	a	b	c
0	1	NaN	2
1	4	2	6
2	5	4	10
3	3	6	14
4	7	8	NaN

7.2 재형성과 피벗¶

재형성(reshaping, 피벗 연산): 표형식의 데이터를 재배치하는 다양한 기본 연산

7.2.1 계층적 색인으로 재형성하기¶

stack: 데이터의 칼럼을 로우로 피벗 또는 회전시킨다.
unstack: 로우를 칼럼으로 피벗시킨다.

In [132]:

# 문자열이 담긴 배열을 로우와 칼럼의 색인으로하는 작은 DataFrame
data = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=pd.Index(['one', 'two', 'three'], name='number'))

In [133]:

data

Out[133]:

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

In [134]:

# 문자열이 담긴 배열을 로우와 칼럼의 색인으로하는 작은 DataFrame
# pd.Index로 index를 설정 후에 name을 입력하기 위해서 저렇게 한 것.
# 기본적으로는 columns에 list만 넘겨도 생성 됨
data2 = DataFrame(np.arange(6).reshape((2, 3)),
                 index=pd.Index(['Ohio', 'Colorado'], name='state'),
                 columns=['one', 'two', 'three'])

In [135]:

data2

Out[135]:

	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

stack 메서드를 사용하면 칼럼이 로우로 피벗되어 다음과 같은 Series 객체를 반환¶

In [136]:

result = data.stack()

In [137]:

result

Out[137]:

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [138]:

type(result)

Out[138]:

pandas.core.series.Series

unstack 메서드를 사용하면 앞에서 얻은 계층적 색인을 가진 Series로부터 DataFrame을 얻을 수 있다.¶

In [139]:

# 보통 가장 안쪽에 있는 one, two, three부터 컬럼으로 끄집어 냄
result.unstack()

Out[139]:

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

In [140]:

type( result.unstack() )

Out[140]:

pandas.core.frame.DataFrame

보통 가장 안쪽에 있는 것부터 끄집어내는데(stack도 마찬가지), 레벨 이름이나 숫자를 전달해서 끄집어낼 단계를 지정할 수 있다.¶

레벨이름이 0이면 첫번쨰 index를 컬럼으로
레벨이름이 1이면 두번째 index를 컬럼으로
선택한 이름을 컬럼으로

In [141]:

result.unstack(0)

Out[141]:

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

In [142]:

result.unstack('state')

Out[142]:

state	Ohio	Colorado
number
one	0	3
two	1	4
three	2	5

In [143]:

result.unstack(1)

Out[143]:

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

In [217]:

result.unstack('number')

Out[217]:

number	one	two	three
state
Ohio	0	1	2
Colorado	3	4	5

해당 레벨이 있는 모든 값이 하위 그룹에 속하지 않을 경우 unstack을 하게 되면 누락된 데이터가 생길 수 있다.

In [144]:

s1 = Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

In [145]:

s2 = Series([4, 5, 6], index=['c', 'd', 'e'])

In [146]:

data2 = pd.concat([s1, s2], keys=['one', 'two'])

In [147]:

s1

Out[147]:

a    0
b    1
c    2
d    3
dtype: int64

In [148]:

s2

Out[148]:

c    4
d    5
e    6
dtype: int64

In [149]:

data3 = pd.concat([s1, s2])

In [150]:

data3

Out[150]:

a    0
b    1
c    2
d    3
c    4
d    5
e    6
dtype: int64

In [151]:

# index로 s1을 one, s2를 two로 할당
data2

Out[151]:

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [152]:

data2.unstack()

Out[152]:

	a	b	c	d	e
one	0	1	2	3	NaN
two	NaN	NaN	4	5	6

stack 메서드는 누락된 데이터를 자동으로 걸러내기 때문에 연산을 쉽게 원상복구 가능¶

In [153]:

# 바꿨다 원상복구... 근데 null값 포함하지 않는다는 게 point!
data2.unstack().stack()

Out[153]:

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: float64

In [154]:

data2

Out[154]:

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [155]:

data2.unstack().stack(dropna=False)

Out[155]:

one  a     0
     b     1
     c     2
     d     3
     e   NaN
two  a   NaN
     b   NaN
     c     4
     d     5
     e     6
dtype: float64

DataFrame을 unstack할 때, unstack() 레벨은 결과에서 가장 낮은 단계¶

In [156]:

df = DataFrame({'left': result, 'right': result + 5},
               columns=pd.Index(['left', 'right'], name='side'))

In [157]:

df

Out[157]:

	side	left	right
state	number
Ohio	one	0	5
	two	1	6
	three	2	7
Colorado	one	3	8
	two	4	9
	three	5	10

In [158]:

result

Out[158]:

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [159]:

# unstack 할 때 결과에서 가장 낮은 레벨. 즉, side의 하위로 갔다.
df.unstack('state')

Out[159]:

side	left		right
state	Ohio	Colorado	Ohio	Colorado
number
one	0	3	5	8
two	1	4	6	9
three	2	5	7	10

In [219]:

# side도 number의 하위로 갔다. 
df.unstack('state').stack('side')

Out[219]:

	state	Ohio	Colorado
number	side
one	left	0	3
one	right	5	8
two	left	1	4
two	right	6	9
three	left	2	5
three	right	7	10

Why? 피벗을 하지?¶

피벗테이블(Pivot Table)
피벗테이블은 엑셀이 제공하는 가장 강력한 데이터 분석 도구입니다. 아마 저에게 엑셀의 가장 뛰어난 기능이 뭐냐고 물으신다면... 주저없이 피벗테이블이라고 얘기할 수 있습니다. 피벗테이블은 많은 양의 데이터를 빠른 시간에 다양하게 분석할 수 있는 대화형 테이블이라고 할 수 있는데요, 데이터를 분석하는 데 있어서 이만한 기능을 제공하는 프로그램도 흔치 않은 것이 사실입니다.
그런데 피벗(Pivot)이라는건 뭘까요? 피벗은 단어의 의미 그대로 회전중심축을 의미하는데, 원본 데이터를 사용자가 정의한 축을 중심으로 다양하게 분석(회전)해볼 수 있다는 의미입니다. 엑셀 97에서 처음 소개된 피벗테이블은 너무 복잡해보여서 사용자들이 쉽게 접근하지 못했던 것이 사실입니다. 엑셀 2007부터는 클릭 몇번으로 피벗테이블을 만들고 분석할 수 있게 되었는데, 여러분들도 저와 함께 쉽게 이해하실 수 있을겁니다.
오피스 팁 엑셀의 피벗테이블을 이용하여 데이터를 분석하는 팁
피벗 테이블 보고서

7.2. 피버팅으로 데이터 나열 방식 바꾸기¶

데이터베이스나 CSV 파일에 여러 개의 시계열 데이터를 저장하는 일반적인 방법은 시간순으로 나열하는 방법
csv파일 읽기: p.222쪽 참고
컬럼 이름 변경
컬럼 합치기
원하는 컬럼만 보여주기: p.169쪽 frame2.ix['three'] 부분 참고

DataFrame을 생성하는 코드는 편의를 위해 생략했다.¶

책을 다 한 번 훑고 다시 돌아오니 이제는 할 수 있게됐다.¶

names로 지정해주지 않으면 pivot 할 때 error 발생¶

date, item, value
1959-03-31 00:00:00, realgdp, 2710.349
1959-03-31 00:00:00, infl, 0.000
1959-03-31 00:00:00, unemp, 5.800
1959-06-30 00:00:00, realgdp, 2778.801
1959-06-30 00:00:00, infl, 2.340
1959-06-30 00:00:00, unemp, 5.100
1959-09-30 00:00:00, realgdp, 2775.488
1959-09-30 00:00:00, infl, 2.740
1959-09-30 00:00:00, unemp, 5.300
1959-12-31 00:00:00, realgdp, 2785.204

In [268]:

%%writefile ch07/pivot.csv
date, item, value
1959-03-31 00:00:00, realgdp, 2710.349
1959-03-31 00:00:00, infl, 0.000
1959-03-31 00:00:00, unemp, 5.800
1959-06-30 00:00:00, realgdp, 2778.801
1959-06-30 00:00:00, infl, 2.340
1959-06-30 00:00:00, unemp, 5.100
1959-09-30 00:00:00, realgdp, 2775.488
1959-09-30 00:00:00, infl, 2.740
1959-09-30 00:00:00, unemp, 5.300
1959-12-31 00:00:00, realgdp, 2785.204        

Overwriting ch07/pivot.csv

In [274]:

# header를 0으로 설정하겠다는건데.. 기본값.
# 계속 pivot 할 때 에러가 난다.
# names로 꼭 설정해야 하나보다. pivot이 인식하는 것은 names로 setting된 값인듯.
# 열과 행을 바꿔야 하기 때문에 이미 인식하고 있는 값이 없으면 error 뱉는것 같다.
ldata = pd.read_csv('ch07/pivot.csv', header=0)

In [275]:

# 이것 말고는 딱히 다른게 생각나지 않는다.
# csv 파일에서 1번째 줄인 header를 지우면 어떤 컬럼인지 모르니까 차라리 skiprows=1 을 해주는게 낫겠다.
# 6장 처음 부분에 read_csv 옵션들이 있으니 참고
ldata = pd.read_csv('ch07/pivot.csv', skiprows=1, names=['date', 'item', 'value'])

In [276]:

ldata[:10]

Out[276]:

	date	item	value
0	1959-03-31 00:00:00	realgdp	2710.349
1	1959-03-31 00:00:00	infl	0.000
2	1959-03-31 00:00:00	unemp	5.800
3	1959-06-30 00:00:00	realgdp	2778.801
4	1959-06-30 00:00:00	infl	2.340
5	1959-06-30 00:00:00	unemp	5.100
6	1959-09-30 00:00:00	realgdp	2775.488
7	1959-09-30 00:00:00	infl	2.740
8	1959-09-30 00:00:00	unemp	5.300
9	1959-12-31 00:00:00	realgdp	2785.204

In [277]:

type(ldata)

Out[277]:

pandas.core.frame.DataFrame

In [278]:

# 1번째 인자: 로우 색인으로 사용될 칼럼 이름
# 2번째 인자: 칼럼 색인으로 사용될 칼럼 이름
# 3번째 인자: DataFrame에 채워 넣을 값을 담고 있는 칼럼
pivoted = ldata.pivot('date', 'item', 'value')

In [279]:

pivoted.head()

Out[279]:

item	infl	realgdp	unemp
date
1959-03-31 00:00:00	0.00	2710.349	5.8
1959-06-30 00:00:00	2.34	2778.801	5.1
1959-09-30 00:00:00	2.74	2775.488	5.3
1959-12-31 00:00:00	NaN	2785.204	NaN

한 번에 2개의 칼럼 변형

In [280]:

ldata['value2'] = np.random.randn(len(ldata))

In [281]:

ldata[:10]

Out[281]:

	date	item	value	value2
0	1959-03-31 00:00:00	realgdp	2710.349	-1.758243
1	1959-03-31 00:00:00	infl	0.000	-1.163026
2	1959-03-31 00:00:00	unemp	5.800	-0.997308
3	1959-06-30 00:00:00	realgdp	2778.801	-1.407402
4	1959-06-30 00:00:00	infl	2.340	1.698340
5	1959-06-30 00:00:00	unemp	5.100	0.309916
6	1959-09-30 00:00:00	realgdp	2775.488	0.759469
7	1959-09-30 00:00:00	infl	2.740	-0.376500
8	1959-09-30 00:00:00	unemp	5.300	-0.288409
9	1959-12-31 00:00:00	realgdp	2785.204	-2.492189

In [282]:

# 3번째 인자 생략하면 계층적 색인으로 보여줌
pivoted = ldata.pivot('date', 'item')

In [285]:

pivoted[:5]

Out[285]:

	value			value2
item	infl	realgdp	unemp	infl	realgdp	unemp
date
1959-03-31 00:00:00	0.00	2710.349	5.8	-1.163026	-1.758243	-0.997308
1959-06-30 00:00:00	2.34	2778.801	5.1	1.698340	-1.407402	0.309916
1959-09-30 00:00:00	2.74	2775.488	5.3	-0.376500	0.759469	-0.288409
1959-12-31 00:00:00	NaN	2785.204	NaN	NaN	-2.492189	NaN

In [286]:

# Wow! Simple sentence!
# 계층적 색인으로 되어 있는 것을 ['value']로 좁혀주고 5개만 보여준다.
pivoted['value'][:5]

Out[286]:

item	infl	realgdp	unemp
date
1959-03-31 00:00:00	0.00	2710.349	5.8
1959-06-30 00:00:00	2.34	2778.801	5.1
1959-09-30 00:00:00	2.74	2775.488	5.3
1959-12-31 00:00:00	NaN	2785.204	NaN

pivot: set_index를 사용해서 계층적 색인을 만들고 unstack 메서드를 이용해서 형태를 변경하는 단축키같은 메서드

In [287]:

unstacked = ldata.set_index(['date', 'item']).unstack('item')

In [288]:

unstacked[:7]

Out[288]:

	value			value2
item	infl	realgdp	unemp	infl	realgdp	unemp
date
1959-03-31 00:00:00	0.00	2710.349	5.8	-1.163026	-1.758243	-0.997308
1959-06-30 00:00:00	2.34	2778.801	5.1	1.698340	-1.407402	0.309916
1959-09-30 00:00:00	2.74	2775.488	5.3	-0.376500	0.759469	-0.288409
1959-12-31 00:00:00	NaN	2785.204	NaN	NaN	-2.492189	NaN

In [289]:

# 위에서 unstack을 하니 item 로우가 컬럼으로 이동했다.
unstacked = ldata.set_index(['date', 'item'])

In [290]:

# date, item이 로우
# value, value2가 열
unstacked[:7]

Out[290]:

		value	value2
date	item
1959-03-31 00:00:00	realgdp	2710.349	-1.758243
	infl	0.000	-1.163026
	unemp	5.800	-0.997308
1959-06-30 00:00:00	realgdp	2778.801	-1.407402
	infl	2.340	1.698340
	unemp	5.100	0.309916
1959-09-30 00:00:00	realgdp	2775.488	0.759469

In [291]:

# set_index가 로우 설정
# set_index로 설정되지 않은 것들은 모두 열로 이동
unstacked = ldata.set_index(['date'])

In [292]:

unstacked

Out[292]:

	item	value	value2
date
1959-03-31 00:00:00	realgdp	2710.349	-1.758243
1959-03-31 00:00:00	infl	0.000	-1.163026
1959-03-31 00:00:00	unemp	5.800	-0.997308
1959-06-30 00:00:00	realgdp	2778.801	-1.407402
1959-06-30 00:00:00	infl	2.340	1.698340
1959-06-30 00:00:00	unemp	5.100	0.309916
1959-09-30 00:00:00	realgdp	2775.488	0.759469
1959-09-30 00:00:00	infl	2.740	-0.376500
1959-09-30 00:00:00	unemp	5.300	-0.288409
1959-12-31 00:00:00	realgdp	2785.204	-2.492189

7.3 데이터 변형¶

필터링, 정제 및 다른 변형 역시 중요한 연산

7.3.1 중복 제거하기¶

In [392]:

data = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 4, 4]})

In [393]:

data

Out[393]:

	k1	k2
0	one	1
1	one	1
2	one	2
3	two	3
4	two	3
5	two	4
6	two	4

In [394]:

# 이 구조가 잘 인해 안되면 타이핑 해보세요.
# 파이썬 인터프리터가 답을 알려줍니다.
['one'] * 3 + ['two'] * 4

Out[394]:

['one', 'one', 'one', 'two', 'two', 'two', 'two']

In [395]:

data.duplicated()

Out[395]:

0    False
1     True
2    False
3    False
4     True
5    False
6     True
dtype: bool

In [396]:

data2 = DataFrame({'k1': ['one'] * 3 + ['two'] * 4,
                  'k2': [1, 1, 2, 3, 3, 3, 4]})

In [397]:

# 2개 열이 모두 같아야 중복으로 인정 됨
data2.duplicated()

Out[397]:

0    False
1     True
2    False
3    False
4     True
5     True
6    False
dtype: bool

In [398]:

# druplicated 배열이 False인 DataFrame 반환
data.drop_duplicates()

Out[398]:

	k1	k2
0	one	1
2	one	2
3	two	3
5	two	4

In [399]:

data['v1'] = range(7)

In [400]:

data

Out[400]:

	k1	k2	v1
0	one	1	0
1	one	1	1
2	one	2	2
3	two	3	3
4	two	3	4
5	two	4	5
6	two	4	6

In [406]:

# 중복 여부를 k1 컬럼만 본다는 이야기
# one, two만 남을 수 밖에 없음
data.drop_duplicates(['k1'])

Out[406]:

	k1	k2	v1
0	one	1	0
3	two	3	3

In [407]:

data.drop_duplicates(['k1'], take_last=True)

Out[407]:

	k1	k2	v1
2	one	2	2
6	two	4	6

In [408]:

data.drop_duplicates(['k1'], take_last=False)

Out[408]:

	k1	k2	v1
0	one	1	0
3	two	3	3

In [403]:

data.drop_duplicates(['k1', 'k2'])

Out[403]:

	k1	k2	v1
0	one	1	0
2	one	2	2
3	two	3	3
5	two	4	5

duplicated와 drop_duplicates는 기본적으로 처음 발견된 값을 유지한다. take_last = True 옵션을 넘기면 마지막으로 발견된 값을 반환
이게 아직 잘 이해가 안된다. 몇 번 테스트 해봤는데 달라지는 것도 없고..
삽질 끝에 알아냈다. 그냥 영어 그대로 해석하면 되는거네...-_-;
중복을 제거하는데 take_last가 False면 처음값 출력해주고
take_last가 True면 마지막 값 출력해 주고..

In [404]:

# take_last=False는 5를 선택
# take_last=True는 6을 선택
data.drop_duplicates(['k1', 'k2'], take_last=True)

Out[404]:

	k1	k2	v1
1	one	1	1
2	one	2	2
4	two	3	4
6	two	4	6

In [352]:

data.drop_duplicates?

Type:        instancemethod
String form:
<bound method DataFrame.drop_duplicates of     k1  k2  v1
0  one   1   0
1  one   1   1
2  one   2   2
3  two   3   3
4  two   3   4
5  two   4   5
6  two   4   6>
File:        /Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.py
Definition:  data.drop_duplicates(self, cols=None, take_last=False, inplace=False)
Docstring:
Return DataFrame with duplicate rows removed, optionally only
considering certain columns

Parameters
----------
cols : column label or sequence of labels, optional
    Only consider certain columns for identifying duplicates, by
    default use all of the columns
take_last : boolean, default False
    Take the last observed row in a row. Defaults to the first row
inplace : boolean, default False
    Whether to drop duplicates in place or to return a copy

Returns
-------
deduplicated : DataFrame

7.3.2 함수나 매핑 이용해 데이터 변형하기¶

DataFrame의 칼럼이나 Series, 배열 안의 값을 기반으로 데이터의 형태를 변형하고 싶을 때

In [414]:

data = DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami',
                           'corned beef', 'Bacon', 'pastrami',
                           'honey ham', 'nova lox'],
                  'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

In [415]:

data

Out[415]:

	food	ounces
0	bacon	4.0
1	pulled pork	3.0
2	bacon	12.0
3	Pastrami	6.0
4	corned beef	7.5
5	Bacon	8.0
6	pastrami	3.0
7	honey ham	5.0
8	nova lox	6.0

해당 육류가 어떤 동물의 고기인지 알려줄 수 있는 칼럼을 하나 추가한ㄴㄴㄴ다고 가정
육류별 동물을 담고 있는 사전 데이터 작성

In [416]:

meat_to_animal = {
                  'bacon': 'pig',
                  'pulled pork': 'pig',
                  'pastrami': 'cow',
                  'corned beef': 'cow',
                  'honey ham': 'pig',
                  'nova lox': 'salmon'
                  }

Series의 map 메서드는 사전류의 객체나 어떤 함수를 받을 수 있는데, 이 데이터에는 육류의 이름에 대∙소문자가 섞여 있는 사소한 문제가 있으므로 모두 소문자로 변경

3단계로 풀어 헤쳐서 함수의 역할 살펴보기¶

처음에는 한 번에 읽을 수 없으니 조각조각 코드를 나눠서 실행해 본다.
하나씩 하나씩 퍼즐 맞추듯이 조립하여 의미를 확인해 본다.

In [417]:

# 먼저 data['food']에 어떤 데이터가 있는지 확인
data['food']

Out[417]:

0          bacon
1    pulled pork
2          bacon
3       Pastrami
4    corned beef
5          Bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [420]:

data['food'].map?

map 1번째 인자로 함수를 넘기고, 2번째 인자로 연속된 형태의(리스트, 튜플 같은) 자료형을 넘기면 된다.

Type:        builtin_function_or_method
String form: <built-in function map>
Namespace:   Python builtin
Docstring:
map(function, sequence[, sequence, ...]) -> list

Return a list of the results of applying the function to the items of
the argument sequence(s).  If more than one sequence is given, the
function is called with an argument list consisting of the corresponding
item of each sequence, substituting None for missing values when not all
sequences have the same length.  If the function is None, return a list of
the items of the sequence (or a list of tuples if more than one sequence).

In [ ]:

data['food'].map

In [418]:

# data['food']에 map 함수를 적용하는데 소문자로 모두 변경하는 함수 적용
data['food'].map(str.lower)

Out[418]:

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

In [419]:

# 또 map으로 meat_to_animal dictionary를 넘긴다.
# bacon -> pig로, pastrami -> cow로 변경
data['food'].map(str.lower).map(meat_to_animal)

Out[419]:

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

In [412]:

# 변경된 것들을 animal 열을 새로 생성하고 여기에 대입
data['animal'] = data['food'].map(str.lower).map(meat_to_animal)

In [413]:

# food열의 대∙소문자는 안 바뀐 것을 알 수 있다.
# 대∙소문자 비교한건 animal 필드에 적용하기 위한 것인 것임을 확인할 수 있다.
data

Out[413]:

	food	ounces	animal
0	bacon	4.0	pig
1	pulled pork	3.0	pig
2	bacon	12.0	pig
3	Pastrami	6.0	cow
4	corned beef	7.5	cow
5	Bacon	8.0	pig
6	pastrami	3.0	cow
7	honey ham	5.0	pig
8	nova lox	6.0	salmon

물론 함수를 넘겨서 같은 일을 수행할 수 있다.

In [421]:

data['food'].map(lambda x: meat_to_animal[x.lower()])

Out[421]:

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

map 메서드를 사용하면 데이터의 요소별 변형 및 데이터를 다듬는 작업 편리

7.3.3 값 치환하기¶

fillna 메서드: 누락된 값을 채우는 일은 일반적인 값 치환 작업
위에서 살펴봤듯이 map 메서드를 한 객체 안에서 값의 부분집합을 변경하는 데 사용했다면,
replace 메서드: 같은 작업에 대해서 좀 더 간단하고 유연한 방법 제공

In [422]:

data = Series([1., -999., 2., -999., -1000., 3.])

In [423]:

data

Out[423]:

0       1
1    -999
2       2
3    -999
4   -1000
5       3
dtype: float64

-999는 누락된 데이터를 나타내기 위한 값
이 값은 replace 메서드를 이용해서 pandas에서 인식할 수 있는 NA값으로 치환된 새로운 Series를 생성

In [424]:

data.replace(-999, np.nan)

Out[424]:

0       1
1     NaN
2       2
3     NaN
4   -1000
5       3
dtype: float64

당연히 여러 개의 값도 한 번에 치환 가능

In [425]:

# 정말 직관적으로 잘만든것 같다.
# 1개 이상을 변경하려면 list로 넘기면 모두 알아서 변경해준다.
data.replace([-999, -1000], np.nan)

Out[425]:

0     1
1   NaN
2     2
3   NaN
4   NaN
5     3
dtype: float64

In [428]:

# 변경할 문자열만 list로 넘기라는 법 없음
# 변경하고 싶은 문자열도 list로 넘기면 순서에 맞게끔 변경해 줌
data.replace([-999, -1000], [999, 1000])

Out[428]:

0       1
1     999
2       2
3     999
4    1000
5       3
dtype: float64

In [429]:

data.replace([-999, -1000], [np.nan, 0])

Out[429]:

0     1
1   NaN
2     2
3   NaN
4     0
5     3
dtype: float64

In [430]:

# 사전으로 넘겨도 된다.
# 정말 편리하게 만들었다.
data.replace({-999: np.nan, -1000: 0})

Out[430]:

0     1
1   NaN
2     2
3   NaN
4     0
5     3
dtype: float64

7.3.4 축 색인 이름 바꾸기¶

Series의 값처럼 축 이름 역시 유사한 방식으로 함수나 새롭게 바꿀 값으로 이용해서 변형
새로운 자료 구조를 만들지 않고 그 자리에서 바로 축 이름을 변경하는 것이 가능

In [431]:

data = DataFrame(np.arange(12).reshape((3, 4)),
                 index = ['Ohio', 'Colorado', 'New York'],
                 columns=['one', 'two', 'three', 'four'])

In [432]:

data

Out[432]:

	one	two	three	four
Ohio	0	1	2	3
Colorado	4	5	6	7
New York	8	9	10	11

map 함수 사용법¶

예전에도 map, reduce 이런 함수들이 어려웠는데, 여기에서 확실히 익히네
그냥 map 함수에 함수를 인자로 넘기면 됨..

In [433]:

data.index.map(str.upper)

Out[433]:

array(['OHIO', 'COLORADO', 'NEW YORK'], dtype=object)

In [434]:

data.index = data.index.map(str.upper)

In [435]:

data

Out[435]:

	one	two	three	four
OHIO	0	1	2	3
COLORADO	4	5	6	7
NEW YORK	8	9	10	11

rename 메서드: 원래 객체를 변경하지 않고 새로운 객체를 생성

In [436]:

data.rename(index=str.title, columns=str.upper)

Out[436]:

	ONE	TWO	THREE	FOUR
Ohio	0	1	2	3
Colorado	4	5	6	7
New York	8	9	10	11

In [437]:

# 위에서 index를 대문자로 바꿔줬다.
# rename을 사용해서 원본은 바뀌지 않았다.
data

Out[437]:

	one	two	three	four
OHIO	0	1	2	3
COLORADO	4	5	6	7
NEW YORK	8	9	10	11

rename 메서드: 사전 형식의 객체를 이용해서 축 이름 중 일부만 변경하는 것도 가능

In [438]:

data.rename(index={'OHIO': 'INDIANA'},
            columns={'three': 'peekaboo'})

Out[438]:

	one	two	peekaboo	four
INDIANA	0	1	2	3
COLORADO	4	5	6	7
NEW YORK	8	9	10	11

In [439]:

data

Out[439]:

	one	two	three	four
OHIO	0	1	2	3
COLORADO	4	5	6	7
NEW YORK	8	9	10	11

rename 메서드: DataFrame을 직접 복사해서 index와 columns 속성을 갱신할 필요없이 바로 변경도 가능
원본 데이터를 바로 변경하려면 inplace = True 옵션 설정
원본 데이터를 바로 변경하기 때문에 매우 주의해서 사용하자

In [440]:

# 항상 DataFrame의 참조를 반환한다.
_ = data.rename(index={'OHIO': 'INDIANA'}, inplace=True)

In [441]:

# OHIO -> INDIANA로 바뀌었다.
data

Out[441]:

	one	two	three	four
INDIANA	0	1	2	3
COLORADO	4	5	6	7
NEW YORK	8	9	10	11

7.3.5 개별화와 양자화¶

연속성 데이터는 종종 개별로 분할하거나 분석을 위해 그룹별로 나눔
수업에 참여하는 학생 그룹 데이터가 있고, 나이대에 따라 분류한다고 가정

In [442]:

ages = [20, 22, 25, 27, 21, 23, 27, 31, 61, 45, 41, 32]

이 데이터를 pandas의 cut 함수를 사용해서 18-25, 26-35, 35-60, 60이상인 그룹으로 나눔

In [443]:

bins = [18, 25, 35, 60, 100]

In [444]:

# 1번째 인자: 나눌 list
# 2번째 인자: 나눌 기준
cats = pd.cut(ages, bins)

In [446]:

pd.cut?

Type:        function
String form: <function cut at 0x109e2acf8>
File:        /Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/tools/tile.py
Definition:  pd.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)
Docstring:
Return indices of half-open bins to which each value of `x` belongs.

Parameters
----------
x : array-like
    Input array to be binned. It has to be 1-dimensional.
bins : int or sequence of scalars
    If `bins` is an int, it defines the number of equal-width bins in the
    range of `x`. However, in this case, the range of `x` is extended
    by .1% on each side to include the min or max values of `x`. If
    `bins` is a sequence it defines the bin edges allowing for
    non-uniform bin width. No extension of the range of `x` is done in
    this case.
right : bool, optional
    Indicates whether the bins include the rightmost edge or not. If
    right == True (the default), then the bins [1,2,3,4] indicate
    (1,2], (2,3], (3,4].
labels : array or boolean, default None
    Labels to use for bin edges, or False to return integer bin labels
retbins : bool, optional
    Whether to return the bins or not. Can be useful if bins is given
    as a scalar.

Returns
-------
out : Categorical or array of integers if labels is False
bins : ndarray of floats
    Returned only if `retbins` is True.

Notes
-----
The `cut` function can be useful for going from a continuous variable to
a categorical variable. For example, `cut` could convert ages to groups
of age ranges.

Any NA values will be NA in the result.  Out of bounds values will be NA in
the resulting Categorical object


Examples
--------
>>> cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
(array([(0.191, 3.367], (0.191, 3.367], (0.191, 3.367], (3.367, 6.533],
       (6.533, 9.7], (0.191, 3.367]], dtype=object),
 array([ 0.1905    ,  3.36666667,  6.53333333,  9.7       ]))
>>> cut(np.ones(5), 4, labels=False)
array([2, 2, 2, 2, 2])

In [445]:

cats

Out[445]:

Categorical: 
[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], (18, 25], (25, 35], (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Levels (4): Index(['(18, 25]', '(25, 35]', '(35, 60]', '(60, 100]'], dtype=object)

pandas에서 반환하는 객체는 Categorical 객체라는 특수한 객체인데, 이 객체는 그룹 이름이 담긴 배열
이 Categorical 객체는 labels 속성에 있는 ages 데이터에 대한 카테고리 이름을 levels라는 배열에 내부적으로 담고 있다.

In [447]:

cats.labels

Out[447]:

array([0, 0, 0, 1, 0, 0, 1, 1, 3, 2, 2, 1])

In [448]:

cats.levels

Out[448]:

Index([u'(18, 25]', u'(25, 35]', u'(35, 60]', u'(60, 100]'], dtype=object)

In [449]:

pd.value_counts(cats)

Out[449]:

(18, 25]     5
(25, 35]     4
(35, 60]     2
(60, 100]    1
dtype: int64

간격을 나타내는 표기법은 (로 시작해서 ]로 끝난다.
(: 괄호쪽의 값은 포함하지 않고
[: 대괄호 쪽의 값은 포함하는 간격을 나타냄
right=False를 넘겨서 괄호와 대괄호의 위치 변경 가능

In [450]:

pd.cut(ages, [18, 26, 36, 61, 100], right=False)

Out[450]:

Categorical: 
[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), [18, 26), [26, 36), [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Levels (4): Index(['[18, 26)', '[26, 36)', '[36, 61)', '[61, 100)'], dtype=object)

In [451]:

bins

Out[451]:

[18, 25, 35, 60, 100]

labels 옵션으로 그룹의 이름을 직접 넘겨줄 수도 있다.

In [452]:

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']

In [453]:

pd.cut(ages, bins, labels=group_names)

Out[453]:

Categorical: 
[Youth, Youth, Youth, YoungAdult, Youth, Youth, YoungAdult, YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Levels (4): Index(['Youth', 'YoungAdult', 'MiddleAged', 'Senior'], dtype=object)

cut 함수에 명시적으로 그룹의 경계 값을 넘기지 않고 그룹의 개수를 넘겨주면 데이터 내에서 최소값과 최대값을 기준으로 균등한 길이의 그룹을 자동으로 계산

In [454]:

data = np.random.randn(20)

In [455]:

data

Out[455]:

array([ 0.21900547,  0.8840372 , -0.84813527,  1.79066963,  1.20818628,
        0.50816038,  0.5384968 , -0.46679507,  0.20554364,  0.40605808,
        1.38356295, -0.0487796 , -1.86564376,  0.0567211 ,  0.00651676,
       -0.93895591, -0.74396268,  0.2366114 ,  0.62541814, -1.38886959])

In [458]:

# data, 그룹의 갯수, 정확도?
pd.cut(data, 4, precision=2)

Out[458]:

Categorical: 
[(-0.037, 0.88], (0.88, 1.79], (-0.95, -0.037], (0.88, 1.79], (0.88, 1.79], (-0.037, 0.88], (-0.037, 0.88], (-0.95, -0.037], (-0.037, 0.88], (-0.037, 0.88], (0.88, 1.79], (-0.95, -0.037], (-1.87, -0.95], (-0.037, 0.88], (-0.037, 0.88], (-0.95, -0.037], (-0.95, -0.037], (-0.037, 0.88], (-0.037, 0.88], (-1.87, -0.95]]
Levels (4): Index(['(-1.87, -0.95]', '(-0.95, -0.037]',
                   '(-0.037, 0.88]', '(0.88, 1.79]'], dtype=object)

qcut: 표본 변위치를 기반으로 데이터를 나눠준다. 데이터의 분산에 따라 각각의 그룹마다 데이터의 개수가 다르게 나누어지는 경우가 많다.
qcut: 표준 변위치를 사용하기 때문에 적당히 같은 크기의 그룹으로 나눌 수 있다.

In [459]:

data = np.random.randn(1000) # Normally distibuted

In [460]:

cats = pd.qcut(data, 4) # Cut into quartiles

In [461]:

cats

Out[461]:

Categorical: 
[(-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (-0.608, 0.0816], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.0816, 0.744], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], [-3.176, -0.608], (0.0816, 0.744], (0.744, 2.956], [-3.176, -0.608], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], [-3.176, -0.608], (0.744, 2.956], (-0.608, 0.0816], (0.744, 2.956], (0.0816, 0.744], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (0.0816, 0.744], (-0.608, 0.0816], (-0.608, 0.0816], [-3.176, -0.608], (0.744, 2.956], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956], (0.0816, 0.744], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], [-3.176, -0.608], (-0.608, 0.0816], [-3.176, -0.608], (-0.608, 0.0816], (0.744, 2.956], (0.744, 2.956]]
Levels (4): Index(['[-3.176, -0.608]', '(-0.608, 0.0816]',
                   '(0.0816, 0.744]', '(0.744, 2.956]'], dtype=object)

In [462]:

pd.value_counts(cats)

Out[462]:

(-0.608, 0.0816]    250
(0.744, 2.956]      250
[-3.176, -0.608]    250
(0.0816, 0.744]     250
dtype: int64

In [463]:

data2 = pd.cut(data, 4, precision=2)

In [464]:

# qcut과 비교를 위해
pd.value_counts(data2)

Out[464]:

(-0.11, 1.42]     496
(-1.64, -0.11]    375
(1.42, 2.96]       83
(-3.18, -1.64]     46
dtype: int64

cut함수와 유사하게 변위치를 직접 지정해줄 수 있다(변위치 값은 0부터 1까지)

In [468]:

cats2 = pd.qcut(data, [0, 0.1, 0.5, 0.9, 1.])

In [469]:

cats2

Out[469]:

Categorical: 
[(-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (1.361, 2.956], (1.361, 2.956], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], [-3.176, -1.259], (0.0816, 1.361], (1.361, 2.956], (1.361, 2.956], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (1.361, 2.956], (1.361, 2.956], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (1.361, 2.956], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], [-3.176, -1.259], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], [-3.176, -1.259], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (1.361, 2.956], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], [-3.176, -1.259], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (1.361, 2.956], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (1.361, 2.956], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], (0.0816, 1.361], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], [-3.176, -1.259], [-3.176, -1.259], (-1.259, 0.0816], [-3.176, -1.259], (1.361, 2.956], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (-1.259, 0.0816], (1.361, 2.956], (0.0816, 1.361], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], (0.0816, 1.361], [-3.176, -1.259], (0.0816, 1.361], (-1.259, 0.0816], (-1.259, 0.0816], [-3.176, -1.259], (0.0816, 1.361], (1.361, 2.956], (-1.259, 0.0816], (-1.259, 0.0816], (0.0816, 1.361], (1.361, 2.956], (0.0816, 1.361], (-1.259, 0.0816], [-3.176, -1.259], (-1.259, 0.0816], (1.361, 2.956], [-3.176, -1.259], (-1.259, 0.0816], (-1.259, 0.0816], (-1.259, 0.0816], (1.361, 2.956], (1.361, 2.956]]
Levels (4): Index(['[-3.176, -1.259]', '(-1.259, 0.0816]',
                   '(0.0816, 1.361]', '(1.361, 2.956]'], dtype=object)

In [470]:

pd.value_counts(cats2)

Out[470]:

(0.0816, 1.361]     400
(-1.259, 0.0816]    400
[-3.176, -1.259]    100
(1.361, 2.956]      100
dtype: int64

그룹 분석과 변위치를 다룰 때는 cut과 qcut 함수 같은 이산 함수가 특히 더 유용
이 내용은 수집과 그룹 연산에 대한 장에서 다시 한 번 살펴봄

통계, 확률 모르면 용어부터 이해가 안됨.. 이산 함수?¶

7.3.6 특이값 찾아내고 제외하기¶

배열연산을 수행할 때는 특이값(outlier)을 제외하거나 적당한 값으로 대체하는 것 중요
실제 실무에서도 outlier는 제거한다고 한다. 이 하나의 특이값이 나머지 값들을 변화시킬 수도 있어서.
어느 정도 경계까지 outlier로 볼 것인지? 이게 또 문제인듯..

In [476]:

np.random.seed?

In [471]:

np.random.seed(12345)

In [472]:

data = DataFrame(np.random.randn(1000, 4))

In [473]:

data.describe()

Out[473]:

	0	1	2	3
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.067684	0.067924	0.025598	-0.002298
std	0.998035	0.992106	1.006835	0.996794
min	-3.428254	-3.548824	-3.184377	-3.745356
25%	-0.774890	-0.591841	-0.641675	-0.644144
50%	-0.116401	0.101143	0.002073	-0.013611
75%	0.616366	0.780282	0.680391	0.654328
max	3.366626	2.653656	3.260383	3.927528

한 칼럼에서 절대 값이 3 초과하는 값 찾아내기

In [474]:

col = data[3]

In [475]:

col[np.abs(col) > 3]

Out[475]:

97     3.927528
305   -3.399312
400   -3.745356
Name: 3, dtype: float64

In [477]:

data[np.abs(data) > 3]

Out[477]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 4 columns):
0    2  non-null values
1    1  non-null values
2    5  non-null values
3    3  non-null values
dtypes: float64(4)

In [478]:

data[(np.abs(data) > 3)]

Out[478]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 4 columns):
0    2  non-null values
1    1  non-null values
2    5  non-null values
3    3  non-null values
dtypes: float64(4)

In [479]:

# 절대값 3을 초과하는 값이 들어있는 모든 로우를 선택하려면 any 메서드 사용
data[(np.abs(data) > 3).any(1)]

Out[479]:

	0	1	2	3
5	-0.539741	0.476985	3.248944	-1.021228
97	-0.774363	0.552936	0.106061	3.927528
102	-0.655054	-0.565230	3.176873	0.959533
305	-2.315555	0.457246	-0.025907	-3.399312
324	0.050188	1.951312	3.260383	0.963301
400	0.146326	0.508391	-0.196713	-3.745356
499	-0.293333	-0.242459	-3.056990	1.918403
523	-3.428254	-0.296336	-0.439938	-0.867165
586	0.275144	1.179227	-3.184377	1.369891
808	-0.362528	-3.548824	1.553205	-2.186301
900	3.366626	-2.372214	0.851010	1.332846

-3이나 3을 초과하는 값을 -3 혹은 3으로 지정할 수 있다.

유니버설 함수인 np.sign은 주어진 값이 부호에 따라 1이나 -1이 담긴 배열을 반환

In [505]:

# 이렇게하면 sign값만 추출할 수 있다.
np.sign(data[2])

Out[505]:

0    -1
1     1
2     1
3     1
4     1
5     1
6     1
7    -1
8    -1
9    -1
10   -1
11    1
12    1
13    1
14    1
...
985    1
986    1
987    1
988   -1
989    1
990   -1
991    1
992   -1
993   -1
994    1
995   -1
996   -1
997    1
998   -1
999    1
Name: 2, Length: 1000, dtype: float64

In [500]:

# 절대값 3을 넘는 것들을 수정하는데 원래 데이터의 sign값을 유지하면서 * 3을 한다.
data[np.abs(data) > 3] = np.sign(data) * 3

In [502]:

data[(np.abs(data) >= 3).any(1)]

Out[502]:

	0	1	2	3
5	-0.539741	0.476985	3.000000	-1.021228
97	-0.774363	0.552936	0.106061	3.000000
102	-0.655054	-0.565230	3.000000	0.959533
305	-2.315555	0.457246	-0.025907	-3.000000
324	0.050188	1.951312	3.000000	0.963301
400	0.146326	0.508391	-0.196713	-3.000000
499	-0.293333	-0.242459	-3.000000	1.918403
523	-3.000000	-0.296336	-0.439938	-0.867165
586	0.275144	1.179227	-3.000000	1.369891
808	-0.362528	-3.000000	1.553205	-2.186301
900	3.000000	-2.372214	0.851010	1.332846

np.any 실험¶

In [486]:

(np.abs(data) > 3).any?

Type:        function
String form: <function any at 0x102729668>
File:        /Library/Python/2.7/site-packages/numpy-1.9.0.dev_c50e60d-py2.7-macosx-10.8-x86_64.egg/numpy/core/fromnumeric.py
Definition:  any(a, axis=None, out=None, keepdims=False)
Docstring:
Test whether any array element along a given axis evaluates to True.

Returns single boolean unless `axis` is not ``None``

Parameters
----------
a : array_like
    Input array or object that can be converted to an array.
axis : None or int or tuple of ints, optional
    Axis or axes along which a logical OR reduction is performed.
    The default (`axis` = `None`) is perform a logical OR over all
    the dimensions of the input array. `axis` may be negative, in
    which case it counts from the last to the first axis.

    .. versionadded:: 1.7.0

    If this is a tuple of ints, a reduction is performed on multiple
    axes, instead of a single axis or all the axes as before.
out : ndarray, optional
    Alternate output array in which to place the result.  It must have
    the same shape as the expected output and its type is preserved
    (e.g., if it is of type float, then it will remain so, returning
    1.0 for True and 0.0 for False, regardless of the type of `a`).
    See `doc.ufuncs` (Section "Output arguments") for details.
keepdims : bool, optional
    If this is set to True, the axes which are reduced are left
    in the result as dimensions with size one. With this option,
    the result will broadcast correctly against the original `arr`.

Returns
-------
any : bool or ndarray
    A new boolean or `ndarray` is returned unless `out` is specified,
    in which case a reference to `out` is returned.

See Also
--------
ndarray.any : equivalent method

all : Test whether all elements along a given axis evaluate to True.

Notes
-----
Not a Number (NaN), positive infinity and negative infinity evaluate
to `True` because these are not equal to zero.

Examples
--------
>>> np.any([[True, False], [True, True]])
True

>>> np.any([[True, False], [False, False]], axis=0)
array([ True, False], dtype=bool)

>>> np.any([-1, 0, 5])
True

>>> np.any(np.nan)
True

>>> o=np.array([False])
>>> z=np.any([-1, 4, 5], out=o)
>>> z, o
(array([ True], dtype=bool), array([ True], dtype=bool))
>>> # Check now that z is a reference to o
>>> z is o
True
>>> id(z), id(o) # identity of z and o              # doctest: +SKIP
(191614240, 191614240)

In [480]:

data[(np.abs(data) > 3).any()]

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.py:1853: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  "DataFrame index.", UserWarning)

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-480-4947ca346c8a> in <module>()
----> 1 data[(np.abs(data) > 3).any()]

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.pyc in __getitem__(self, key)
   1821         if isinstance(key, (Series, np.ndarray, list)):
   1822             # either boolean or fancy integer index
-> 1823             return self._getitem_array(key)
   1824         elif isinstance(key, DataFrame):
   1825             return self._getitem_frame(key)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.pyc in _getitem_array(self, key)
   1857             # _check_bool_indexer will throw exception if Series key cannot
   1858             # be reindexed to match DataFrame rows
-> 1859             key = _check_bool_indexer(self.index, key)
   1860             indexer = key.nonzero()[0]
   1861             return self.take(indexer, axis=0, convert=False)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/indexing.pyc in _check_bool_indexer(ax, key)
   1213         mask = com.isnull(result.values)
   1214         if mask.any():
-> 1215             raise IndexingError('Unalignable boolean Series key provided')
   1216 
   1217         result = result.astype(bool).values

IndexingError: Unalignable boolean Series key provided

In [481]:

data[(np.abs(data) > 3).any(0)]

---------------------------------------------------------------------------
IndexingError                             Traceback (most recent call last)
<ipython-input-481-ccc73e6802f0> in <module>()
----> 1 data[(np.abs(data) > 3).any(0)]

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.pyc in __getitem__(self, key)
   1821         if isinstance(key, (Series, np.ndarray, list)):
   1822             # either boolean or fancy integer index
-> 1823             return self._getitem_array(key)
   1824         elif isinstance(key, DataFrame):
   1825             return self._getitem_frame(key)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.pyc in _getitem_array(self, key)
   1857             # _check_bool_indexer will throw exception if Series key cannot
   1858             # be reindexed to match DataFrame rows
-> 1859             key = _check_bool_indexer(self.index, key)
   1860             indexer = key.nonzero()[0]
   1861             return self.take(indexer, axis=0, convert=False)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/indexing.pyc in _check_bool_indexer(ax, key)
   1213         mask = com.isnull(result.values)
   1214         if mask.any():
-> 1215             raise IndexingError('Unalignable boolean Series key provided')
   1216 
   1217         result = result.astype(bool).values

IndexingError: Unalignable boolean Series key provided

In [485]:

data[(np.abs(data) > 3).any(2)]

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-485-d9988aeb394e> in <module>()
----> 1 data[(np.abs(data) > 3).any(2)]

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.pyc in any(self, axis, bool_only, skipna, level)
   4087                                       skipna=skipna)
   4088         return self._reduce(nanops.nanany, axis=axis, skipna=skipna,
-> 4089                             numeric_only=bool_only, filter_type='bool')
   4090 
   4091     def all(self, axis=0, bool_only=None, skipna=True, level=None):

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/frame.pyc in _reduce(self, op, axis, skipna, numeric_only, filter_type, **kwds)
   4276     def _reduce(self, op, axis=0, skipna=True, numeric_only=None,
   4277                 filter_type=None, **kwds):
-> 4278         axis = self._get_axis_number(axis)
   4279         f = lambda x: op(x, axis=axis, skipna=skipna, **kwds)
   4280         labels = self._get_agg_axis(axis)

/Library/Python/2.7/site-packages/pandas-0.12.0_307_g3a2fe0b-py2.7-macosx-10.8-intel.egg/pandas/core/generic.pyc in _get_axis_number(self, axis)
    220             except:
    221                 pass
--> 222         raise ValueError('No axis named %s' % axis)
    223 
    224     def _get_axis_name(self, axis):

ValueError: No axis named 2

In [490]:

np.any([[True, False], [True, True]])

Out[490]:

True

In [491]:

np.any([[True, False], [True, True]], axis=0)

Out[491]:

array([ True,  True], dtype=bool)

In [493]:

np.any([[True, False], [False, False]], axis=0)

Out[493]:

array([ True, False], dtype=bool)

7.3.7 치환과 임의 샘플링¶

numpy.random.permutation 함수를 이용하면 로우를 쉽게 임의의 순서대로 재배치
순서를 바꾸고 싶은 만큼의 길이를 permutation 함수에 넘기면 바뀐 순서가 담긴 정수 배열 생성성

In [506]:

df = DataFrame(np.arange(5 * 4).reshape(5, 4))

In [507]:

sampler = np.random.permutation(5)

In [508]:

sampler

Out[508]:

array([1, 0, 2, 3, 4])

In [509]:

df

Out[509]:

	0	1	2	3
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

In [512]:

df.take(sampler)

Out[512]:

	0	1	2	3
1	4	5	6	7
0	0	1	2	3
2	8	9	10	11
3	12	13	14	15
4	16	17	18	19

치환없이 일부만 임의로 선택하려면 permutation 함수에서 반환된 배열에서 원하는 크기 k만큼만 처음부터 잘라내면 된다.

In [530]:

# 1. df의 크기를 넘겨주어 치환을 한다.
# 2. df.take 함수로 df에 함수 적용
# 3. 이 데이터 길이는 5개이지만 0,1,2만 보고 싶을 때 [:3]으로 보여주는 것 제한
df.take(np.random.permutation(len(df))[:3])

Out[530]:

	0	1	2	3
4	16	17	18	19
2	8	9	10	11
0	0	1	2	3

In [531]:

df.take(np.random.permutation(len(df)))

Out[531]:

	0	1	2	3
0	0	1	2	3
3	12	13	14	15
4	16	17	18	19
2	8	9	10	11
1	4	5	6	7

In [517]:

# 할 때마다 바뀜
# 1. df의 크기를 넘겨주어 치환을 한다.
np.random.permutation(len(df))

Out[517]:

array([0, 3, 2, 4, 1])

In [523]:

np.random.permutation(len(df))

Out[523]:

array([4, 1, 3, 0, 2])

In [528]:

np.random.permutation(5)

Out[528]:

array([0, 3, 4, 1, 2])

치환을 통해 표본을 생성하기 위한 가장 빠른 방법은 np.random.randint를 사용해서 생성한 난수를 이용하는 방법

In [532]:

bag = np.array([5, 7, -1, 6, 4])

In [537]:

# 1. sampler에 의해서 0, bag의 크기로 랜덤 int가 생성하는데 10개 생성
sampler = np.random.randint(0, len(bag), size=10)

In [541]:

sampler

Out[541]:

array([1, 3, 2, 4, 2, 2, 1, 0, 0, 3])

In [539]:

draws = bag.take(sampler)

In [540]:

# sampler 순서에 따라 bag에 있는 원소들을 배열
# sampler[0]: 1이니 bag[1]의 7
# sampler[1]: 3이니 bag[3]의 6
draws

Out[540]:

array([ 7,  6, -1,  4, -1, -1,  7,  5,  5,  6])

7.3.8 표시자/더미 변수¶

통계 모델이나 기계 학습 애플리케이션을 위한 또 다른 데이터 변형은 분류 값을 더미나 표시 행렬로 변환하는 것
만약 어떤 DataFrame의 한 칼럼에 k가지의 값이 있다면 k개의 칼럼이 있는 DataFrame이나 행렬을 만들고 값으로는 1과 0을 채워 넣을 것이다.
이때 사용하는 함수가 pandas의 get_dummies 함수인데

In [542]:

df = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                'data1': range(6)})

In [545]:

df

Out[545]:

	data1	key
0	0	b
1	1	b
2	2	a
3	3	c
4	4	a
5	5	b

In [543]:

# df['key']대로 key가 b이면 b컬럼에 1 세팅. 나머지는 0
pd.get_dummies(df['key'])

Out[543]:

	a	b	c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

In [544]:

df['key']

Out[544]:

0    b
1    b
2    a
3    c
4    a
5    b
Name: key, dtype: object

표시용 DataFrame 안에 있는 칼럼에 prefix를 추가한 후 다른 데이터와 병합하고 싶을 때가 있는데, 이런 경우 get_dummies 함수의 prefix 인자를 사용

In [546]:

dummies = pd.get_dummies(df['key'], prefix='key')

In [549]:

dummies

Out[549]:

	key_a	key_b	key_c
0	0	1	0
1	0	1	0
2	1	0	0
3	0	0	1
4	1	0	0
5	0	1	0

In [547]:

df_with_dummy = df[['data1']].join(dummies)

In [548]:

df_with_dummy

Out[548]:

	data1	key_a	key_b	key_c
0	0	0	1	0
1	1	0	1	0
2	2	1	0	0
3	3	0	0	1
4	4	1	0	0
5	5	0	1	0

DataFrame의 한 컬럼이 여러 카테고리에 속한다면 일이 조금 복잡

In [550]:

mnames = ['movie_id', 'title', 'genres']

원본 pydata에는 ch02에 movielens가 들어있다. ch07 폴더로 이동 후 테스트 해야한다.¶

In [553]:

movies = pd.read_table('ch07/movielens/movies.dat', sep='::',
                       header=None, names=mnames)

In [554]:

movies[:10]

Out[554]:

	movie_id	title	genres
0	1	Toy Story (1995)	Animation\|Children's\|Comedy
1	2	Jumanji (1995)	Adventure\|Children's\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama
4	5	Father of the Bride Part II (1995)	Comedy
5	6	Heat (1995)	Action\|Crime\|Thriller
6	7	Sabrina (1995)	Comedy\|Romance
7	8	Tom and Huck (1995)	Adventure\|Children's
8	9	Sudden Death (1995)	Action
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller

각 장르마다 표시용 값을 추가하려면 약간의 수고
먼저 데이터 묶음에서 유일한 장르 목록을 추출(set.union 트릭 사용)

In [562]:

# 이걸 Unique 하게 만들어야 한다.
movies.genres

Out[562]:

0      Animation|Children's|Comedy
1     Adventure|Children's|Fantasy
2                   Comedy|Romance
3                     Comedy|Drama
4                           Comedy
5            Action|Crime|Thriller
6                   Comedy|Romance
7             Adventure|Children's
8                           Action
9        Action|Adventure|Thriller
10            Comedy|Drama|Romance
11                   Comedy|Horror
12            Animation|Children's
13                           Drama
14        Action|Adventure|Romance
...
3868                            Horror
3869                            Horror
3870                            Horror
3871                            Horror
3872                            Horror
3873                            Comedy
3874                      Comedy|Drama
3875    Adventure|Animation|Children's
3876             Action|Drama|Thriller
3877                          Thriller
3878                            Comedy
3879                             Drama
3880                             Drama
3881                             Drama
3882                    Drama|Thriller
Name: genres, Length: 3883, dtype: object

In [555]:

# 1. movies.genres에서 x를 하나씩 추출
# 2. x.split('|')로 나눈다. 그럼 1,2,3~ 가 되겠지
# 3. 여기에서 다시 중복이 없게 하기 위해 set 함수를 한 번 적용
# 4. 최종적으로 장르 generator가 생성됨
genre_iter = (set(x.split('|')) for x in movies.genres)

In [557]:

genre_iter

Out[557]:

<generator object <genexpr> at 0x10a0efaa0>

In [563]:

set.union?

Type:        method_descriptor
String form: <method 'union' of 'set' objects>
Namespace:   Python builtin
Docstring:
Return the union of sets as a new set.

(i.e. all elements that are in either set.)

In [559]:

# genre_iter가 generator이기 때문에 앞에 *를 붙여주어 여러개를 받을 수 있게 해주었다.
# 이 부분은 나도 잘 이해가 안됨..
# generator 부분을 Python Cookbook 보면서 공부할 예정
# 어쨋든 set.union으로 generator를 1개씩 넘겨서
# 모두 합치고 sorted 함수로 sorting
genres = sorted(set.union(*genre_iter))

In [560]:

genres

Out[560]:

['Action',
 'Adventure',
 'Animation',
 "Children's",
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Fantasy',
 'Film-Noir',
 'Horror',
 'Musical',
 'Mystery',
 'Romance',
 'Sci-Fi',
 'Thriller',
 'War',
 'Western']

표시용 DataFrame을 생성하기 위해 0으로 초기화된 DataFrame을 생성

In [564]:

dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)

In [565]:

dummies

Out[565]:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3883 entries, 0 to 3882
Data columns (total 18 columns):
Action         3883  non-null values
Adventure      3883  non-null values
Animation      3883  non-null values
Children's     3883  non-null values
Comedy         3883  non-null values
Crime          3883  non-null values
Documentary    3883  non-null values
Drama          3883  non-null values
Fantasy        3883  non-null values
Film-Noir      3883  non-null values
Horror         3883  non-null values
Musical        3883  non-null values
Mystery        3883  non-null values
Romance        3883  non-null values
Sci-Fi         3883  non-null values
Thriller       3883  non-null values
War            3883  non-null values
Western        3883  non-null values
dtypes: float64(18)

In [566]:

dummies[:10]

Out[566]:

	Action	Adventure	Animation	Children's	Comedy	Crime	Documentary	Drama	Fantasy	Film-Noir	Horror	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

각 영화를 순회하면서 dummies 칼럼의 각 항목을 1로 설정

In [568]:

for i, gen in enumerate(movies.genres):
    dummies.ix[i, gen.split('|')] = 1

In [570]:

dummies[:10]

Out[570]:

	Action	Adventure	Animation	Children's	Comedy	Crime	Drama	Fantasy	Romance	Thriller
0	0	0	1	1	1	0	0	0	0	0
1	0	1	0	1	0	0	0	1	0	0
2	0	0	0	0	1	0	0	0	1	0
3	0	0	0	0	1	0	1	0	0	0
4	0	0	0	0	1	0	0	0	0	0
5	1	0	0	0	0	1	0	0	0	1
6	0	0	0	0	1	0	0	0	1	0
7	0	1	0	1	0	0	0	0	0	0
8	1	0	0	0	0	0	0	0	0	0
9	1	1	0	0	0	0	0	0	0	1

앞에서 한 대로 movies DataFrame과 조합하면 됨

In [571]:

movies_windic = movies.join(dummies.add_prefix('Genre_'))

In [572]:

movies_windic.ix[0]

Out[572]:

movie_id                                       1
title                           Toy Story (1995)
genres               Animation|Children's|Comedy
Genre_Action                                   0
Genre_Adventure                                0
Genre_Animation                                1
Genre_Children's                               1
Genre_Comedy                                   1
Genre_Crime                                    0
Genre_Documentary                              0
Genre_Drama                                    0
Genre_Fantasy                                  0
Genre_Film-Noir                                0
Genre_Horror                                   0
Genre_Musical                                  0
Genre_Mystery                                  0
Genre_Romance                                  0
Genre_Sci-Fi                                   0
Genre_Thriller                                 0
Genre_War                                      0
Genre_Western                                  0
Name: 0, dtype: object

In [578]:

movies_windic.ix[1]

Out[578]:

movie_id                                        2
title                              Jumanji (1995)
genres               Adventure|Children's|Fantasy
Genre_Action                                    0
Genre_Adventure                                 1
Genre_Animation                                 0
Genre_Children's                                1
Genre_Comedy                                    0
Genre_Crime                                     0
Genre_Documentary                               0
Genre_Drama                                     0
Genre_Fantasy                                   1
Genre_Film-Noir                                 0
Genre_Horror                                    0
Genre_Musical                                   0
Genre_Mystery                                   0
Genre_Romance                                   0
Genre_Sci-Fi                                    0
Genre_Thriller                                  0
Genre_War                                       0
Genre_Western                                   0
Name: 1, dtype: object

데이터가 이보다 더 크다면 다중 멤버십을 갖는 표시용 변수를 이 방법으로 생성하는 것은 그다지 빠른 방법은 아니다. 빠르게 처리하고 싶다면 DataFrame의 내부를 조작하는 저수준의 함수를 작성해서 사용해야 한다.
get_dummies나 cut 같은 이산 함수를 잘 조합하면 통계 애플리케이션에서 유용하게 사용

In [579]:

values = np.random.rand(10)

In [580]:

values

Out[580]:

array([ 0.17582534,  0.18368382,  0.93694953,  0.108632  ,  0.46920334,
        0.77175794,  0.58726166,  0.38948574,  0.76265434,  0.19847269])

In [581]:

bins = [0, 0.2, 0.4, 0.6, 0.8, 1]

In [582]:

pd.get_dummies(pd.cut(values, bins))

Out[582]:

	(0, 0.2]	(0.2, 0.4]	(0.4, 0.6]	(0.6, 0.8]	(0.8, 1]
0	1	0	0	0	0
1	1	0	0	0	0
2	0	0	0	0	1
3	1	0	0	0	0
4	0	0	1	0	0
5	0	0	0	1	0
6	0	0	1	0	0
7	0	1	0	0	0
8	0	0	0	1	0
9	1	0	0	0	0

In [583]:

pd.cut(values, bins)

Out[583]:

Categorical: 
[(0, 0.2], (0, 0.2], (0.8, 1], (0, 0.2], (0.4, 0.6], (0.6, 0.8], (0.4, 0.6], (0.2, 0.4], (0.6, 0.8], (0, 0.2]]
Levels (5): Index(['(0, 0.2]', '(0.2, 0.4]', '(0.4, 0.6]', '(0.6, 0.8]',
                   '(0.8, 1]'], dtype=object)

7.4 문자열 다루기¶

파이썬 대중적 인기 원인: 문자열이나 텍스트 처리의 용이함이 한몫
텍스트 연산은 대부분 문자열 객체의 내장 메서드를 통해 간단하게 처리
좀 더 복잡한 패턴 매칭이나 텍스트 조작은 정규표현식 필요
pandas는 배열 데이터 전체에 쉽게 정규표현식을 적용하고 추가적으로 누락된 데이터를 편리하게 처리할 수 있는 기능 포함

7.4.1 문자열 객체 메서드¶

내장 문자열 메서드만으로도 왠만한 애플리케이션들은 해결 가능

In [584]:

val  = 'a,b, guido'

In [589]:

# 쉼표로 구분된 문자열 분리
val.split(',')

Out[589]:

['a', 'b', ' guido']

strip 메서드: 공백문자(줄바꿈문자 포함) 제거

In [586]:

pieces = [x.strip() for x in val.split(',')]

In [587]:

pieces

Out[587]:

['a', 'b', 'guido']

더하기 연산을 통해 '::' 문자열과 합칠 수 있다.

In [590]:

first, second, third = pieces

In [591]:

first

Out[591]:

'a'

In [592]:

second

Out[592]:

'b'

In [593]:

third

Out[593]:

'guido'

In [595]:

first + '::' + second + '::' + third

Out[595]:

'a::b::guido'

+로 문자열을 합치는게 귀찮다. +를 쓰다가 오류가 날 수도 있고..
그래서 join 메서드를 많이 사용한다.

In [597]:

# '합칠문자열'.join(합칠리스트)
'::'.join(pieces)

Out[597]:

'a::b::guido'

일치하는 부분 문자열의 위치를 찾는 방법도 존재
index나 find를 사용하는 것도 가능하지만 파이썬의 in 예약어를 사용하면 일치하는 부분 문자열을 쉽게 찾을 수 있다.

In [598]:

'guido' in val

Out[598]:

True

In [599]:

val.index(',')

Out[599]:

In [600]:

val.find(':')

Out[600]:

-1

In [601]:

val

Out[601]:

'a,b, guido'

find, index의 차이점¶

index의 경우 문자열을 찾지 못하면 예외를 발생
find의 경우 -1을 반환

In [602]:

val.index(':')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-602-280f8b2856ce> in <module>()
----> 1 val.index(':')

ValueError: substring not found

In [603]:

# count는 특정 부분 문자열이 몇 건 발견되었는지 반환
val.count(',')

Out[603]:

replace는 찾아낸 패턴을 다른 문자열로 치환
대체할 문자열로 비어있는 문자열을 넘겨서 패턴을 삭제하기 위한 방법으로 자주 사용

In [604]:

val.replace(',', '::')

Out[604]:

'a::b:: guido'

In [605]:

val.replace(',', '')

Out[605]:

'ab guido'

파이썬 내장 문자열 메서드¶

인자	설명
count	문자열에서 겹치지 않는 부분 문자열의 개수를 반환
endswith, startswith	문자열이 주어진 접미사(접두사)로 끝날 경우 True를 반환
join	문자열을 구분자로 하여 다른 문자열을 순서대로 이어붙인다.
index	부분 문자열의 첫 번째 글자의 위치를 반환한다. 부분 문자열이 없다면 ValueError 예외가 발생한다.
find	첫 번째 부분 문자열의 첫 번째 글자의 위치를 반환한다. index와 유사하지만 부분 문자열이 없을 경우 -1을 반환한다.
rfind	마지막 부분 문자열의 첫 번째 글자의 위치를 반환한다. 부분 문자열이 없다면 -1을 반환한다.
replace	문자열을 다른 문자열로 치환한다.
strip, rstrip, lstrip	개행문자를 포함한 공백문자를 제거한다. lstrip은 문자열의 시작 부분에 있는 공백 문자만을 제거하고, rstrip은 문자열의 마지막 부분에 잇는 공백문자만을 제거한다.
split	문자열을 구분자를 기준으로 부분 문자열의 리스트로 분리한다.
lower, upper	각각 알파벳 문자를 소문자 혹은 대문자로 변환한다.
ljust, rjust	문자열을 오른쪽 혹은 왼쪽으로 정렬하고 주어진 길이에서 문자열의 길이를 제외한 곳은 공백문자를 채워 넣어 주어진 길이를 가지는 문자열을 반환한다.

7.4.2 정규표현식¶

텍스트에서 문자열 패턴을 찾는 유연한 방법
regex라 불리는 단일 표현식은 정규표현언어로 구성된 문자열
정규표현식을 작성하는 방법은 그 자체로 하나의 독립된 장으로 구성할 수 있는 만큼, 이 책에서 다루는 범위를 벗어난다. 인터넷에는 제드쇼의 '정규표현식 어렵게 배우기(Learn Regex The Hard Way) 같은 훌륭한 튜토리얼과 레퍼런스가 있다.
re 모듈 함수는 패턴 매칭, 치환, 분리의 3가지
3가지는 서로 연관되어 있으며, 정규표현식은 텍스트 안에 존재하는 패턴을 표현하고 이를 여러가지 다양한 목적으로 사용할 수 있도록 되어 있다.
여러가지 공백문자(탭, 스페이스, 개행문자)가 포함된 문자열을 나누고 싶다면 하나 이상의 공백문자를 의미하는 \s+를 사용해서 문자열을 분리할 수 있다.

정규표현식 도서¶

Yes 24 정규표현식 도서 검색
정규 표현식 완전 해부와 실습(개정판): 난 개인적으로 이 책으로 정규표현식을 공부했다. pcre로 설명해 주는데 굉장히 재밌게 봤던 기억이 난다. 엔진은 어떤게 있는지? 최적화는 어떻게 하는지? 등 3회독 했지. 지금은 절판된듯.. 예전에 전자 도서관에서 pdf 구한것 같은데.. 기억이 잘..
정규 표현식 완전 해부와 실습 리뷰

In [2]:

import re

In [3]:

text = 'foo    bar\t baz   \tqux'

In [4]:

text

Out[4]:

'foo    bar\t baz   \tqux'

In [5]:

re.split('\s+', text)

Out[5]:

['foo', 'bar', 'baz', 'qux']

re.split('\s+', text)를 사용하면 먼저 정규표현식이 컴파일되고 그 다음에 split 메서드가 실행
re.compile을 통해 직접 정규표현식을 컴파일하고 그렇게 얻은 정규표현식 객체를 재사용하는 것도 가능

In [6]:

regex = re.compile('\s+')

In [7]:

regex.split(text)

Out[7]:

['foo', 'bar', 'baz', 'qux']

정규표현식에 매칭되는 모든 패턴의 목록을 얻고 싶다면 findall 메서드

In [8]:

regex.findall(text)

Out[8]:

['    ', '\t ', '   \t']

정규표현식 안에서 \ 문자가 이스케이스되는 것을 피하려면 raw 문자열 표기법으로 문제를 회피할 수 있다.
r'C:\x'은 'c:\\x'와 동일

같은 정규표현식을 다른 문자열에도 적용해야 한다면 re.compile을 이용해서 정규표현식 객체를 만들어 사용하는 방법 추천. CPU 사용량 절약
match와 search는 findall 메서드와 관련
findall: 문자열에서 일치하는 모든 부분의 문자열을 찾아줌
search: 패턴과 일치하는 첫 번째 존재를 반환
match: 이보다 더 엄격해서 문자열의 시작 부분에서 일치하는 것만 찾아준다.

책에 마지막 re.I를 적용하는 부분이 생략되어 제대로 동작하지 않는다. regex 변수를 추가해 주기 바란다.

In [17]:

text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# 이 부분이 생략되어 있음
regex = re.compile(pattern, re.I) # re.IGNORECASE는 정규표현식이 대∙소문자를 가리지 않도록 한다.

In [18]:

regex.findall(text)

Out[18]:

['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']

search는 텍스트에서 첫 번째 이메일 주소만 찾아준다. 이 정규표현식에 대한 match 객체는 그 패턴이 문자열 안에서 위치하는 시작점과 끝점만을 알려준다.

In [19]:

m = regex.search(text)

In [20]:

Out[20]:

<_sre.SRE_Match at 0x109dbe308>

In [21]:

text[m.start():m.end()]

Out[21]:

'dave@google.com'

regex.match는 None을 반환. 왜냐하면 그 패턴이 문자열의 시작점에서부터 일치하는지 검사하기 때문

In [30]:

# 이 문서의 처음은 Dave로 시작하기 때문에 regex와 일치하지 않음
print regex.match(text)

None

sub 메서드는 찾은 패턴을 주어진 문자열로 치환하여 새로운 문자열 반환

In [31]:

print regex.sub('REDACTED', text)

Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED

이메일 주소를 찾아서 동시에 각 이메일 주소를 사용자 이름, 도메인 이름, 도메인 접미사의 3가지 컴포넌트로 나눠야 한다면 각 패턴을 괄호로 묶어준다.

In [81]:

pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'