#!/usr/bin/env python
# coding: utf-8

# # CHAPTER 12 Advanced pandas（高级pandas用法）
# 
# # 12.1 Categorical Data（类别数据）
# 
# 这一届会介绍pandas的Categorical类型。
# 
# # 1 Background and Motivation（背景和动力）
# 
# 表格中的列克可能会有重复的部分。我们可以用unique和value_counts，从一个数组从提取不同的值，并计算频度：

# In[1]:


import numpy as np
import pandas as pd


# In[2]:


values = pd.Series(['apple', 'orange', 'apple', 'apple'] * 2)
values


# In[3]:


pd.unique(values)


# In[4]:


pd.value_counts(values)


# 对于不同的类型数据值，一个更好的方法是用维度表（dimension table）来表示，然后用整数键（integer keys）来指代维度表：

# In[6]:


values = pd.Series([0, 1, 0, 0] * 2)
values


# In[7]:


dim = pd.Series(['apple', 'orange'])
dim


# 用take方法来重新存储原始的，由字符串构成的Series：

# In[8]:


dim.take(values)


# 这种用整数表示的方法叫做类别（categorical）或字典编码（dictionary-encoded）表示法。表示不同类别值的数组，被称作类别，字典，或层级。本书中我们将使用类别（categorical and categories）来称呼。表示类别的整数值被叫做，类别编码（category code），或编码（code）。
# 
# # 2 Categorical Type in pandas（pandas中的Categorical类型）
# 
# pandas中有一个Categorical类型，是用来保存那些基于整数的类别型数据。考虑下面的例子：

# In[9]:


fruits = ['apple', 'orange', 'apple', 'apple'] * 2


# In[10]:


N = len(fruits)


# In[12]:


df = pd.DataFrame({'fruit': fruits,
                   'basket_id': np.arange(N),
                   'count': np.random.randint(3, 15, size=N),
                   'weight': np.random.uniform(0, 4, size=N)},
                  columns=['basket_id', 'fruit', 'count', 'weight'])
df


# 这里，df['fruit']是一个python的字符串对象。我们将其转换为类型对象：

# In[13]:


fruits_cat = df['fruit'].astype('category')
fruits_cat


# fruits_cat的值并不是一个numpy数组，而是一个pandas.Categorical实例：

# In[16]:


c = fruits_cat.values
type(c)


# 这个Categorical对象有categories和codes属性：

# In[17]:


c.categories


# In[18]:


c.codes


# 可以把转换的结果变为DataFrame列：

# In[19]:


df['fruit'] = df['fruit'].astype('category')
df.fruit


# 也可以直接把其他的python序列变为pandas.Categorical类型：

# In[20]:


my_categories = pd.Categorical(['foo', 'bar', 'baz', 'foo', 'bar'])
my_categories


# 如果已经得到了分类编码数据（categorical encoded data），我们可以使用from_codes构造器：

# In[21]:


categories = ['foo', 'bar', 'baz']


# In[22]:


codes = [0, 1, 2, 0, 0, 1]


# In[23]:


my_cats_2 = pd.Categorical.from_codes(codes, categories)
my_cats_2


# 除非明确指定，非常默认类别没有特定的顺序。所以，取决于输入的数据，categories数组可能有不同的顺序。当使用from_codes或其他一些构造器的时候，我们可以指定类别的顺序：

# In[24]:


ordered_cat = pd.Categorical.from_codes(codes, categories, 
                                        ordered=True)
ordered_cat


# 输出的结果中，`[foo < bar < baz]`表示foo在bar之间，以此类推。一个没有顺序的类型实例（unordered categorical instance）可以通过as_ordered来排序：

# In[25]:


my_cats_2.as_ordered()


# 最后一点需要注意的，类型数据没必要一定是字符串，它可以是任何不可变的值类型
# （any immutable value types）。
# 
# # 3 Computations with Categoricals（类型计算）
# 
# Categorical类型和其他类型差不多，不过对于某些函数，比如groupby函数，在Categorical数据上会有更好的效果。很多函数可以利用ordered标记。
# 
# 假设有一些随机的数字，用pandas.quct进行分箱（binning）。得到的类型是pandas.Categorical；虽然之前用到过pandas.cut，但是没有具体介绍里面的细节：

# In[26]:


np.random.seed(12345)


# In[27]:


draws = np.random.randn(1000)


# In[28]:


draws[:5]


# 计算分箱后的分位数：

# In[29]:


bins = pd.qcut(draws, 4)
bins


# 具体分位数并不如季度的名字直观，我们直接在qcut中设定labels：

# In[30]:


bins = pd.qcut(draws, 4, labels=['Q1', 'Q2', 'Q3', 'Q4'])
bins


# In[31]:


bins.codes[:10]


# bins caetegorical并没有包含边界星系，我们可以用groupby来提取：

# In[32]:


bins = pd.Series(bins, name='quartile')


# In[34]:


results = (pd.Series(draws)
           .groupby(bins)
           .agg(['count', 'min', 'max'])
           .reset_index())
results


# quartile列包含了原始的类别信息，包含bins中的顺序：

# In[35]:


results['quartile']


# ### Better performance with categoricals （使用categoricals得到更好的效果）
# 
# 使用categorical能让效果提高。如果一个DataFrame的列是categorical类型，使用的时候会减少很多内存的使用。假设我们有一个一千万的元素和一个类别：

# In[36]:


N = 10000000
draws = pd.Series(np.random.randn(N))
labels = pd.Series(['foo', 'bar', 'baz', 'qux'] * (N // 4))


# 把labels变为categorical：

# In[37]:


categories = labels.astype('category')


# 可以看到labels会比categories使用更多的内存：

# In[38]:


labels.memory_usage()


# In[39]:


categories.memory_usage()


# 当然，转换成category也是要消耗计算的，不过这种消耗是一次性的：

# In[40]:


get_ipython().run_line_magic('time', "_ = labels.astype('category')")


# 在categories上使用groupby会非常快，因为用的是基于整数的编码，而不是由字符串组成的数组。
# 
# # 4 Categorical Methods（类别方法）
# 
# 如果是包含categorical数据的Series数据，有和Series.str类似的一些比较特殊的方法。对于访问categories和code很方便：

# In[41]:


s = pd.Series(['a', 'b', 'c', 'd'] * 2)


# In[42]:


cat_s = s.astype('category')
cat_s


# 属性cat可以访问categorical方法：

# In[43]:


cat_s.cat.codes


# In[44]:


cat_s.cat.categories


# 假设我们知道实际的类别超过了当前观测到的四个类别，那么我们可以使用set_categories方法来扩展：

# In[45]:


actual_categories = ['a', 'b', 'c', 'd', 'e']
cat_s2 = cat_s.cat.set_categories(actual_categories)
cat_s2


# 数据本身似乎没有改变，不过在对其进行操作的时候会反应出来。例如，value_counts：

# In[46]:


cat_s.value_counts()


# In[47]:


cat_s2.value_counts()


# 在大型数据集，categoricals经常用来作为省内存和提高效果的工具。在对一个很大的DataFrame或Series进行过滤后，很多类型可能不会出现在数据中。我们用remove_unused_categories方法来除去没有观测到的类别：

# In[48]:


cat_s3 = cat_s[cat_s.isin(['a', 'b'])]
cat_s3


# In[49]:


cat_s3.cat.remove_unused_categories()


# 下面是一些categorical的方法：
# 
# ![](http://oydgk2hgw.bkt.clouddn.com/pydata-book/kbedp.png)
# 
# ### Creating dummy variables for modeling（为建模创建哑变量）
# 
# 在使用机器学习的一些工具时，经常要转变类型数据为哑变量（dummy variables ），也被称作是独热编码（one-hot encoding）。即在DataFrame中，给一列中不同的类别创建不同的列，用1表示出现，用0表示未出现。
# 
# 例子：

# In[50]:


cat_s = pd.Series(['a', 'b', 'c', 'd'] * 2, dtype='category')


# 在第七章也介绍过，pandas.get_dummies函数会把一维的类型数据变为包含哑变量的DataFrame：

# In[51]:


pd.get_dummies(cat_s)