Pandas Data Types

Accompanying the PB Python article here

In [1]:
import pandas as pd
import numpy as np
In [2]:
df = pd.read_csv("https://github.com/chris1610/pbpython/blob/master/data/sales_data_types.csv?raw=True")
In [3]:
df
Out[3]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active
0 10002.0 Quest Industries $125,000.00 $162500.00 30.00% 500 1 10 2015 Y
1 552278.0 Smith Plumbing $920,000.00 $101,2000.00 10.00% 700 6 15 2014 Y
2 23477.0 ACME Industrial $50,000.00 $62500.00 25.00% 125 3 29 2016 Y
3 24900.0 Brekke LTD $350,000.00 $490000.00 4.00% 75 10 27 2015 Y
4 651029.0 Harbor Co $15,000.00 $12750.00 -15.00% Closed 2 2 2014 N

Use df.info and df.dtypes to look at the types that pandas automatically infers based on the data

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 10 columns):
Customer Number    5 non-null float64
Customer Name      5 non-null object
2016               5 non-null object
2017               5 non-null object
Percent Growth     5 non-null object
Jan Units          5 non-null object
Month              5 non-null int64
Day                5 non-null int64
Year               5 non-null int64
Active             5 non-null object
dtypes: float64(1), int64(3), object(6)
memory usage: 480.0+ bytes

df.dtypes

Since the 2016 and 2017 columns were read in as objects, trying to add the values will result in string concatenation not numerical addition

In [5]:
df['2016'] + df['2017']
Out[5]:
0      $125,000.00$162500.00
1    $920,000.00$101,2000.00
2        $50,000.00$62500.00
3      $350,000.00$490000.00
4        $15,000.00$12750.00
dtype: object

The simplest way to to convert to a type is using astype.

We can apply it to the customer number first.

In [6]:
df['Customer Number'].astype('int')
Out[6]:
0     10002
1    552278
2     23477
3     24900
4    651029
Name: Customer Number, dtype: int64

The code above does not alter the original dataframe

In [7]:
df.dtypes
Out[7]:
Customer Number    float64
Customer Name       object
2016                object
2017                object
Percent Growth      object
Jan Units           object
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object

Assign the new integer customer number back to the original frame and check the type

In [8]:
df["Customer Number"] = df['Customer Number'].astype('int')
df.dtypes
Out[8]:
Customer Number     int64
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object
In [9]:
df
Out[9]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active
0 10002 Quest Industries $125,000.00 $162500.00 30.00% 500 1 10 2015 Y
1 552278 Smith Plumbing $920,000.00 $101,2000.00 10.00% 700 6 15 2014 Y
2 23477 ACME Industrial $50,000.00 $62500.00 25.00% 125 3 29 2016 Y
3 24900 Brekke LTD $350,000.00 $490000.00 4.00% 75 10 27 2015 Y
4 651029 Harbor Co $15,000.00 $12750.00 -15.00% Closed 2 2 2014 N

The data all looks good for the Customer Number.

If we try to convert the Jan Units column, we will get an error.

In [10]:
df['Jan Units'].astype('int')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-10-31333711e4a4> in <module>()
----> 1 df['Jan Units'].astype('int')

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    690     elif arr.dtype == np.object_ and np.issubdtype(dtype.type, np.integer):
    691         # work around NumPy brokenness, #1987
--> 692         return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
    693 
    694     if dtype.name in ("datetime64", "timedelta64"):

pandas/_libs/lib.pyx in pandas._libs.lib.astype_intsafe()

pandas/_libs/src/util.pxd in util.set_value_at_unsafe()

ValueError: invalid literal for int() with base 10: 'Closed'

In a similar manner we get an error if we try to convert the sales column

In [11]:
df['2016'].astype('float')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-999869d577b0> in <module>()
----> 1 df['2016'].astype('float')

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/util/_decorators.py in wrapper(*args, **kwargs)
    116                 else:
    117                     kwargs[new_arg_name] = new_arg_value
--> 118             return func(*args, **kwargs)
    119         return wrapper
    120     return _deprecate_kwarg

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/generic.py in astype(self, dtype, copy, errors, **kwargs)
   4002         # else, only a single dtype is given
   4003         new_data = self._data.astype(dtype=dtype, copy=copy, errors=errors,
-> 4004                                      **kwargs)
   4005         return self._constructor(new_data).__finalize__(self)
   4006 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, **kwargs)
   3460 
   3461     def astype(self, dtype, **kwargs):
-> 3462         return self.apply('astype', dtype=dtype, **kwargs)
   3463 
   3464     def convert(self, **kwargs):

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
   3327 
   3328             kwargs['mgr'] = self
-> 3329             applied = getattr(b, f)(**kwargs)
   3330             result_blocks = _extend_blocks(applied, result_blocks)
   3331 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in astype(self, dtype, copy, errors, values, **kwargs)
    542     def astype(self, dtype, copy=False, errors='raise', values=None, **kwargs):
    543         return self._astype(dtype, copy=copy, errors=errors, values=values,
--> 544                             **kwargs)
    545 
    546     def _astype(self, dtype, copy=False, errors='raise', values=None,

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/internals.py in _astype(self, dtype, copy, errors, values, klass, mgr, **kwargs)
    623 
    624                 # _astype_nansafe works fine with 1-d only
--> 625                 values = astype_nansafe(values.ravel(), dtype, copy=True)
    626                 values = values.reshape(self.shape)
    627 

~/miniconda3/envs/pbp3/lib/python3.6/site-packages/pandas/core/dtypes/cast.py in astype_nansafe(arr, dtype, copy)
    701 
    702     if copy:
--> 703         return arr.astype(dtype)
    704     return arr.view(dtype)
    705 

ValueError: could not convert string to float: '$15,000.00'

We can try to use astype with a bool type but that does not give expected results

In [12]:
df['Active'].astype('bool')
Out[12]:
0    True
1    True
2    True
3    True
4    True
Name: Active, dtype: bool
In [13]:
# astype can take a dictionary of column names and data types
df.astype({'Customer Number': 'int', 'Customer Name': 'str'}).dtypes
Out[13]:
Customer Number     int64
Customer Name      object
2016               object
2017               object
Percent Growth     object
Jan Units          object
Month               int64
Day                 int64
Year                int64
Active             object
dtype: object

In order to convert the currency and percentages, we need to use custom functions

In [14]:
def convert_currency(val):
    """
    Convert the string number value to a float
     - Remove $
     - Remove commas
     - Convert to float type
    """
    new_val = val.replace(',','').replace('$', '')
    return float(new_val)
In [15]:
def convert_percent(val):
    """
    Convert the percentage string to an actual floating point percent
    """
    new_val = val.replace('%', '')
    return float(new_val) / 100

Use apply to convert the 2016 and 2017 columns to floating point numbers

In [16]:
df['2016'].apply(convert_currency)
Out[16]:
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64
In [17]:
df['2017'].apply(convert_currency)
Out[17]:
0     162500.0
1    1012000.0
2      62500.0
3     490000.0
4      12750.0
Name: 2017, dtype: float64

We could use a lambda function as well but it may be more difficult for new users to understand

In [18]:
df['2016'].apply(lambda x: x.replace('$', '').replace(',', '')).astype('float')
Out[18]:
0    125000.0
1    920000.0
2     50000.0
3    350000.0
4     15000.0
Name: 2016, dtype: float64
In [19]:
# Assign the converted values back to the columns
df['2016'] = df['2016'].apply(convert_currency)
df['2017'] = df['2017'].apply(convert_currency)

Use a lambda function to convert the percentage strings to numbers

In [20]:
df['Percent Growth'].apply(lambda x: x.replace('%', '')).astype('float') / 100
Out[20]:
0    0.30
1    0.10
2    0.25
3    0.04
4   -0.15
Name: Percent Growth, dtype: float64
In [21]:
df['Percent Growth'] = df['Percent Growth'].apply(convert_percent)
In [22]:
df.dtypes
Out[22]:
Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units           object
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object
In [23]:
# Let's look at the data so far
df
Out[23]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active
0 10002 Quest Industries 125000.0 162500.0 0.30 500 1 10 2015 Y
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700 6 15 2014 Y
2 23477 ACME Industrial 50000.0 62500.0 0.25 125 3 29 2016 Y
3 24900 Brekke LTD 350000.0 490000.0 0.04 75 10 27 2015 Y
4 651029 Harbor Co 15000.0 12750.0 -0.15 Closed 2 2 2014 N

pd.to_numeric is another option for handling column conversions when invalid values are included

In [24]:
pd.to_numeric(df['Jan Units'], errors='coerce')
Out[24]:
0    500.0
1    700.0
2    125.0
3     75.0
4      NaN
Name: Jan Units, dtype: float64
In [25]:
# Fill in the NaN with 0
pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)
Out[25]:
0    500.0
1    700.0
2    125.0
3     75.0
4      0.0
Name: Jan Units, dtype: float64

Make sure to populate the original column of data

In [26]:
df["Jan Units"] = pd.to_numeric(df['Jan Units'], errors='coerce').fillna(0)

pd.to_datetime is very useful for working with date conversions

In [27]:
pd.to_datetime(df[['Month', 'Day', 'Year']])
Out[27]:
0   2015-01-10
1   2014-06-15
2   2016-03-29
3   2015-10-27
4   2014-02-02
dtype: datetime64[ns]
In [28]:
df["Start_Date"] = pd.to_datetime(df[['Month', 'Day', 'Year']])
In [29]:
# Check out the dataframe
df
Out[29]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active Start_Date
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 Y 2015-01-10
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 Y 2014-06-15
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 Y 2016-03-29
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 Y 2015-10-27
4 651029 Harbor Co 15000.0 12750.0 -0.15 0.0 2 2 2014 N 2014-02-02

Use np.where to convert the active column to a boolean

In [30]:
df["Active"] = np.where(df["Active"] == "Y", True, False)
In [31]:
df
Out[31]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active Start_Date
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True 2015-01-10
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True 2014-06-15
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True 2016-03-29
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True 2015-10-27
4 651029 Harbor Co 15000.0 12750.0 -0.15 0.0 2 2 2014 False 2014-02-02
In [32]:
df.dtypes
Out[32]:
Customer Number             int64
Customer Name              object
2016                      float64
2017                      float64
Percent Growth            float64
Jan Units                 float64
Month                       int64
Day                         int64
Year                        int64
Active                       bool
Start_Date         datetime64[ns]
dtype: object

Many of the examples shown above can be used when reading in data using dtypes or converters arguments

In [33]:
df_2 = pd.read_csv("https://github.com/chris1610/pbpython/blob/master/data/sales_data_types.csv?raw=True", 
                   dtype={'Customer Number':'int'},
                   converters={'2016':convert_currency,
                               '2017': convert_currency,
                               'Percent Growth': convert_percent,
                               'Jan Units': lambda x: pd.to_numeric(x, errors='coerce'),
                               'Active': lambda x: np.where(x == "Y", True, False)
                              })
In [34]:
df_2.dtypes
Out[34]:
Customer Number      int64
Customer Name       object
2016               float64
2017               float64
Percent Growth     float64
Jan Units          float64
Month                int64
Day                  int64
Year                 int64
Active              object
dtype: object
In [35]:
df_2
Out[35]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True
4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False
In [36]:
# This can not be applied at the time the data is read in
df_2["Start_Date"] = pd.to_datetime(df_2[['Month', 'Day', 'Year']])
In [37]:
df_2
Out[37]:
Customer Number Customer Name 2016 2017 Percent Growth Jan Units Month Day Year Active Start_Date
0 10002 Quest Industries 125000.0 162500.0 0.30 500.0 1 10 2015 True 2015-01-10
1 552278 Smith Plumbing 920000.0 1012000.0 0.10 700.0 6 15 2014 True 2014-06-15
2 23477 ACME Industrial 50000.0 62500.0 0.25 125.0 3 29 2016 True 2016-03-29
3 24900 Brekke LTD 350000.0 490000.0 0.04 75.0 10 27 2015 True 2015-10-27
4 651029 Harbor Co 15000.0 12750.0 -0.15 NaN 2 2 2014 False 2014-02-02