In [7]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from datetime import datetime, date
plt.style.use('ggplot')

In [8]:

# Loading the Transactions Data from the excel file

trans = pd.read_excel('Raw_data.xlsx' , sheet_name='Transactions')

In [9]:

# Checking first 5 records from Transactions Data

trans.head(5)

Out[9]:

	transaction_id	product_id	customer_id	transaction_date	online_order	order_status	brand	product_line	product_class	product_size	list_price	standard_cost	product_first_sold_date
0	1	2	2950	2017-02-25	0.0	Approved	Solex	Standard	medium	medium	71.49	53.62	41245.0
1	2	3	3120	2017-05-21	1.0	Approved	Trek Bicycles	Standard	medium	large	2091.47	388.92	41701.0
2	3	37	402	2017-10-16	0.0	Approved	OHM Cycles	Standard	low	medium	1793.43	248.82	36361.0
3	4	88	3135	2017-08-31	0.0	Approved	Norco Bicycles	Standard	medium	medium	1198.46	381.10	36145.0
4	5	78	787	2017-10-01	1.0	Approved	Giant Bicycles	Standard	medium	large	1765.30	709.48	42226.0

In [10]:

# Information of columns and data-types of Transactions Data.

trans.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
transaction_id             20000 non-null int64
product_id                 20000 non-null int64
customer_id                20000 non-null int64
transaction_date           20000 non-null datetime64[ns]
online_order               19640 non-null float64
order_status               20000 non-null object
brand                      19803 non-null object
product_line               19803 non-null object
product_class              19803 non-null object
product_size               19803 non-null object
list_price                 20000 non-null float64
standard_cost              19803 non-null float64
product_first_sold_date    19803 non-null float64
dtypes: datetime64[ns](1), float64(4), int64(3), object(5)
memory usage: 2.0+ MB

The datatype of product_first_sold_date column is not in datetime format. The data type of this column must be changed from int64 to datetime format.

Total Records¶

In [11]:

print("Total records (rows) in the dataset : {}".format(trans.shape[0]))
print("Total columns (features) in the dataset : {}".format(trans.shape[1]))

Total records (rows) in the dataset : 20000
Total columns (features) in the dataset : 13

Numeric Columns and Non-Numeric Columns¶

In [12]:

# select numeric columns
df_numeric = trans.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print("The numeric columns are :")
print(numeric_cols)


# select non-numeric columns
df_non_numeric = trans.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print("The non-numeric columns are :")
print(non_numeric_cols)

The numeric columns are :
['transaction_id' 'product_id' 'customer_id' 'online_order' 'list_price'
 'standard_cost' 'product_first_sold_date']
The non-numeric columns are :
['transaction_date' 'order_status' 'brand' 'product_line' 'product_class'
 'product_size']

1. Missing Values Check¶

Checking for the presence of any missing values in the dataset. If missing values are present for a particular feature then depending upon the situation the feature may be either dropped (cases when a major amount of data is missing) or an appropiate value will be imputed in the feature column with missing values.

In [13]:

# Total number of missing values

trans.isnull().sum()

Out[13]:

transaction_id               0
product_id                   0
customer_id                  0
transaction_date             0
online_order               360
order_status                 0
brand                      197
product_line               197
product_class              197
product_size               197
list_price                   0
standard_cost              197
product_first_sold_date    197
dtype: int64

In [14]:

# Percentage of missing values

trans.isnull().mean()*100

Out[14]:

transaction_id             0.000
product_id                 0.000
customer_id                0.000
transaction_date           0.000
online_order               1.800
order_status               0.000
brand                      0.985
product_line               0.985
product_class              0.985
product_size               0.985
list_price                 0.000
standard_cost              0.985
product_first_sold_date    0.985
dtype: float64

Here it is observed that columns like online_order, brand, product_line, product_class , product_size, standard_cost, product_first_sold_date have missing values.

1.1 Online Order¶

Since 1.8 % of the records have online_order data missing we can perform mode imputation for this categorical column.

In [15]:

trans[trans['online_order'].isnull()]

Out[15]:

	transaction_id	product_id	customer_id	transaction_date	online_order	order_status	brand	product_line	product_class	product_size	list_price	standard_cost	product_first_sold_date
97	98	49	333	2017-06-23	NaN	Approved	Trek Bicycles	Road	medium	medium	533.51	400.13	37823.0
166	167	90	3177	2017-04-26	NaN	Approved	Norco Bicycles	Standard	low	medium	363.01	290.41	38482.0
169	170	6	404	2017-10-16	NaN	Approved	OHM Cycles	Standard	high	medium	227.88	136.73	37838.0
250	251	63	1967	2017-04-11	NaN	Approved	Solex	Standard	medium	medium	1483.20	99.59	42145.0
300	301	78	2530	2017-03-24	NaN	Approved	Giant Bicycles	Standard	medium	large	1765.30	709.48	35455.0
336	337	82	1615	2017-10-30	NaN	Approved	Norco Bicycles	Standard	high	medium	1148.64	689.18	41533.0
342	343	61	1478	2017-03-11	NaN	Approved	Norco Bicycles	Standard	medium	small	586.45	521.94	33429.0
398	399	83	1306	2017-12-18	NaN	Approved	Solex	Touring	medium	large	2083.94	675.03	38206.0
476	477	73	367	2017-02-12	NaN	Approved	Solex	Standard	medium	medium	1945.43	333.18	38859.0
529	530	62	1202	2017-03-31	NaN	Approved	Solex	Standard	medium	medium	478.16	298.72	34143.0
558	559	20	3104	2017-03-21	NaN	Approved	Trek Bicycles	Standard	medium	small	1775.81	1580.47	33455.0
576	577	1	149	2017-05-04	NaN	Approved	Giant Bicycles	Standard	medium	medium	1403.50	954.82	40649.0
760	761	4	2428	2017-12-20	NaN	Approved	Giant Bicycles	Standard	high	medium	1129.13	677.48	33549.0
769	770	4	2414	2017-09-10	NaN	Approved	Giant Bicycles	Standard	high	medium	1129.13	677.48	38258.0
843	844	40	3412	2017-07-09	NaN	Approved	OHM Cycles	Standard	high	medium	1458.17	874.90	36498.0
914	915	39	2756	2017-03-16	NaN	Approved	Giant Bicycles	Standard	medium	large	1812.75	582.48	40336.0
917	918	35	2281	2017-04-30	NaN	Approved	Giant Bicycles	Standard	medium	medium	1403.50	954.82	42688.0
1060	1061	63	1342	2017-07-14	NaN	Approved	Solex	Standard	medium	medium	1483.20	99.59	34996.0
1071	1072	36	1505	2017-11-09	NaN	Approved	Solex	Standard	low	medium	945.04	507.58	40784.0
1159	1160	73	1722	2017-12-15	NaN	Approved	Solex	Standard	medium	medium	1945.43	333.18	37499.0
1192	1193	71	79	2017-11-17	NaN	Approved	Solex	Standard	high	large	1842.92	1105.75	34996.0
1200	1201	84	3241	2017-05-25	NaN	Approved	Trek Bicycles	Road	medium	medium	290.62	215.14	38206.0
1354	1355	92	1864	2017-02-28	NaN	Approved	WeareA2B	Standard	medium	small	1415.01	1259.36	37626.0
1361	1362	32	1395	2017-01-27	NaN	Approved	Giant Bicycles	Standard	high	medium	1179.00	707.40	38482.0
1371	1372	17	3254	2017-04-19	NaN	Approved	Solex	Standard	high	medium	1024.66	614.80	35378.0
1381	1382	96	1393	2017-12-29	NaN	Approved	WeareA2B	Road	low	small	1172.78	1043.77	37539.0
1523	1524	31	2151	2017-05-05	NaN	Approved	Giant Bicycles	Standard	medium	medium	230.91	173.18	39031.0
1552	1553	62	842	2017-10-23	NaN	Approved	Solex	Standard	high	medium	1024.66	614.80	35378.0
1555	1556	73	2776	2017-10-30	NaN	Approved	Solex	Standard	medium	medium	1945.43	333.18	37499.0
1621	1622	0	736	2017-09-07	NaN	Approved	WeareA2B	Standard	medium	small	175.89	131.92	37668.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...
18068	18069	26	788	2017-07-14	NaN	Approved	WeareA2B	Standard	medium	medium	1992.93	762.63	34115.0
18085	18086	48	2813	2017-04-07	NaN	Approved	WeareA2B	Standard	medium	medium	1762.96	950.52	41064.0
18335	18336	95	2958	2017-06-04	NaN	Approved	Giant Bicycles	Standard	medium	large	569.56	528.43	37874.0
18471	18472	81	176	2017-03-06	NaN	Approved	Norco Bicycles	Standard	medium	small	586.45	521.94	33429.0
18501	18502	19	2195	2017-11-02	NaN	Approved	OHM Cycles	Road	high	large	12.01	7.21	39880.0
18606	18607	43	2138	2017-12-01	NaN	Approved	Norco Bicycles	Standard	medium	medium	1555.58	818.01	37873.0
18624	18625	85	1133	2017-06-17	NaN	Approved	WeareA2B	Standard	medium	medium	752.64	205.36	42218.0
18689	18690	23	2987	2017-12-13	NaN	Approved	Norco Bicycles	Standard	medium	medium	1198.46	381.10	41701.0
18711	18712	78	665	2017-09-22	NaN	Approved	Giant Bicycles	Standard	medium	large	1765.30	709.48	38193.0
18722	18723	65	1597	2017-11-21	NaN	Approved	WeareA2B	Standard	medium	medium	1807.45	778.69	42145.0
18751	18752	22	2200	2017-10-12	NaN	Approved	WeareA2B	Standard	medium	medium	60.34	45.26	33455.0
18754	18755	55	403	2017-01-30	NaN	Approved	Trek Bicycles	Road	medium	large	1894.19	598.76	35707.0
18821	18822	30	3464	2017-12-13	NaN	Approved	Solex	Standard	high	medium	748.17	448.90	37337.0
18846	18847	57	50	2017-09-20	NaN	Approved	WeareA2B	Touring	medium	large	1890.39	260.14	34170.0
18982	18983	21	743	2017-01-03	NaN	Approved	Solex	Standard	medium	large	1071.23	380.74	34244.0
19024	19025	36	1937	2017-08-04	NaN	Approved	Solex	Standard	low	medium	945.04	507.58	35052.0
19082	19083	90	1204	2017-12-06	NaN	Approved	Norco Bicycles	Standard	low	medium	363.01	290.41	36367.0
19100	19101	8	2713	2017-10-21	NaN	Approved	Solex	Road	medium	small	1703.52	1516.13	38216.0
19271	19272	91	3017	2017-07-07	NaN	Approved	Solex	Standard	medium	medium	100.35	75.26	36367.0
19282	19283	55	1364	2017-08-02	NaN	Approved	Trek Bicycles	Road	medium	large	1894.19	598.76	37823.0
19303	19304	35	1776	2017-08-22	NaN	Approved	Trek Bicycles	Standard	low	medium	1057.51	154.40	34527.0
19368	19369	66	3223	2017-12-16	NaN	Approved	Giant Bicycles	Road	low	small	590.26	525.33	38647.0
19369	19370	91	1107	2017-09-02	NaN	Approved	WeareA2B	Standard	low	medium	642.31	513.85	41922.0
19433	19434	86	1557	2017-03-10	NaN	Approved	OHM Cycles	Standard	medium	medium	235.63	125.07	38206.0
19495	19496	2	3028	2017-11-14	NaN	Approved	Giant Bicycles	Road	low	small	590.26	525.33	42710.0
19514	19515	51	690	2017-01-22	NaN	Approved	OHM Cycles	Standard	high	medium	2005.66	1203.40	37823.0
19573	19574	18	1735	2017-01-15	NaN	Approved	Solex	Standard	medium	medium	575.27	431.45	41345.0
19580	19581	49	1933	2017-10-12	NaN	Approved	Trek Bicycles	Road	medium	medium	533.51	400.13	41064.0
19635	19636	98	1389	2017-07-26	NaN	Approved	Trek Bicycles	Standard	high	medium	358.39	215.03	38002.0
19843	19844	3	8	2017-01-28	NaN	Approved	Trek Bicycles	Standard	medium	large	2091.47	388.92	37823.0

360 rows × 13 columns

In [16]:

most_freq_online_mode = trans['online_order'].mode()
most_freq_online_mode

Out[16]:

0    1.0
dtype: float64

Since 1 is the most frequent value of online order. We will do a mode imputation on this categorical value.

In [17]:

trans['online_order'].fillna(1, inplace=True)

In [18]:

trans['online_order'].isnull().sum()

Out[18]:

Currently there are no missing values for online_order column

1.2 Brand, Product Line, Product Class, Product Size, Standard Cost, Product First Sold Date¶

It is observed that when brand is null then all other column values with missing values viz. 'product_line', 'product_class', 'product_size', 'standard_cost', 'product_first_sold_date' are also null. Also this null values comprise 1 % of the dataset. Hence we can drop these records/rows.

In [19]:

trans[trans['brand'].isnull()][['brand', 'product_line', 'product_class', 'product_size', 
                                'standard_cost', 'product_first_sold_date']].drop_duplicates()

Out[19]:

	brand	product_line	product_class	product_size	standard_cost	product_first_sold_date
136	NaN	NaN	NaN	NaN	NaN	NaN

In [20]:

trans[trans['brand'].isnull()][['brand', 'product_line', 'product_class', 'product_size', 
                                'standard_cost', 'product_first_sold_date']].shape[0]

Out[20]:

In [21]:

records_to_drop = trans[trans['brand'].isnull()][['brand', 'product_line', 'product_class', 'product_size', 
                                'standard_cost', 'product_first_sold_date']].index
records_to_drop

Out[21]:

Int64Index([  136,   159,   366,   406,   676,   780,  1003,  1130,  1196,
             1282,
            ...
            19045, 19132, 19176, 19196, 19205, 19340, 19383, 19793, 19859,
            19871],
           dtype='int64', length=197)

In [22]:

trans.drop(index=records_to_drop, axis=0, inplace=True)

Finally there are no Missing values in the transaction dataset.

In [23]:

trans.isnull().sum()

Out[23]:

transaction_id             0
product_id                 0
customer_id                0
transaction_date           0
online_order               0
order_status               0
brand                      0
product_line               0
product_class              0
product_size               0
list_price                 0
standard_cost              0
product_first_sold_date    0
dtype: int64

In [24]:

print("Total records after removing Missing Values: {}".format(trans.shape[0]))

Total records after removing Missing Values: 19803

2. Creating a new feature "Profit"¶

The Profit column will be the difference between the list price and the standard price of a product.

In [25]:

trans['Profit'] = trans['list_price']-trans['standard_cost']

In [26]:

# Dystribution of the Profit Column

plt.figure(figsize=(20,8))
sns.distplot(trans['Profit'])

Out[26]:

<matplotlib.axes._subplots.AxesSubplot at 0x289cd29bc50>

3. Inconsistency Check in Data¶

We will check whether there is inconsistent data / typo error data is present in the categorical columns.
The columns to be checked are 'online_order', 'order_status' ,'product_line', 'product_class' , 'product_class' and 'brand'

3.1 Online Order¶

There is no inconsistent data in online_order column.

In [26]:

trans['online_order'].value_counts()

Out[26]:

1.0    10097
0.0     9706
Name: online_order, dtype: int64

3.2 Order Status¶

There is no inconsistent data in order_status column.

In [27]:

trans['order_status'].value_counts()

Out[27]:

Approved     19625
Cancelled      178
Name: order_status, dtype: int64

In [28]:

trans[['order_status', 'online_order']].drop_duplicates()

Out[28]:

	order_status	online_order
0	Approved	0.0
1	Approved	1.0
42	Cancelled	0.0
254	Cancelled	1.0

3.3 Product Line¶

There is no inconsistent data in product_line column.

In [29]:

trans['product_line'].value_counts()

Out[29]:

Standard    14176
Road         3970
Touring      1234
Mountain      423
Name: product_line, dtype: int64

3.4 Product Class¶

There is no inconsistent data in product_class column.

In [30]:

trans['product_class'].value_counts()

Out[30]:

medium    13826
high       3013
low        2964
Name: product_class, dtype: int64

3.5 Product Size¶

There is no inconsistent data in product_size column.

In [31]:

trans['product_size'].value_counts()

Out[31]:

medium    12990
large      3976
small      2837
Name: product_size, dtype: int64

3.6 Brand¶

There is no inconsistent data in brand column.

In [32]:

trans['brand'].value_counts()

Out[32]:

Solex             4253
Giant Bicycles    3312
WeareA2B          3295
OHM Cycles        3043
Trek Bicycles     2990
Norco Bicycles    2910
Name: brand, dtype: int64

4. Duplication Checks¶

We need to ensure that there is no duplication of records in the dataset. This may lead to error in data analysis due to poor data quality. If there are duplicate rows of data then we need to drop such records.
For checking for duplicate records we need to firstly remove the primary key column of the dataset then apply drop_duplicates() function provided by Python.

In [33]:

trans_dedupped = trans.drop('transaction_id', axis=1).drop_duplicates()

print("Number of records after removing customer_id (pk), duplicates : {}".format(trans_dedupped.shape[0]))
print("Number of records in original dataset : {}".format(trans.shape[0]))

Number of records after removing customer_id (pk), duplicates : 19803
Number of records in original dataset : 19803

Since both the numbers are same. There are no duplicate records in the dataset

5. Exporting the Cleaned Transactions Data Set to csv¶

Currently the Transactions dataset is clean. Hence we can export the data to a csv to continue our data analysis of Customer Segments by joining it to other tables.

In [38]:

trans.to_csv('Transactions_Cleaned.csv', index=False)