import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime, date
plt.style.use('ggplot')
# Loading the Transactions Data from the excel file
trans = pd.read_excel('Raw_data.xlsx' , sheet_name='Transactions')
# Checking first 5 records from Transactions Data
trans.head(5)
transaction_id | product_id | customer_id | transaction_date | online_order | order_status | brand | product_line | product_class | product_size | list_price | standard_cost | product_first_sold_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 2950 | 2017-02-25 | 0.0 | Approved | Solex | Standard | medium | medium | 71.49 | 53.62 | 41245.0 |
1 | 2 | 3 | 3120 | 2017-05-21 | 1.0 | Approved | Trek Bicycles | Standard | medium | large | 2091.47 | 388.92 | 41701.0 |
2 | 3 | 37 | 402 | 2017-10-16 | 0.0 | Approved | OHM Cycles | Standard | low | medium | 1793.43 | 248.82 | 36361.0 |
3 | 4 | 88 | 3135 | 2017-08-31 | 0.0 | Approved | Norco Bicycles | Standard | medium | medium | 1198.46 | 381.10 | 36145.0 |
4 | 5 | 78 | 787 | 2017-10-01 | 1.0 | Approved | Giant Bicycles | Standard | medium | large | 1765.30 | 709.48 | 42226.0 |
# Information of columns and data-types of Transactions Data.
trans.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 20000 entries, 0 to 19999 Data columns (total 13 columns): transaction_id 20000 non-null int64 product_id 20000 non-null int64 customer_id 20000 non-null int64 transaction_date 20000 non-null datetime64[ns] online_order 19640 non-null float64 order_status 20000 non-null object brand 19803 non-null object product_line 19803 non-null object product_class 19803 non-null object product_size 19803 non-null object list_price 20000 non-null float64 standard_cost 19803 non-null float64 product_first_sold_date 19803 non-null float64 dtypes: datetime64[ns](1), float64(4), int64(3), object(5) memory usage: 2.0+ MB
The datatype of product_first_sold_date column is not in datetime format. The data type of this column must be changed from int64 to datetime format.
print("Total records (rows) in the dataset : {}".format(trans.shape[0]))
print("Total columns (features) in the dataset : {}".format(trans.shape[1]))
Total records (rows) in the dataset : 20000 Total columns (features) in the dataset : 13
# select numeric columns
df_numeric = trans.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print("The numeric columns are :")
print(numeric_cols)
# select non-numeric columns
df_non_numeric = trans.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print("The non-numeric columns are :")
print(non_numeric_cols)
The numeric columns are : ['transaction_id' 'product_id' 'customer_id' 'online_order' 'list_price' 'standard_cost' 'product_first_sold_date'] The non-numeric columns are : ['transaction_date' 'order_status' 'brand' 'product_line' 'product_class' 'product_size']
Checking for the presence of any missing values in the dataset. If missing values are present for a particular feature then depending upon the situation the feature may be either dropped (cases when a major amount of data is missing) or an appropiate value will be imputed in the feature column with missing values.
# Total number of missing values
trans.isnull().sum()
transaction_id 0 product_id 0 customer_id 0 transaction_date 0 online_order 360 order_status 0 brand 197 product_line 197 product_class 197 product_size 197 list_price 0 standard_cost 197 product_first_sold_date 197 dtype: int64
# Percentage of missing values
trans.isnull().mean()*100
transaction_id 0.000 product_id 0.000 customer_id 0.000 transaction_date 0.000 online_order 1.800 order_status 0.000 brand 0.985 product_line 0.985 product_class 0.985 product_size 0.985 list_price 0.000 standard_cost 0.985 product_first_sold_date 0.985 dtype: float64
Here it is observed that columns like online_order, brand, product_line, product_class , product_size, standard_cost, product_first_sold_date have missing values.
Since 1.8 % of the records have online_order data missing we can perform mode imputation for this categorical column.
trans[trans['online_order'].isnull()]
transaction_id | product_id | customer_id | transaction_date | online_order | order_status | brand | product_line | product_class | product_size | list_price | standard_cost | product_first_sold_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
97 | 98 | 49 | 333 | 2017-06-23 | NaN | Approved | Trek Bicycles | Road | medium | medium | 533.51 | 400.13 | 37823.0 |
166 | 167 | 90 | 3177 | 2017-04-26 | NaN | Approved | Norco Bicycles | Standard | low | medium | 363.01 | 290.41 | 38482.0 |
169 | 170 | 6 | 404 | 2017-10-16 | NaN | Approved | OHM Cycles | Standard | high | medium | 227.88 | 136.73 | 37838.0 |
250 | 251 | 63 | 1967 | 2017-04-11 | NaN | Approved | Solex | Standard | medium | medium | 1483.20 | 99.59 | 42145.0 |
300 | 301 | 78 | 2530 | 2017-03-24 | NaN | Approved | Giant Bicycles | Standard | medium | large | 1765.30 | 709.48 | 35455.0 |
336 | 337 | 82 | 1615 | 2017-10-30 | NaN | Approved | Norco Bicycles | Standard | high | medium | 1148.64 | 689.18 | 41533.0 |
342 | 343 | 61 | 1478 | 2017-03-11 | NaN | Approved | Norco Bicycles | Standard | medium | small | 586.45 | 521.94 | 33429.0 |
398 | 399 | 83 | 1306 | 2017-12-18 | NaN | Approved | Solex | Touring | medium | large | 2083.94 | 675.03 | 38206.0 |
476 | 477 | 73 | 367 | 2017-02-12 | NaN | Approved | Solex | Standard | medium | medium | 1945.43 | 333.18 | 38859.0 |
529 | 530 | 62 | 1202 | 2017-03-31 | NaN | Approved | Solex | Standard | medium | medium | 478.16 | 298.72 | 34143.0 |
558 | 559 | 20 | 3104 | 2017-03-21 | NaN | Approved | Trek Bicycles | Standard | medium | small | 1775.81 | 1580.47 | 33455.0 |
576 | 577 | 1 | 149 | 2017-05-04 | NaN | Approved | Giant Bicycles | Standard | medium | medium | 1403.50 | 954.82 | 40649.0 |
760 | 761 | 4 | 2428 | 2017-12-20 | NaN | Approved | Giant Bicycles | Standard | high | medium | 1129.13 | 677.48 | 33549.0 |
769 | 770 | 4 | 2414 | 2017-09-10 | NaN | Approved | Giant Bicycles | Standard | high | medium | 1129.13 | 677.48 | 38258.0 |
843 | 844 | 40 | 3412 | 2017-07-09 | NaN | Approved | OHM Cycles | Standard | high | medium | 1458.17 | 874.90 | 36498.0 |
914 | 915 | 39 | 2756 | 2017-03-16 | NaN | Approved | Giant Bicycles | Standard | medium | large | 1812.75 | 582.48 | 40336.0 |
917 | 918 | 35 | 2281 | 2017-04-30 | NaN | Approved | Giant Bicycles | Standard | medium | medium | 1403.50 | 954.82 | 42688.0 |
1060 | 1061 | 63 | 1342 | 2017-07-14 | NaN | Approved | Solex | Standard | medium | medium | 1483.20 | 99.59 | 34996.0 |
1071 | 1072 | 36 | 1505 | 2017-11-09 | NaN | Approved | Solex | Standard | low | medium | 945.04 | 507.58 | 40784.0 |
1159 | 1160 | 73 | 1722 | 2017-12-15 | NaN | Approved | Solex | Standard | medium | medium | 1945.43 | 333.18 | 37499.0 |
1192 | 1193 | 71 | 79 | 2017-11-17 | NaN | Approved | Solex | Standard | high | large | 1842.92 | 1105.75 | 34996.0 |
1200 | 1201 | 84 | 3241 | 2017-05-25 | NaN | Approved | Trek Bicycles | Road | medium | medium | 290.62 | 215.14 | 38206.0 |
1354 | 1355 | 92 | 1864 | 2017-02-28 | NaN | Approved | WeareA2B | Standard | medium | small | 1415.01 | 1259.36 | 37626.0 |
1361 | 1362 | 32 | 1395 | 2017-01-27 | NaN | Approved | Giant Bicycles | Standard | high | medium | 1179.00 | 707.40 | 38482.0 |
1371 | 1372 | 17 | 3254 | 2017-04-19 | NaN | Approved | Solex | Standard | high | medium | 1024.66 | 614.80 | 35378.0 |
1381 | 1382 | 96 | 1393 | 2017-12-29 | NaN | Approved | WeareA2B | Road | low | small | 1172.78 | 1043.77 | 37539.0 |
1523 | 1524 | 31 | 2151 | 2017-05-05 | NaN | Approved | Giant Bicycles | Standard | medium | medium | 230.91 | 173.18 | 39031.0 |
1552 | 1553 | 62 | 842 | 2017-10-23 | NaN | Approved | Solex | Standard | high | medium | 1024.66 | 614.80 | 35378.0 |
1555 | 1556 | 73 | 2776 | 2017-10-30 | NaN | Approved | Solex | Standard | medium | medium | 1945.43 | 333.18 | 37499.0 |
1621 | 1622 | 0 | 736 | 2017-09-07 | NaN | Approved | WeareA2B | Standard | medium | small | 175.89 | 131.92 | 37668.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
18068 | 18069 | 26 | 788 | 2017-07-14 | NaN | Approved | WeareA2B | Standard | medium | medium | 1992.93 | 762.63 | 34115.0 |
18085 | 18086 | 48 | 2813 | 2017-04-07 | NaN | Approved | WeareA2B | Standard | medium | medium | 1762.96 | 950.52 | 41064.0 |
18335 | 18336 | 95 | 2958 | 2017-06-04 | NaN | Approved | Giant Bicycles | Standard | medium | large | 569.56 | 528.43 | 37874.0 |
18471 | 18472 | 81 | 176 | 2017-03-06 | NaN | Approved | Norco Bicycles | Standard | medium | small | 586.45 | 521.94 | 33429.0 |
18501 | 18502 | 19 | 2195 | 2017-11-02 | NaN | Approved | OHM Cycles | Road | high | large | 12.01 | 7.21 | 39880.0 |
18606 | 18607 | 43 | 2138 | 2017-12-01 | NaN | Approved | Norco Bicycles | Standard | medium | medium | 1555.58 | 818.01 | 37873.0 |
18624 | 18625 | 85 | 1133 | 2017-06-17 | NaN | Approved | WeareA2B | Standard | medium | medium | 752.64 | 205.36 | 42218.0 |
18689 | 18690 | 23 | 2987 | 2017-12-13 | NaN | Approved | Norco Bicycles | Standard | medium | medium | 1198.46 | 381.10 | 41701.0 |
18711 | 18712 | 78 | 665 | 2017-09-22 | NaN | Approved | Giant Bicycles | Standard | medium | large | 1765.30 | 709.48 | 38193.0 |
18722 | 18723 | 65 | 1597 | 2017-11-21 | NaN | Approved | WeareA2B | Standard | medium | medium | 1807.45 | 778.69 | 42145.0 |
18751 | 18752 | 22 | 2200 | 2017-10-12 | NaN | Approved | WeareA2B | Standard | medium | medium | 60.34 | 45.26 | 33455.0 |
18754 | 18755 | 55 | 403 | 2017-01-30 | NaN | Approved | Trek Bicycles | Road | medium | large | 1894.19 | 598.76 | 35707.0 |
18821 | 18822 | 30 | 3464 | 2017-12-13 | NaN | Approved | Solex | Standard | high | medium | 748.17 | 448.90 | 37337.0 |
18846 | 18847 | 57 | 50 | 2017-09-20 | NaN | Approved | WeareA2B | Touring | medium | large | 1890.39 | 260.14 | 34170.0 |
18982 | 18983 | 21 | 743 | 2017-01-03 | NaN | Approved | Solex | Standard | medium | large | 1071.23 | 380.74 | 34244.0 |
19024 | 19025 | 36 | 1937 | 2017-08-04 | NaN | Approved | Solex | Standard | low | medium | 945.04 | 507.58 | 35052.0 |
19082 | 19083 | 90 | 1204 | 2017-12-06 | NaN | Approved | Norco Bicycles | Standard | low | medium | 363.01 | 290.41 | 36367.0 |
19100 | 19101 | 8 | 2713 | 2017-10-21 | NaN | Approved | Solex | Road | medium | small | 1703.52 | 1516.13 | 38216.0 |
19271 | 19272 | 91 | 3017 | 2017-07-07 | NaN | Approved | Solex | Standard | medium | medium | 100.35 | 75.26 | 36367.0 |
19282 | 19283 | 55 | 1364 | 2017-08-02 | NaN | Approved | Trek Bicycles | Road | medium | large | 1894.19 | 598.76 | 37823.0 |
19303 | 19304 | 35 | 1776 | 2017-08-22 | NaN | Approved | Trek Bicycles | Standard | low | medium | 1057.51 | 154.40 | 34527.0 |
19368 | 19369 | 66 | 3223 | 2017-12-16 | NaN | Approved | Giant Bicycles | Road | low | small | 590.26 | 525.33 | 38647.0 |
19369 | 19370 | 91 | 1107 | 2017-09-02 | NaN | Approved | WeareA2B | Standard | low | medium | 642.31 | 513.85 | 41922.0 |
19433 | 19434 | 86 | 1557 | 2017-03-10 | NaN | Approved | OHM Cycles | Standard | medium | medium | 235.63 | 125.07 | 38206.0 |
19495 | 19496 | 2 | 3028 | 2017-11-14 | NaN | Approved | Giant Bicycles | Road | low | small | 590.26 | 525.33 | 42710.0 |
19514 | 19515 | 51 | 690 | 2017-01-22 | NaN | Approved | OHM Cycles | Standard | high | medium | 2005.66 | 1203.40 | 37823.0 |
19573 | 19574 | 18 | 1735 | 2017-01-15 | NaN | Approved | Solex | Standard | medium | medium | 575.27 | 431.45 | 41345.0 |
19580 | 19581 | 49 | 1933 | 2017-10-12 | NaN | Approved | Trek Bicycles | Road | medium | medium | 533.51 | 400.13 | 41064.0 |
19635 | 19636 | 98 | 1389 | 2017-07-26 | NaN | Approved | Trek Bicycles | Standard | high | medium | 358.39 | 215.03 | 38002.0 |
19843 | 19844 | 3 | 8 | 2017-01-28 | NaN | Approved | Trek Bicycles | Standard | medium | large | 2091.47 | 388.92 | 37823.0 |
360 rows × 13 columns
most_freq_online_mode = trans['online_order'].mode()
most_freq_online_mode
0 1.0 dtype: float64
Since 1 is the most frequent value of online order. We will do a mode imputation on this categorical value.
trans['online_order'].fillna(1, inplace=True)
trans['online_order'].isnull().sum()
0
Currently there are no missing values for online_order column
It is observed that when brand is null then all other column values with missing values viz. 'product_line', 'product_class', 'product_size', 'standard_cost', 'product_first_sold_date' are also null. Also this null values comprise 1 % of the dataset. Hence we can drop these records/rows.
trans[trans['brand'].isnull()][['brand', 'product_line', 'product_class', 'product_size',
'standard_cost', 'product_first_sold_date']].drop_duplicates()
brand | product_line | product_class | product_size | standard_cost | product_first_sold_date | |
---|---|---|---|---|---|---|
136 | NaN | NaN | NaN | NaN | NaN | NaN |
trans[trans['brand'].isnull()][['brand', 'product_line', 'product_class', 'product_size',
'standard_cost', 'product_first_sold_date']].shape[0]
197
records_to_drop = trans[trans['brand'].isnull()][['brand', 'product_line', 'product_class', 'product_size',
'standard_cost', 'product_first_sold_date']].index
records_to_drop
Int64Index([ 136, 159, 366, 406, 676, 780, 1003, 1130, 1196, 1282, ... 19045, 19132, 19176, 19196, 19205, 19340, 19383, 19793, 19859, 19871], dtype='int64', length=197)
trans.drop(index=records_to_drop, axis=0, inplace=True)
Finally there are no Missing values in the transaction dataset.
trans.isnull().sum()
transaction_id 0 product_id 0 customer_id 0 transaction_date 0 online_order 0 order_status 0 brand 0 product_line 0 product_class 0 product_size 0 list_price 0 standard_cost 0 product_first_sold_date 0 dtype: int64
print("Total records after removing Missing Values: {}".format(trans.shape[0]))
Total records after removing Missing Values: 19803
The Profit column will be the difference between the list price and the standard price of a product.
trans['Profit'] = trans['list_price']-trans['standard_cost']
# Dystribution of the Profit Column
plt.figure(figsize=(20,8))
sns.distplot(trans['Profit'])
<matplotlib.axes._subplots.AxesSubplot at 0x289cd29bc50>
We will check whether there is inconsistent data / typo error data is present in the categorical columns.
The columns to be checked are 'online_order', 'order_status' ,'product_line', 'product_class' , 'product_class' and 'brand'
There is no inconsistent data in online_order column.
trans['online_order'].value_counts()
1.0 10097 0.0 9706 Name: online_order, dtype: int64
There is no inconsistent data in order_status column.
trans['order_status'].value_counts()
Approved 19625 Cancelled 178 Name: order_status, dtype: int64
trans[['order_status', 'online_order']].drop_duplicates()
order_status | online_order | |
---|---|---|
0 | Approved | 0.0 |
1 | Approved | 1.0 |
42 | Cancelled | 0.0 |
254 | Cancelled | 1.0 |
There is no inconsistent data in product_line column.
trans['product_line'].value_counts()
Standard 14176 Road 3970 Touring 1234 Mountain 423 Name: product_line, dtype: int64
There is no inconsistent data in product_class column.
trans['product_class'].value_counts()
medium 13826 high 3013 low 2964 Name: product_class, dtype: int64
There is no inconsistent data in product_size column.
trans['product_size'].value_counts()
medium 12990 large 3976 small 2837 Name: product_size, dtype: int64
There is no inconsistent data in brand column.
trans['brand'].value_counts()
Solex 4253 Giant Bicycles 3312 WeareA2B 3295 OHM Cycles 3043 Trek Bicycles 2990 Norco Bicycles 2910 Name: brand, dtype: int64
We need to ensure that there is no duplication of records in the dataset. This may lead to error in data analysis due to poor data quality. If there are duplicate rows of data then we need to drop such records.
For checking for duplicate records we need to firstly remove the primary key column of the dataset then apply drop_duplicates() function provided by Python.
trans_dedupped = trans.drop('transaction_id', axis=1).drop_duplicates()
print("Number of records after removing customer_id (pk), duplicates : {}".format(trans_dedupped.shape[0]))
print("Number of records in original dataset : {}".format(trans.shape[0]))
Number of records after removing customer_id (pk), duplicates : 19803 Number of records in original dataset : 19803
Since both the numbers are same. There are no duplicate records in the dataset
Currently the Transactions dataset is clean. Hence we can export the data to a csv to continue our data analysis of Customer Segments by joining it to other tables.
trans.to_csv('Transactions_Cleaned.csv', index=False)