import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime, date
plt.style.use('ggplot')
# Loading the Customer Demographics Data from the excel file
cust_demo = pd.read_excel('Raw_data.xlsx' , sheet_name='CustomerDemographic')
# Checking first 5 records from Customer Demographics Data
cust_demo.head(5)
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | default | owns_car | tenure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Laraine | Medendorp | F | 93 | 1953-10-12 | Executive Secretary | Health | Mass Customer | N | "' | Yes | 11.0 |
1 | 2 | Eli | Bockman | Male | 81 | 1980-12-16 | Administrative Officer | Financial Services | Mass Customer | N | <script>alert('hi')</script> | Yes | 16.0 |
2 | 3 | Arlin | Dearle | Male | 61 | 1954-01-20 | Recruiting Manager | Property | Mass Customer | N | 2018-02-01 00:00:00 | Yes | 15.0 |
3 | 4 | Talbot | NaN | Male | 33 | 1961-10-03 | NaN | IT | Mass Customer | N | () { _; } >_[(())] { touch /tmp/blns.shellsh... | No | 7.0 |
4 | 5 | Sheila-kathryn | Calton | Female | 56 | 1977-05-13 | Senior Editor | NaN | Affluent Customer | N | NIL | Yes | 8.0 |
# Information of columns and data-types of Customer Demographics Data.
cust_demo.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4000 entries, 0 to 3999 Data columns (total 13 columns): customer_id 4000 non-null int64 first_name 4000 non-null object last_name 3875 non-null object gender 4000 non-null object past_3_years_bike_related_purchases 4000 non-null int64 DOB 3913 non-null datetime64[ns] job_title 3494 non-null object job_industry_category 3344 non-null object wealth_segment 4000 non-null object deceased_indicator 4000 non-null object default 3698 non-null object owns_car 4000 non-null object tenure 3913 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(2), object(9) memory usage: 406.3+ KB
The data-type of columns looks fine. However here default is an irrelevent column which should be dropped / deleted from the dataset. Let's check for the data quality and apply data cleaning process where ever applicable to clean our dataset before performing any analysis.
print("Total records (rows) in the dataset : {}".format(cust_demo.shape[0]))
print("Total columns (features) in the dataset : {}".format(cust_demo.shape[1]))
Total records (rows) in the dataset : 4000 Total columns (features) in the dataset : 13
# select numeric columns
df_numeric = cust_demo.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
print("The numeric columns are : {}".format(numeric_cols))
# select non-numeric columns
df_non_numeric = cust_demo.select_dtypes(exclude=[np.number])
non_numeric_cols = df_non_numeric.columns.values
print("The non-numeric columns are : {}".format(non_numeric_cols))
The numeric columns are : ['customer_id' 'past_3_years_bike_related_purchases' 'tenure'] The non-numeric columns are : ['first_name' 'last_name' 'gender' 'DOB' 'job_title' 'job_industry_category' 'wealth_segment' 'deceased_indicator' 'default' 'owns_car']
default is an irrelevent column. Hence it should be dropped.
# Dropping the default column
cust_demo.drop(labels={'default'}, axis=1 , inplace=True)
Checking for the presence of any missing values in the dataset. If missing values are present for a particular feature then depending upon the situation the feature may be either dropped (cases when a major amount of data is missing) or an appropiate value will be imputed in the feature column with missing values.
# Total number of missing values
cust_demo.isnull().sum()
customer_id 0 first_name 0 last_name 125 gender 0 past_3_years_bike_related_purchases 0 DOB 87 job_title 506 job_industry_category 656 wealth_segment 0 deceased_indicator 0 owns_car 0 tenure 87 dtype: int64
# Percentage of missing values
cust_demo.isnull().mean()*100
customer_id 0.000 first_name 0.000 last_name 3.125 gender 0.000 past_3_years_bike_related_purchases 0.000 DOB 2.175 job_title 12.650 job_industry_category 16.400 wealth_segment 0.000 deceased_indicator 0.000 owns_car 0.000 tenure 2.175 dtype: float64
Here it is observed that columns like gender, DOB, job_title, job_industry_category and tenure have missing values.
# Checking for the presence of first name and customer id in records where last name is missing.
cust_demo[cust_demo['last_name'].isnull()][['first_name', 'customer_id']].isnull().sum()
first_name 0 customer_id 0 dtype: int64
Since All customers have a customer_id and First name, all the customers are identifiable. Hence it is okay for to not have a last name. Filling null last names with "None".
# Fetching records where last name is missing.
cust_demo[cust_demo['last_name'].isnull()]
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | Talbot | NaN | Male | 33 | 1961-10-03 | NaN | IT | Mass Customer | N | No | 7.0 |
66 | 67 | Vernon | NaN | Male | 67 | 1960-06-14 | Web Developer II | Retail | Mass Customer | N | No | 18.0 |
105 | 106 | Glyn | NaN | Male | 54 | 1966-07-03 | Software Test Engineer III | Health | High Net Worth | N | Yes | 18.0 |
138 | 139 | Gar | NaN | Male | 1 | 1964-07-28 | Operator | Telecommunications | Affluent Customer | N | No | 4.0 |
196 | 197 | Avis | NaN | Female | 32 | 1977-01-27 | NaN | NaN | High Net Worth | N | No | 5.0 |
210 | 211 | Beitris | NaN | Female | 6 | 1974-03-04 | VP Marketing | Manufacturing | Mass Customer | N | Yes | 5.0 |
249 | 250 | Kristofer | NaN | Male | 53 | 1988-04-15 | Legal Assistant | Health | Mass Customer | N | Yes | 13.0 |
250 | 251 | Mala | NaN | Female | 88 | 1977-12-24 | VP Sales | Financial Services | Affluent Customer | N | Yes | 10.0 |
256 | 257 | Marissa | NaN | Female | 70 | 1966-02-08 | Sales Associate | Manufacturing | Affluent Customer | N | Yes | 19.0 |
274 | 275 | Dud | NaN | Male | 7 | 1955-07-27 | VP Sales | Health | High Net Worth | N | No | 13.0 |
355 | 356 | Nichole | NaN | Female | 10 | 1975-03-30 | Librarian | Entertainment | High Net Worth | N | No | 5.0 |
459 | 460 | Illa | NaN | Female | 0 | 1986-01-23 | Electrical Engineer | Manufacturing | Affluent Customer | N | Yes | 16.0 |
474 | 475 | Vernor | NaN | Male | 0 | 1996-11-14 | Nuclear Power Engineer | Manufacturing | Affluent Customer | N | No | 1.0 |
493 | 494 | Gaby | NaN | Male | 33 | 1975-06-02 | Design Engineer | Manufacturing | Mass Customer | N | No | 9.0 |
513 | 514 | Trent | NaN | Male | 9 | 1996-06-20 | Associate Professor | Financial Services | Mass Customer | N | Yes | 4.0 |
525 | 526 | Ardelle | NaN | U | 9 | NaT | Social Worker | Health | Mass Customer | N | Yes | NaN |
656 | 657 | Hoyt | NaN | Male | 66 | 1993-02-18 | Safety Technician II | Manufacturing | Affluent Customer | N | No | 10.0 |
659 | 660 | Stormi | NaN | Female | 82 | 1995-07-29 | Geological Engineer | Manufacturing | High Net Worth | N | No | 6.0 |
675 | 676 | Curtis | NaN | Male | 51 | 1968-05-19 | Senior Editor | NaN | High Net Worth | N | Yes | 14.0 |
683 | 684 | Malvin | NaN | Male | 88 | 1987-07-03 | Desktop Support Technician | Financial Services | Mass Customer | N | No | 14.0 |
689 | 690 | Lindsey | NaN | Male | 95 | 1987-03-27 | Assistant Professor | NaN | Affluent Customer | N | Yes | 17.0 |
702 | 703 | Ethelda | NaN | Female | 66 | 1966-10-31 | NaN | Property | Mass Customer | N | No | 15.0 |
743 | 744 | Heinrik | NaN | Male | 54 | 1977-08-30 | Graphic Designer | Manufacturing | Affluent Customer | N | Yes | 14.0 |
779 | 780 | Kim | NaN | Female | 24 | 1973-10-12 | Professor | Financial Services | Mass Customer | N | No | 20.0 |
789 | 790 | Yvonne | NaN | Female | 22 | 1968-03-24 | Senior Editor | NaN | Affluent Customer | N | No | 15.0 |
856 | 857 | Theo | NaN | Female | 15 | 1964-08-14 | General Manager | NaN | High Net Worth | N | No | 4.0 |
859 | 860 | Ida | NaN | Female | 80 | 1980-08-12 | NaN | NaN | High Net Worth | N | Yes | 7.0 |
915 | 916 | Joycelin | NaN | Female | 18 | 1991-06-18 | Recruiter | NaN | Affluent Customer | N | No | 8.0 |
926 | 927 | Jarret | NaN | Male | 25 | 1966-02-19 | Cost Accountant | Financial Services | Mass Customer | N | Yes | 18.0 |
937 | 938 | Corabelle | NaN | Female | 18 | 1996-04-06 | Technical Writer | Retail | Mass Customer | N | No | 7.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3179 | 3180 | Gage | NaN | Male | 96 | 1974-06-14 | Business Systems Development Analyst | IT | Mass Customer | N | Yes | 19.0 |
3187 | 3188 | Boyd | NaN | Male | 94 | 1999-07-07 | Actuary | Financial Services | Mass Customer | N | No | 1.0 |
3199 | 3200 | Marna | NaN | Female | 51 | 1995-11-03 | Environmental Tech | Manufacturing | Mass Customer | N | No | 1.0 |
3258 | 3259 | Rabi | NaN | Male | 74 | 1953-11-04 | Quality Control Specialist | NaN | High Net Worth | N | No | 10.0 |
3318 | 3319 | Erda | NaN | Female | 67 | 1966-04-04 | NaN | Financial Services | Affluent Customer | N | Yes | 19.0 |
3320 | 3321 | Ives | NaN | Male | 38 | 1980-05-10 | Software Test Engineer I | NaN | High Net Worth | N | Yes | 14.0 |
3323 | 3324 | Sholom | NaN | Male | 32 | 1973-07-11 | Research Nurse | Health | Mass Customer | N | Yes | 10.0 |
3324 | 3325 | Sylas | NaN | Male | 80 | 1996-10-08 | Database Administrator IV | Manufacturing | High Net Worth | N | No | 1.0 |
3346 | 3347 | Nichols | NaN | Male | 99 | 1985-11-08 | Computer Systems Analyst II | Entertainment | High Net Worth | N | Yes | 18.0 |
3363 | 3364 | Trueman | NaN | Male | 77 | 1993-08-19 | Engineer IV | Manufacturing | Mass Customer | N | Yes | 3.0 |
3384 | 3385 | Ronda | NaN | Female | 23 | 1975-02-10 | Systems Administrator III | Argiculture | Mass Customer | N | No | 9.0 |
3396 | 3397 | Melisande | NaN | Female | 70 | 1985-08-19 | Product Engineer | IT | Mass Customer | N | No | 11.0 |
3400 | 3401 | Cristie | NaN | Female | 92 | 1993-07-28 | Tax Accountant | Telecommunications | Mass Customer | N | Yes | 4.0 |
3442 | 3443 | Fran | NaN | Male | 11 | 1995-04-12 | Technical Writer | NaN | Mass Customer | N | Yes | 5.0 |
3444 | 3445 | Craggy | NaN | Male | 62 | 1966-06-23 | Database Administrator I | Financial Services | Affluent Customer | N | Yes | 11.0 |
3446 | 3447 | Linell | NaN | Female | 43 | 1977-11-23 | NaN | Financial Services | High Net Worth | N | No | 17.0 |
3479 | 3480 | Jarib | NaN | Male | 30 | 1959-06-24 | NaN | NaN | Mass Customer | N | No | 20.0 |
3554 | 3555 | Latashia | NaN | Female | 96 | 1976-02-26 | Programmer Analyst II | Manufacturing | Mass Customer | N | No | 21.0 |
3596 | 3597 | Giorgi | NaN | Male | 71 | 1954-06-16 | Analog Circuit Design manager | Property | Affluent Customer | N | Yes | 16.0 |
3623 | 3624 | Lenka | NaN | Female | 54 | 1984-10-16 | Cost Accountant | Financial Services | Mass Customer | N | Yes | 7.0 |
3634 | 3635 | Elset | NaN | Female | 51 | 1977-07-06 | VP Marketing | Retail | High Net Worth | N | No | 9.0 |
3650 | 3651 | Baxie | NaN | Male | 91 | 1999-11-15 | Human Resources Assistant I | Manufacturing | Mass Customer | N | No | 2.0 |
3717 | 3718 | Damiano | NaN | U | 22 | NaT | Geologist IV | IT | Mass Customer | N | Yes | NaN |
3755 | 3756 | Barry | NaN | Male | 22 | 1977-07-08 | NaN | NaN | Affluent Customer | N | No | 10.0 |
3816 | 3817 | Tuckie | NaN | Male | 65 | 1957-05-02 | VP Product Management | Manufacturing | High Net Worth | N | No | 13.0 |
3884 | 3885 | Asher | NaN | Male | 55 | 1978-06-17 | Actuary | Financial Services | Mass Customer | N | Yes | 8.0 |
3915 | 3916 | Myrtia | NaN | Female | 31 | 1958-10-17 | NaN | Retail | Affluent Customer | N | Yes | 17.0 |
3926 | 3927 | Conway | NaN | Male | 29 | 1978-01-07 | Electrical Engineer | Manufacturing | Mass Customer | N | Yes | 7.0 |
3961 | 3962 | Benoit | NaN | Male | 17 | 1977-10-06 | Project Manager | Argiculture | High Net Worth | N | Yes | 14.0 |
3998 | 3999 | Patrizius | NaN | Male | 11 | 1973-10-24 | NaN | Manufacturing | Affluent Customer | N | Yes | 10.0 |
125 rows × 12 columns
cust_demo['last_name'].fillna('None',axis=0, inplace=True)
cust_demo['last_name'].isnull().sum()
0
Currently there are no missing values for last name column.
cust_demo[cust_demo['DOB'].isnull()]
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
143 | 144 | Jory | Barrabeale | U | 71 | NaT | Environmental Tech | IT | Mass Customer | N | No | NaN |
167 | 168 | Reggie | Broggetti | U | 8 | NaT | General Manager | IT | Affluent Customer | N | Yes | NaN |
266 | 267 | Edgar | Buckler | U | 53 | NaT | NaN | IT | High Net Worth | N | No | NaN |
289 | 290 | Giorgio | Kevane | U | 42 | NaT | Senior Sales Associate | IT | Mass Customer | N | No | NaN |
450 | 451 | Marlow | Flowerdew | U | 37 | NaT | Quality Control Specialist | IT | High Net Worth | N | No | NaN |
452 | 453 | Cornelius | Yarmouth | U | 81 | NaT | Assistant Professor | IT | High Net Worth | N | No | NaN |
453 | 454 | Eugenie | Domenc | U | 58 | NaT | Research Nurse | Health | Affluent Customer | N | Yes | NaN |
479 | 480 | Darelle | Ive | U | 67 | NaT | Registered Nurse | Health | Mass Customer | N | Yes | NaN |
512 | 513 | Kienan | Soar | U | 30 | NaT | Tax Accountant | IT | Mass Customer | N | No | NaN |
525 | 526 | Ardelle | None | U | 9 | NaT | Social Worker | Health | Mass Customer | N | Yes | NaN |
547 | 548 | Georgie | Cudbertson | U | 84 | NaT | NaN | IT | High Net Worth | N | Yes | NaN |
581 | 582 | Rhoda | McKeown | U | 21 | NaT | Staff Scientist | IT | Affluent Customer | N | No | NaN |
598 | 599 | Ernestus | Cruden | U | 48 | NaT | Senior Financial Analyst | Financial Services | Mass Customer | N | Yes | NaN |
679 | 680 | Gay | Pickersgill | U | 22 | NaT | NaN | IT | High Net Worth | N | Yes | NaN |
684 | 685 | Booth | Birkin | U | 28 | NaT | Senior Developer | IT | Mass Customer | N | No | NaN |
798 | 799 | Harland | Spilisy | U | 39 | NaT | Programmer I | IT | Mass Customer | N | Yes | NaN |
838 | 839 | Charis | Greaves | U | 14 | NaT | Structural Analysis Engineer | IT | Mass Customer | N | Yes | NaN |
882 | 883 | Lolita | Bennie | U | 73 | NaT | Recruiter | IT | Mass Customer | N | Yes | NaN |
891 | 892 | Conroy | Healy | U | 22 | NaT | Office Assistant II | IT | Mass Customer | N | Yes | NaN |
949 | 950 | Bret | Ivakhnov | U | 24 | NaT | Recruiter | IT | High Net Worth | N | Yes | NaN |
974 | 975 | Goldarina | Rzehorz | U | 26 | NaT | Automation Specialist IV | IT | Mass Customer | N | No | NaN |
982 | 983 | Shaylyn | Riggs | U | 49 | NaT | NaN | IT | Affluent Customer | N | No | NaN |
995 | 996 | Aura | Bemlott | U | 67 | NaT | Assistant Manager | IT | Mass Customer | N | Yes | NaN |
1037 | 1038 | Fraser | Acome | U | 57 | NaT | Engineer I | Manufacturing | Mass Customer | N | Yes | NaN |
1043 | 1044 | Frederico | Whilder | U | 4 | NaT | Food Chemist | Health | High Net Worth | N | No | NaN |
1081 | 1082 | Guinevere | Kelby | U | 90 | NaT | Financial Analyst | Financial Services | Mass Customer | N | Yes | NaN |
1173 | 1174 | Shellysheldon | Gooderridge | U | 9 | NaT | Executive Secretary | IT | Mass Customer | N | No | NaN |
1209 | 1210 | Shandie | Sprigg | U | 81 | NaT | Programmer II | IT | Mass Customer | N | No | NaN |
1243 | 1244 | Glenn | Tinham | U | 80 | NaT | Financial Analyst | Financial Services | Mass Customer | N | Yes | NaN |
1350 | 1351 | Lorettalorna | None | U | 32 | NaT | Office Assistant IV | IT | High Net Worth | N | No | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
2695 | 2696 | Isabelle | Bursnoll | U | 42 | NaT | Social Worker | Health | Mass Customer | N | Yes | NaN |
2696 | 2697 | Klarika | Yerby | U | 70 | NaT | Legal Assistant | IT | High Net Worth | N | No | NaN |
2853 | 2854 | Vikky | Dyde | U | 49 | NaT | Project Manager | IT | High Net Worth | N | Yes | NaN |
2919 | 2920 | Casar | Ritchley | U | 0 | NaT | Business Systems Development Analyst | IT | Mass Customer | N | Yes | NaN |
2962 | 2963 | Christin | Fricke | U | 17 | NaT | Safety Technician II | IT | Affluent Customer | N | Yes | NaN |
2998 | 2999 | Rinaldo | Diggin | U | 28 | NaT | Business Systems Development Analyst | IT | Affluent Customer | N | Yes | NaN |
3011 | 3012 | Devland | Probart | U | 81 | NaT | Technical Writer | IT | Mass Customer | N | Yes | NaN |
3085 | 3086 | Pieter | Gadesby | U | 18 | NaT | Biostatistician I | IT | High Net Worth | N | No | NaN |
3150 | 3151 | Thorn | Choffin | U | 20 | NaT | Senior Developer | IT | Affluent Customer | N | Yes | NaN |
3221 | 3222 | Caralie | Sellors | U | 40 | NaT | Senior Editor | IT | Affluent Customer | N | No | NaN |
3222 | 3223 | Tiffi | Wortt | U | 44 | NaT | Database Administrator III | IT | Mass Customer | N | Yes | NaN |
3254 | 3255 | Sutherlan | Truin | U | 47 | NaT | Engineer IV | IT | High Net Worth | N | No | NaN |
3287 | 3288 | Fair | Dewen | U | 47 | NaT | Engineer III | IT | High Net Worth | N | No | NaN |
3297 | 3298 | Christine | Baignard | U | 1 | NaT | VP Quality Control | IT | Affluent Customer | N | Yes | NaN |
3311 | 3312 | Franky | Nanninini | U | 49 | NaT | Administrative Officer | IT | High Net Worth | N | No | NaN |
3321 | 3322 | Hew | Sworder | U | 24 | NaT | Financial Analyst | Financial Services | Affluent Customer | N | Yes | NaN |
3342 | 3343 | Cristabel | Bim | U | 3 | NaT | Recruiter | IT | Mass Customer | N | Yes | NaN |
3364 | 3365 | Karlens | Chaffyn | U | 29 | NaT | Engineer III | IT | Mass Customer | N | No | NaN |
3472 | 3473 | Sanderson | Alloway | U | 34 | NaT | Analog Circuit Design manager | IT | Mass Customer | N | No | NaN |
3509 | 3510 | Jemima | Izaac | U | 48 | NaT | Safety Technician II | IT | Affluent Customer | N | Yes | NaN |
3512 | 3513 | Enriqueta | Waterhowse | U | 80 | NaT | Internal Auditor | IT | Affluent Customer | N | Yes | NaN |
3564 | 3565 | Charyl | Pottiphar | U | 14 | NaT | Structural Engineer | IT | High Net Worth | N | Yes | NaN |
3653 | 3654 | Kenyon | Paddefield | U | 78 | NaT | Electrical Engineer | Manufacturing | Mass Customer | N | No | NaN |
3717 | 3718 | Damiano | None | U | 22 | NaT | Geologist IV | IT | Mass Customer | N | Yes | NaN |
3726 | 3727 | Eba | Youle | U | 65 | NaT | Assistant Professor | IT | Mass Customer | N | No | NaN |
3778 | 3779 | Ulick | Daspar | U | 68 | NaT | NaN | IT | Affluent Customer | N | No | NaN |
3882 | 3883 | Nissa | Conrad | U | 35 | NaT | Legal Assistant | IT | Mass Customer | N | No | NaN |
3930 | 3931 | Kylie | Epine | U | 19 | NaT | NaN | IT | High Net Worth | N | Yes | NaN |
3934 | 3935 | Teodor | Alfonsini | U | 72 | NaT | NaN | IT | High Net Worth | N | Yes | NaN |
3997 | 3998 | Sarene | Woolley | U | 60 | NaT | Assistant Manager | IT | High Net Worth | N | No | NaN |
87 rows × 12 columns
round(cust_demo['DOB'].isnull().mean()*100)
2.0
Since less than 5 % of data has null date of birth. we can remove the records where date of birth is null.
dob_index_drop = cust_demo[cust_demo['DOB'].isnull()].index
dob_index_drop
Int64Index([ 143, 167, 266, 289, 450, 452, 453, 479, 512, 525, 547, 581, 598, 679, 684, 798, 838, 882, 891, 949, 974, 982, 995, 1037, 1043, 1081, 1173, 1209, 1243, 1350, 1476, 1508, 1582, 1627, 1682, 1739, 1772, 1779, 1805, 1917, 1937, 1989, 1999, 2020, 2068, 2164, 2204, 2251, 2294, 2334, 2340, 2413, 2425, 2468, 2539, 2641, 2646, 2695, 2696, 2853, 2919, 2962, 2998, 3011, 3085, 3150, 3221, 3222, 3254, 3287, 3297, 3311, 3321, 3342, 3364, 3472, 3509, 3512, 3564, 3653, 3717, 3726, 3778, 3882, 3930, 3934, 3997], dtype='int64')
cust_demo.drop(index=dob_index_drop, inplace=True, axis=0)
cust_demo['DOB'].isnull().sum()
0
Currently there are no missing values for DOB column.
# Function to calculate the age as of today based on the DOB of the customer.
def age(born):
today = date.today()
return today.year - born.year - ((today.month, today.day) < (born.month, born.day))
cust_demo['Age'] = cust_demo['DOB'].apply(age)
# Viz to find out the Age Distribution
plt.figure(figsize=(20,8))
sns.distplot(cust_demo['Age'], kde=False, bins=50)
<matplotlib.axes._subplots.AxesSubplot at 0x1ef4c9deb00>
Statistics of the Age column
cust_demo['Age'].describe()
count 3913.000000 mean 43.346026 std 12.803129 min 19.000000 25% 34.000000 50% 43.000000 75% 53.000000 max 177.000000 Name: Age, dtype: float64
Here we find there is only 1 customer with an age of 177. Clearly this is an outlier since the 75th percentile of Age is 53.
cust_demo[cust_demo['Age'] > 100]
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
33 | 34 | Jephthah | Bachmann | U | 59 | 1843-12-21 | Legal Assistant | IT | Affluent Customer | N | No | 20.0 | 177 |
Here we see a customer with age 177 which is an outlier. hence we need to remove this record.
age_index_drop = cust_demo[cust_demo['Age']>100].index
cust_demo.drop(index=age_index_drop, inplace=True , axis=0)
When Date of Birth was Null the Tenure was also Null. Hence after removing null DOBs from dataframe , null tenures were also removed.
cust_demo['tenure'].isnull().sum()
0
There are no missing values for Tenure column.
# Fetching records where Job Title is missing.
cust_demo[cust_demo['job_title'].isnull()]
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | 4 | Talbot | None | Male | 33 | 1961-10-03 | NaN | IT | Mass Customer | N | No | 7.0 | 59 |
5 | 6 | Curr | Duckhouse | Male | 35 | 1966-09-16 | NaN | Retail | High Net Worth | N | Yes | 13.0 | 54 |
6 | 7 | Fina | Merali | Female | 6 | 1976-02-23 | NaN | Financial Services | Affluent Customer | N | Yes | 11.0 | 45 |
10 | 11 | Uriah | Bisatt | Male | 99 | 1954-04-30 | NaN | Property | Mass Customer | N | No | 9.0 | 67 |
21 | 22 | Deeanne | Durtnell | Female | 79 | 1962-12-10 | NaN | IT | Mass Customer | N | No | 11.0 | 58 |
22 | 23 | Olav | Polak | Male | 43 | 1995-02-10 | NaN | NaN | High Net Worth | N | Yes | 1.0 | 26 |
29 | 30 | Darrick | Helleckas | Male | 18 | 1961-10-18 | NaN | IT | Affluent Customer | N | Yes | 6.0 | 59 |
45 | 46 | Kaila | Allin | Female | 98 | 1972-02-26 | NaN | NaN | Affluent Customer | N | Yes | 15.0 | 49 |
51 | 52 | Curran | Bentson | Male | 57 | 1988-06-22 | NaN | Financial Services | Mass Customer | N | Yes | 13.0 | 32 |
59 | 60 | Nadiya | Champerlen | Female | 18 | 1970-02-04 | NaN | Manufacturing | Mass Customer | N | No | 10.0 | 51 |
61 | 62 | Sorcha | Roggers | Female | 38 | 1979-07-06 | NaN | IT | Mass Customer | N | Yes | 22.0 | 41 |
73 | 74 | Pansy | Kiddie | Female | 94 | 1969-06-19 | NaN | NaN | Mass Customer | N | Yes | 6.0 | 51 |
80 | 81 | Bee | Blazewicz | Female | 58 | 1986-09-04 | NaN | Health | High Net Worth | N | No | 13.0 | 34 |
107 | 108 | Kayle | Mingaud | Female | 4 | 1994-03-14 | NaN | NaN | High Net Worth | N | No | 3.0 | 27 |
109 | 110 | Sascha | St. Quintin | Male | 23 | 2000-07-31 | NaN | Financial Services | Affluent Customer | N | No | 1.0 | 20 |
160 | 161 | Tadd | Bloss | Male | 49 | 1976-01-21 | NaN | NaN | Mass Customer | N | No | 16.0 | 45 |
166 | 167 | Nathalie | Tideswell | Female | 95 | 1969-10-27 | NaN | Health | High Net Worth | N | Yes | 17.0 | 51 |
177 | 178 | Matthieu | Bertelmot | Male | 2 | 1967-04-03 | NaN | NaN | Affluent Customer | N | No | 8.0 | 54 |
184 | 185 | Crosby | Walcot | Male | 80 | 1979-12-13 | NaN | Property | Mass Customer | N | Yes | 13.0 | 41 |
196 | 197 | Avis | None | Female | 32 | 1977-01-27 | NaN | NaN | High Net Worth | N | No | 5.0 | 44 |
206 | 207 | Adena | Whyman | Female | 9 | 1994-08-10 | NaN | NaN | Mass Customer | N | No | 7.0 | 26 |
216 | 217 | Jeralee | Quartly | Female | 63 | 1979-12-09 | NaN | Manufacturing | High Net Worth | N | No | 16.0 | 41 |
228 | 229 | Vaughn | Lambis | Male | 30 | 1966-03-06 | NaN | Property | High Net Worth | N | No | 19.0 | 55 |
243 | 244 | Germayne | Sperry | Male | 57 | 1974-11-25 | NaN | Retail | Affluent Customer | N | No | 8.0 | 46 |
261 | 262 | Cordie | Petrelli | Male | 97 | 1977-12-23 | NaN | Health | High Net Worth | N | Yes | 10.0 | 43 |
275 | 276 | Goldi | Dwine | Female | 47 | 1990-03-25 | NaN | Financial Services | Mass Customer | N | No | 22.0 | 31 |
287 | 288 | Ebenezer | Seedman | Male | 71 | 1985-09-08 | NaN | Manufacturing | High Net Worth | N | No | 9.0 | 35 |
295 | 296 | Marshal | Rathbone | Male | 34 | 1972-06-19 | NaN | Health | High Net Worth | N | Yes | 17.0 | 48 |
301 | 302 | Laurice | Colgrave | Female | 32 | 1977-03-27 | NaN | Health | Mass Customer | N | No | 13.0 | 44 |
318 | 319 | Madelle | Matteris | Female | 32 | 1971-10-11 | NaN | Retail | Mass Customer | N | Yes | 14.0 | 49 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3797 | 3798 | Yorker | Dennison | Male | 13 | 1968-02-22 | NaN | Manufacturing | Mass Customer | N | Yes | 17.0 | 53 |
3803 | 3804 | Andria | Keays | Female | 23 | 1986-08-21 | NaN | Manufacturing | Mass Customer | N | Yes | 4.0 | 34 |
3805 | 3806 | Ado | Gailor | Male | 1 | 1954-02-08 | NaN | Property | Mass Customer | N | No | 7.0 | 67 |
3810 | 3811 | Etta | Leele | Female | 60 | 1997-03-19 | NaN | Financial Services | High Net Worth | N | No | 4.0 | 24 |
3821 | 3822 | Conny | Speechley | Male | 37 | 1959-03-09 | NaN | Manufacturing | High Net Worth | N | Yes | 18.0 | 62 |
3823 | 3824 | Giffard | Stollman | Male | 33 | 1994-11-21 | NaN | Property | Mass Customer | N | No | 3.0 | 26 |
3825 | 3826 | Marlow | Balffye | Male | 33 | 1978-09-25 | NaN | Health | Mass Customer | N | No | 7.0 | 42 |
3826 | 3827 | Cherida | Whyffen | Female | 10 | 1976-09-05 | NaN | Retail | Affluent Customer | N | No | 8.0 | 44 |
3839 | 3840 | Marc | Torrans | Male | 27 | 1962-09-30 | NaN | NaN | High Net Worth | N | No | 5.0 | 58 |
3843 | 3844 | Clotilda | Oret | Female | 87 | 1987-12-06 | NaN | Manufacturing | Affluent Customer | N | No | 15.0 | 33 |
3864 | 3865 | Urbanus | Fuxman | Male | 49 | 1978-03-15 | NaN | Manufacturing | Mass Customer | N | Yes | 11.0 | 43 |
3880 | 3881 | Olivie | Nazair | Female | 50 | 1971-01-12 | NaN | Financial Services | Affluent Customer | N | No | 18.0 | 50 |
3892 | 3893 | Hadria | Moles | Female | 7 | 1996-11-18 | NaN | NaN | High Net Worth | N | Yes | 4.0 | 24 |
3908 | 3909 | Micheil | McGeorge | Male | 1 | 1987-10-04 | NaN | Manufacturing | High Net Worth | N | Yes | 18.0 | 33 |
3915 | 3916 | Myrtia | None | Female | 31 | 1958-10-17 | NaN | Retail | Affluent Customer | N | Yes | 17.0 | 62 |
3927 | 3928 | Kristin | Way | Female | 71 | 1982-04-16 | NaN | Property | Affluent Customer | N | Yes | 6.0 | 39 |
3928 | 3929 | Jacqui | Fortnam | Female | 50 | 1989-10-18 | NaN | NaN | Affluent Customer | N | Yes | 10.0 | 31 |
3929 | 3930 | Blancha | Baldi | Female | 43 | 1988-01-06 | NaN | Financial Services | High Net Worth | N | No | 22.0 | 33 |
3932 | 3933 | Chiarra | Cops | Female | 65 | 1983-07-05 | NaN | NaN | High Net Worth | N | Yes | 10.0 | 37 |
3938 | 3939 | Georges | Dumbelton | Male | 67 | 1981-06-25 | NaN | Manufacturing | Affluent Customer | N | No | 15.0 | 39 |
3944 | 3945 | Lazarus | Donaghy | Male | 77 | 1994-10-21 | NaN | Retail | High Net Worth | N | No | 7.0 | 26 |
3945 | 3946 | Wylie | FitzGilbert | Male | 85 | 1960-06-23 | NaN | Retail | High Net Worth | N | Yes | 10.0 | 60 |
3951 | 3952 | Di | Borsnall | Female | 96 | 1968-05-09 | NaN | Manufacturing | Affluent Customer | N | No | 10.0 | 53 |
3958 | 3959 | Dannie | Sowray | Male | 76 | 1992-12-07 | NaN | NaN | Mass Customer | N | No | 3.0 | 28 |
3959 | 3960 | Hobart | Burgan | Male | 6 | 2000-03-16 | NaN | Property | Mass Customer | N | No | 1.0 | 21 |
3967 | 3968 | Alexandra | Kroch | Female | 99 | 1977-12-22 | NaN | Property | High Net Worth | N | No | 22.0 | 43 |
3971 | 3972 | Maribelle | Schaffel | Female | 6 | 1979-03-28 | NaN | Retail | Mass Customer | N | No | 8.0 | 42 |
3978 | 3979 | Kleon | Adam | Male | 67 | 1974-07-13 | NaN | Financial Services | Mass Customer | N | Yes | 18.0 | 46 |
3986 | 3987 | Beckie | Wakeham | Female | 18 | 1964-05-29 | NaN | Argiculture | Mass Customer | N | No | 7.0 | 56 |
3998 | 3999 | Patrizius | None | Male | 11 | 1973-10-24 | NaN | Manufacturing | Affluent Customer | N | Yes | 10.0 | 47 |
497 rows × 13 columns
Since Percentage of missing Job is 13. We will replace null values with Missing.
cust_demo['job_title'].fillna('Missing', inplace=True, axis=0)
cust_demo['job_title'].isnull().sum()
0
Currently there are no mssing values for job_title column.
cust_demo[cust_demo['job_industry_category'].isnull()]
customer_id | first_name | last_name | gender | past_3_years_bike_related_purchases | DOB | job_title | job_industry_category | wealth_segment | deceased_indicator | owns_car | tenure | Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4 | 5 | Sheila-kathryn | Calton | Female | 56 | 1977-05-13 | Senior Editor | NaN | Affluent Customer | N | Yes | 8.0 | 44 |
7 | 8 | Rod | Inder | Male | 31 | 1962-03-30 | Media Manager I | NaN | Mass Customer | N | No | 7.0 | 59 |
15 | 16 | Harlin | Parr | Male | 38 | 1977-02-27 | Media Manager IV | NaN | Mass Customer | N | Yes | 18.0 | 44 |
16 | 17 | Heath | Faraday | Male | 57 | 1962-03-19 | Sales Associate | NaN | Affluent Customer | N | Yes | 15.0 | 59 |
17 | 18 | Marjie | Neasham | Female | 79 | 1967-07-06 | Professor | NaN | Affluent Customer | N | No | 11.0 | 53 |
22 | 23 | Olav | Polak | Male | 43 | 1995-02-10 | Missing | NaN | High Net Worth | N | Yes | 1.0 | 26 |
32 | 33 | Ernst | Hacon | Male | 44 | 1957-06-25 | Product Engineer | NaN | Affluent Customer | N | Yes | 11.0 | 63 |
35 | 36 | Lurette | Stonnell | Female | 33 | 1977-11-09 | VP Quality Control | NaN | Affluent Customer | N | No | 22.0 | 43 |
45 | 46 | Kaila | Allin | Female | 98 | 1972-02-26 | Missing | NaN | Affluent Customer | N | Yes | 15.0 | 49 |
47 | 48 | Rebbecca | Casone | Female | 46 | 1975-08-15 | Biostatistician II | NaN | Mass Customer | N | Yes | 8.0 | 45 |
48 | 49 | Nolly | Ownsworth | Male | 63 | 1994-01-26 | VP Quality Control | NaN | Affluent Customer | N | No | 1.0 | 27 |
56 | 57 | Abba | Masedon | M | 87 | 1988-06-13 | Chief Design Engineer | NaN | Mass Customer | N | Yes | 13.0 | 32 |
58 | 59 | Niki | Heathcote | Male | 60 | 2000-02-08 | Physical Therapy Assistant | NaN | High Net Worth | N | No | 3.0 | 21 |
67 | 68 | Dahlia | Eddoes | Female | 37 | 1974-04-21 | Information Systems Manager | NaN | Affluent Customer | N | No | 9.0 | 47 |
68 | 69 | Heidi | Milner | Female | 16 | 1969-06-22 | Web Developer II | NaN | Mass Customer | N | No | 6.0 | 51 |
72 | 73 | Minette | Worters | Female | 16 | 1960-05-27 | Teacher | NaN | Affluent Customer | N | Yes | 5.0 | 60 |
73 | 74 | Pansy | Kiddie | Female | 94 | 1969-06-19 | Missing | NaN | Mass Customer | N | Yes | 6.0 | 51 |
83 | 84 | Rich | Mathiasen | Male | 78 | 1958-02-07 | Accountant III | NaN | Mass Customer | N | Yes | 14.0 | 63 |
84 | 85 | Kane | Tixall | Male | 1 | 1958-05-21 | Analyst Programmer | NaN | Mass Customer | N | No | 8.0 | 62 |
107 | 108 | Kayle | Mingaud | Female | 4 | 1994-03-14 | Missing | NaN | High Net Worth | N | No | 3.0 | 27 |
108 | 109 | Cody | Blabey | Male | 16 | 1978-12-11 | Marketing Assistant | NaN | Affluent Customer | N | Yes | 4.0 | 42 |
110 | 111 | Cele | Evason | Female | 65 | 1993-08-29 | Analyst Programmer | NaN | Mass Customer | N | No | 2.0 | 27 |
112 | 113 | Gage | Nickless | Male | 67 | 1956-05-06 | Staff Scientist | NaN | Mass Customer | N | No | 20.0 | 65 |
117 | 118 | Prentice | Pearmain | Male | 43 | 1959-11-12 | Budget/Accounting Analyst IV | NaN | High Net Worth | N | No | 19.0 | 61 |
118 | 119 | Willey | Chastanet | Male | 9 | 1981-12-04 | Associate Professor | NaN | High Net Worth | N | Yes | 9.0 | 39 |
147 | 148 | Jaquith | Maffey | Female | 69 | 1981-05-08 | Programmer Analyst III | NaN | Mass Customer | N | Yes | 5.0 | 40 |
153 | 154 | Faydra | Dulieu | Female | 90 | 1958-02-13 | Junior Executive | NaN | Mass Customer | N | No | 11.0 | 63 |
157 | 158 | Hamlin | Odams | Male | 99 | 1984-09-03 | Internal Auditor | NaN | Affluent Customer | N | No | 5.0 | 36 |
160 | 161 | Tadd | Bloss | Male | 49 | 1976-01-21 | Missing | NaN | Mass Customer | N | No | 16.0 | 45 |
177 | 178 | Matthieu | Bertelmot | Male | 2 | 1967-04-03 | Missing | NaN | Affluent Customer | N | No | 8.0 | 54 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3851 | 3852 | Zerk | Merrien | Male | 44 | 1982-02-04 | Help Desk Operator | NaN | Mass Customer | N | No | 4.0 | 39 |
3852 | 3853 | Kerri | Marrington | Female | 91 | 1975-06-26 | Accounting Assistant IV | NaN | Mass Customer | N | Yes | 19.0 | 45 |
3854 | 3855 | Brnaby | Doughtery | Male | 89 | 1965-02-26 | General Manager | NaN | Mass Customer | N | No | 16.0 | 56 |
3859 | 3860 | Sheila-kathryn | Conklin | Female | 14 | 1986-04-05 | Mechanical Systems Engineer | NaN | Affluent Customer | N | Yes | 13.0 | 35 |
3863 | 3864 | Ilyssa | Piaggia | Female | 23 | 1963-08-27 | Help Desk Technician | NaN | Mass Customer | N | Yes | 10.0 | 57 |
3870 | 3871 | Magda | Shugg | Female | 80 | 1983-11-13 | Recruiting Manager | NaN | Mass Customer | N | No | 4.0 | 37 |
3876 | 3877 | Georgine | Poutress | Female | 55 | 1971-01-28 | Account Coordinator | NaN | High Net Worth | N | No | 11.0 | 50 |
3877 | 3878 | Waldon | Digges | Male | 99 | 1978-02-24 | Programmer III | NaN | Mass Customer | N | No | 9.0 | 43 |
3878 | 3879 | Vin | Attack | Male | 74 | 1979-08-28 | Payment Adjustment Coordinator | NaN | High Net Worth | N | No | 19.0 | 41 |
3886 | 3887 | Dulcie | Nealon | Female | 66 | 1964-07-16 | Computer Systems Analyst IV | NaN | Affluent Customer | N | No | 7.0 | 56 |
3891 | 3892 | Roma | Finlater | Male | 19 | 1978-01-29 | Staff Scientist | NaN | Mass Customer | N | Yes | 15.0 | 43 |
3892 | 3893 | Hadria | Moles | Female | 7 | 1996-11-18 | Missing | NaN | High Net Worth | N | Yes | 4.0 | 24 |
3895 | 3896 | Perla | Blakiston | Female | 3 | 1979-10-15 | Tax Accountant | NaN | Mass Customer | N | Yes | 13.0 | 41 |
3902 | 3903 | Dayna | Cawthera | Female | 69 | 1981-02-13 | Research Assistant III | NaN | Mass Customer | N | Yes | 17.0 | 40 |
3906 | 3907 | Adriana | Heam | Female | 8 | 1996-01-11 | Technical Writer | NaN | High Net Worth | N | Yes | 5.0 | 25 |
3910 | 3911 | Valeda | Ezele | Female | 81 | 1954-05-25 | Recruiting Manager | NaN | Mass Customer | N | No | 5.0 | 66 |
3917 | 3918 | Rosalia | Skedge | Female | 52 | 1977-07-05 | Junior Executive | NaN | High Net Worth | N | No | 18.0 | 43 |
3924 | 3925 | Cally | Chaim | Female | 81 | 1978-11-25 | Statistician I | NaN | High Net Worth | N | No | 7.0 | 42 |
3928 | 3929 | Jacqui | Fortnam | Female | 50 | 1989-10-18 | Missing | NaN | Affluent Customer | N | Yes | 10.0 | 31 |
3932 | 3933 | Chiarra | Cops | Female | 65 | 1983-07-05 | Missing | NaN | High Net Worth | N | Yes | 10.0 | 37 |
3946 | 3947 | Tanitansy | McTrustam | Female | 26 | 1970-05-12 | GIS Technical Architect | NaN | Mass Customer | N | No | 12.0 | 51 |
3950 | 3951 | Ephrem | Hollerin | Male | 39 | 1975-02-10 | Quality Control Specialist | NaN | Affluent Customer | N | Yes | 9.0 | 46 |
3956 | 3957 | Bernice | Scotchforth | Female | 4 | 1978-07-20 | Business Systems Development Analyst | NaN | High Net Worth | N | Yes | 14.0 | 42 |
3958 | 3959 | Dannie | Sowray | Male | 76 | 1992-12-07 | Missing | NaN | Mass Customer | N | No | 3.0 | 28 |
3962 | 3963 | Ardelle | Dasent | Female | 10 | 1954-08-22 | Software Test Engineer II | NaN | Mass Customer | N | No | 13.0 | 66 |
3965 | 3966 | Astrix | Sigward | Female | 53 | 1968-09-15 | Geologist I | NaN | Mass Customer | N | Yes | 11.0 | 52 |
3973 | 3974 | Misha | Ranklin | Female | 82 | 1961-02-11 | Technical Writer | NaN | Affluent Customer | N | Yes | 9.0 | 60 |
3975 | 3976 | Gretel | Chrystal | Female | 0 | 1957-11-20 | Internal Auditor | NaN | Affluent Customer | N | Yes | 13.0 | 63 |
3982 | 3983 | Jarred | Lyste | Male | 19 | 1965-04-21 | Graphic Designer | NaN | Mass Customer | N | Yes | 9.0 | 56 |
3999 | 4000 | Kippy | Oldland | Male | 76 | 1991-11-05 | Software Engineer IV | NaN | Affluent Customer | N | No | 11.0 | 29 |
656 rows × 13 columns
Since Percentage of missing Job Industry Category is 16. We will replace null values with Missing
cust_demo['job_industry_category'].fillna('Missing', inplace=True, axis=0)
cust_demo['job_industry_category'].isnull().sum()
0
Finally there are no Missing Values in the dataset.
cust_demo.isnull().sum()
customer_id 0 first_name 0 last_name 0 gender 0 past_3_years_bike_related_purchases 0 DOB 0 job_title 0 job_industry_category 0 wealth_segment 0 deceased_indicator 0 owns_car 0 tenure 0 Age 0 dtype: int64
print("Total records after removing Missing Values: {}".format(cust_demo.shape[0]))
Total records after removing Missing Values: 3912
We will check whether there is inconsistent data / typo error data is present in the categorical columns.
The columns to be checked are 'gender', 'wealth_segment' ,'deceased_indicator', 'owns_car'
cust_demo['gender'].value_counts()
Female 2037 Male 1872 F 1 M 1 Femal 1 Name: gender, dtype: int64
Here there are inconsistent data in gender column.There are spelling mistakes and typos. For gender with value M will be replaced with Male, F will be replaced by Female and Femal will be replaced by Female
def replace_gender_names(gender):
# Making Gender as Male and Female as standards
if gender=='M':
return 'Male'
elif gender=='F':
return 'Female'
elif gender=='Femal':
return 'Female'
else :
return gender
cust_demo['gender'] = cust_demo['gender'].apply(replace_gender_names)
cust_demo['gender'].value_counts()
Female 2039 Male 1873 Name: gender, dtype: int64
The inconsistent data ,spelling mistakes and typos in gender column are removed.
There is no inconsistent data in wealth_segment column.
cust_demo['wealth_segment'].value_counts()
Mass Customer 1954 High Net Worth 996 Affluent Customer 962 Name: wealth_segment, dtype: int64
There is no inconsistent data in deceased_indicator column.
cust_demo['deceased_indicator'].value_counts()
N 3910 Y 2 Name: deceased_indicator, dtype: int64
There is no inconsistent data in owns_car column.
cust_demo['owns_car'].value_counts()
Yes 1974 No 1938 Name: owns_car, dtype: int64
We need to ensure that there is no duplication of records in the dataset. This may lead to error in data analysis due to poor data quality. If there are duplicate rows of data then we need to drop such records.
For checking for duplicate records we need to firstly remove the primary key column of the dataset then apply drop_duplicates() function provided by Python.
cust_demo_dedupped = cust_demo.drop('customer_id', axis=1).drop_duplicates()
print("Number of records after removing customer_id (pk), duplicates : {}".format(cust_demo_dedupped.shape[0]))
print("Number of records in original dataset : {}".format(cust_demo.shape[0]))
Number of records after removing customer_id (pk), duplicates : 3912 Number of records in original dataset : 3912
Since both the numbers are same. There are no duplicate records in the dataset.
Currently the Customer Demographics dataset is clean. Hence we can export the data to a csv to continue our data analysis of Customer Segments by joining it to other tables.
cust_demo.to_csv('CustomerDemographic_Cleaned.csv', index=False)