Column attributes :
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was firstregistered.
gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.AIM : To use pandas library to clean and analyze the dataset.
# importing the required libraries:
import numpy as np
import pandas as pd
# reading the file into a dataframe using pandas:
autos = pd.read_csv("autos.csv", encoding= 'Latin-1')
# peeking at the data:
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# preliminary analysis using pandas:
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The column names use 'camelcase' instead of 'snakecase'. As snakecase is easier to understand and work with, let's convert the attribute names into snakecase.
# analysing the columns:
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# defining a function to change the column-names:
def col_clean(col):
col = col.replace("yearOfRegistration","registration_year")
col = col.replace("monthOfRegistration","registration_month")
col = col.replace("notRepairedDamage","unrepaired_damage")
col = col.replace("dateCreated","ad_created")
name = [col[0].lower()]
for c in col[1:]:
if c in ('ABCDEFGHIJKLMNOPQRSTUVWXYZ'):
name.append("_")
name.append(c.lower())
else:
name.append(c)
return ''.join(name)
## checking the function:
print(col_clean('fuelType'))
fuel_type
# changing the col names using our function:
new_cols = []
for c in autos.columns:
new_cols.append(col_clean(c))
# checking the new column names:
print(new_cols)
['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_p_s', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen']
# reassigning the new col back to our dataset:
autos.columns = new_cols
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): date_crawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offer_type 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicle_type 44905 non-null object registration_year 50000 non-null int64 gearbox 47320 non-null object power_p_s 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object registration_month 50000 non-null int64 fuel_type 45518 non-null object brand 50000 non-null object unrepaired_damage 40171 non-null object ad_created 50000 non-null object nr_of_pictures 50000 non-null int64 postal_code 50000 non-null int64 last_seen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
** Basic data exploration :**
Now let's do some basic data exploration to determine what other cleaning tasks need to be done. Initially we will look for:
# basic dataset analysis:
autos.describe(include='all') # we use (include = 'all') to get the numerical and non numerical columns
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-23 19:38:20 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Let's further analyse the two numerical columns, namely 'price' and odometer readings:
## the price column :
autos['price'].value_counts()
$0 1421 $500 781 $1,500 734 $2,500 643 $1,200 639 $1,000 639 $600 531 $3,500 498 $800 498 $2,000 460 $999 434 $750 433 $900 420 $650 419 $850 410 $700 395 $4,500 394 $300 384 $2,200 382 $950 379 $1,100 376 $1,300 371 $3,000 365 $550 356 $1,800 355 $5,500 340 $1,250 335 $350 335 $1,600 327 $1,999 322 ... $9,577 1 $5,685 1 $19,970 1 $26,150 1 $10,498 1 $2,989 1 $7,825 1 $9,725 1 $554 1 $8,875 1 $20,799 1 $11,140 1 $295,000 1 $185 1 $33,777 1 $12,395 1 $6,155 1 $20,985 1 $12,898 1 $219 1 $251 1 $410 1 $2,004 1 $18,890 1 $25,850 1 $11,360 1 $2,986 1 $24,444 1 $3,620 1 $6,202 1 Name: price, Length: 2357, dtype: int64
## the odometer column:
autos['odometer'].value_counts()
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
## removing the '$' from price values and 'km' from the odometer readings and converting them to numeric columns:
autos['price'] = (autos['price']
.str.replace("$","")
.str.replace(",","")
.astype(int)
)
autos['odometer'] = (autos['odometer']
.str.replace("km","")
.str.replace(",","")
.astype(int)
)
## changing the column name to 'odometer_km':
autos.rename(columns={"odometer":"odometer_km"},inplace = True)
Analysing some of the text columns, namely (seller
, offer_type
and nr_of_pictures
):
autos['seller'].value_counts(dropna=False)
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts(dropna=False)
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos['nr_of_pictures'].value_counts(dropna=False)
0 50000 Name: nr_of_pictures, dtype: int64
We see that the columns above have mostly similar values, therefore we should drop these columns before further analysis:
autos = autos.drop(['seller', 'offer_type','nr_of_pictures'], axis=1)
price
and odometer_km
columns in the dataset to look for outliers:¶## exploring the price and odometer_km columns to futher look for ouliers in the data entries:
autos['price'].unique().shape
(2357,)
autos['price'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price, dtype: float64
autos['price'].value_counts().sort_index().head(5)
0 1421 1 156 2 3 3 1 5 2 Name: price, dtype: int64
autos['price'].value_counts().sort_index().tail(10)
999990 1 999999 2 1234566 1 1300000 1 3890000 1 10000000 1 11111111 2 12345678 3 27322222 1 99999999 1 Name: price, dtype: int64
We see that some of these vehicles cost much more than the expected value and can be outliers in the data. We should, therefore further explore the column data.
## Removing the outliers in the price column of the dataset:
autos = autos.loc[autos['price'].between(1,4500000),:]
Let's check if the outliers in the 'price' column have been removed:
autos['price'].value_counts().sort_index().tail(10)
265000 1 295000 1 299000 1 345000 1 350000 1 999990 1 999999 2 1234566 1 1300000 1 3890000 1 Name: price, dtype: int64
We see that the maximum price of the vehicle in the dataset has nw been reduced to ~3.8 million, which is more realstic as it caters to the demands of the open market.
## exploring the odometer_km column for outliers:
autos['odometer_km'].value_counts().sort_index().tail(10)
30000 780 40000 815 50000 1013 60000 1155 70000 1217 80000 1415 90000 1734 100000 2115 125000 5058 150000 31417 Name: odometer_km, dtype: int64
We see that the odometer column has reasonable values.
# basc information about the columns:
autos[['date_crawled','last_seen', 'ad_created', 'registration_month' , 'registration_year']].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48571 entries, 0 to 49999 Data columns (total 5 columns): date_crawled 48571 non-null object last_seen 48571 non-null object ad_created 48571 non-null object registration_month 48571 non-null int64 registration_year 48571 non-null int64 dtypes: int64(2), object(3) memory usage: 2.2+ MB
We see that the 'date_crawled, last_seen, and ad_created columns' are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values.
autos[['date_crawled','last_seen', 'ad_created', 'registration_month' , 'registration_year']].head(5)
date_crawled | last_seen | ad_created | registration_month | registration_year | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 | 3 | 2004 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 | 6 | 1997 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 | 7 | 2009 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 | 6 | 2007 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 7 | 2003 |
The 'date_crawled, last_seen and ad_created columns contain time-stamps in the end which should be removed before further analysis:
## selecting only the first 10 characters from the str-time columns:
autos['date_crawled'] = autos['date_crawled'].str[:10]
autos['last_seen'] = autos['last_seen'].str[:10]
autos['ad_created'] = autos['ad_created'].str[:10]
autos[['date_crawled','last_seen', 'ad_created', 'registration_month' , 'registration_year']].head(5)
date_crawled | last_seen | ad_created | registration_month | registration_year | |
---|---|---|---|---|---|
0 | 2016-03-26 | 2016-04-06 | 2016-03-26 | 3 | 2004 |
1 | 2016-04-04 | 2016-04-06 | 2016-04-04 | 6 | 1997 |
2 | 2016-03-26 | 2016-04-06 | 2016-03-26 | 7 | 2009 |
3 | 2016-03-12 | 2016-03-15 | 2016-03-12 | 6 | 2007 |
4 | 2016-04-01 | 2016-04-01 | 2016-04-01 | 7 | 2003 |
# for 'date_crawled' column:
autos['date_crawled'].value_counts(normalize =True, dropna = False)
## To include missing values in the distribution and to use percentages instead of counts
2016-04-03 0.038603 2016-03-20 0.037883 2016-03-21 0.037389 2016-03-12 0.036915 2016-03-14 0.036544 2016-04-04 0.036524 2016-03-07 0.036009 2016-04-02 0.035474 2016-03-28 0.034856 2016-03-19 0.034774 2016-03-15 0.034280 2016-03-29 0.034115 2016-04-01 0.033683 2016-03-30 0.033683 2016-03-08 0.033291 2016-03-09 0.033086 2016-03-22 0.033003 2016-03-11 0.032571 2016-03-23 0.032221 2016-03-26 0.032200 2016-03-10 0.032180 2016-03-31 0.031830 2016-03-17 0.031644 2016-03-25 0.031603 2016-03-27 0.031089 2016-03-16 0.029606 2016-03-24 0.029338 2016-03-05 0.025324 2016-03-13 0.015668 2016-03-06 0.014041 2016-04-05 0.013094 2016-03-18 0.012909 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
date_crawled
refers to the number of times the website has been visited by the user. It is an important metric for site owners to understand how often they are being revisited. We see that the date_crawled
column represents the percentage number of visits per day on the website.# for the 'ad_created' column:
autos['ad_created'].value_counts(normalize =True, dropna = False)
2016-04-03 0.038850 2016-03-20 0.037944 2016-03-21 0.037594 2016-04-04 0.036894 2016-03-12 0.036750 2016-03-14 0.035186 2016-04-02 0.035144 2016-03-28 0.034980 2016-03-07 0.034733 2016-03-29 0.034053 2016-03-15 0.034012 2016-03-19 0.033683 2016-04-01 0.033683 2016-03-30 0.033497 2016-03-08 0.033312 2016-03-09 0.033147 2016-03-11 0.032900 2016-03-22 0.032818 2016-03-26 0.032262 2016-03-23 0.032056 2016-03-10 0.031891 2016-03-31 0.031871 2016-03-25 0.031747 2016-03-17 0.031294 2016-03-27 0.030986 2016-03-16 0.030121 2016-03-24 0.029277 2016-03-05 0.022894 2016-03-13 0.017006 2016-03-06 0.015318 ... 2016-02-14 0.000041 2016-02-05 0.000041 2016-02-20 0.000041 2016-02-12 0.000041 2016-01-10 0.000041 2016-02-24 0.000041 2016-02-02 0.000041 2016-02-18 0.000041 2016-02-26 0.000041 2016-01-13 0.000021 2016-02-16 0.000021 2015-12-30 0.000021 2016-02-08 0.000021 2016-01-29 0.000021 2016-02-11 0.000021 2016-02-09 0.000021 2016-01-07 0.000021 2016-02-22 0.000021 2016-01-22 0.000021 2015-06-11 0.000021 2016-02-17 0.000021 2015-09-09 0.000021 2016-01-16 0.000021 2016-01-03 0.000021 2015-12-05 0.000021 2016-01-14 0.000021 2015-08-10 0.000021 2016-02-01 0.000021 2016-02-07 0.000021 2015-11-10 0.000021 Name: ad_created, Length: 76, dtype: float64
# for the 'last_seen' column:
autos['last_seen'].value_counts(normalize =True, dropna = False)
2016-04-06 0.221799 2016-04-07 0.131931 2016-04-05 0.124786 2016-03-17 0.028083 2016-04-03 0.025200 2016-04-02 0.024933 2016-03-30 0.024768 2016-04-04 0.024480 2016-03-12 0.023780 2016-03-31 0.023780 2016-04-01 0.022791 2016-03-29 0.022359 2016-03-22 0.021371 2016-03-28 0.020877 2016-03-20 0.020650 2016-03-21 0.020630 2016-03-24 0.019765 2016-03-25 0.019209 2016-03-23 0.018530 2016-03-26 0.016800 2016-03-16 0.016450 2016-03-15 0.015874 2016-03-19 0.015832 2016-03-27 0.015647 2016-03-14 0.012600 2016-03-11 0.012374 2016-03-10 0.010665 2016-03-09 0.009594 2016-03-13 0.008894 2016-03-08 0.007412 2016-03-18 0.007350 2016-03-07 0.005394 2016-03-06 0.004324 2016-03-05 0.001071 Name: last_seen, dtype: float64
last_seen
represents the percentage of users that last logged on the given page. We again see that the ad_created
,last_seen
and the date_crawled
columns have the highest metrics during 'March - April' every year, hence that period seems to be popular for car sales.# exploring the registration month and year columns:
autos[['registration_month', 'registration_year']].describe()
registration_month | registration_year | |
---|---|---|
count | 48571.000000 | 48571.000000 |
mean | 5.782319 | 2004.754257 |
std | 3.685700 | 88.638548 |
min | 0.000000 | 1000.000000 |
25% | 3.000000 | 1999.000000 |
50% | 6.000000 | 2004.000000 |
75% | 9.000000 | 2008.000000 |
max | 12.000000 | 9999.000000 |
We need to limit the registration year period from (1900-2016). Here we also see that '1000' and '9999' are the potential outliers in the data which need to removed.
# limiting the registeration year in our data:
autos = autos.loc[autos['registration_year'].between(1900,2016),:]
[autos['registration_year'].value_counts(dropna = False) > 1900]
[2000 True 2005 True 1999 True 2004 True 2003 True 2006 True 2001 True 2002 True 1998 True 2007 True 2008 True 2009 True 1997 True 2011 False 2010 False 1996 False 2012 False 1995 False 2016 False 2013 False 2014 False 1994 False 1993 False 2015 False 1992 False 1990 False 1991 False 1989 False 1988 False 1985 False ... 1966 False 1976 False 1969 False 1975 False 1965 False 1964 False 1963 False 1959 False 1961 False 1910 False 1956 False 1958 False 1937 False 1962 False 1950 False 1954 False 1941 False 1951 False 1934 False 1957 False 1955 False 1953 False 1943 False 1929 False 1939 False 1938 False 1948 False 1927 False 1931 False 1952 False Name: registration_year, Length: 78, dtype: bool]
The period between [1997-2009] seems to be the period of boom for the used car sales industry with more than 1900 cars registered per year during the timespan.
Registration month cannot have a minimum value of 0.0, as the minimum value is assigned to the month of january as 1.
## dropping the columns which do not come under registeration month period of (1-12):
autos = autos.loc[autos['registration_month'].between(1,12),:]
## final exploration of the columns:
autos[['registration_month', 'registration_year']].describe()
registration_month | registration_year | |
---|---|---|
count | 42661.000000 | 42661.000000 |
mean | 6.377113 | 2003.194815 |
std | 3.352061 | 6.941008 |
min | 1.000000 | 1927.000000 |
25% | 3.000000 | 1999.000000 |
50% | 6.000000 | 2004.000000 |
75% | 9.000000 | 2008.000000 |
max | 12.000000 | 2016.000000 |
## Exploring the changes in the dataset that we made already:
autos.head(10)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_p_s | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 | 79588 | 2016-04-06 |
1 | 2016-04-04 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 | 71034 | 2016-04-06 |
2 | 2016-03-26 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 | 35394 | 2016-04-06 |
3 | 2016-03-12 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000 | 6 | benzin | smart | nein | 2016-03-12 | 33729 | 2016-03-15 |
4 | 2016-04-01 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | kombi | 2003 | manuell | 0 | focus | 150000 | 7 | benzin | ford | nein | 2016-04-01 | 39218 | 2016-04-01 |
5 | 2016-03-21 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | 7900 | test | bus | 2006 | automatik | 150 | voyager | 150000 | 4 | diesel | chrysler | NaN | 2016-03-21 | 22962 | 2016-04-06 |
6 | 2016-03-20 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | 300 | test | limousine | 1995 | manuell | 90 | golf | 150000 | 8 | benzin | volkswagen | NaN | 2016-03-20 | 31535 | 2016-03-23 |
7 | 2016-03-16 | Golf_IV_1.9_TDI_90PS | 1990 | control | limousine | 1998 | manuell | 90 | golf | 150000 | 12 | diesel | volkswagen | nein | 2016-03-16 | 53474 | 2016-04-07 |
8 | 2016-03-22 | Seat_Arosa | 250 | test | NaN | 2000 | manuell | 0 | arosa | 150000 | 10 | NaN | seat | nein | 2016-03-22 | 7426 | 2016-03-26 |
9 | 2016-03-16 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | 590 | control | bus | 1997 | manuell | 90 | megane | 150000 | 7 | benzin | renault | nein | 2016-03-16 | 15749 | 2016-04-06 |
autos['vehicle_type'].value_counts(dropna =False)
limousine 11886 kleinwagen 9695 kombi 8340 bus 3838 cabrio 2849 coupe 2289 suv 1894 NaN 1538 andere 332 Name: vehicle_type, dtype: int64
We see that some of the listings in the vehicle_type
column are in german. Let's convert those listings into english for better understanding the dataset.
## converting the vehicle_type into english:
autos.loc[autos['vehicle_type'] == 'kleinwagen' , 'vehicle_type'] = 'compact_car'
autos.loc[autos['vehicle_type'] == 'kombi' , 'vehicle_type'] = 'station_wagon'
autos.loc[autos['vehicle_type'] == 'cabrio' , 'vehicle_type'] = 'cabriolet'
autos.loc[autos['vehicle_type'] == 'andere' , 'vehicle_type'] = 'others'
## checking the value counts in english:
autos['vehicle_type'].value_counts(dropna =False)
limousine 11886 compact_car 9695 station_wagon 8340 bus 3838 cabriolet 2849 coupe 2289 suv 1894 NaN 1538 others 332 Name: vehicle_type, dtype: int64
We see that 'limousine'* which essentially are large luxury vehicles, is the most common vehicle_type
listed for sale on the website.*
## exploring the 'gearbox' column:
autos['gearbox'].value_counts(dropna = False)
manuell 32086 automatik 9384 NaN 1191 Name: gearbox, dtype: int64
We see that the gearbox listings are in german. Let's convert them into english for better understanding the dataset.
autos.loc[autos['gearbox'] == 'manuell' , 'gearbox'] = 'manual'
autos.loc[autos['gearbox'] == 'automatik' , 'gearbox'] = 'automatic'
autos['gearbox'].value_counts(normalize = True ,dropna = False)
manual 0.752116 automatic 0.219967 NaN 0.027918 Name: gearbox, dtype: float64
*We see that over 75% of the listings on the website include a 'Manual gearbox', which is more commonly known as a 'stick-shift'.*
Exploring the 'power_ps' column:
autos['power_p_s'].describe()
count 42661.000000 mean 122.413164 std 191.183734 min 0.000000 25% 75.000000 50% 110.000000 75% 150.000000 max 17700.000000 Name: power_p_s, dtype: float64
There are major outliers in the power_p_s column as the vehicles listed cannot have a minimum value of '0' and a maximum of '17700'. We should get rid of such ouliers in our data:
autos['power_p_s'].value_counts(dropna = False)
0 2979 75 2663 60 1863 150 1783 140 1742 101 1534 116 1510 90 1478 170 1375 105 1296 136 852 125 833 163 799 102 785 143 694 122 664 131 661 54 655 110 610 109 575 177 521 80 515 50 497 58 460 120 456 69 454 184 362 95 353 68 350 115 347 ... 1000 1 2729 1 168 1 3750 1 999 1 455 1 678 1 454 1 30 1 587 1 304 1 1003 1 8404 1 431 1 6045 1 157 1 1998 1 430 1 398 1 253 1 14 1 3500 1 365 1 317 1 17700 1 460 1 268 1 236 1 1771 1 16312 1 Name: power_p_s, Length: 426, dtype: int64
A simple google search tells us that the compact_cars are the least powered vehicles with an avg of ' 70hp ' and the most powerful cars have around ' 1500hp ' . Let's limit our power range to ( 70-1500 ) in order to make the dataset relevant.
## selecting only the vehicles within the power range of (70-1500):
autos = autos.loc[autos['power_p_s'].between(70,1500),:]
autos['power_p_s'].describe()
count 33658.000000 mean 140.840305 std 65.528226 min 70.000000 25% 101.000000 50% 125.000000 75% 163.000000 max 1405.000000 Name: power_p_s, dtype: float64
Exploring the 'unrepaired_damage' column:
autos['unrepaired_damage'].value_counts(dropna= False)
nein 26987 NaN 3747 ja 2924 Name: unrepaired_damage, dtype: int64
Since all of the listings are in German, let's convert them to English for relevance:
autos.loc[autos['unrepaired_damage'] == 'nein' , 'unrepaired_damage'] = 'No'
autos.loc[autos['unrepaired_damage'] == 'ja' , 'unrepaired_damage'] = 'Yes'
autos['unrepaired_damage'].value_counts(normalize = True,dropna= False)
No 0.801800 NaN 0.111326 Yes 0.086874 Name: unrepaired_damage, dtype: float64
We see that over 80% of the cars listed on the website suffer from unrepaired damage, which does not bode well for the website on the first glance. But given that these vehicles are used, some damage should be expected.
*** Now that we have cleaned the dataset, let's move towards gaining inferences from our data. ***
AGGREGATING
the brand column:¶## Let's first see the the number of brands in the dataset:
print(autos['brand'].value_counts(normalize = True ,dropna = False)*100)
print('\n')
print('There are {} unique car brands in the dataset.'.format(len(autos['brand'].value_counts(dropna = False))))
volkswagen 20.004159 bmw 13.559926 mercedes_benz 11.572286 audi 10.571038 opel 9.341018 ford 6.173867 renault 3.693030 peugeot 2.757145 skoda 1.744013 mazda 1.690534 seat 1.550894 fiat 1.437994 nissan 1.283499 toyota 1.253788 citroen 1.143859 mini 1.129004 volvo 1.117119 hyundai 0.965595 mitsubishi 0.926971 honda 0.915087 alfa_romeo 0.799216 porsche 0.775447 sonstige_autos 0.754650 kia 0.736823 smart 0.695228 chevrolet 0.540733 suzuki 0.540733 chrysler 0.410007 dacia 0.320875 jeep 0.285222 land_rover 0.249569 subaru 0.216888 saab 0.205003 jaguar 0.184206 rover 0.136669 daihatsu 0.101016 daewoo 0.098045 lancia 0.074277 lada 0.044566 Name: brand, dtype: float64 There are 39 unique car brands in the dataset.
Let's aggregate the car brands on the basis of their mean price as they are listed on the website. Also, we calculated that about 80% of the cars listed on the website had unrepaired damage. Let's use that data to futher check if the 'brand value' affects the listed car prices for the first 10 most popular brands in the dataset.
# empty dictionary to store the values:
car_brand_prices = {}
car_brand_prices_undam = {}
car_brand_prices_dam = {}
car_brands = autos['brand'].value_counts(dropna = False).head(10).index
## looping over the dataset to aggregate and find the mean price per brand:
for brand in car_brands:
brand_col = autos.loc[autos['brand'] == brand,:]
brand_col_dam = brand_col.loc[brand_col['unrepaired_damage'] == 'Yes']
brand_col_undam = brand_col.loc[brand_col['unrepaired_damage'] == 'No']
# to find mean price:
mean_price = brand_col['price'].sum() / len(brand_col)
mean_price_dam = brand_col_dam['price'].sum() / len(brand_col_dam)
mean_price_undam = brand_col_undam['price'].sum() / len(brand_col_undam)
# adding to their respective dictionaries :
car_brand_prices[brand] = mean_price
car_brand_prices_undam[brand] = mean_price_undam
car_brand_prices_dam[brand] = mean_price_dam
# Ranking car brands by average price :
print("The car brands and their average prices are given as:" + '\n')
price_list = []
for brand,price in car_brand_prices.items():
price_list.append((price, brand))
price_list.sort(reverse = True)
for price,brand in price_list:
print(brand + ' : ' + str(price))
avg_top_3 = 0;
for price,brand in price_list[:3]:
avg_top_3 = avg_top_3 + price
avg_top_3 = avg_top_3/ 3;
print("The average price of the top three car brands : {}" .format(int(avg_top_3)))
The car brands and their average prices are given as: audi : 10061.087408656549 mercedes_benz : 9145.857509627727 bmw : 9115.98773006135 skoda : 7261.8313458262355 volkswagen : 7121.500816872122 ford : 5255.128970163619 mazda : 4686.669595782074 opel : 3884.320292620865 peugeot : 3802.855603448276 renault : 3350.540627514079 The average price of the top three car brands : 9440
# Ranking car brands by average price of damaged cars :
print("The car brands and their average prices of damaged cars given as:" + '\n')
price_list_dam = []
for brand,price in car_brand_prices_dam.items():
price_list_dam.append((price, brand))
price_list_dam.sort(reverse = True)
for price,brand in price_list_dam:
print(brand + ' : ' + str(price))
avg_top_3 = 0;
for price,brand in price_list_dam[:3]:
avg_top_3 = avg_top_3 + price
avg_top_3 = avg_top_3/ 3;
print("The average price of the top three car brands in the damaged category : {}" .format(int(avg_top_3)))
The car brands and their average prices of damaged cars given as: volkswagen : 5076.759656652361 skoda : 4467.222222222223 mercedes_benz : 4252.318021201413 bmw : 3846.87125748503 audi : 3518.924187725632 ford : 1741.9076305220883 opel : 1712.28125 peugeot : 1697.137254901961 renault : 1586.291390728477 mazda : 1552.72131147541 The average price of the top three car brands in the damaged category : 4598
# Ranking car brands by average price of undamaged cars :
print("The car brands and their average prices of undamaged cars given as:" + '\n')
price_list_undam = []
for brand,price in car_brand_prices_undam.items():
price_list_undam.append((price, brand))
price_list_undam.sort(reverse = True)
for price,brand in price_list_undam:
print(brand + ' : ' + str(price))
avg_top_3 = 0;
for price,brand in price_list_undam[:3]:
avg_top_3 = avg_top_3 + price
avg_top_3 = avg_top_3/ 3;
print("The average price of the top three car brands in the undamaged category is : {}".format(int(avg_top_3)))
The car brands and their average prices of undamaged cars given as: audi : 11222.07032057911 mercedes_benz : 10000.280860349127 bmw : 9667.33493846977 skoda : 7693.350806451613 volkswagen : 7590.761508596783 ford : 6222.325492689129 mazda : 5486.579676674365 opel : 4423.779074614423 peugeot : 4261.455295735901 renault : 3888.2448979591836 The average price of the top three car brands in the undamaged category is : 10296
From the price comparison above we can draw certain conclusions:
Audi
, Mercedes-Benz
and BMW
with an average price of 9440 overall and 10296 in the undamaged category.Volkswagen
, Skoda
and Mercedes-Benz
hold the top three in the damaged cars category with an average price of : 4598Opel
, Peugeot
and Renalut
are the cheapest car brands in the dataset.*** Since Mercedez-Benz
features in the top three amongst all the categories, it is safe to say that it has a higher brand value or it holds its resale value the best amongst the given brands.
However, we should also take a note that the dataset if from a German website and Mercedes-Benz
is German itself, which might affect the overall resale value of the brand. ***
AGGREGATING
the odometer_km
column:¶## looping over the dataset to aggregate and find the mean mileage per brand:
car_brand_mil = {}
for brand in car_brands:
brand_col = autos.loc[autos['brand'] == brand,:]
# to find mean mileage:
mean_mil = brand_col['odometer_km'].sum() / len(brand_col)
# adding to their respective dictionaries :
car_brand_mil[brand] = mean_mil
print("The car brands and their average mileages are given as:" + '\n')
mil_list = []
for brand,mil in car_brand_mil.items():
mil_list.append((mil, brand))
mil_list.sort(reverse = True)
for mil,brand in mil_list:
print(brand + ' : ' + str(mil))
The car brands and their average mileages are given as: bmw : 132300.61349693252 mercedes_benz : 130513.47881899872 audi : 127922.9904440697 opel : 127188.2951653944 volkswagen : 127079.31085697311 ford : 124983.15688161695 renault : 124883.34674175382 peugeot : 124364.22413793103 mazda : 122996.48506151143 skoda : 112827.93867120954
*The top three car brands with highest mean mileages are BMW
, Mercedes-Benz
and Audi
.*
## creating a new dataframe object to check the correlation between mean prices and mean mileage:
cars = {'mean_prices': car_brand_prices, 'mean_mileage' :car_brand_mil }
df_Car = pd.DataFrame(data= cars)
df_Car
mean_mileage | mean_prices | |
---|---|---|
audi | 127922.990444 | 10061.087409 |
bmw | 132300.613497 | 9115.987730 |
ford | 124983.156882 | 5255.128970 |
mazda | 122996.485062 | 4686.669596 |
mercedes_benz | 130513.478819 | 9145.857510 |
opel | 127188.295165 | 3884.320293 |
peugeot | 124364.224138 | 3802.855603 |
renault | 124883.346742 | 3350.540628 |
skoda | 112827.938671 | 7261.831346 |
volkswagen | 127079.310857 | 7121.500817 |
df_Car.corr()
mean_mileage | mean_prices | |
---|---|---|
mean_mileage | 1.000000 | 0.309558 |
mean_prices | 0.309558 | 1.000000 |
*** We see that the mean prices and mean mileage of the car brands are loosely related with a correlation factor of 0.309.***
Let's find the top 10 most common brand/model combinatons from our dataset:
# finding the top 10 most common car models from our dataset:
autos['brand_model'] = autos['brand'] + '-' + autos['model']
bmods = autos['brand_model'].value_counts(dropna = False).head(11).index
brand_model ={}
print ("The top 10 most commonly listed brand-model combinations with their number of listings are: "+ '\n')
for bmod in bmods:
car_model = autos.loc[autos['brand_model'] == bmod]
count = len(car_model)
brand_model[bmod] = count
mod_list = []
for mod,count in brand_model.items():
mod_list.append((count, mod))
mod_list.sort(reverse = True)
for count,mod in mod_list:
print(str(mod)+ ' : ' + str(count))
The top 10 most commonly listed brand-model combinations with their number of listings are: volkswagen-golf : 3001 bmw-3er : 2307 volkswagen-passat : 1170 opel-astra : 1099 audi-a4 : 1089 bmw-5er : 1032 mercedes_benz-c_klasse : 1012 mercedes_benz-e_klasse : 848 audi-a3 : 732 audi-a6 : 718 nan : 0
We see that volkswagen-golf
, bmw-3er
and volkswagen-passat
* are the popular Brand-Model listings on the website.*
We exlored the data from 'eBay Kleinanzeigen' and after thorough cleaning of the dataset we can draw the following conclusions from our analysis:
ad_created
, last_seen
and the date_crawled
we see that the website is very active and columns have the highest metrics during the months of 'March - April' every year, hence that period seems to be popular for car sales.vehicle_type
listed for sale on the website.Volkswagen
is the most popular car brand listed on the website, taking up around 20% of the listings.Mercedes-Benz
is the costliest car brand listed on the website.BMW
has the highest mean mileage amongst the other brands listed on the website.