In this project, I will work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but it can be found here.
The project's aim is to clean the data and analyze the included used car listings.
The data dictionary provided with data is as follows:
Column name | Description |
---|---|
dateCrawled | When this ad was first crawled. All field-values are taken from this date |
name | Name of the car |
seller | Whether the seller is private or a dealer |
offerType | The type of listing |
price | The price on the ad to sell the car |
abtest | Whether the listing is included in an A/B test |
vehicleType | The vehicle Type |
yearOfRegistration | The year in which the car was first registered |
gearbox | The transmission type |
powerPS | The power of the car in PS |
model | The car model name |
kilometer | How many kilometers the car has driven |
monthOfRegistration | The month in which the car was first registered |
fuelType | What type of fuel the car uses |
brand | The brand of the car |
notRepairedDamage | If the car has a damage which is not yet repaired |
dateCreated | The date on which the eBay listing was created |
nrOfPictures | The number of pictures in the ad |
postalCode | The postal code for the location of the vehicle |
lastSeenOnline | When the crawler saw this ad last online |
I will start by importing the libraries I need and reading in the data file.
#Importing the essential libraries
import pandas as pd
import numpy as np
#Reading in the file and assigning the first column as index column
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
#Exploring the basic info on dataset
autos
print(autos.info())
print(autos.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB None dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
# Creating a specific function to replace columns
def clean_col(col):
col = col.replace("yearOfRegistration","registration_year")
col = col.replace("monthOfRegistration","registration_month")
col = col.replace("notRepairedDamage","unrepaired_damage")
col = col.replace("dateCreated","ad_created")
col = col.replace("offerType","offer_type")
col = col.replace("vehicleType","vehicle_type")
col = col.replace("powerPS","power_ps")
col = col.replace("fuelType","fuel_type")
col = col.replace("nrOfPictures","nr_of_pictures")
col = col.replace("postalCode","postal_code")
col = col.replace("lastSeen","last_seen")
col = col.replace("dateCrawled","date_crawled")
return col
#Using the function to replace the incorrect names to a list with new names
new_columns = []
for column in autos.columns:
column = clean_col(column)
new_columns.append(column)
#Assigning the modified column names
autos.columns = new_columns
#Checking the result
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
If we further check our data we will find another issue:
This can be found by checking basic stats on each column as I show below.
#Describing both the numeric and categorical columns
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-05 16:57:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
We can see that nr_of_pictures column seems to have only one unique value. Besides, five columns have only two unique values. We will check if one of these values is not very frequent and might be omitted. Below I print value counts for each of these columns.
print(autos["nr_of_pictures"].value_counts())
print(autos["abtest"].value_counts())
print(autos["seller"].value_counts())
print(autos["offer_type"].value_counts())
print(autos["gearbox"].value_counts())
print(autos["unrepaired_damage"].value_counts())
0 50000 Name: nr_of_pictures, dtype: int64 test 25756 control 24244 Name: abtest, dtype: int64 privat 49999 gewerblich 1 Name: seller, dtype: int64 Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64 manuell 36993 automatik 10327 Name: gearbox, dtype: int64 nein 35232 ja 4939 Name: unrepaired_damage, dtype: int64
Two of the columns we checked have almost all the same values: offer_type and seller. This means that they do not offer any substandial criterion for analysing our data. We will drop both columns together with nr_of_pictures.
autos = autos.drop(["nr_of_pictures","seller", "offer_type"], axis=1)
This third issue is visible from the column descriptions we printed above. The columns price and odometer contain commas, "$" and "km" signs that do not allow conversion into numeric values. We will have to remove these characters and convert the values into floats. In order to still be able to understand the values in these columns, we will add measure units to the column name.
#Removing odd characters, converting and renaming price column
autos["price"] = (autos["price"].str.replace("$","")
.str.replace(",","").astype(int)
)
autos.rename({"price":"price_usd"}, axis = 1, inplace=True)
#Removing odd characters, converting and renaming odometer column
autos["odometer"] = (autos["odometer"].str.replace(",","")
.str.replace("km","").astype(int)
)
autos.rename({"odometer":"odometer_km"}, axis = 1, inplace=True)
#Checking the result
print(autos.head())
date_crawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... price_usd abtest vehicle_type registration_year gearbox power_ps \ 0 5000 control bus 2004 manuell 158 1 8500 control limousine 1997 automatik 286 2 8990 test limousine 2009 manuell 102 3 4350 control kleinwagen 2007 automatik 71 4 1350 test kombi 2003 manuell 0 model odometer_km registration_month fuel_type brand \ 0 andere 150000 3 lpg peugeot 1 7er 150000 6 benzin bmw 2 golf 70000 7 benzin volkswagen 3 fortwo 70000 6 benzin smart 4 focus 150000 7 benzin ford unrepaired_damage ad_created postal_code last_seen 0 nein 2016-03-26 00:00:00 79588 2016-04-06 06:45:54 1 nein 2016-04-04 00:00:00 71034 2016-04-06 14:45:08 2 nein 2016-03-26 00:00:00 35394 2016-04-06 20:15:37 3 nein 2016-03-12 00:00:00 33729 2016-03-15 03:16:28 4 nein 2016-04-01 00:00:00 39218 2016-04-01 14:38:50
<ipython-input-7-15f8e80f5b59>:2: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will*not* be treated as literal strings when regex=True. autos["price"] = (autos["price"].str.replace("$","")
We will now further look into price_usd and odometer_km columns to check for any unrealistic values.
#How many unique values
autos["price_usd"].unique().shape
(2357,)
#Descriptive stats
autos["price_usd"].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64
#15 highest values with their frequencies
autos["price_usd"].value_counts().sort_index(ascending=False).head(15)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 Name: price_usd, dtype: int64
Several values in price column are above USD 10 mln and, apparently, are introduced at random (for ex., 12345678). The highest reasonable price in the list is USD350000. We will cut off all the prices above this number. But first, let's check the lowest prices.
autos["price_usd"].value_counts().sort_index(ascending=True).head(15)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 Name: price_usd, dtype: int64
There are 1421 entries where price equals 0. We know that ebay is an auction site with starting price of 1USD. Zero price doesn't make sense, so we'll remove these rows.
#Only keep the prices between 1 and 350 000
autos = autos[autos["price_usd"].between(1,350000)]
#Check
autos["price_usd"].value_counts().sort_index(ascending=False)
350000 1 345000 1 299000 1 295000 1 265000 1 ... 8 1 5 2 3 1 2 3 1 156 Name: price_usd, Length: 2346, dtype: int64
#Check the whole column again
autos["price_usd"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_usd, dtype: float64
Now we'll look at the odometer column in the same way.
autos["odometer_km"].describe()
count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
Min and max values seem reasonable. The mean is 125,000km so the high mileage cars predominate. Let's check if there is anything wrong with value frequencies.
autos["odometer_km"].value_counts(ascending=False)
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 5000 836 40000 815 30000 780 20000 762 10000 253 Name: odometer_km, dtype: int64
autos["odometer_km"].value_counts(ascending=True).head(15)
10000 253 20000 762 30000 780 40000 815 5000 836 50000 1012 60000 1155 70000 1217 80000 1415 90000 1734 100000 2115 125000 5057 150000 31414 Name: odometer_km, dtype: int64
Seems like our price_usd and odometer_km columns are in order now. We can move on to other issues.
Two main issues here:
We will first look at the three string columns.
autos[["date_crawled","ad_created","last_seen"]][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
We can see that the first 10 characters of each cell represent the day. In order to check the distribution of values by date, we will extract these characters and create a frequency table in percentages of data crawled, created and seen each day.
(autos["date_crawled"]
.str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The distribution seems reasonable, with less entries crawled on the last days, probably due to the ending cycle of the crawler's work.
(autos["ad_created"]
.str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
(autos["last_seen"]
.str[:10]
.value_counts(normalize=True, dropna=False)
.sort_index()
)
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
Can't see anything wrong with these distributions. The only thing we should do now is convert all these dates into numeric data. We'll do that by deleting the dashes between the numbers and changing the data type to integer.
autos["date_crawled"] = (autos["date_crawled"]
.str.split()
.str[0]
.str.replace("-","")
.astype(int)
)
print("date_crawled", "\n", autos["date_crawled"].head(10))
autos["ad_created"] = (autos["ad_created"]
.str.split()
.str[0]
.str.replace("-","")
.astype(int)
)
print("ad_created", "\n", autos["ad_created"].head(10))
autos["last_seen"] = (autos["last_seen"]
.str.split()
.str[0]
.str.replace("-","")
.astype(int)
)
print("last_seen", "\n", autos["last_seen"].head(10))
date_crawled 0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 5 20160321 6 20160320 7 20160316 8 20160322 9 20160316 Name: date_crawled, dtype: int64 ad_created 0 20160326 1 20160404 2 20160326 3 20160312 4 20160401 5 20160321 6 20160320 7 20160316 8 20160322 9 20160316 Name: ad_created, dtype: int64 last_seen 0 20160406 1 20160406 2 20160406 3 20160315 4 20160401 5 20160406 6 20160323 7 20160407 8 20160326 9 20160406 Name: last_seen, dtype: int64
Now to the issues with registration_year column, where we noticed very bizarre min (1000) and max(9999) values. Let's check again.
autos["registration_year"].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
#Looking closer at the 10 greatest values
(autos["registration_year"]
.value_counts()
.sort_index(ascending=False)
.head(15)
)
9999 3 9000 1 8888 1 6200 1 5911 1 5000 4 4800 1 4500 1 4100 1 2800 1 2019 2 2018 470 2017 1392 2016 1220 2015 392 Name: registration_year, dtype: int64
It is clear that everything over 2016, the year when the announcements were posted, must be a random number and needs to be deleted. Now for the lowest values.
(autos["registration_year"]
.value_counts()
.sort_index(ascending=True)
.head(15)
)
1000 1 1001 1 1111 1 1800 2 1910 5 1927 1 1929 1 1931 1 1934 2 1937 4 1938 1 1939 1 1941 2 1943 1 1948 1 Name: registration_year, dtype: int64
I guess 1910 might still be a registration date for a very, very old car. 1800 and lower are already impossible, because there were no cars. These rows too must be deleted.
#Only keep the years between 1910 and 2016
autos = autos[autos["registration_year"].between(1910,2016)]
#Checking the resulting distribution
print(autos["registration_year"].value_counts(normalize=True))
print("\n")
#Chcking the resulting column
print(autos["registration_year"].describe())
2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 ... 1929 0.000021 1931 0.000021 1938 0.000021 1939 0.000021 1952 0.000021 Name: registration_year, Length: 78, dtype: float64 count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
We passed from 48565 values to 46681, losing arond 4% of the entries. It is not a significant amount, so we will proceed with the reduced dataset without extreme values sqewing our calculations.
Some of the categorical data in the dataset is in German. To make my analysis accessible for English language readers, I'll translate it.
First, let's display the unique values of the columns we need to translate.
print(autos["vehicle_type"].unique())
print(autos["gearbox"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())
['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio' 'andere'] ['manuell' 'automatik' nan] ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] ['nein' nan 'ja']
#Creating translation dictionaries
veh_type = {
"bus":"bus",
"limousine":"limousine",
"kleinwagen":"small car",
"kombi":"station wagon",
"coupe":"coupe",
"suv":"suv",
"cabrio":"cabrio",
"andere":"other"
}
gearbox_tr = {
"manuell":"manual",
"automatik":"automatic"
}
fuel_tr = {
"lpg":"lpg",
"diesel":"diesel",
"cng":"cng",
"benzin":"gasoline",
"hybrid":"hybrid",
"elektro":"electric",
"andere":"other"
}
damage_tr = {
"nein":"no",
"ja":"yes"
}
#Mapping the translations into the columns
autos["vehicle_type"] = autos["vehicle_type"].map(veh_type)
autos["gearbox"] = autos["gearbox"].map(gearbox_tr)
autos["fuel_type"] = autos["fuel_type"].map(fuel_tr)
autos["unrepaired_damage"] = autos["unrepaired_damage"].map(damage_tr)
#Checking
print(autos["vehicle_type"].unique())
print(autos["gearbox"].unique())
print(autos["fuel_type"].unique())
print(autos["unrepaired_damage"].unique())
['bus' 'limousine' 'small car' 'station wagon' nan 'coupe' 'suv' 'cabrio' 'other'] ['manual' 'automatic' nan] ['lpg' 'gasoline' 'diesel' nan 'cng' 'hybrid' 'electric' 'other'] ['no' nan 'yes']
Below we will find out:
#Showing top 10 brands in the dataset
popular_brands = (autos["brand"]
.value_counts(normalize=True)
.head(10)
.index
)
print(popular_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'], dtype='object')
The five most popular brands are German, which is not surprising for German ebay. Now we will calculate the mean price for each of those brands.
#Computing mean price by brand
mean_price_by_brand = {}
for b in popular_brands:
selected_rows = autos[autos["brand"] == b]
mean_price = selected_rows["price_usd"].mean()
mean_price_by_brand[b] = mean_price
for i in mean_price_by_brand:
print(i, ":", mean_price_by_brand[i])
volkswagen : 5402.410261610221 bmw : 8332.820517811953 opel : 2975.2419354838707 mercedes_benz : 8628.450366422385 audi : 9336.687453600594 ford : 3749.4695065890287 renault : 2474.8646069968195 peugeot : 3094.0172290021537 fiat : 2813.748538011696 seat : 4397.230949589683
Now that we have mean prices by brand, we can group the car brands into expensive, medium and cheap, anchoring our groups on descriptive statistics we pulled on the price_usd column above.
#Grouping brands by mean price
expensive_brands = []
medium_low_brands = []
medium_upper_brands = []
low_cost_brands = []
for brand in mean_price_by_brand:
if mean_price_by_brand[brand] <= 3000:
low_cost_brands.append(brand)
elif 3000 < mean_price_by_brand[brand]<= 5000:
medium_low_brands.append(brand)
elif 5000 < mean_price_by_brand[brand] <= 8000:
medium_upper_brands.append(brand)
elif 8000 < mean_price_by_brand[brand]:
expensive_brands.append(brand)
print("Expensive car brands: ", expensive_brands)
print("\n")
print("Medium-high cost brands: ", medium_upper_brands)
print("\n")
print("Medium-low cost brands: ", medium_low_brands)
print("\n")
print("Low cost brands: ", low_cost_brands)
Expensive car brands: ['bmw', 'mercedes_benz', 'audi'] Medium-high cost brands: ['volkswagen'] Medium-low cost brands: ['ford', 'peugeot', 'seat'] Low cost brands: ['opel', 'renault', 'fiat']
The most popular brand Volkswagen is in the middle-price category, which seems reazonable. Three out of five most popular brands - BMW, Mercedes Benz and Audi - also turn out to be among the most expensive ones. This is probably due to the fact that cars of these brands are of better quality and last longer, so they are more likely to be resold after use. Let's check.
We will now calculate mean mileage for the most popular brands and see if there is any relation to the price distribution.
#Creating a dictionary with mean mileage by brand
mean_km_brand = {}
for b in popular_brands:
selected_rows = autos[autos["brand"] == b]
mean_km = selected_rows["odometer_km"].mean()
mean_km_brand[b] = mean_km
for i in mean_km_brand:
print(i, ":", mean_km_brand[i])
volkswagen : 128707.15879132022 bmw : 132572.51313996495 opel : 129310.0358422939 mercedes_benz : 130788.36331334666 audi : 129157.38678544914 ford : 124266.01287159056 renault : 128071.33121308497 peugeot : 127153.62526920316 fiat : 117121.9715956558 seat : 121131.30128956624
Below we will create a dataframe with mean price and mean mileage columns in order to make comparison easier.
#Converting price and mileage dictionaries into series
price_series = pd.Series(mean_price_by_brand)
km_series = pd.Series(mean_km_brand)
#Creating dataframe from price series
price_km_df = pd.DataFrame(price_series, columns = ["mean_price"])
#Adding odometer column to the dataframe
price_km_df["odometer_km"] = km_series
#Checking the result
print(price_km_df)
mean_price odometer_km volkswagen 5402.410262 128707.158791 bmw 8332.820518 132572.513140 opel 2975.241935 129310.035842 mercedes_benz 8628.450366 130788.363313 audi 9336.687454 129157.386785 ford 3749.469507 124266.012872 renault 2474.864607 128071.331213 peugeot 3094.017229 127153.625269 fiat 2813.748538 117121.971596 seat 4397.230950 121131.301290
Still the view is confusing. I suggest we try the same grouping method we used on prices to see if the groups coincide.
#Grouping brands by mean mileage
high_mileage = []
medium_mileage = []
low_mileage = []
for brand in mean_km_brand:
if mean_km_brand[brand] <= 125000:
low_mileage.append(brand)
elif 125000 < mean_km_brand[brand]<= 130000:
medium_mileage.append(brand)
elif 130000 < mean_km_brand[brand]:
high_mileage.append(brand)
print("High mileage brands: ", high_mileage)
print("\n")
print("Medium mileage brands: ", medium_mileage)
print("\n")
print("Low mileage brands: ", low_mileage)
High mileage brands: ['bmw', 'mercedes_benz'] Medium mileage brands: ['volkswagen', 'opel', 'audi', 'renault', 'peugeot'] Low mileage brands: ['ford', 'fiat', 'seat']
Our hypothesis is confirmed: the more expensive cars do tend to have higher mileage. As I speculated above, it might be due to the fact that expensive cars are more reliable and serve longer, and so can be sold several times without significant loss in value.
Below I will calculate the most popular model for each of the popular car brands.
brand_model = {}
for b in popular_brands:
selected_rows = autos[autos["brand"] == b]
top_model = selected_rows["model"].value_counts().index[0]
brand_model[b] = top_model
for i in brand_model:
print(i, ":", brand_model[i])
volkswagen : golf bmw : 3er opel : corsa mercedes_benz : c_klasse audi : a4 ford : focus renault : twingo peugeot : 2_reihe fiat : punto seat : ibiza
How much cheaper are cars with damage than their non-damaged counterparts? Let's find out. I will calculate the average price for cars with and without unrepaired damage.
damaged = autos["unrepaired_damage"].dropna().unique()
mean_price_damage = {}
for i in damaged:
selected_rows = autos[autos["unrepaired_damage"] == i]
mean_price = selected_rows["price_usd"].mean()
mean_price_damage[i] = mean_price
for i in mean_price_damage:
print(i, ":", mean_price_damage[i])
no : 7164.033102796004 yes : 2241.146035242291
price_difference = 100 - ((mean_price_damage["yes"]/mean_price_damage["no"])*100)
print("The damaged cars are", "{:.2f}".format(price_difference), "% cheaper than the undamaged ones.")
The damaged cars are 68.72 % cheaper than the undamaged ones.
I have cleaned the autos dataset from resolving 6 different issues. The analysis of the dataset allows me to make the following conclusions: