In this guided project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle, but in this project we're going to use a sample dataset with 50,000 data points that was prepared by dataquest.
The data dictionary provided with data is as follows:
column | description |
---|---|
dateCrawled | When this ad was first crawled. All field-values are taken from this date. |
name | Name of the car. |
seller | Whether the seller is private or a dealer. |
offerType | The type of listing |
price | The price on the ad to sell the car. |
abtest | Whether the listing is included in an A/B test. |
vehicleType | The vehicle Type. |
yearOfRegistration | The year in which the car was first registered. |
gearbox | The transmission type. |
powerPS | The power of the car in PS. |
model | The car model name. |
kilometer | How many kilometers the car has driven. |
monthOfRegistration | The month in which the car was first registered. |
fuelType | What type of fuel the car uses. |
brand | The brand of the car. |
notRepairedDamage | If the car has a damage which is not yet repaired. |
dateCreated | The date on which the eBay listing was created. |
nrOfPictures | The number of pictures in the ad. |
postalCode | The postal code for the location of the vehicle. |
lastSeenOnline | When the crawler saw this ad last online. |
The aim of this project is to clean the data and analyze the included used car listings. Also, we want to explore some of the unique benefits jupyter notebook provides for pandas. And of course, we want to have fun doing it!
So let's get started. As usual, we'll start by importing libraries, reading the dataset, and doing some exploration
#Import pandas and NumPy libraries
import pandas as pd
import numpy as np
#Read autos.csv into pandas
autos = pd.read_csv('autos.csv', encoding = 'Latin-1')
#The default encoding = 'utf-8' doesn't work, so we had to try another encoding. In this case we use 'Latin-1'
# A neat feature of jupyter notebook is its ability
#to render the first few and last few values of any pandas object.
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info() # print information about the autos dataframe
autos.head() #print first few rows
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
* note: we know this by looking at autos.info(), RangeIndex is 50,000 entries, and column with the least amount of non-null count is notRepairedDamage,
with 40,171.
So what are waiting for? Let's clean the dataset!
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive.
#print an array of the existing column names.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
#rename the columns
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
#verify the change
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
As we can see, now the column names are on snake_case format. Woohoo! Next!
It is a shame that I don't speak German. I need to take some class to understand some of the words in this dataset. :(
But luckily, there is google translate! So by using google translate, I am going to translate all the German words in this dataset to English
autos["seller"]
print('-Before-')
print(autos["seller"].unique()) #print unique values
autos["seller"] = (autos["seller"].str.replace('privat', 'private')
.str.replace('gewerblich', 'commercial')
)
print('\n')
print('-After-')
print(autos["seller"].unique()) #verify the change
-Before- ['privat' 'gewerblich'] -After- ['private' 'commercial']
autos["offer_type"]
print('-Before-')
print(autos["offer_type"].unique()) #print unique values
autos["offer_type"] = (autos["offer_type"].str.replace('Angebot', 'offer')
.str.replace('Gesuch', 'request')
)
print('\n')
print('-After-')
print(autos["offer_type"].unique()) #verify the change
-Before- ['Angebot' 'Gesuch'] -After- ['offer' 'request']
autos["vehicle_type"]
print('-Before-')
print(autos["vehicle_type"].unique()) #print unique values
autos["vehicle_type"] = (autos["vehicle_type"].str.replace('kleinwagen', 'small car')
.str.replace('kombi', 'wagon')
.str.replace('cabrio', 'convertible')
.str.replace('andere', 'other')
)
print('\n')
print('-After-')
print(autos["vehicle_type"].unique()) #verify the change
-Before- ['bus' 'limousine' 'kleinwagen' 'kombi' nan 'coupe' 'suv' 'cabrio' 'andere'] -After- ['bus' 'limousine' 'small car' 'wagon' nan 'coupe' 'suv' 'convertible' 'other']
autos["gearbox"]
print('-Before-')
print(autos["gearbox"].unique()) #print unique values
autos["gearbox"] = (autos["gearbox"].str.replace('manuell', 'manual')
.str.replace('automatik', 'automatic')
.str.replace('cabrio', 'convertible')
.str.replace('andere', 'other')
)
print('\n')
print('-After-')
print(autos["gearbox"].unique()) #verify the change
-Before- ['manuell' 'automatik' nan] -After- ['manual' 'automatic' nan]
autos["unrepaired_damage"]
print('-Before-')
print(autos["unrepaired_damage"].unique()) #print unique values
autos["unrepaired_damage"] = (autos["unrepaired_damage"].str.replace('nein', 'no')
.str.replace('ja', 'yes')
)
print('\n')
print('-After-')
print(autos["unrepaired_damage"].unique()) #verify the change
-Before- ['nein' nan 'ja'] -After- ['no' nan 'yes']
autos["fuel_type"]
print('-Before-')
print(autos["fuel_type"].unique()) #print unique values
autos["fuel_type"] = (autos["fuel_type"].str.replace('benzin', 'gas')
.str.replace('elektro', 'electric')
.str.replace('andere', 'other')
)
print('\n')
print('-After-')
print(autos["fuel_type"].unique()) #verify the change
-Before- ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] -After- ['lpg' 'gas' 'diesel' nan 'cng' 'hybrid' 'electric' 'other']
Alright! So we have translated all the German words!
Now we have a better understanding of the dataset, which will help us when we explore the dataset even further, and do our analysis
Onto the next one!
By looking at the table above, we can see that odometer and price columns are numeric values stored as text (we can see that on the row 'top' above, price has '$ and'odometer has 'km').
We don't want odometer and price columns to be string, because we won't be able to sort it. So you know what? let's clean those columns! Remove the non-numeric characters and convert the data types of those two columns to int.
print(autos.dtypes) #verify data types
date_crawled object name object seller object offer_type object price object ab_test object vehicle_type object registration_year int64 gearbox object power_ps int64 model object odometer object registration_month int64 fuel_type object brand object unrepaired_damage object ad_created object nr_of_pictures int64 postal_code int64 last_seen object dtype: object
autos["price"].unique() #print unique values
array(['$5,000', '$8,500', '$8,990', ..., '$385', '$22,200', '$16,995'], dtype=object)
autos["odometer"].unique() #print unique values
array(['150,000km', '70,000km', '50,000km', '80,000km', '10,000km', '30,000km', '125,000km', '90,000km', '20,000km', '60,000km', '5,000km', '100,000km', '40,000km'], dtype=object)
#replace non-numeric values
autos["price"] = (autos["price"].str.replace('$','')
.str.replace(',','')
.astype(int)
)
autos["price"].head() #verify change
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
autos["odometer"] = (autos["odometer"].str.replace('km','')
.str.replace(',','')
.astype(int)
)
autos["odometer"].head() #verify change
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int64
#rename columns
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
autos.rename({"price": "price_usd"}, axis = 1, inplace = True)
autos.describe(include = 'all') #verify the changes
date_crawled | name | seller | offer_type | price_usd | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 5.000000e+04 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000.000000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 245 | NaN | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-23 18:39:34 | Ford_Fiesta | private | offer | NaN | test | limousine | NaN | manual | NaN | golf | NaN | NaN | gas | volkswagen | no | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | NaN | 25756 | 12859 | NaN | 36993 | NaN | 4024 | NaN | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | 9.840044e+03 | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | 125732.700000 | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | 4.811044e+05 | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | 40042.211706 | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | 1.100000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | 1.000000e+08 | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Up until this point, we've done the following:
Let's continue exploring and cleaning the dataset, specifically looking for data that doesn't look right. For now, we'll continue working on the odometer_km and price columns.
We're going to analyze the columns using minimum and maximum values and look for outliers that we might want to remove.
autos["price_usd"]
print(autos["price_usd"].unique().shape) #to see how many unique values
print('\n')
print(autos["price_usd"].describe()) #to view min/max/median/mean etc
(2357,) count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_usd, dtype: float64
# count unique values of the 50 highest prices
autos["price_usd"].value_counts().sort_index(ascending= False).head(50)
#note: Series.value_counts() returns a series
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 145000 1 139997 1 137999 1 135000 1 130000 1 129000 1 128000 1 120000 2 119900 1 119500 1 116000 1 115991 1 115000 1 114400 1 109999 1 105000 2 104900 1 99900 2 99000 2 98500 1 Name: price_usd, dtype: int64
# count unique values of the 50 lowest prices
autos["price_usd"].value_counts().sort_index(ascending= False).tail(50)
140 9 139 1 135 1 130 15 129 1 125 8 122 1 120 39 117 1 115 2 111 2 110 3 100 134 99 19 90 5 89 1 80 15 79 1 75 5 70 10 66 1 65 5 60 9 59 1 55 2 50 49 49 4 47 1 45 4 40 6 35 1 30 7 29 1 25 5 20 4 18 1 17 3 15 2 14 1 13 2 12 3 11 2 10 7 9 1 8 1 5 2 3 1 2 3 1 156 0 1421 Name: price_usd, dtype: int64
If we look at the two series above, it appears that the outliers are listings with price \$0, and everything above \$350,000.
We want to remove \$0, because giving a car to strangers for free doesn't seem realistic at this day and age, right? :P
Also, we want to remove rows with prices > \$350,000, because after \$350,000 the prices that follows jumped significantly higher, which also doesn't seeem natural.
*note: cars with really low price (e.g. below \$10) makes sense, because eBay is an aution site. So we're going to keep it. Most likely the really low prices are just the starting bid.
#removing outliers
autos = autos[autos["price_usd"].between(1, 350000)]
#verify change
print('-head-')
print(autos["price_usd"].value_counts().sort_index(ascending= False).head())
print('\n')
print('-tail-')
print(autos["price_usd"].value_counts().sort_index(ascending= False).tail())
print('\n')
print('-describe-')
print(autos["price_usd"].describe())
-head- 350000 1 345000 1 299000 1 295000 1 265000 1 Name: price_usd, dtype: int64 -tail- 8 1 5 2 3 1 2 3 1 156 Name: price_usd, dtype: int64 -describe- count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_usd, dtype: float64
Outliers on price_usd column has been removed. Yay! Now the column only contains car with price between \$1 and \$350,000
Let's move on!
autos["odometer_km"]
print('-unique-')
print(autos["odometer_km"].unique().shape)
print('\n')
print('-desribe-')
print(autos["odometer_km"].describe())
-unique- (13,) -desribe- count 48565.000000 mean 125770.101925 std 39788.636804 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
print(autos["odometer_km"].value_counts().sort_index(ascending= False))
150000 31414 125000 5057 100000 2115 90000 1734 80000 1415 70000 1217 60000 1155 50000 1012 40000 815 30000 780 20000 762 10000 253 5000 836 Name: odometer_km, dtype: int64
There doesn't seem to be any outlier in odometer_km column. Everything is good. So we'll just leave it at that. Next!
Let's now move on to the date columns and understand the date range the data covers.
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteRight now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.
Let's first understand how the values in the three string columns are formatted. These columns all represent full timestamp values:
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
We notice that the first 10 characters represent the date (e.g. 2016-03-12). To understand the date range, we can extract just the date values, use Series.value_counts() to generate a distribution, and then sort by the index.
Let's extract the date values! To select the first 10 characters in each column, we can use Series.str[:10]:
dateof_date_crawled
dateof_date_crawled = autos['date_crawled'].str[:10]
print('-dateof_date_crawled-')
print(dateof_date_crawled)
-dateof_date_crawled- 0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 ... 49995 2016-03-27 49996 2016-03-28 49997 2016-04-02 49998 2016-03-08 49999 2016-03-14 Name: date_crawled, Length: 48565, dtype: object
dateof_ad_created
dateof_ad_created = autos['ad_created'].str[:10]
print('-dateof_ad_created-')
print(dateof_ad_created)
-dateof_ad_created- 0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 ... 49995 2016-03-27 49996 2016-03-28 49997 2016-04-02 49998 2016-03-08 49999 2016-03-13 Name: ad_created, Length: 48565, dtype: object
dateof_last_seen
dateof_last_seen = autos['last_seen'].str[:10]
print('-dateof_last_seen-')
print(dateof_last_seen)
-dateof_last_seen- 0 2016-04-06 1 2016-04-06 2 2016-04-06 3 2016-03-15 4 2016-04-01 ... 49995 2016-04-01 49996 2016-04-02 49997 2016-04-04 49998 2016-04-05 49999 2016-04-06 Name: last_seen, Length: 48565, dtype: object
Almost done with data cleaning, almost! But for now, let's explore the dates and registration_year column even further to understand the data distribution
dateof_date_crawled
#sort_index() ranks date from in ascending order (earliest to latest)
dateof_date_crawled.value_counts(normalize=True, dropna=False).sort_index
#Series.value_counts(normalize=True, dropna=False) is used
#to include missing values in the distribution
#and to use percentages instead of counts
<bound method Series.sort_index of 2016-04-03 0.038608 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-12 0.036920 2016-03-14 0.036549 2016-04-04 0.036487 2016-03-07 0.036014 2016-04-02 0.035478 2016-03-28 0.034860 2016-03-19 0.034778 2016-03-15 0.034284 2016-03-29 0.034099 2016-04-01 0.033687 2016-03-30 0.033687 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-22 0.032987 2016-03-11 0.032575 2016-03-23 0.032225 2016-03-26 0.032204 2016-03-10 0.032184 2016-03-31 0.031834 2016-03-17 0.031628 2016-03-25 0.031607 2016-03-27 0.031092 2016-03-16 0.029610 2016-03-24 0.029342 2016-03-05 0.025327 2016-03-13 0.015670 2016-03-06 0.014043 2016-04-05 0.013096 2016-03-18 0.012911 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64>
2016-03-20 makes up most of the data in the date_crawled column, with 3.8%
dateof_ad_created
dateof_ad_created.value_counts(normalize=True, dropna=False).sort_index
<bound method Series.sort_index of 2016-04-03 0.038855 2016-03-20 0.037949 2016-03-21 0.037579 2016-04-04 0.036858 2016-03-12 0.036755 ... 2015-09-09 0.000021 2016-02-09 0.000021 2016-01-29 0.000021 2016-02-17 0.000021 2016-01-22 0.000021 Name: ad_created, Length: 76, dtype: float64>
2016-03-20 makes up most of the data in the ad_created column, with 3.8%
dateof_last_seen
dateof_last_seen.value_counts(normalize=True, dropna=False).sort_index
<bound method Series.sort_index of 2016-04-06 0.221806 2016-04-07 0.131947 2016-04-05 0.124761 2016-03-17 0.028086 2016-04-03 0.025203 2016-04-02 0.024915 2016-03-30 0.024771 2016-04-04 0.024483 2016-03-31 0.023783 2016-03-12 0.023783 2016-04-01 0.022794 2016-03-29 0.022341 2016-03-22 0.021373 2016-03-28 0.020859 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-23 0.018532 2016-03-26 0.016802 2016-03-16 0.016452 2016-03-15 0.015876 2016-03-19 0.015834 2016-03-27 0.015649 2016-03-14 0.012602 2016-03-11 0.012375 2016-03-10 0.010666 2016-03-09 0.009595 2016-03-13 0.008895 2016-03-08 0.007413 2016-03-18 0.007351 2016-03-07 0.005395 2016-03-06 0.004324 2016-03-05 0.001071 Name: last_seen, dtype: float64>
2016-04-07 makes up most of the data in the last_seen column, with 13.2%
All is good on the date columns. Next!
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The minimum value is 1000 ?! Which means the car was produced and registered in the year 1000?! Really ? Car didn't even exist yet.
*Correct me if I'm wrong, but I don't even think there is any road that was built for cars at that time. So this definitely doesn't make sense.
The maximum value is 9999 -___-' which makes even less sense.
*I mean this dataset is from 2016. How could it be that there is a used car that comes from the future? But you know what, let's not dwell on this even further.
It's time to clean the data!
So since there are odd values in the dataset, we will remove a few rows:
Since this dataset is from 2016, then we will only include cars up until 2016.
I did a little googling for "the first car ever", and turns out the birth certificate of automobile is at year 1886. The car is called Benz Patent Motor Car, model no. 1, which was invented by Carl Benz. This car was not sold to general public at the time of the invention, but maybe there is someone that owns this particular car and wants to sell it on Ebay. Who knows? So without overthinking, let's set the earliest year to 1886
TL;DR: we'll only incude cars with registration year between 1886 and 2016
Before we remove the rows, let's check the relative frequency of cars with a registration year that fall outside of the 1886 - 2016 interval and see if it's safe to remove those rows entirely, or if we need more custom logic..
#check relative frequency using boolean filtering
(~autos["registration_year"].between(1886, 2016)).value_counts(normalize=True, dropna=False)
False 0.961207 True 0.038793 Name: registration_year, dtype: float64
True = 0.039, which means that cars with a registration year that fall outside of the 1886 - 2016 interval only make up 3.9% of the total rows. This is only a small percentage, so we can remove them.
print('-Before-')
print(autos["registration_year"].describe()) #before change
print('\n')
autos = autos[autos["registration_year"].between(1886, 2016)] #removing odd rows
print('-After-')
print(autos["registration_year"].describe()) #verify change
-Before- count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64 -After- count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
#calculate the distribution of the remaining values
print('-head-')
print(autos["registration_year"].value_counts(normalize=True).head(10).sort_index)
print('\n')
print('-tail-')
print(autos["registration_year"].value_counts(normalize=True).tail(10).sort_index)
-head- <bound method Series.sort_index of 2000 0.067608 2005 0.062895 1999 0.062060 2004 0.057904 2003 0.057818 2006 0.057197 2001 0.056468 2002 0.053255 1998 0.050620 2007 0.048778 Name: registration_year, dtype: float64> -tail- <bound method Series.sort_index of 1955 0.000043 1953 0.000021 1943 0.000021 1929 0.000021 1939 0.000021 1938 0.000021 1948 0.000021 1927 0.000021 1931 0.000021 1952 0.000021 Name: registration_year, dtype: float64>
So based on the data distribution we can see that in this dataset:
We have cleaned the dataset, and we can start doing more analysis in this dataset. However, before we move on let's take a small break and do another reflection on what we have done in this project. So... have done the following:
We did most of our chore (data cleaning), now the super fun part begins! We'll start analyzing data and creating insights! Oh I'm excited :D
One of the analysis techniques we learned in this course is aggregation. When working with data on cars, it's natural to explore variations across different car brands. We can use aggregation to understand the brand column.
How do we do aggregation? Here are the steps:
#Identify the unique values we want to aggregate by
brands_freq = autos["brand"].value_counts(normalize=True, dropna=False)
brands_freq
#We explore the unique values in the brand column, and calculate the relative frequency
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
#Unique values identified
#We only select top brands, which has relative frequency > 5%
top_brands = brands_freq[brands_freq > .05].index
top_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
top_brands_price
top_brands_price = {} #Create an empty dictionary to store our aggregate data
for b in top_brands: # Loop over the unique values, and for each:
selected_rows = autos[autos["brand"] == b ] #Subset the dataframe by the unique values using boolean comparison
mean_price = selected_rows["price_usd"].mean() #Calculate the mean of whichever column we're interested in
top_brands_price[b] = int(mean_price) #Assign the val/mean to the dict as k/v
#print in descending order
sorted_top_brands_price = sorted(top_brands_price.items() , reverse=True, key=lambda x: x[1])
for element in sorted_top_brands_price :
print(element[0] , " :" , element[1] )
audi : 9336 mercedes_benz : 8628 bmw : 8332 volkswagen : 5402 ford : 3749 opel : 2975
top_brands_mileage
#using the same aggregation methods as above
top_brands_mileage = {}
for b in top_brands:
selected_rows = autos[autos["brand"] == b ]
mean_mileage = selected_rows["odometer_km"].mean()
top_brands_mileage[b] = int(mean_mileage)
sorted_top_brands_mileage = sorted(top_brands_mileage.items() , reverse=True, key=lambda x: x[1])
for element in sorted_top_brands_mileage :
print(element[0] , " :" , element[1] )
bmw : 132572 mercedes_benz : 130788 opel : 129310 audi : 129157 volkswagen : 128707 ford : 124266
Alright! So we have figured out the average price and mileage for the top brands using aggregation, next we want to analyze the aggregated data by converting dictionary --> series objects using pandas series constructor
tbp_series
(top brands price)
tbp_series = pd.Series(top_brands_price).sort_values(ascending=False)
print(tbp_series)
audi 9336 mercedes_benz 8628 bmw 8332 volkswagen 5402 ford 3749 opel 2975 dtype: int64
tbm_series
(top brands mileage)
tbm_series = pd.Series(top_brands_mileage).sort_values(ascending=False)
print(tbm_series)
bmw 132572 mercedes_benz 130788 opel 129310 audi 129157 volkswagen 128707 ford 124266 dtype: int64
We have convered both dictionaries to series objects using series constructor. The next step is to create a dataframe from both series objects using dataframe constructor so we can easily compare the top brand's mean price & mean mileage
brand_info = pd.DataFrame(tbp_series, columns = ['mean_price']) #create a dataframe from tbp_series
brand_info['mean_mileage'] = tbm_series #assign tbm_series to a new column
brand_info #pretty print
mean_price | mean_mileage | |
---|---|---|
audi | 9336 | 129157 |
mercedes_benz | 8628 | 130788 |
bmw | 8332 | 132572 |
volkswagen | 5402 | 128707 |
ford | 3749 | 124266 |
opel | 2975 | 129310 |
If we want to buy a car just by considering the price and mileage, then I'd say Ford has the best value, because it is the second least expensive car, and has the least amount of mileage on average.
When looking at the table above, it seems like there is no direct correlation between price and mileage for top brands. For example, Audi is the most expensive car, and yet it is ranked fourth in mileage; Merceedes Benz is the second most expensive car, and it has second most amount of mileage.
But really? There is no correlation at all? Maybe since the top brands only make up of 5% of the dataset, then analyzing only the top brands don't do whole dataset justice . I have an idea, let's analyze the whole dataset. We can split the odometer_km (mileage) into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
Let's do this! I am curious
Call me stubborn, but I really don't believe that there is no correlation at all between mileage and price.
Let's use aggregation to analyze the mileage and price columns for the whole dataset. See if we get a different result. I wonder...
#using aggregation
mileage_price = {}
for mi in autos["odometer_km"].unique():
selected_rows = autos[autos["odometer_km"] == mi ] #boolean comparison
mean_price = selected_rows["price_usd"].mean()
mileage_price[mi] = mean_price
sorted_mileage_price = sorted(mileage_price.items() , reverse=True, key=lambda x: x[1])
for element in sorted_mileage_price :
print(element[0] , " :" , element[1] )
10000 : 20550.867219917014 20000 : 18448.477088948788 30000 : 16608.836842105262 40000 : 15499.568381430365 50000 : 13812.173212487412 60000 : 12385.004432624113 70000 : 10927.182813816344 80000 : 9721.947636363637 5000 : 8873.51592356688 90000 : 8465.02510460251 100000 : 8132.697278911564 125000 : 6214.0220300597075 150000 : 3767.9271065314942
Tadaa! See?! It appears that the higher the price, the lower the mileage, and vice versa. Let's try another method just to make our analysis credible. This time, we are going to split the odometer_km into groups, and use aggregation to see if average prices follows any patterns based on the mileage.
#splitting odomoter_km into 3 groups
mileage_groups = autos["odometer_km"].value_counts(bins = 3).sort_index(ascending= False)
mileage_groups
(101666.667, 150000.0] 34942 (53333.333, 101666.667] 7421 (4854.999, 53333.333] 4318 Name: odometer_km, dtype: int64
#calculate the mean price of mileage_groups
#using boolean comparison
print('-(101666.667, 150000.0]-')
print('Price = $' + str(round(autos[autos["odometer_km"] >= 101666.667]["price_usd"].mean() )) ) #group 1
print('\n')
print('-(53333.333, 101666.667]-')
print('Price = $' + str(round(autos[autos["odometer_km"].between(53333.333 , 101666.667)]["price_usd"].mean() ))) #group 2
print('\n')
print('-(4854.999, 53333.333]-')
print('Price = $' + str(round(autos[autos["odometer_km"] <= 53333.333]["price_usd"].mean() ))) #group 3
-(101666.667, 150000.0]- Price = $4108 -(53333.333, 101666.667]- Price = $9595 -(4854.999, 53333.333]- Price = $14891
Based on the result above, we get the following:
mileage_groups | price | |
---|---|---|
(101666.667, 150000.0] | \$4,108 | |
(53333.333, 101666.667] | \$9,595 | |
(4854.999, 53333.333] | \$14,891 |
Wooh! Good that we did this. Turns out if we analyze the whole dataset, we find there is a correlation between price and mileage. As we all can see from the table, the group of cars with higher mileage has lower price, and vice versa.
Alright alright, so what else can we analyze? let's check the dataset again
autos.head(5) #print the first 5 columns to check what to do next
date_crawled | name | seller | offer_type | price_usd | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | private | offer | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | private | offer | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | gas | bmw | no | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | private | offer | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | gas | volkswagen | no | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | private | offer | 4350 | control | small car | 2007 | automatic | 71 | fortwo | 70000 | 6 | gas | smart | no | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | private | offer | 1350 | test | wagon | 2003 | manual | 0 | focus | 150000 | 7 | gas | ford | no | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
So in our analysis we have done the following:
Let's do a few more analysis to get some more insights!
Let's find out the most common brand/ model combinations by using groupby( ) on brands and the model column
#we could have used aggregation, but let's try groupby() just for fun (and it's simpler anyway)
#using groupby:
#A groupby operation involves some combination of splitting the object,
#applying a function, and combining the results.
#This can be used to group large amounts of data and compute operations on these groups.
common_brands_model = autos.groupby(["brand","model"]).size().sort_values(ascending=False).head()
common_brands_model
brand model volkswagen golf 3707 bmw 3er 2615 volkswagen polo 1609 opel corsa 1592 volkswagen passat 1349 dtype: int64
Oh wow! The most common brand & model combination is volkswagen golf, with 3707 entries. Interesting.
Earlier when analyzing the brand column we managed to get top_brands, which consists of car brands that have a relative frequency > 5% in this dataset. Let's find the most common model of each brands! In this case we're going to use aggregation.
top_brands.unique() #just to remind us which brands are included in top_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
#aggregation
top_brands_model = {}
for b in top_brands:
selected_rows = autos[autos["brand"] == b ] #boolean comparison
model = selected_rows["model"]
top_brands_model[b] = model
print(str(b) + ': ' + str(top_brands_model[b].value_counts()[[0]]))
print('\n')
volkswagen: golf 3707 Name: model, dtype: int64 bmw: 3er 2615 Name: model, dtype: int64 opel: corsa 1592 Name: model, dtype: int64 mercedes_benz: c_klasse 1136 Name: model, dtype: int64 audi: a4 1231 Name: model, dtype: int64 ford: focus 762 Name: model, dtype: int64
Ford focus is the most popular model of Ford brand, but compared to the popular model of other top brands it has the lowest number of entries. This is not surprising, since Ford is the only American top brand while the other brands in top brands are from German. Note that this dataset is obtained from German ebay, so the result makes sense.
Let's find out the most expensive cars in this dataset using groupby( ) on brands and the model column
#top 5 most expensive used cars
brand_model_price = autos.groupby(["brand","model"])["price_usd"].max().sort_values(ascending=False).head()
brand_model_price
brand model porsche 911 350000 bmw z_reihe 259000 x_reihe 250000 mercedes_benz andere 180000 audi andere 175000 Name: price_usd, dtype: int64
Ok, so the most expensive cars in this dataset is porsche 911, with $350,000. One day I'll buy it! ;D
Sooo we are almost done with our analysis. We have obtained a lot of insights, but let's do a few more just for fun!
Let's find out the the price difference between cars with damage vs non-damaged
#The the average prices of cars with damage vs non-damaged**
autos.groupby(["unrepaired_damage"])["price_usd"].mean()
unrepaired_damage no 7164.033103 yes 2241.146035 Name: price_usd, dtype: float64
Makes total sense. The price of cars without unrepaired damage is higher on average, with $7,164 Next, let's calculate the difference
#Calculating the difference: yes - no
autos.groupby(["unrepaired_damage"])["price_usd"].mean()["yes"] - autos.groupby(["unrepaired_damage"])["price_usd"].mean()["no"]
-4922.887067553713
Alright so without using calculator we know that the difference is $4,922... neat.
Let's use groupby to find the the most expensive vehicle types on average
autos.groupby(["vehicle_type"])["price_usd"].mean().sort_values(ascending=False).head()
vehicle_type suv 13407.183715 coupe 11731.207961 convertible 10786.965186 bus 6667.593649 limousine 5958.511510 Name: price_usd, dtype: float64
SUV? This is actually unexpected, but ok!
Of course using common sense we all guess that automatic should be more expensive, but let's check...
#Calculate the average prices
autos.groupby(["gearbox"])["price_usd"].mean()
gearbox automatic 10972.718547 manual 4716.709175 Name: price_usd, dtype: float64
#Calculate the difference: automatic - manual
autos.groupby(["gearbox"])["price_usd"].mean()["automatic"] - autos.groupby(["gearbox"])["price_usd"].mean()["manual"]
6256.009372369583
Our guess is right, automatic is $6,256 more expensive!
This is going to be our last analysis in this project! Let's find out the average prices of each fuel types and sort it in descending order
autos.groupby(["fuel_type"])["price_usd"].mean().sort_values(ascending=False)
fuel_type electric 24716.368421 hybrid 14346.027027 diesel 8583.889467 gas 5055.093973 cng 4861.478873 lpg 4328.140216 other 2973.866667 Name: price_usd, dtype: float64
Electric cars are the most expensive, which is not surprising at all
Before we jump into conclusion, let's recap on what we have done for our data analysis
Data Analysis:
Well, that was fun wasn't it? We have explored, cleaned and analyzed the dataset, and now it's time present our insights! ;D