In this project, we'll work with a dataset of used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset can he found here: data.
Let's start by importing the libraries we need and reading the dataset into pandas.
import numpy as np
import pandas as pd
autos = pd.read_csv('autos_dq.csv', encoding='Windows-1252')
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 4.8+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The initial observation are the following:
We will now convert column names from camelcase to snakecase and reword some of the column names to be more descriptive.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
autos.rename({
'dateCrawled':'date_crawled',
'offerType':'offer_type',
'vehicleType':'vehicle_type',
'yearOfRegistration':'registration_year',
'powerPS':'power_ps',
'monthOfRegistration':'registration_month',
'fuelType':'fuel_type',
'notRepairedDamage':'unrepaired_damage',
'dateCreated':'ad_created',
'nrOfPictures':'nr_of_pictures',
'postalCode':'postal_code',
'lastSeen':'last_seen'
}, axis=1, inplace=True)
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 11:37:04 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
After the initial review of the data, it seems that there are few columns that have mostly one value and probably might be dropped from the further analysis; it includes: seller and offer_type.
autos = autos.drop(columns = ['seller', 'offer_type'])
We also investigate few more columns.
autos.model.value_counts()
golf 4024 andere 3528 3er 2761 polo 1757 corsa 1735 ... kalina 2 i3 1 b_max 1 200 1 rangerover 1 Name: model, Length: 245, dtype: int64
autos.brand.value_counts()
volkswagen 10687 opel 5461 bmw 5429 mercedes_benz 4734 audi 4283 ford 3479 renault 2404 peugeot 1456 fiat 1308 seat 941 skoda 786 mazda 757 nissan 754 citroen 701 smart 701 toyota 617 sonstige_autos 546 hyundai 488 volvo 457 mini 424 mitsubishi 406 honda 399 kia 356 alfa_romeo 329 porsche 294 suzuki 293 chevrolet 283 chrysler 181 dacia 129 daihatsu 128 jeep 110 subaru 109 land_rover 99 saab 80 daewoo 79 trabant 78 jaguar 77 rover 69 lancia 57 lada 31 Name: brand, dtype: int64
The data stored in brand and model columns seems to be clean. Now let's explore the price column.
autos.price.describe()
count 50000 unique 2357 top $0 freq 1421 Name: price, dtype: object
autos.price.value_counts()
$0 1421 $500 781 $1,500 734 $2,500 643 $1,000 639 ... $3,425 1 $3,129 1 $1,960 1 $369 1 $14,790 1 Name: price, Length: 2357, dtype: int64
The data stored in the price column is stored as a text. Now let's remove any non-numeric characters ($) and then convert it into numeric values, followed by the column name change to properly reflect the currency.
autos["price"] = autos["price"].str.replace('$', '')
autos["price"] = autos["price"].str.replace(',', '').astype(int)
autos.rename({"price":"price_usd"}, axis=1, inplace=True)
The same applies to odometer column. Let's quickly do the operation for this column and also change the column name.
autos.odometer.value_counts()
150,000km 32424 125,000km 5170 100,000km 2169 90,000km 1757 80,000km 1436 70,000km 1230 60,000km 1164 50,000km 1027 5,000km 967 40,000km 819 30,000km 789 20,000km 784 10,000km 264 Name: odometer, dtype: int64
autos["odometer"] = autos["odometer"].str.replace('km','')
autos["odometer"] = autos["odometer"].str.replace(',','').astype(int)
autos.rename({"odometer":"odometer_km"}, axis=1, inplace=True)
Let's continue exploring the data, specifically looking for data that doesn't look right. We'll analyze the odometer_km and price_usd columns and look for the outliers.
autos.price_usd.value_counts().sort_index(ascending=False).head(20)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 Name: price_usd, dtype: int64
autos.price_usd.value_counts().sort_index(ascending=True).head(30)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 Name: price_usd, dtype: int64
autos.loc[autos['price_usd'] == 350000]
date_crawled | name | price_usd | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
36818 | 2016-03-27 18:37:37 | Porsche_991 | 350000 | control | coupe | 2016 | manuell | 500 | 911 | 5000 | 3 | benzin | porsche | nein | 2016-03-27 00:00:00 | 0 | 70499 | 2016-03-27 18:37:37 |
It seems that price column includes many outliers that should be removed. It's fair to say that every entries having price that above 350 000 can be removed (we checked the entries for 350 000 and 345 000, which are Porsche and Rolls-Royce, so their prices still seem rational).
The same applies to any cars sold for the price that equals or is lower than 50. Let's remove them!
autos = autos[(autos["price_usd"] > 50) & (autos["price_usd"] < 350000)]
Now let's do the same exercise with the odemeter_km column.
autos.odometer_km.value_counts().sort_index(ascending=False).head(20)
150000 31257 125000 5045 100000 2103 90000 1733 80000 1413 70000 1214 60000 1153 50000 1011 40000 815 30000 777 20000 757 10000 248 5000 771 Name: odometer_km, dtype: int64
autos.odometer_km.value_counts().sort_index(ascending=True).head(30)
5000 771 10000 248 20000 757 30000 777 40000 815 50000 1011 60000 1153 70000 1214 80000 1413 90000 1733 100000 2103 125000 5045 150000 31257 Name: odometer_km, dtype: int64
It seems that odometer_km column does not need further cleaning, as the data seems to be clean and does not include any outlier.
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 48297 entries, 0 to 49999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 48297 non-null object 1 name 48297 non-null object 2 price_usd 48297 non-null int32 3 abtest 48297 non-null object 4 vehicle_type 43842 non-null object 5 registration_year 48297 non-null int64 6 gearbox 46061 non-null object 7 power_ps 48297 non-null int64 8 model 45889 non-null object 9 odometer_km 48297 non-null int32 10 registration_month 48297 non-null int64 11 fuel_type 44390 non-null object 12 brand 48297 non-null object 13 unrepaired_damage 39373 non-null object 14 ad_created 48297 non-null object 15 nr_of_pictures 48297 non-null int64 16 postal_code 48297 non-null int64 17 last_seen 48297 non-null object dtypes: int32(2), int64(5), object(11) memory usage: 4.6+ MB
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. We can differentiate by referring to the data dictionary:
date_crawled
: added by the crawlerlast_seen
: added by the crawlerad_created
: from the websiteregistration_month
: from the websiteregistration_year
: from the websiteRight now, the date_crawled, last_seen, and ad_created columns are all identified as string values by pandas. Because these three columns are represented as strings, we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are represented as numeric values, so we can use methods like Series.describe() to understand the distribution without any extra data processing.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025364 2016-03-06 0.014080 2016-03-07 0.036006 2016-03-08 0.033253 2016-03-09 0.033046 2016-03-10 0.032238 2016-03-11 0.032569 2016-03-12 0.036980 2016-03-13 0.015674 2016-03-14 0.036648 2016-03-15 0.034288 2016-03-16 0.029464 2016-03-17 0.031534 2016-03-18 0.012879 2016-03-19 0.034723 2016-03-20 0.037787 2016-03-21 0.037269 2016-03-22 0.032921 2016-03-23 0.032300 2016-03-24 0.029464 2016-03-25 0.031534 2016-03-26 0.032279 2016-03-27 0.031079 2016-03-28 0.034930 2016-03-29 0.034122 2016-03-30 0.033687 2016-03-31 0.031865 2016-04-01 0.033750 2016-04-02 0.035572 2016-04-03 0.038553 2016-04-04 0.036524 2016-04-05 0.013065 2016-04-06 0.003168 2016-04-07 0.001387 Name: date_crawled, dtype: float64
The most popular day for the ad crawl was 2016-04-03, with 3.86% share.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038802 2016-04-04 0.036897 2016-04-05 0.011781 2016-04-06 0.003251 2016-04-07 0.001242 Name: ad_created, Length: 76, dtype: float64
The most popular day for the ad creation was 2016-04-03, with 3.88% share.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001077 2016-03-06 0.004327 2016-03-07 0.005425 2016-03-08 0.007309 2016-03-09 0.009607 2016-03-10 0.010622 2016-03-11 0.012402 2016-03-12 0.023749 2016-03-13 0.008883 2016-03-14 0.012651 2016-03-15 0.015881 2016-03-16 0.016440 2016-03-17 0.028076 2016-03-18 0.007330 2016-03-19 0.015777 2016-03-20 0.020643 2016-03-21 0.020581 2016-03-22 0.021368 2016-03-23 0.018593 2016-03-24 0.019794 2016-03-25 0.019173 2016-03-26 0.016688 2016-03-27 0.015529 2016-03-28 0.020829 2016-03-29 0.022320 2016-03-30 0.024722 2016-03-31 0.023811 2016-04-01 0.022879 2016-04-02 0.024908 2016-04-03 0.025115 2016-04-04 0.024515 2016-04-05 0.125018 2016-04-06 0.221794 2016-04-07 0.132161 Name: last_seen, dtype: float64
The most popular day for the last seen column was 2016-04-06, with 2.21% share.
After exploring ad crawl, ad creation and last seen columns, let's explore registration_year column.
autos.registration_year.describe()
count 48297.000000 mean 2004.724807 std 87.831495 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The registration_year column includes some odd values, with minimum value at 1000 (before cars were invented) and maximum value 9999 (many years into the future).
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Determining the earliest valid year is more difficult. Realistically, it could be somewhere in the first few decades of the 1900s.
autos = autos[autos['registration_year'].between(1900,2016)]
autos.registration_year.describe()
count 46423.000000 mean 2002.935592 std 7.127037 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
autos.registration_year.value_counts(normalize=True)
2000 0.067079 2005 0.062835 1999 0.062060 2004 0.058139 2003 0.058053 ... 1938 0.000022 1939 0.000022 1953 0.000022 1943 0.000022 1952 0.000022 Name: registration_year, Length: 78, dtype: float64
After further cleaning and removing any cars claimed to be registered before 1900 or after 2016, we have 46423 remaining cars.
Most cars (6.7%) were registered in year 2000.
Now let's explore variations across different car brands using aggregation.
autos['brand'].describe()
count 46423 unique 40 top volkswagen freq 9809 Name: brand, dtype: object
autos['brand'].value_counts()
volkswagen 9809 bmw 5109 opel 4981 mercedes_benz 4484 audi 4024 ford 3244 renault 2187 peugeot 1388 fiat 1194 seat 848 skoda 763 nissan 712 mazda 709 smart 658 citroen 653 toyota 593 hyundai 465 sonstige_autos 443 volvo 423 mini 408 mitsubishi 380 honda 365 kia 329 alfa_romeo 309 porsche 280 suzuki 275 chevrolet 264 chrysler 164 dacia 123 daihatsu 117 jeep 106 subaru 98 land_rover 98 saab 77 jaguar 72 daewoo 69 trabant 64 rover 62 lancia 49 lada 27 Name: brand, dtype: int64
We will select top 10 most common brands for further analysis.
top_brands = autos['brand'].value_counts().index[:10]
top_brands
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat'], dtype='object')
Now we will do the data agreggation using loops and dictionary. The process looks as follows:
We will use the process described above to calculate the mean price for the top 10 brands in the dataset.
mean_price_brand = {}
for b in top_brands:
brand_row = autos[autos['brand']==b]
mean = brand_row['price_usd'].mean()
mean_price_brand[b]=mean
mean_price_brand
{'volkswagen': 5431.494036089306, 'bmw': 8378.434918770796, 'opel': 2999.6338084721942, 'mercedes_benz': 8664.980151650312, 'audi': 9376.088469184891, 'ford': 3771.270961775586, 'renault': 2490.5665294924556, 'peugeot': 3105.0886167146973, 'fiat': 2820.7747068676717, 'seat': 4423.116745283019}
dict(sorted(mean_price_brand.items(), key=lambda item: item[1])) #sorting the dictionary by value
{'renault': 2490.5665294924556, 'fiat': 2820.7747068676717, 'opel': 2999.6338084721942, 'peugeot': 3105.0886167146973, 'ford': 3771.270961775586, 'seat': 4423.116745283019, 'volkswagen': 5431.494036089306, 'bmw': 8378.434918770796, 'mercedes_benz': 8664.980151650312, 'audi': 9376.088469184891}
We aggregated across brands to understand mena price of the cars. From among the top 10 brands (most frequent ones), audi has the highest average price (9376 USD), followed by mercedes benz (8664 USD) and bmw (8378 USD).
Ford and Seat were less expensive, with Volkswagen in between.
For the top 6 brands, let's use aggregation to understand the average mileage for those cars and if there's any visible link with mean price. To do so, we can combine the data from both series objects into a single dataframe (with a shared index) and display the dataframe directly using pandas series constructor.
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 46423 entries, 0 to 49999 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 46423 non-null object 1 name 46423 non-null object 2 price_usd 46423 non-null int32 3 abtest 46423 non-null object 4 vehicle_type 43840 non-null object 5 registration_year 46423 non-null int64 6 gearbox 44418 non-null object 7 power_ps 46423 non-null int64 8 model 44279 non-null object 9 odometer_km 46423 non-null int32 10 registration_month 46423 non-null int64 11 fuel_type 43224 non-null object 12 brand 46423 non-null object 13 unrepaired_damage 38287 non-null object 14 ad_created 46423 non-null object 15 nr_of_pictures 46423 non-null int64 16 postal_code 46423 non-null int64 17 last_seen 46423 non-null object dtypes: int32(2), int64(5), object(11) memory usage: 4.4+ MB
mean_mileage_brand = {}
for b in top_brands:
brand_row = autos[autos['brand']==b]
mean = brand_row['odometer_km'].mean()
mean_mileage_brand[b]=mean
mean_mileage_brand
{'volkswagen': 128781.73106330921, 'bmw': 132673.71305539244, 'opel': 129415.77996386267, 'mercedes_benz': 130951.15967885817, 'audi': 129249.50298210735, 'ford': 124257.09001233046, 'renault': 128331.04709647919, 'peugeot': 127193.80403458214, 'fiat': 117081.23953098827, 'seat': 121303.06603773584}
We calculated the mean mileage for the top 10 brands and stored the results in a dictionary. Now we will convert both dictionaries into series objects, using the series constructor.
mpb_series = pd.Series(mean_price_brand)
print(mpb_series)
volkswagen 5431.494036 bmw 8378.434919 opel 2999.633808 mercedes_benz 8664.980152 audi 9376.088469 ford 3771.270962 renault 2490.566529 peugeot 3105.088617 fiat 2820.774707 seat 4423.116745 dtype: float64
mmb_series = pd.Series(mean_mileage_brand)
print(mmb_series)
volkswagen 128781.731063 bmw 132673.713055 opel 129415.779964 mercedes_benz 130951.159679 audi 129249.502982 ford 124257.090012 renault 128331.047096 peugeot 127193.804035 fiat 117081.239531 seat 121303.066038 dtype: float64
Now we will create a dataframe from the first series object using the dataframe constructor. Then we will assign the other series as a new column in this dataframe
price_miles_df = pd.DataFrame(mpb_series, columns=['mean_price'])
price_miles_df['mean_mileage'] = mmb_series
price_miles_df
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5431.494036 | 128781.731063 |
bmw | 8378.434919 | 132673.713055 |
opel | 2999.633808 | 129415.779964 |
mercedes_benz | 8664.980152 | 130951.159679 |
audi | 9376.088469 | 129249.502982 |
ford | 3771.270962 | 124257.090012 |
renault | 2490.566529 | 128331.047096 |
peugeot | 3105.088617 | 127193.804035 |
fiat | 2820.774707 | 117081.239531 |
seat | 4423.116745 | 121303.066038 |
price_miles_df.describe()
mean_price | mean_mileage | |
---|---|---|
count | 10.000000 | 10.000000 |
mean | 5146.144894 | 126923.813346 |
std | 2676.018160 | 4737.382400 |
min | 2490.566529 | 117081.239531 |
25% | 3025.997511 | 124991.268518 |
50% | 4097.193854 | 128556.389080 |
75% | 7641.699698 | 129374.210718 |
max | 9376.088469 | 132673.713055 |
price_miles_df['mean_price'].corr(price_miles_df['mean_mileage'])
0.5618561436226072
There are no huge differences in the average mileage within the top 10 brands in the dataset, with the standard deviation value being below 5000 km. There is also no correlation between those two values.
Within the project, we have mainly performed several data cleaning steps with the ebay cars dataset. As for the data analysis itself, from among the top 10 brands, there were three brands with the highest average price (audi, mercedes_benz and bmw). Also, the mean milage did not influence the average price of the car.