The aim of this project is to practice data cleaning and analysis using the included used car listings dataset. For our practise we will be attempting to find out the best cars to purchase from the german ebay website.
The dataset I'll be working with is a dataset of used cars from eBay Kleinanzeigen, a classified section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle but I'll be making use of a modified version that's been provided to me by dataquest.
The original dataset can be found along with information on this dataset by following the link.
After our analysis, we discovered that the Volkswagen was the most popular brand, the sonstige_autos was the most expensive on average and the volvo gave the most mileage. We also noticed a few brands, like the BMW and Mercedez Benz, which were expensive yet popular among buyers. We deduced that the reason for this was that they provided value which was worth their high price.
We concluded that the best brand to buy would be between a Volkwagen or volvo. If we chose to go for a more classy brand, a BMW or a Mercedez Benz would be better options.
Before exploring the dataset, there are some things we need to do first such as handle missing values, convert categorical data to numerical data and so on. All this is to ensure we get as accurate an insight as possible from our data. Let's take a look at our data
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
From the above code cell, We see that we have 20 columns in our dataset. Most of the columns have 50000 entries with a quater of them (five of them to be exact) having missing values of about 2000 to almost 9000 of them. These missing values will need to be cleaned out. We'll also need to change the column names to something that's a bit easier for us to work with. Let's begin shall we.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
changes={'yearOfRegistration': 'registration_year',
'monthOfRegistration': 'registration_month',
'notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created', 'offerType': 'offer_type',
'vehicleType': 'vehicle_type', 'powerPS': 'power_ps',
'fuelType': 'fuel_type', 'nrOfPictures': 'no_of_pics',
'postalCode': 'postal_code', 'lastSeen': 'last_seen',
'dateCrawled': 'date_crawled'}
autos.rename(changes, axis = 1, inplace=True)
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'no_of_pics', 'postal_code', 'last_seen'], dtype='object')
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Great! Now our column names look better. All letters are in lower case and al words have been separated using an underscore _
. This type of casing is known as snake case
. We write in snake case to ease our typing. This way we don't have to alternate between the uppercase and lowercase letters.
Now we to find out what kind of data we're looking at and what we need to clean.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
We find that we have two alpha-numeric columns (price
, odometer
) which need to be cleaned to a numeric type, three date columns (date_crawled
, ad_created
, last_seen
) which will be properly cleaned later five numeric columns (registration_year
, regestration_month
, no_of_pics
, postal_code
) and the rest of the columns are text columns.
We'll need to investigate the date columns some more to see how much insight we can gain about the timings of some events related to the cars. The no_of_pics
column has only one value for all its rows so we'll drop that column.
We'll start our cleaning process by dealing with the alpha-numeric data first. We'll remove all the letters and symbols in the data and convert the columns to either an integer or a float type whichever is best suited for the column.
autos['price'] = autos['price'].str.replace('$','')
autos['price'] = autos['price'].str.replace(',', '')
autos['price'] = autos['price'].astype(int)
autos['odometer'] = autos['odometer'].str.replace('km', '')
autos['odometer'] = autos['odometer'].str.replace(',', '')
autos['odometer'] = autos['odometer'].astype(int)
autos.rename({'odometer': 'odometer_km', 'price': 'price_dollars'}, axis = 1, inplace=True)
Now that that's done let's take a look at the price
column first.
autos['price_dollars'].unique().shape
(2357,)
There are 2357 unique prices in price_dollars
. Lets investigate some more to find out more about these values
autos['price_dollars'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dollars, dtype: float64
autos['price_dollars'].value_counts(ascending=False).head(6)
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 Name: price_dollars, dtype: int64
autos['price_dollars'].value_counts(ascending=False).tail(6)
6495 1 20790 1 8970 1 846 1 2895 1 33980 1 Name: price_dollars, dtype: int64
autos.loc[autos['price_dollars'] == 99999999, "name"].shape
(1,)
Upon further investigation, we notice the following:
price
is 0 dollars indicating that the vehicle is free.We also notice that the 0 dollar price has the highest frequency. This seems strange as it implies that quite a number of cars are free. I did some google searching and I discovered that the least price for a used care on ebay 4000 dollars indicating that no car should have a price of less than 4000 dollars.
Lets do some investigating. I decided to start my investigation by finding out the number of all the cars with a price of 0 dollars and an odometer reading of more than 100000km.
autos.loc[(autos['price_dollars'] == 0) & (autos['odometer_km'] > 10000), "price_dollars"].shape
(1280,)
autos.loc[(autos['price_dollars'] == 0) & (autos['odometer_km'] <= 10000), :]
date_crawled | name | seller | offer_type | price_dollars | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | no_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
71 | 2016-03-28 19:39:35 | Suche_Opel_Astra_F__Corsa_oder_Kadett_E_mit_Re... | privat | Angebot | 0 | control | NaN | 1990 | manuell | 0 | NaN | 5000 | 0 | benzin | opel | NaN | 2016-03-28 00:00:00 | 0 | 4552 | 2016-04-07 01:45:48 |
167 | 2016-04-02 19:43:45 | Suche_VW_Multivan_Innenausstattung_Set_oder_TE... | privat | Angebot | 0 | control | NaN | 2011 | NaN | 0 | transporter | 5000 | 0 | NaN | volkswagen | NaN | 2016-04-02 00:00:00 | 0 | 64739 | 2016-04-06 19:45:08 |
226 | 2016-03-25 23:52:12 | Porsche_911_S_Targa__67er_SWB | privat | Angebot | 0 | control | cabrio | 1967 | manuell | 160 | 911 | 5000 | 12 | benzin | porsche | nein | 2016-03-25 00:00:00 | 0 | 44575 | 2016-04-05 14:46:39 |
259 | 2016-04-03 23:49:58 | guenstiges_Auto_/_auch_defekt | privat | Angebot | 0 | control | NaN | 2000 | NaN | 0 | NaN | 5000 | 6 | NaN | sonstige_autos | NaN | 2016-04-03 00:00:00 | 0 | 89269 | 2016-04-06 07:16:22 |
301 | 2016-03-08 20:37:59 | Kaufe_alle_Autos_bietet_an | privat | Angebot | 0 | control | NaN | 1990 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-08 00:00:00 | 0 | 13589 | 2016-04-05 18:44:43 |
418 | 2016-03-29 14:43:24 | Fiat_SCUDO_8_Sitzer_Bus__Diesel_JTD__80_KW | privat | Angebot | 0 | test | bus | 2003 | manuell | 80 | andere | 5000 | 5 | diesel | fiat | nein | 2016-03-29 00:00:00 | 0 | 35315 | 2016-04-05 23:47:14 |
430 | 2016-03-18 23:52:40 | Winterraeder_FORD | privat | Angebot | 0 | test | NaN | 2007 | NaN | 0 | focus | 5000 | 0 | NaN | ford | NaN | 2016-03-18 00:00:00 | 0 | 40549 | 2016-03-19 06:47:06 |
1937 | 2016-03-19 08:51:48 | Vw_polo_1_9tdi | privat | Angebot | 0 | test | kombi | 2001 | manuell | 120 | polo | 5000 | 0 | NaN | volkswagen | NaN | 2016-03-19 00:00:00 | 0 | 4720 | 2016-04-06 07:46:03 |
2360 | 2016-04-04 16:44:31 | Polo_86c_3f__g40_g60__vr6 | privat | Angebot | 0 | test | NaN | 1995 | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 2016-04-04 00:00:00 | 0 | 26529 | 2016-04-06 18:17:16 |
2466 | 2016-03-20 17:57:49 | Auto_Haus_Kanaan_An&Verkauf_Gebrauchtwagen | privat | Angebot | 0 | test | NaN | 2015 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-20 00:00:00 | 0 | 55234 | 2016-03-21 21:17:30 |
2481 | 2016-04-04 11:47:53 | Triton_Supermoto_400_ccm_Tuev_9/17_!_Verkauf_/... | privat | Angebot | 0 | control | NaN | 2017 | automatik | 0 | NaN | 10000 | 2 | benzin | sonstige_autos | NaN | 2016-04-04 00:00:00 | 0 | 58708 | 2016-04-06 12:46:49 |
2813 | 2016-03-20 09:51:34 | Verkaufe_Fiat_Punto_Felgen | privat | Angebot | 0 | control | NaN | 1990 | NaN | 0 | punto | 5000 | 0 | NaN | fiat | NaN | 2016-03-20 00:00:00 | 0 | 80999 | 2016-04-06 05:45:13 |
2875 | 2016-04-04 08:53:00 | VW_Polo_3F_Motorsport | privat | Angebot | 0 | control | coupe | 1991 | manuell | 75 | polo | 5000 | 1 | benzin | volkswagen | nein | 2016-04-04 00:00:00 | 0 | 54516 | 2016-04-06 10:45:56 |
3308 | 2016-03-28 20:58:50 | Suche_dringend_ein_Kleinwagen | privat | Angebot | 0 | control | NaN | 2000 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-28 00:00:00 | 0 | 44532 | 2016-04-03 02:20:29 |
3394 | 2016-04-01 18:52:13 | Opel_omega_b_schlachtauto_zu_verschenken | privat | Angebot | 0 | control | NaN | 2000 | automatik | 211 | NaN | 5000 | 0 | benzin | opel | NaN | 2016-04-01 00:00:00 | 0 | 84189 | 2016-04-03 15:46:14 |
3422 | 2016-03-28 18:56:23 | Audi_A6_3.0Tdi_Gewinde___20zoll | privat | Angebot | 0 | control | NaN | 2006 | NaN | 0 | a6 | 5000 | 0 | NaN | audi | NaN | 2016-03-28 00:00:00 | 0 | 26632 | 2016-04-04 22:18:49 |
3643 | 2016-03-27 00:49:35 | Single_frame_golf_4 | privat | Angebot | 0 | test | NaN | 2000 | NaN | 0 | golf | 5000 | 0 | NaN | volkswagen | NaN | 2016-03-26 00:00:00 | 0 | 6528 | 2016-04-07 03:15:23 |
3928 | 2016-03-07 21:45:42 | Oldtimer_GAZ_M_21_Wolga_viele_Ersatzteile_Youn... | privat | Angebot | 0 | control | limousine | 1960 | manuell | 0 | NaN | 5000 | 1 | benzin | sonstige_autos | NaN | 2016-03-07 00:00:00 | 0 | 98630 | 2016-04-06 01:15:19 |
4079 | 2016-03-07 18:38:53 | Opel_Zafira_CNG_zu_verkaufen! | privat | Angebot | 0 | control | NaN | 2016 | NaN | 0 | zafira | 5000 | 0 | cng | opel | ja | 2016-03-07 00:00:00 | 0 | 44135 | 2016-03-08 16:46:52 |
4111 | 2016-03-12 08:53:03 | Brauch_ein_neues_Autos_habt_ihr_was_zum_anbiet... | privat | Angebot | 0 | control | NaN | 2000 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-12 00:00:00 | 0 | 80809 | 2016-03-21 06:17:14 |
4509 | 2016-03-22 01:00:15 | Fahrzeuglackierer | privat | Angebot | 0 | control | NaN | 2015 | NaN | 0 | NaN | 5000 | 0 | benzin | sonstige_autos | NaN | 2016-03-22 00:00:00 | 0 | 38440 | 2016-03-26 08:18:27 |
6308 | 2016-03-12 17:38:17 | Kaufe_Autos_jeglicher | privat | Angebot | 0 | test | NaN | 9996 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 2016-03-12 00:00:00 | 0 | 21244 | 2016-03-12 17:38:17 |
7172 | 2016-03-14 20:46:25 | 2_Ford_Escort_Cabrio_Paket_Preis | privat | Angebot | 0 | control | NaN | 1990 | NaN | 0 | escort | 5000 | 3 | NaN | ford | NaN | 2016-03-14 00:00:00 | 0 | 47804 | 2016-03-14 20:46:25 |
7234 | 2016-03-31 07:54:02 | Renault_Clio_Grandtour_Dynamique_TCe_90_eco2 | privat | Angebot | 0 | test | kombi | 2014 | manuell | 90 | NaN | 10000 | 10 | benzin | renault | nein | 2016-03-31 00:00:00 | 0 | 1640 | 2016-04-06 02:17:46 |
7266 | 2016-03-17 19:53:31 | Tauschen_golf3 | privat | Angebot | 0 | control | NaN | 2016 | NaN | 0 | golf | 5000 | 0 | benzin | volkswagen | nein | 2016-03-17 00:00:00 | 0 | 16515 | 2016-03-20 22:44:57 |
7499 | 2016-03-13 16:37:44 | Karosserie__opel_Kadett_c_Limo | privat | Angebot | 0 | test | limousine | 1978 | NaN | 0 | kadett | 5000 | 0 | NaN | opel | NaN | 2016-03-13 00:00:00 | 0 | 93413 | 2016-03-31 21:47:16 |
7512 | 2016-03-28 16:50:43 | Bmw_750i_Tausch_oder_Angebot | privat | Angebot | 0 | test | limousine | 1995 | automatik | 326 | 7er | 5000 | 0 | NaN | bmw | nein | 2016-03-28 00:00:00 | 0 | 23826 | 2016-04-06 22:17:41 |
7672 | 2016-03-13 01:57:22 | FORD_FOCUS_WENIG_KILOMETER | privat | Angebot | 0 | control | NaN | 2018 | NaN | 0 | focus | 5000 | 12 | diesel | ford | nein | 2016-03-13 00:00:00 | 0 | 52385 | 2016-03-14 21:46:28 |
7808 | 2016-03-09 12:54:08 | Koppel_Autos_Motoren_und_Getriebe_laufen | privat | Angebot | 0 | control | NaN | 1980 | NaN | 0 | escort | 10000 | 0 | NaN | ford | NaN | 2016-03-09 00:00:00 | 0 | 25554 | 2016-03-13 00:17:30 |
8438 | 2016-03-14 21:37:21 | Jaguar_xj_40_Daimler_Original_Baujahr_1994 | privat | Angebot | 0 | test | NaN | 1995 | automatik | 0 | NaN | 5000 | 3 | NaN | jaguar | nein | 2016-03-14 00:00:00 | 0 | 12099 | 2016-03-25 11:45:38 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
39594 | 2016-03-11 23:40:30 | fiat_scudo__fuer_Bastler__laeuft_auf_3_zylinde... | privat | Angebot | 0 | test | bus | 2000 | manuell | 69 | andere | 5000 | 5 | diesel | fiat | NaN | 2016-03-11 00:00:00 | 0 | 21107 | 2016-03-17 15:17:51 |
39604 | 2016-03-23 21:51:40 | Kaefer_1303__Projekt_mit_Subaru_EJ22_Motor_eve... | privat | Angebot | 0 | test | limousine | 1973 | manuell | 136 | kaefer | 5000 | 0 | benzin | volkswagen | NaN | 2016-03-23 00:00:00 | 0 | 9131 | 2016-04-06 04:15:27 |
39635 | 2016-03-24 13:51:55 | SUCHE_fuer_TT8N | privat | Angebot | 0 | control | coupe | 2000 | manuell | 0 | tt | 5000 | 12 | benzin | audi | nein | 2016-03-24 00:00:00 | 0 | 61169 | 2016-04-05 23:46:41 |
40264 | 2016-03-27 11:49:32 | Bmw_316_i_Schlachtfesst_austauschmotor | privat | Angebot | 0 | test | NaN | 2000 | NaN | 0 | 3er | 5000 | 0 | NaN | bmw | NaN | 2016-03-27 00:00:00 | 0 | 87766 | 2016-04-07 07:16:08 |
40324 | 2016-03-29 00:57:04 | Golf_5_Bj.2007 | privat | Angebot | 0 | test | limousine | 2007 | manuell | 140 | golf | 5000 | 6 | benzin | volkswagen | NaN | 2016-03-28 00:00:00 | 0 | 12435 | 2016-04-05 13:45:15 |
40568 | 2016-03-25 22:53:28 | Opel_Vectra_B | privat | Angebot | 0 | control | NaN | 2017 | automatik | 136 | vectra | 5000 | 12 | NaN | opel | nein | 2016-03-25 00:00:00 | 0 | 25554 | 2016-04-05 15:45:46 |
40955 | 2016-03-06 00:54:53 | Zer_Guten_Auto_800_ | privat | Angebot | 0 | control | NaN | 2016 | NaN | 0 | NaN | 5000 | 0 | NaN | chevrolet | NaN | 2016-03-06 00:00:00 | 0 | 28325 | 2016-03-06 00:54:53 |
41251 | 2016-03-24 20:00:29 | Tauschen_e36_cabrio | privat | Angebot | 0 | control | cabrio | 1995 | manuell | 116 | NaN | 5000 | 3 | lpg | bmw | nein | 2016-03-24 00:00:00 | 0 | 26897 | 2016-03-29 06:47:12 |
42181 | 2016-03-27 19:50:53 | SAMSUNG_55_3D_Tv_und_Soundbar_gegen_Auto | privat | Angebot | 0 | test | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-27 00:00:00 | 0 | 57080 | 2016-04-06 01:15:30 |
42851 | 2016-03-26 22:54:00 | Mercedes_Benz._____________________Kabine_actr... | privat | Angebot | 0 | test | NaN | 2005 | NaN | 0 | andere | 5000 | 0 | NaN | mercedes_benz | NaN | 2016-03-26 00:00:00 | 0 | 46240 | 2016-04-07 01:45:36 |
43224 | 2016-03-18 23:48:59 | Golf_3_in_Teile__..._einfach_Anfragen__4_Tuerer | privat | Angebot | 0 | test | NaN | 2000 | NaN | 60 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 2016-03-18 00:00:00 | 0 | 32312 | 2016-04-06 00:46:08 |
43481 | 2016-03-20 10:59:19 | Holzkoepfe/Arschmaden | privat | Angebot | 0 | control | NaN | 2000 | NaN | 0 | NaN | 5000 | 8 | NaN | sonstige_autos | NaN | 2016-03-20 00:00:00 | 0 | 23845 | 2016-03-20 10:59:19 |
43492 | 2016-04-03 13:52:33 | Opel_Astra_G_mit_Klima_elektrische_fenster | privat | Angebot | 0 | control | NaN | 2000 | NaN | 101 | astra | 5000 | 0 | NaN | opel | NaN | 2016-04-03 00:00:00 | 0 | 47169 | 2016-04-05 12:46:36 |
44109 | 2016-03-13 19:37:21 | BMW_e30_Sammlungsaufloesung | privat | Angebot | 0 | control | NaN | 1990 | NaN | 0 | 3er | 5000 | 0 | NaN | bmw | NaN | 2016-03-13 00:00:00 | 0 | 97456 | 2016-03-13 19:37:21 |
44195 | 2016-03-11 12:50:39 | whatsapp_gruppe_Ford_focus | privat | Angebot | 0 | test | NaN | 2015 | NaN | 0 | focus | 5000 | 0 | NaN | ford | NaN | 2016-03-11 00:00:00 | 0 | 72770 | 2016-04-07 06:45:56 |
44415 | 2016-03-10 22:55:34 | Auffahrrampe_/_Ausstellungsrampe | privat | Angebot | 0 | control | NaN | 2013 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 2016-03-10 00:00:00 | 0 | 31789 | 2016-03-18 04:15:19 |
44624 | 2016-03-20 13:44:18 | Opel_zafira__selection!_2.2_benzin__lpg._Defect | privat | Angebot | 0 | test | bus | 2001 | automatik | 150 | NaN | 5000 | 10 | lpg | opel | ja | 2016-03-20 00:00:00 | 0 | 63452 | 2016-03-26 22:47:07 |
46213 | 2016-04-02 13:47:16 | Bellier_Vario | privat | Angebot | 0 | test | kleinwagen | 1910 | NaN | 0 | NaN | 5000 | 1 | andere | sonstige_autos | NaN | 2016-04-02 00:00:00 | 0 | 93105 | 2016-04-04 11:16:30 |
46220 | 2016-03-29 04:03:36 | Tausch_oder_Verkauf_Renault_twingo | privat | Angebot | 0 | test | NaN | 2000 | manuell | 0 | twingo | 5000 | 0 | benzin | renault | NaN | 2016-03-29 00:00:00 | 0 | 71159 | 2016-04-05 17:25:59 |
46366 | 2016-03-05 22:54:03 | Lackierer_sucht_Auftraege | privat | Angebot | 0 | control | NaN | 1960 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 2016-03-05 00:00:00 | 0 | 72622 | 2016-03-06 05:43:10 |
46665 | 2016-03-24 13:50:22 | FIAT_COUPE_175_ERSATZTEILE | privat | Angebot | 0 | control | coupe | 1994 | manuell | 139 | andere | 5000 | 0 | benzin | fiat | NaN | 2016-03-24 00:00:00 | 0 | 58553 | 2016-04-07 05:17:26 |
47142 | 2016-03-19 08:37:44 | GESUCHT_WIRD_UNBEDINGT_EIN_VOLKSWAGEN_TOURAN_S... | privat | Angebot | 0 | test | bus | 2003 | manuell | 101 | touran | 5000 | 0 | diesel | volkswagen | ja | 2016-03-19 00:00:00 | 0 | 27753 | 2016-04-06 07:17:47 |
47280 | 2016-04-05 11:43:12 | PKW_gesucht___01793917553 | privat | Angebot | 0 | test | NaN | 2005 | manuell | 0 | NaN | 5000 | 0 | benzin | volkswagen | NaN | 2016-04-05 00:00:00 | 0 | 8066 | 2016-04-05 11:43:12 |
47310 | 2016-03-14 04:55:34 | Vw_golf_3_gt | privat | Angebot | 0 | control | NaN | 2016 | manuell | 90 | golf | 5000 | 12 | NaN | volkswagen | NaN | 2016-03-14 00:00:00 | 0 | 99955 | 2016-04-07 06:45:52 |
47368 | 2016-04-01 23:44:14 | ZU_VERSCHENKEN_AUTO | privat | Angebot | 0 | control | kleinwagen | 1998 | manuell | 60 | polo | 5000 | 8 | benzin | volkswagen | NaN | 2016-04-01 00:00:00 | 0 | 69469 | 2016-04-01 23:44:14 |
48193 | 2016-04-04 01:47:49 | Mein_Auto_fuer_Ihre_Werbung | privat | Angebot | 0 | control | NaN | 2005 | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 2016-04-03 00:00:00 | 0 | 7549 | 2016-04-06 08:16:17 |
48290 | 2016-04-03 17:41:30 | Suche_mk2_Fahrer | privat | Angebot | 0 | test | NaN | 2009 | NaN | 0 | NaN | 5000 | 0 | NaN | ford | NaN | 2016-04-03 00:00:00 | 0 | 45711 | 2016-04-05 17:26:42 |
48636 | 2016-03-27 00:52:29 | Audi_A4_gemacht_!!! | privat | Angebot | 0 | test | limousine | 2000 | manuell | 368 | a4 | 10000 | 7 | benzin | audi | nein | 2016-03-26 00:00:00 | 0 | 99734 | 2016-03-27 08:43:42 |
48985 | 2016-03-27 08:55:25 | Stockcar_Autos_Peugeot | privat | Angebot | 0 | test | NaN | 2007 | NaN | 60 | NaN | 5000 | 0 | NaN | peugeot | NaN | 2016-03-27 00:00:00 | 0 | 17153 | 2016-04-07 05:17:18 |
49496 | 2016-03-26 13:55:28 | Bmw_e39_520 | privat | Angebot | 0 | control | limousine | 1998 | manuell | 0 | NaN | 5000 | 0 | NaN | bmw | NaN | 2016-03-26 00:00:00 | 0 | 26188 | 2016-03-26 13:55:28 |
141 rows × 20 columns
After my investigation I noticed that there were a mix of fully-repaired vehicles, vehicles with unrepaired damages and those with an unknown
status. I also noticed Vehicles with registration years ranging from the 1900s to 2016 - vehicles from anytime in the 2000s shouldn't be free - and they were all private vehicles.
Since we have no way of confirming the correct data for these rows, We will be removing all the rows with a price of zero dollars and 99999999 dollars (the outlier) since it is safe to do so.
autos = autos[(autos['price_dollars'].between(1, 6000000))]
autos.shape
(48578, 20)
Odometer
column¶Now we move to the odometer column.
autos['odometer_km'].unique().shape
(13,)
autos['odometer_km'].unique()
array([150000, 70000, 50000, 80000, 10000, 30000, 125000, 90000, 20000, 60000, 5000, 100000, 40000])
autos['odometer_km'].describe()
count 48578.000000 mean 125765.675820 std 39793.193823 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts(ascending=False).head(6)
150000 31421 125000 5058 100000 2116 90000 1734 80000 1415 70000 1217 Name: odometer_km, dtype: int64
autos['odometer_km'].value_counts(ascending=False).tail(6)
50000 1014 5000 837 40000 816 30000 780 20000 762 10000 253 Name: odometer_km, dtype: int64
From the analysis we notice the following:
autos.loc[autos['odometer_km'] == 5000, "brand"].value_counts().head()
volkswagen 161 opel 106 bmw 81 mercedes_benz 64 ford 56 Name: brand, dtype: int64
autos.loc[autos['odometer_km'] == 5000, "price_dollars"].describe()
count 8.370000e+02 mean 1.316806e+04 std 1.366148e+05 min 1.000000e+00 25% 4.500000e+02 50% 1.600000e+03 75% 7.499000e+03 max 3.890000e+06 Name: price_dollars, dtype: float64
autos.loc[(autos['odometer_km'] == 5000) & (autos['price_dollars'] == 3890000), "name"]
47634 Ferrari_FXX Name: name, dtype: object
autos.loc[(autos['odometer_km'] == 5000) & (autos['price_dollars'] == 1), ["name","registeration_year","unrepaired_damage"]]
/dataquest/system/env/python3/lib/python3.4/site-packages/ipykernel/__main__.py:1: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike if __name__ == '__main__': /dataquest/system/env/python3/lib/python3.4/site-packages/pandas/core/indexing.py:1367: FutureWarning: Passing list-likes to .loc or [] with any missing label will raise KeyError in the future, you can use .reindex() as an alternative. See the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike return self._getitem_tuple(key)
name | registeration_year | unrepaired_damage | |
---|---|---|---|
344 | Verkaufe_hier_mein_mein_schoener_honda_crx_in_... | NaN | NaN |
1422 | zerlege_Astra_f_teile_guenstig | NaN | NaN |
1987 | Audi_A4_B5_Ersatzteile | NaN | NaN |
2892 | Suche_Opel_Omega_gerne_DTI_ab_BJ_2000 | NaN | NaN |
3679 | Suche_Auto | NaN | NaN |
4496 | RENAULT__MEGAN__GRANTOUR____bj_2001____TAUSCHE... | NaN | nein |
5246 | Autoaufbereitung | NaN | NaN |
5883 | Suche_ein_Auto | NaN | NaN |
7442 | Suche_Mercedes_280_SE_w108_Karosserie_oder_zum... | NaN | NaN |
9643 | Golf_2_gti_Umbau_1_8_L_107_PS | NaN | nein |
13679 | Suche_Audi_80_Cabrio | NaN | NaN |
17984 | Oldtimer_Chevrolet_Malibu_5.7l_V8_Bj.:1978 | NaN | NaN |
18569 | Suche_Leasing_Audi_A4_Avant__A6_oder_Q5__SQ5_P... | NaN | NaN |
24362 | Suche_Mercedes_CLS_320_CDI | NaN | NaN |
27171 | Wartburg_Scheunenfund_Teilespender | NaN | NaN |
32819 | Suche_VW_/_Audi | NaN | NaN |
36652 | WER_WÜRDE_MICH_SEIN_AUTO_AUF_RATEN_VERKAUFEN | NaN | NaN |
41577 | Alfa_Romeo_156_Schlachtfest_2.5_V6_24V_428/B_B... | NaN | NaN |
43623 | 1.6__60_kW_90ps | NaN | ja |
47087 | W124_SCHLACHTFEST_|_Kombi_|_1995_|_Aktiv_Bass | NaN | NaN |
49845 | Schlachte_VW_Sharan_vr6_Automatik___no_GTI_16V... | NaN | nein |
As expected, the most expensive car with a 5000km odometer reading is a Ferrari_FXX which is worth 3890000 dollars.
Some other things I noticed include that:
Now let's look at the cars with the most mileage
autos.loc[autos['odometer_km'] == 150000, "brand"].value_counts().head()
volkswagen 7224 bmw 3881 opel 3589 mercedes_benz 3340 audi 2994 Name: brand, dtype: int64
autos.loc[autos['odometer_km'] == 150000, "name"].value_counts().head()
BMW_316i 65 BMW_320i 57 Volkswagen_Golf_1.4 53 Opel_Corsa 49 BMW_318i 49 Name: name, dtype: int64
autos.loc[autos['odometer_km'] == 150000, "price_dollars"].describe()
count 3.142100e+04 mean 5.362514e+03 std 1.363871e+05 min 1.000000e+00 25% 9.990000e+02 50% 2.190000e+03 75% 4.700000e+03 max 1.234568e+07 Name: price_dollars, dtype: float64
autos.loc[(autos['odometer_km'] == 150000) & (autos['price_dollars'] == 12345678), "brand"]
27371 fiat 39377 volvo 47598 opel Name: brand, dtype: object
After analysis, we noticed that of all the cars with a mileage of 150000km:
The highest seems strange. Typically prices within the millions range are rounded up to the nearest hundred thousands or higher. Since they are only three rows, its safe enough for us to remove them without losing any meaningful insights in from our data.
autos = autos.loc[~(autos['price_dollars'] == 12345678), :]
autos.shape
(48575, 20)
Apart from the weird highest price I mentioned earlier, nothing else seems to be out of the ordinary with the odometer
column. Lets move on to the next thing.
There are three date columns in our data:
We begin by extracting the date and time of each column to separate columns before performing our analysis. We will discard the original columns after this since we have no need for them. To do this, well create a function to avoid repeating our code multiple times.
def extract_date_time(column):
extract_date = autos[column].str[:10]
extract_time = autos[column].str[10:]
autos.drop(column, axis = 1)
return extract_date, extract_time
autos['date_crawled_date'], autos['date_crawled_time'] = extract_date_time('date_crawled')
autos['ad_created_date'], autos['ad_created_time'] = extract_date_time('ad_created')
autos['last_seen_date'], autos['last_seen_time'] = extract_date_time('last_seen')
Now that we've extracted the date and time of each column, we can begin our analysis.
autos['date_crawled_date'].value_counts(normalize=True, dropna=False)
2016-04-03 0.038600 2016-03-20 0.037880 2016-03-21 0.037406 2016-03-12 0.036933 2016-03-14 0.036541 2016-04-04 0.036521 2016-03-07 0.036006 2016-04-02 0.035471 2016-03-28 0.034853 2016-03-19 0.034771 2016-03-15 0.034277 2016-03-29 0.034133 2016-04-01 0.033680 2016-03-30 0.033680 2016-03-08 0.033309 2016-03-09 0.033083 2016-03-22 0.033001 2016-03-11 0.032568 2016-03-23 0.032218 2016-03-26 0.032198 2016-03-10 0.032177 2016-03-31 0.031827 2016-03-17 0.031642 2016-03-25 0.031601 2016-03-27 0.031086 2016-03-16 0.029604 2016-03-24 0.029336 2016-03-05 0.025322 2016-03-13 0.015666 2016-03-06 0.014040 2016-04-05 0.013093 2016-03-18 0.012908 2016-04-06 0.003170 2016-04-07 0.001400 Name: date_crawled_date, dtype: float64
We can observe the following:
autos['ad_created_date'].value_counts(normalize=True, dropna=False)
2016-04-03 0.038847 2016-03-20 0.037941 2016-03-21 0.037612 2016-04-04 0.036891 2016-03-12 0.036768 2016-03-14 0.035183 2016-04-02 0.035142 2016-03-28 0.034977 2016-03-07 0.034730 2016-03-29 0.034071 2016-03-15 0.034009 2016-03-19 0.033680 2016-04-01 0.033680 2016-03-30 0.033495 2016-03-08 0.033330 2016-03-09 0.033145 2016-03-11 0.032898 2016-03-22 0.032815 2016-03-26 0.032259 2016-03-23 0.032054 2016-03-10 0.031889 2016-03-31 0.031868 2016-03-25 0.031745 2016-03-17 0.031292 2016-03-27 0.030983 2016-03-16 0.030118 2016-03-24 0.029274 2016-03-05 0.022892 2016-03-13 0.017005 2016-03-06 0.015317 ... 2016-02-26 0.000041 2016-02-20 0.000041 2016-02-18 0.000041 2016-02-02 0.000041 2016-02-05 0.000041 2016-01-10 0.000041 2016-02-24 0.000041 2016-02-12 0.000041 2016-02-14 0.000041 2015-12-05 0.000021 2015-12-30 0.000021 2016-02-07 0.000021 2016-01-07 0.000021 2015-08-10 0.000021 2016-02-16 0.000021 2016-02-17 0.000021 2015-09-09 0.000021 2016-02-09 0.000021 2016-02-11 0.000021 2016-01-22 0.000021 2015-06-11 0.000021 2015-11-10 0.000021 2016-01-03 0.000021 2016-02-08 0.000021 2016-02-22 0.000021 2016-01-13 0.000021 2016-01-16 0.000021 2016-01-29 0.000021 2016-02-01 0.000021 2016-01-14 0.000021 Name: ad_created_date, Length: 76, dtype: float64
In the ad_created_date
column, we notice that:
autos['last_seen_date'].value_counts(normalize=True, dropna=False)
2016-04-06 0.221801 2016-04-07 0.131920 2016-04-05 0.124776 2016-03-17 0.028080 2016-04-03 0.025198 2016-04-02 0.024931 2016-03-30 0.024766 2016-04-04 0.024478 2016-03-12 0.023798 2016-03-31 0.023778 2016-04-01 0.022790 2016-03-29 0.022357 2016-03-22 0.021369 2016-03-28 0.020875 2016-03-21 0.020648 2016-03-20 0.020648 2016-03-24 0.019763 2016-03-25 0.019207 2016-03-23 0.018528 2016-03-26 0.016799 2016-03-16 0.016449 2016-03-15 0.015872 2016-03-19 0.015831 2016-03-27 0.015646 2016-03-14 0.012599 2016-03-11 0.012373 2016-03-10 0.010664 2016-03-09 0.009593 2016-03-13 0.008893 2016-03-08 0.007432 2016-03-18 0.007349 2016-03-07 0.005394 2016-03-06 0.004323 2016-03-05 0.001071 Name: last_seen_date, dtype: float64
The crawler was last seen in 2016, sometime between most of March and the start of April 2016
autos['registration_year'].describe()
count 48575.000000 mean 2004.752671 std 88.635269 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Here we observe that:
These are unrealistic because the ads were crawled in 2016 so its impossible for any car to have been registered after 2016. The registration year of 1000 is also impossible because at that time, cars had not been invented yet. We may have to remove rows like these which contain unrealistic date stamps but lets find out how many they are to see if its safe to do so. To do this we set our upper and lower bounds to be 2016 and 1940 respectively.
outlier_count = autos.loc[~(autos['registration_year'].between(1940, 2016)), 'registration_year'].count()
outlier_count
1900
There are 1900 cars that were registered outside of 1900 to 2016. Considering how many rows we have dropped till now, It's I think it's okay to remove these ones outside our bounds.
autos = autos.loc[autos['registration_year'].between(1940, 2016), :]
autos['registration_year'].value_counts(normalize=True)
2000 0.067616 2005 0.062903 1999 0.062089 2004 0.057911 2003 0.057825 2006 0.057225 2001 0.056476 2002 0.053262 1998 0.050627 2007 0.048784 2008 0.047456 2009 0.044692 1997 0.041800 2011 0.034772 2010 0.034044 1996 0.029416 2012 0.028066 1995 0.026310 2016 0.026138 2013 0.017204 2014 0.014226 1994 0.013476 1993 0.009106 2015 0.008399 1992 0.007949 1990 0.007434 1991 0.007263 1989 0.003728 1988 0.002892 1985 0.002057 ... 1981 0.000621 1968 0.000557 1967 0.000557 1971 0.000557 1973 0.000514 1960 0.000514 1974 0.000514 1966 0.000471 1977 0.000471 1976 0.000450 1969 0.000407 1975 0.000386 1965 0.000364 1964 0.000257 1963 0.000171 1961 0.000129 1959 0.000129 1962 0.000086 1956 0.000086 1958 0.000086 1950 0.000064 1957 0.000043 1955 0.000043 1954 0.000043 1951 0.000043 1941 0.000043 1948 0.000021 1943 0.000021 1953 0.000021 1952 0.000021 Name: registration_year, Length: 70, dtype: float64
Now that we've cleaned our data, we can assert the following conclusions:
Great! We are finally done exploring all the date columns in out dataset for insights. We will move on to the next column, the brand
column.
brand
column¶To explore this column we will start by getting an aggregate of the unique values of this column. Because of how many unique brands are in the dataset, We will aggregate on the top 20 most common brands.
We will get this by using the ASeries.value_counts()
method available in pandas to count the
brand_count = autos['brand'].value_counts().head(20)
brand_count
volkswagen 9865 bmw 5137 opel 5019 mercedes_benz 4502 audi 4041 ford 3262 renault 2200 peugeot 1393 fiat 1197 seat 853 skoda 766 nissan 713 mazda 709 smart 661 citroen 655 toyota 593 hyundai 468 sonstige_autos 455 volvo 427 mini 409 Name: brand, dtype: int64
The most commonly sold brand is the volkswagen. The BMW and opel are a second and third respectively but there is a significantly lesser number of them than the volkswagen. The mini and volvo are the rarest brands of the top 20 with both less than 430 in number.
Not surprisingly, posh cars like the ferarri dont appear on the top twenty list. The ford appears to be one quite common while nissan is a closer to the bottom of the list. This could be as a result of the german's preference in cars or a nissan might be more expensive than a ford. Lets find out by taking a look at the mean price of the cars on this list.
To do this, We'll create a function to sort and display, in a descending order, the mean price dictionary we will create for our analysis.
def display_table(dataset, index = 0, is_dataset = True):
"""
Displays frequency table of any column in the dataset in descending order.
Dataset can be a frequency dictionary or a dataset.
INPUT:
dataset = The dataset or frequency table to be examined
index = index of column whose frequency table we are after. Default = 0
is_dataset = A boolean. True if dataset is a dataset else: false. Default = True
OUTPUT:
table_sorted = Frequency table of column corresponding to index sorted in descending order.
"""
if is_dataset:
table = freq_table(dataset, index)
else:
table = dataset
table_val_sorted = sorted(table.values(), reverse = True)
for value in table_val_sorted:
for key in table:
if table[key] == value:
print(key, ':', value)
brand_unique = autos['brand'].unique()
brand_price_mean = {}
for brand in brand_count.index:
brand_price_mean[brand] = autos.loc[autos['brand'] == brand, 'price_dollars'].mean()
display_table(brand_price_mean, is_dataset=False)
sonstige_autos : 45644.97802197802 citroen : 45486.68549618321 mini : 10613.459657701711 audi : 9336.687453600594 mercedes_benz : 8629.300755219901 bmw : 8570.910064239828 ford : 7447.770999386879 volkswagen : 6729.81956411556 skoda : 6368.0 hyundai : 5365.254273504273 toyota : 5167.091062394604 volvo : 4946.501170960188 nissan : 4743.40252454418 seat : 4397.230949589683 mazda : 4112.596614950635 smart : 3580.2239031770046 peugeot : 3094.0172290021537 opel : 2976.2472604104405 fiat : 2813.748538011696 renault : 2475.7172727272728
First we notice that the volkswagen (our most common brand) and ford (a common brand) are somewhere in the middle of the list. Indicating that the average brand price is a bit on the average side of the top 20 brands.
The opel is at the bottom of the price list indicating that its popularity may be largly due to its affordability. We also notice that the sonstige_autos and citroen are at the top of the list. Not surprising considering that they are luxury cars.
We also notice that the BMW and Mercedes Benz are quite expensive yet they are among the top 4 most common brands. This could be because they offer better mileage, last longer than most other brands or offer some other advantages over their competition that make them worth their price.
Now let's look at the mean mileage of these brands.
brand_mean_mileage = {}
for brand in brand_count.index:
brand_mean_mileage[brand] = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
display_table(brand_mean_mileage, is_dataset=False)
volvo : 138067.9156908665 bmw : 132600.73973136072 mercedes_benz : 130816.30386494892 opel : 129336.5212193664 audi : 129157.38678544914 volkswagen : 128713.6340598074 renault : 128127.27272727272 peugeot : 127153.62526920316 mazda : 124464.03385049365 ford : 124351.62477007971 seat : 121131.30128956624 citroen : 119572.51908396947 nissan : 118330.99579242637 fiat : 117121.9715956558 toyota : 115944.35075885328 skoda : 110848.5639686684 hyundai : 106442.30769230769 smart : 99326.77760968229 sonstige_autos : 90219.78021978022 mini : 88105.13447432763
On this list we notice that the Volvo has the highest average mileage. The sonstige_autos (our most expensive brand offers one of the lowest mileages. This could explain why its ranked so low among the top 20 most common brands. The bmw, opel and Mercedes benz all have one of the highest mileages. They are topped only by Volvo. It's worth noting that despite the high price, the BMW and Mercedez were one of the most common brands. Ford is at the middle of the list, it is also fairly expensive and is also a fairly common brand. And Volkwagen (our most common brand) has a high average mileage and its quite affordable.
From what I gather, This signifies a trend, the most important factor in picking a brand is the mileage. The better the mileage a brand provides, the more likely the brand is to be bought. Other factors that affect the likelihood could be buyer tastes, the price also has some importance but, as in the case of the BMW and Mercedes Benz, they can be overlooked if the benefits the brand provides are worth it.
Let's take a closer look at the mean mileages of brands in the top 8 of the mean price list. We could find some connections between the two properties.
mean_price_series = pd.Series(brand_price_mean)
mean_mileage_series = pd.Series(brand_mean_mileage)
mean_stats_df = pd.DataFrame(mean_price_series, columns = ['mean_price'])
mean_stats_df['mean_mileage'] = mean_mileage_series
mean_stats_df['brand_count'] = brand_count
mean_stats_df
mean_price | mean_mileage | brand_count | |
---|---|---|---|
audi | 9336.687454 | 129157.386785 | 4041 |
bmw | 8570.910064 | 132600.739731 | 5137 |
citroen | 45486.685496 | 119572.519084 | 655 |
fiat | 2813.748538 | 117121.971596 | 1197 |
ford | 7447.770999 | 124351.624770 | 3262 |
hyundai | 5365.254274 | 106442.307692 | 468 |
mazda | 4112.596615 | 124464.033850 | 709 |
mercedes_benz | 8629.300755 | 130816.303865 | 4502 |
mini | 10613.459658 | 88105.134474 | 409 |
nissan | 4743.402525 | 118330.995792 | 713 |
opel | 2976.247260 | 129336.521219 | 5019 |
peugeot | 3094.017229 | 127153.625269 | 1393 |
renault | 2475.717273 | 128127.272727 | 2200 |
seat | 4397.230950 | 121131.301290 | 853 |
skoda | 6368.000000 | 110848.563969 | 766 |
smart | 3580.223903 | 99326.777610 | 661 |
sonstige_autos | 45644.978022 | 90219.780220 | 455 |
toyota | 5167.091062 | 115944.350759 | 593 |
volkswagen | 6729.819564 | 128713.634060 | 9865 |
volvo | 4946.501171 | 138067.915691 | 427 |
Sorting our new dataframe by mean price
in descending order we will be able to easily find our top 8 brands with the highest prices and their corresponding mileages
mean_stats_df.sort_values('mean_price', axis=0, ascending=False)
mean_price | mean_mileage | brand_count | |
---|---|---|---|
sonstige_autos | 45644.978022 | 90219.780220 | 455 |
citroen | 45486.685496 | 119572.519084 | 655 |
mini | 10613.459658 | 88105.134474 | 409 |
audi | 9336.687454 | 129157.386785 | 4041 |
mercedes_benz | 8629.300755 | 130816.303865 | 4502 |
bmw | 8570.910064 | 132600.739731 | 5137 |
ford | 7447.770999 | 124351.624770 | 3262 |
volkswagen | 6729.819564 | 128713.634060 | 9865 |
skoda | 6368.000000 | 110848.563969 | 766 |
hyundai | 5365.254274 | 106442.307692 | 468 |
toyota | 5167.091062 | 115944.350759 | 593 |
volvo | 4946.501171 | 138067.915691 | 427 |
nissan | 4743.402525 | 118330.995792 | 713 |
seat | 4397.230950 | 121131.301290 | 853 |
mazda | 4112.596615 | 124464.033850 | 709 |
smart | 3580.223903 | 99326.777610 | 661 |
peugeot | 3094.017229 | 127153.625269 | 1393 |
opel | 2976.247260 | 129336.521219 | 5019 |
fiat | 2813.748538 | 117121.971596 | 1197 |
renault | 2475.717273 | 128127.272727 | 2200 |
Apart from the sonstige_autos and mini, all the brands in the top 8 of the mean price list have a mean mileage
above 100000km. The brands with average mileages
greater than 120000km have a high brand count
values. This indicated that a car is most likely to be sold if it has a high mileage even if the car may be a bit expensive.
In this project, we anayzed a modified version of the Ebay car sales dataset provided on kaggle to find out the best cars to purchase and reached the following conclusions:
We also noticed a few brands that like the BMW and mercedez which were expensive yet popular among buyers. We deduced that the reason for this was that they provided value which was worth their price.
We concluded that the best brand to buy would be between a volkwagen or volvo. If we chose to go for a mor classy brand, a BMW or a Mercedez Benz would be better options.