Dataset contains information on used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded to Kaggle by user orgesleka.
To clean the data and analyze trends in the included car listings.
Dataframe column description:
dateCrawled : when this ad was first crawled, all field-values are taken from this date
name : "name" of the car
seller : private or dealer
offerType
price : the price on the ad to sell the car
abtest
vehicleType
yearOfRegistration : at which year the car was first registered
gearbox
powerPS : power of the car in PS
model
kilometer : how many kilometers the car has driven
monthOfRegistration : at which month the car was first registered
fuelType
brand
notRepairedDamage : if the car has a damage which is not repaired yet
dateCreated : the date for which the ad at ebay was created
nrOfPictures : number of pictures in the ad
postalCode
lastSeenOnline : when the crawler saw this ad last online
import pandas as pd
import numpy as np
# Read data
autos = pd.read_csv('autos.csv', encoding= 'Latin-1')
# Inspecting the dataframe
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
There are 50,000 entries and 20 columns. Some coulmns like vehicleType
, gearbox
, model
etc. contain null data. There is a need to further investigate these columns individually. Most of data types are integers and objects which are strings primarily and datatime info.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Let's convert the column names from camelcase to snakecase and reword some of the column names based on the data dictionary to be more descriptive and inline with python convention.
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code',
'last_seen']
Lets verify if column names are changed to snakecase
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'num_of_pictures', 'postal_code', 'last_seen'], dtype='object')
autos.describe(include = 'all') # including all parameter would list null values as well
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-04-02 15:49:30 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
There are columns like seller
, offer_type
have same value repeated consistently. 49,999 entries have same value. Therefore, we can eliminate them in our analysis at this point. Column num_of_pictures
has all 0 values.
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
autos['num_of_pictures'].value_counts()
0 50000 Name: num_of_pictures, dtype: int64
Lets explore price
, odometer
columns to see if numeric data is entered as text.
autos['price'].head() # note prices are listed in dollars, when we remove dollar sign
# We can make sure to include dollar in column name to not to lose important data.
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object
autos['odometer'].head()
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: odometer, dtype: object
As we can see both columns have numeric data as object data type. We need to clean it and convert it to numeric type.
autos['price'] = (autos['price']
.str.replace('$','')
.str.replace(',','')
.astype(int))
# lets rename `price` to `price_dollar`
autos.rename({'price' : 'price_dollar'}, axis = 1, inplace = True)
autos['price_dollar'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price_dollar, dtype: int64
autos['odometer'] = (autos['odometer']
.str.replace(',','')
.str.replace('km','')
.astype(int))
# Lets rename `odometer` to 'odometer_km'
autos.rename({'odometer': 'odometer_km'}, axis = 1, inplace = True)
autos['odometer_km'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer_km, dtype: int64
autos['odometer_km'].unique().shape
(13,)
There are 13 unique values in the odometer_km
column
autos['odometer_km'].describe()
count 50000.000000 mean 125732.700000 std 40042.211706 min 5000.000000 25% 125000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
autos['odometer_km'].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
150,000 km is the most common odometer reading for the cars. Upto 75 % of odometer readings are at 150,000. Data here does not seem to be unrealstically high or low as used cars would tend to have high odometer readings.
Lets check the price
column
autos['price_dollar'].unique().shape
(2357,)
There are 2,357 unique entries in this column.
autos['price_dollar'].describe()
count 5.000000e+04 mean 9.840044e+03 std 4.811044e+05 min 0.000000e+00 25% 1.100000e+03 50% 2.950000e+03 75% 7.200000e+03 max 1.000000e+08 Name: price_dollar, dtype: float64
Max value of the car is 100,000,000 which is unrealistic and minimum value is 0. Upto 75% of car prices are around 7,200 dollars and mean lies at 9840 dollars
price_filter = autos["price_dollar"] > 1000000
weird_values = autos.loc[price_filter, ['name', 'price_dollar']]
print(weird_values)
name price_dollar 2897 Escort_MK_1_Hundeknochen_zum_umbauen_auf_RS_2000 11111111 7814 Ferrari_F40 1300000 11137 suche_maserati_3200_gt_Zustand_unwichtig_laufe... 10000000 22947 Bmw_530d_zum_ausschlachten 1234566 24384 Schlachte_Golf_3_gt_tdi 11111111 27371 Fiat_Punto 12345678 39377 Tausche_volvo_v40_gegen_van 12345678 39705 Tausch_gegen_gleichwertiges 99999999 42221 Leasinguebernahme 27322222 47598 Opel_Vectra_B_1_6i_16V_Facelift_Tuning_Showcar... 12345678 47634 Ferrari_FXX 3890000
We can see that some of the prices for vehicles which are not even in luxury category are in million dollars. To address this data discrepancy lets filter the results such that we get entries which have prices between 500 to 200,000 dollars.
autos =autos[autos["price_dollar"].between(500,200000)]
autos["price_dollar"].describe()
count 45089.000000 mean 6271.147863 std 8464.802153 min 500.000000 25% 1500.000000 50% 3500.000000 75% 7900.000000 max 198000.000000 Name: price_dollar, dtype: float64
Now we can see that mean value was skewed due to very unrelastically high and low values. It has been normalized to some extent using appropriate filter values. Mean value of car price is 6,271 dollars and maximum value is 198,000 dollars.
Lets remove the unwanted column which have redudant entries which don't add any value to our data analysis.
autos =autos.drop(["seller","offer_type","num_of_pictures"], axis =1)
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 45089 entries, 0 to 49999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 45089 non-null object 1 name 45089 non-null object 2 price_dollar 45089 non-null int64 3 abtest 45089 non-null object 4 vehicle_type 41415 non-null object 5 registration_year 45089 non-null int64 6 gearbox 43276 non-null object 7 power_ps 45089 non-null int64 8 model 43025 non-null object 9 odometer_km 45089 non-null int64 10 registration_month 45089 non-null int64 11 fuel_type 41891 non-null object 12 brand 45089 non-null object 13 unrepaired_damage 37573 non-null object 14 ad_created 45089 non-null object 15 postal_code 45089 non-null int64 16 last_seen 45089 non-null object dtypes: int64(6), object(11) memory usage: 6.2+ MB
autos.iloc[1] # just to get an idea about how one entry looks like
date_crawled 2016-04-04 13:38:56 name BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik price_dollar 8500 abtest control vehicle_type limousine registration_year 1997 gearbox automatik power_ps 286 model 7er odometer_km 150000 registration_month 6 fuel_type benzin brand bmw unrepaired_damage nein ad_created 2016-04-04 00:00:00 postal_code 71034 last_seen 2016-04-06 14:45:08 Name: 1, dtype: object
Lets analyze date associated columns
dates_first_five = autos[['date_crawled',
'ad_created',
'last_seen']][0:5]
print(dates_first_five)
date_crawled ad_created last_seen 0 2016-03-26 17:47:46 2016-03-26 00:00:00 2016-04-06 06:45:54 1 2016-04-04 13:38:56 2016-04-04 00:00:00 2016-04-06 14:45:08 2 2016-03-26 18:57:24 2016-03-26 00:00:00 2016-04-06 20:15:37 3 2016-03-12 16:58:10 2016-03-12 00:00:00 2016-03-15 03:16:28 4 2016-04-01 14:38:50 2016-04-01 00:00:00 2016-04-01 14:38:50
dates_first_five.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5 entries, 0 to 4 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 5 non-null object 1 ad_created 5 non-null object 2 last_seen 5 non-null object dtypes: object(3) memory usage: 160.0+ bytes
We can see that dataype for the above dataframe is of object(= string) type, so we can use string methods to further analyze.
dates =dates_first_five['date_crawled'].str[:10]
print(dates)
0 2016-03-26 1 2016-04-04 2 2016-03-26 3 2016-03-12 4 2016-04-01 Name: date_crawled, dtype: object
Lets consider the dates_crawled
column
date_crawled = (autos['date_crawled'].
str[:10].
str.replace('-','').
astype(int))
ad_created = (autos['ad_created'].
str[:10].
str.replace('-','').
astype(int))
last_seen = (autos['last_seen'].
str[:10].
str.replace('-','').
astype(int))
Now, we are going to sort dates in ascending order and count relative frequencies (percentages).
date_crawled_relative_freq= date_crawled.value_counts(normalize =True, dropna=False).sort_index()
ad_created_relative_freq= ad_created.value_counts(normalize =True, dropna=False).sort_index()
last_seen_relative_freq= last_seen.value_counts(normalize =True, dropna=False).sort_index()
print(date_crawled_relative_freq)
20160305 0.025572 20160306 0.014128 20160307 0.036195 20160308 0.033179 20160309 0.032913 20160310 0.032713 20160311 0.033024 20160312 0.037326 20160313 0.015525 20160314 0.036306 20160315 0.034022 20160316 0.029364 20160317 0.031161 20160318 0.012886 20160319 0.034753 20160320 0.038080 20160321 0.037725 20160322 0.033024 20160323 0.032380 20160324 0.028987 20160325 0.031094 20160326 0.032647 20160327 0.031161 20160328 0.034798 20160329 0.033268 20160330 0.033312 20160331 0.031671 20160401 0.033911 20160402 0.035774 20160403 0.038812 20160404 0.036616 20160405 0.013152 20160406 0.003172 20160407 0.001353 Name: date_crawled, dtype: float64
We can see that data was crawled from March 5,2016 to April 7,2016 with almost similarly consistent relative frequencies.
print(ad_created_relative_freq)
20150611 0.000022 20150810 0.000022 20150909 0.000022 20151110 0.000022 20151205 0.000022 ... 20160403 0.039034 20160404 0.036994 20160405 0.011888 20160406 0.003260 20160407 0.001198 Name: ad_created, Length: 76, dtype: float64
Ebay Advertisements were created from 11 June 2015 until April 7,2016. Ads were created slightly earlier than they were crawled from online
print(last_seen_relative_freq)
20160305 0.001087 20160306 0.004170 20160307 0.005212 20160308 0.007008 20160309 0.009470 20160310 0.010291 20160311 0.012043 20160312 0.023908 20160313 0.008871 20160314 0.012287 20160315 0.015680 20160316 0.016168 20160317 0.027679 20160318 0.007408 20160319 0.015414 20160320 0.020426 20160321 0.020670 20160322 0.021247 20160323 0.018408 20160324 0.019539 20160325 0.018585 20160326 0.016479 20160327 0.015436 20160328 0.020515 20160329 0.021358 20160330 0.024152 20160331 0.023443 20160401 0.022866 20160402 0.024884 20160403 0.024928 20160404 0.024307 20160405 0.126594 20160406 0.225310 20160407 0.134157 Name: last_seen, dtype: float64
We can see last seen
entries are from March 5,2016 from April 7,2016. We can see significant increase relative frequencies in last 3 days April 5,6,7 2016. If we add them then it forms close 0.48 (= 48 % of all entries)
Lets analyze data registration_year
the year in which cars were first registered.
reg_year = autos['registration_year']
reg_year.describe()
count 45089.000000 mean 2005.063741 std 89.659809 min 1000.000000 25% 2000.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Crude analysis shows that mean is 2005 with min at 1000 (I dont think we even had horse carts then, I might be wrong lol) and max at 9999 (thats too much distant in future)
autos_reg_year_cleaned = autos[reg_year.between(1916,2017)]
reg_year_cleaned = autos_reg_year_cleaned['registration_year']
reg_year_cleaned.value_counts(normalize =True).head(25)
2005 0.064202 2000 0.060862 2004 0.060145 2006 0.059764 2003 0.059562 1999 0.057657 2001 0.056468 2002 0.054227 2007 0.050887 2008 0.049429 2009 0.046583 1998 0.045417 2011 0.036248 2010 0.035553 1997 0.033895 2012 0.029277 2017 0.029052 1996 0.024031 2016 0.021251 1995 0.019279 2013 0.017866 2014 0.014616 1994 0.011209 2015 0.008115 1993 0.007622 Name: registration_year, dtype: float64
We can clearly see that majority of the value close to 95% of values lie between 1992-2005. We can use sum function to confirm it
reg_year_cleaned.value_counts(normalize =True).head(25).sum()
0.9532157188011388
Lets first explore the brand
column
autos['brand'].value_counts(normalize =True)
volkswagen 0.213666 bmw 0.113331 opel 0.101621 mercedes_benz 0.101533 audi 0.089711 ford 0.064894 renault 0.044689 peugeot 0.028965 fiat 0.023997 seat 0.018585 skoda 0.016789 smart 0.015237 mazda 0.015059 nissan 0.015037 citroen 0.014261 toyota 0.013285 hyundai 0.010069 sonstige_autos 0.009581 volvo 0.009248 mini 0.009204 honda 0.007940 mitsubishi 0.007696 kia 0.007341 alfa_romeo 0.006565 porsche 0.006077 chevrolet 0.005988 suzuki 0.005855 chrysler 0.003571 dacia 0.002861 jeep 0.002395 land_rover 0.002196 daihatsu 0.002173 subaru 0.001974 saab 0.001641 jaguar 0.001575 daewoo 0.001353 rover 0.001264 trabant 0.001109 lancia 0.001042 lada 0.000621 Name: brand, dtype: float64
Lets look at top 20 results
top_20 =autos['brand'].value_counts(normalize =True).head(20)
top_20.index # gives us list to iterate over
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat', 'seat', 'skoda', 'smart', 'mazda', 'nissan', 'citroen', 'toyota', 'hyundai', 'sonstige_autos', 'volvo', 'mini'], dtype='object')
# Creating empty dictionary to store values key - brand name , value- mean price
brand_prices ={}
for brand in top_20.index:
filter_brand =autos['brand'] == brand # creating boolean mask
mean_price =autos[filter_brand]['price_dollar'].mean()
brand_prices[brand] = mean_price # assigning result back to our dictionary
for k,v in brand_prices.items():
print("{}\'s average price is {:,} dollars".format(k,v)+ '\n')
volkswagen's average price is 5,701.241955574009 dollars bmw's average price is 8,418.535029354207 dollars opel's average price is 3,348.936054124836 dollars mercedes_benz's average price is 8,670.394058540847 dollars audi's average price is 9,484.499876390606 dollars ford's average price is 4,267.123376623376 dollars renault's average price is 2,760.8114143920598 dollars peugeot's average price is 3,329.3254211332314 dollars fiat's average price is 3,211.992606284658 dollars seat's average price is 4,709.527446300716 dollars skoda's average price is 6,539.7107001321 dollars smart's average price is 3,552.050946142649 dollars mazda's average price is 4,391.736377025037 dollars nissan's average price is 5,077.141592920354 dollars citroen's average price is 3,994.679626749611 dollars toyota's average price is 5,244.499165275459 dollars hyundai's average price is 5,695.464757709251 dollars sonstige_autos's average price is 12,432.61574074074 dollars volvo's average price is 5,110.575539568345 dollars mini's average price is 10,616.96626506024 dollars
Above we have obtained average prices for top 20 brands in our dataframe.
Lets see the trend of top 20 brands with respect to odometer readings aka mileage.
brand_mileage ={}
for brand in top_20.index:
filter_brand =autos['brand'] == brand
mean_mileage =autos[filter_brand]['odometer_km'].mean()
brand_mileage[brand] = mean_mileage
for k,v in brand_mileage.items():
print("{}\'s average mileage is {:,.2f} km".format(k,v)+ '\n')
volkswagen's average mileage is 128,459.10 km bmw's average mileage is 133,017.61 km opel's average mileage is 128,120.91 km mercedes_benz's average mileage is 131,021.19 km audi's average mileage is 129,283.07 km ford's average mileage is 123,646.62 km renault's average mileage is 126,493.80 km peugeot's average mileage is 126,271.06 km fiat's average mileage is 115,032.35 km seat's average mileage is 120,763.72 km skoda's average mileage is 111,010.57 km smart's average mileage is 100,705.97 km mazda's average mileage is 124,219.44 km nissan's average mileage is 117,352.51 km citroen's average mileage is 118,779.16 km toyota's average mileage is 115,843.07 km hyundai's average mileage is 105,253.30 km sonstige_autos's average mileage is 90,914.35 km volvo's average mileage is 138,992.81 km mini's average mileage is 89,469.88 km
# We can use pd.Series method to create Series from our dictionaries
brand_mean_prices_series = pd.Series(brand_prices)
brand_mean_mileage =pd.Series(brand_mileage)
df = pd.DataFrame(brand_mean_prices_series,columns=['mean_price_dollars'])
df['mean_mileage_km'] = brand_mean_mileage #aggregating mileage values to our dataframe
print(df)
mean_price_dollars mean_mileage_km volkswagen 5701.241956 128459.103176 bmw 8418.535029 133017.612524 opel 3348.936054 128120.907900 mercedes_benz 8670.394059 131021.188292 audi 9484.499876 129283.065513 ford 4267.123377 123646.616541 renault 2760.811414 126493.796526 peugeot 3329.325421 126271.056662 fiat 3211.992606 115032.347505 seat 4709.527446 120763.723150 skoda 6539.710700 111010.568032 smart 3552.050946 100705.967977 mazda 4391.736377 124219.440353 nissan 5077.141593 117352.507375 citroen 3994.679627 118779.160187 toyota 5244.499165 115843.071786 hyundai 5695.464758 105253.303965 sonstige_autos 12432.615741 90914.351852 volvo 5110.575540 138992.805755 mini 10616.966265 89469.879518
We can see correlation in the data here where most of the brands higher the mean price higher is the mean mileage eg: bmw
, audi
, mercedes_benz
with few exceptions like mini
where mean price is high but mean mileage is quite low.
Lets create a Series of cars with unrepaired_damaged and change nien to no and ja to yes
autos['unrepaired_damage'] =(autos['unrepaired_damage'].
str.replace('nein','no').
str.replace('ja','yes'))
damaged_cars = autos['unrepaired_damage'] == 'yes'
damaged =autos[damaged_cars]
price_unrepaired ={}
for brand in top_20.index:
damaged_cars= damaged['brand'] == brand # if the value in brand column equals to brand in the iteration list then choose it
mean_price =damaged[damaged_cars]['price_dollar'].mean() # using boolean mask created above to filter results and take mean of it
price_unrepaired[brand] = mean_price
unrepaired_mean_price =pd.Series(price_unrepaired)
df['damaged_mean_price'] = unrepaired_mean_price # adding column damaged_mean_price to our aggregated dataframe
print(df)
mean_price_dollars mean_mileage_km damaged_mean_price volkswagen 5701.241956 128459.103176 2665.754003 bmw 8418.535029 133017.612524 3886.117021 opel 3348.936054 128120.907900 1842.817536 mercedes_benz 8670.394059 131021.188292 4113.140244 audi 9484.499876 129283.065513 3608.470968 ford 4267.123377 123646.616541 1992.865455 renault 2760.811414 126493.796526 1524.241379 peugeot 3329.325421 126271.056662 1783.381295 fiat 3211.992606 115032.347505 1703.788889 seat 4709.527446 120763.723150 2199.059701 skoda 6539.710700 111010.568032 4012.016393 smart 3552.050946 100705.967977 1462.521739 mazda 4391.736377 124219.440353 1731.426471 nissan 5077.141593 117352.507375 2335.671053 citroen 3994.679627 118779.160187 2316.807018 toyota 5244.499165 115843.071786 3707.871795 hyundai 5695.464758 105253.303965 2921.869565 sonstige_autos 12432.615741 90914.351852 6978.866667 volvo 5110.575540 138992.805755 1937.813953 mini 10616.966265 89469.879518 4645.384615
We can notice that price for unrepaired cars reduces significantly by approximately 50%.
volk = autos['brand'] == 'volkswagen' # boolean mask to narrow results only associated with Volkswagen brand
volks_model=autos[volk]
name = volks_model['name'] # to get name of specific model
name.value_counts(normalize = True).head(20) # top 20 results of relative frequencies of name of the model in Volkswagen brand
Volkswagen_Golf_1.4 0.007785 Volkswagen_Polo 0.007370 Volkswagen_Golf 0.005917 Volkswagen_Polo_1.2 0.004982 Volkswagen_Golf_1.6 0.004879 Volkswagen_Passat_Variant_1.9_TDI 0.004048 Volkswagen_Golf_1.9_TDI 0.003322 Volkswagen_Lupo_1.0 0.003010 Volkswagen_Golf_1.4_Trendline 0.002699 Volkswagen_Beetle 0.002491 Volkswagen_Golf_2.0_GTI 0.002491 Volkswagen_Passat 0.002284 Volkswagen_Polo_1.2_Comfortline 0.002180 Volkswagen_Polo_1.2_Trendline 0.002180 Volkswagen_Polo_1.4 0.002180 Volkswagen_Passat_Variant_1.6 0.001972 Volkswagen_Passat_Variant_2.0_TDI_DPF_Comfortline 0.001972 Volkswagen_Fox_1.2 0.001868 Volkswagen_Polo_60_Servo 0.001868 Volkswagen_Golf_Variant_1.9_TDI 0.001868 Name: name, dtype: float64
Volkswagen is the brand with most entries and data seems to be dominated by different versions of Volkswagen Golf and Polo
Data shows positive correlation between pricing and milage, where more expensive cars like mercedes,audi, bmw tend to have higher milage odometer readings on them.
Most Common Brand is Volkswagen with models like Golf and Polo dominating the entries.
Prices of unrepaired cars is significantly lower compared to average price of the car by almost 50%
Upto 75% of the car listings have milage 150,000 which is understandable considering these are used cars.
Upto 48% of the entries were scrapped online between period of April 5-7,2016.
Upto 95 % of the cars were registered between 1992-2005.