In this project I will be using the dataset for used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. You can find this database from here.
I will be using a portion of the dataset for my analysis. The dataset I am using has 50,000 rows and 20 columns. The column description is given below:
The aim of this project is to clean the data, and analyze the included used car listings.
#importing some libraries
import pandas as pd
import numpy as np
autos=pd.read_csv('autos.csv',encoding='Latin-1')#reading the csv file
print(autos.info())
print(autos.head())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB None dateCrawled name \ 0 2016-03-26 17:47:46 Peugeot_807_160_NAVTECH_ON_BOARD 1 2016-04-04 13:38:56 BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik 2 2016-03-26 18:57:24 Volkswagen_Golf_1.6_United 3 2016-03-12 16:58:10 Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... 4 2016-04-01 14:38:50 Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... seller offerType price abtest vehicleType yearOfRegistration \ 0 privat Angebot $5,000 control bus 2004 1 privat Angebot $8,500 control limousine 1997 2 privat Angebot $8,990 test limousine 2009 3 privat Angebot $4,350 control kleinwagen 2007 4 privat Angebot $1,350 test kombi 2003 gearbox powerPS model odometer monthOfRegistration fuelType \ 0 manuell 158 andere 150,000km 3 lpg 1 automatik 286 7er 150,000km 6 benzin 2 manuell 102 golf 70,000km 7 benzin 3 automatik 71 fortwo 70,000km 6 benzin 4 manuell 0 focus 150,000km 7 benzin brand notRepairedDamage dateCreated nrOfPictures \ 0 peugeot nein 2016-03-26 00:00:00 0 1 bmw nein 2016-04-04 00:00:00 0 2 volkswagen nein 2016-03-26 00:00:00 0 3 smart nein 2016-03-12 00:00:00 0 4 ford nein 2016-04-01 00:00:00 0 postalCode lastSeen 0 79588 2016-04-06 06:45:54 1 71034 2016-04-06 14:45:08 2 35394 2016-04-06 20:15:37 3 33729 2016-03-15 03:16:28 4 39218 2016-04-01 14:38:50
In the above output I can see that there are 5 columns that have incomplete data:
Since, it was a German website we need to change certain things for example the seller column tells whether its 'private' or 'dealer'. In the data above under the seller column we can see 'privat' instead of 'private'. Similarly, in the notRepairedDamage column we can see 'nein' instead of 'no'. Therefore, we have to take care of these details during analysis.
#Change column names from camelcase to snakecase.
print(autos.columns)
autos.columns=['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
'last_seen']
autos.head()
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
The column names have been converted from camelcase to snakecase to make them more meaningful.
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-30 19:48:02 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
I have made the following observations after looking at the above table:-
#Fixing column odometer
odometer=autos['odometer'].str.replace('km','')
odometer=odometer.str.replace(',','')
odometer=odometer.astype(int)
autos['odometer']=odometer
autos.rename({'odometer':'odometer_km'},axis=1,inplace=True)
#Fixing column price
price=autos['price'].str.replace('$','')
price=price.str.replace(',','')
price=price.astype(int)
autos['price']=price
#Fixing columns: vehicle_Type,gear_box,unrepaired_damage
#Translating from German TO English.
autos['vehicle_type'].value_counts(normalize=True,dropna=False)
limousine 0.25718 kleinwagen 0.21644 kombi 0.18254 NaN 0.10190 bus 0.08186 cabrio 0.06122 coupe 0.05074 suv 0.03972 andere 0.00840 Name: vehicle_type, dtype: float64
vehicle_mapping={'limousine':'limousine','kleinwagen':'small car','kombi':'station wagon','bus':'bus','cabrio':'convertible','coupe':'coupe','suv':'suv','andere':'other','NaN':'unknown'}
autos['vehicle_type']=autos['vehicle_type'].map(vehicle_mapping)
gb_mapping={'automatik':'automatic','manuell':'manual'}
autos['gear_box']=autos['gear_box'].map(gb_mapping)
autos['unrepaired_damage'].value_counts(normalize=True,dropna=False)
upd_mapping={'nein':'no','NaN':'unknown','ja':'yes'}
autos['unrepaired_damage']=autos['unrepaired_damage'].map(upd_mapping)
Columns gear_box, vehicle_type, and unrepaired_damage have been translated from German to English
#Removing unwanted columns
autos=autos.drop(['seller','offer_type','nr_of_pictures'],axis=1)
Columns seller, offer_type, and nr_of_pictures have been dropped because almost all of the values in these columns were the same.
price=autos['price']
price.unique().shape
price.describe()
price.value_counts().sort_index(ascending=False)
autos=autos[(autos["price"]>=0) & (autos['price']<=400000)]
autos.head()
date_crawled | name | price | abtest | vehicle_type | registration_year | gear_box | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manual | 158 | andere | 150000 | 3 | lpg | peugeot | no | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatic | 286 | 7er | 150000 | 6 | benzin | bmw | no | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manual | 102 | golf | 70000 | 7 | benzin | volkswagen | no | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | 4350 | control | small car | 2007 | automatic | 71 | fortwo | 70000 | 6 | benzin | smart | no | 2016-03-12 00:00:00 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | 1350 | test | station wagon | 2003 | manual | 0 | focus | 150000 | 7 | benzin | ford | no | 2016-04-01 00:00:00 | 39218 | 2016-04-01 14:38:50 |
odometer=autos['odometer_km']
odometer.unique().shape
odometer.value_counts().sort_index(ascending=False)
150000 32416 125000 5169 100000 2168 90000 1757 80000 1436 70000 1230 60000 1164 50000 1025 40000 818 30000 789 20000 784 10000 264 5000 966 Name: odometer_km, dtype: int64
Observations:
This dataset has 5 date columns:
Columns date_crawled, ad_created, and last_seen have string datatype where as colummns registration_month, registration_year are in integer datatype.
Starting date column exploration with string datatypes.
autos[['date_crawled','ad_created','last_seen']][0:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
The date format is timestamp(Y-m-d h:m:s). Lets get a date range idea but looking deeper.
autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_values(ascending=True)
2016-04-07 0.001420 2016-04-06 0.003181 2016-03-18 0.013064 2016-04-05 0.013104 2016-03-06 0.013944 2016-03-13 0.015564 2016-03-05 0.025387 2016-03-24 0.029108 2016-03-16 0.029508 2016-03-27 0.031049 2016-03-17 0.031509 2016-03-25 0.031749 2016-03-31 0.031909 2016-03-10 0.032129 2016-03-23 0.032389 2016-03-26 0.032489 2016-03-11 0.032489 2016-03-22 0.032909 2016-03-09 0.033209 2016-03-08 0.033269 2016-03-30 0.033629 2016-04-01 0.033809 2016-03-15 0.033990 2016-03-29 0.034150 2016-03-28 0.034850 2016-03-19 0.034910 2016-04-02 0.035410 2016-03-07 0.035970 2016-04-04 0.036490 2016-03-14 0.036630 2016-03-12 0.036770 2016-03-21 0.037490 2016-03-20 0.037831 2016-04-03 0.038691 Name: date_crawled, dtype: float64
The data was compiled between March and April 2016, and it is unifom.
autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)
2015-06-11 0.000020 2015-08-10 0.000020 2015-09-09 0.000020 2015-11-10 0.000020 2015-12-05 0.000020 ... 2016-04-03 0.038931 2016-04-04 0.036850 2016-04-05 0.011843 2016-04-06 0.003261 2016-04-07 0.001280 Name: ad_created, Length: 76, dtype: float64
This data is very large as compared to the previous date columns. It seems that all the ads were created on the website between 2015-June to 2016-April.
autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)
2016-03-05 0.001080 2016-03-06 0.004421 2016-03-07 0.005362 2016-03-08 0.007582 2016-03-09 0.009843 2016-03-10 0.010763 2016-03-11 0.012524 2016-03-12 0.023807 2016-03-13 0.008983 2016-03-14 0.012804 2016-03-15 0.015884 2016-03-16 0.016445 2016-03-17 0.027928 2016-03-18 0.007422 2016-03-19 0.015744 2016-03-20 0.020706 2016-03-21 0.020726 2016-03-22 0.021586 2016-03-23 0.018585 2016-03-24 0.019565 2016-03-25 0.019205 2016-03-26 0.016965 2016-03-27 0.016024 2016-03-28 0.020846 2016-03-29 0.022326 2016-03-30 0.024847 2016-03-31 0.023827 2016-04-01 0.023106 2016-04-02 0.024887 2016-04-03 0.025367 2016-04-04 0.024627 2016-04-05 0.124275 2016-04-06 0.220982 2016-04-07 0.130957 Name: last_seen, dtype: float64
It seems that the crawler was fetching data from the site between March and April 2016. This has been confirmed by both columns date_crawled, and last_seen. The distribution of this column is roughly uniform.
However, there is a discrepency in the last three days as these are 6-10x more than the previous day. This means that 6-10x more sales were made on the last three days above. This massive spike in sales is highly unlikely and therefore, it can be concluded that the above spike had more to do with the crawling period ending than with the sales.
autos['registration_month'].describe()
count 49986.000000 mean 5.723723 std 3.711839 min 0.000000 25% 3.000000 50% 6.000000 75% 9.000000 max 12.000000 Name: registration_month, dtype: float64
The min value for month is 0 which suggests that the data might be missing for that particular car. The max month is 12 and it seems that the data is spread around the entire year.
autos['registration_year'].describe()
count 49986.000000 mean 2005.075721 std 105.727161 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The registration_year column tells about the age of the car, but according to the analysis above it contains odd values.The minimum year is 1000, when cars never existed, and the maximum year is 9999, many years into the future. This suggests that we may have false data entries and the data needs to be fixed for analysis.
The highest acceptable value for registration_year is 2016 as this data was compiled in 2016. After some googling I found that the Germany started registering number plates in 1896. So we can have 1896 as the lowest value.
autos=autos[autos['registration_year'].between(1896,2016)]
autos['registration_year'].value_counts(normalize=True,dropna=False).sort_index(ascending=True)
1910 0.000187 1927 0.000021 1929 0.000021 1931 0.000021 1934 0.000042 ... 2012 0.027553 2013 0.016786 2014 0.013850 2015 0.008310 2016 0.027408 Name: registration_year, Length: 78, dtype: float64
autos['registration_year'].describe()
count 48016.000000 mean 2002.806002 std 7.306212 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
Almost 75% of the cars were registered after 1999 and amongst these 50% were registered in the early 2000s.
autos['brand'].value_counts(normalize=True)*100
volkswagen 21.211679 bmw 11.002582 opel 10.817228 mercedes_benz 9.536405 audi 8.640870 ford 6.976841 renault 4.735921 peugeot 2.953182 fiat 2.586638 seat 1.818144 skoda 1.603632 mazda 1.514079 nissan 1.509913 citroen 1.391203 smart 1.391203 toyota 1.247501 sonstige_autos 1.089220 hyundai 0.985088 volvo 0.924692 mini 0.864295 mitsubishi 0.814312 honda 0.785155 kia 0.710180 alfa_romeo 0.662279 porsche 0.610213 suzuki 0.591470 chevrolet 0.570643 chrysler 0.366544 dacia 0.256165 daihatsu 0.256165 jeep 0.224925 subaru 0.218677 land_rover 0.204099 saab 0.160363 jaguar 0.158281 trabant 0.156198 daewoo 0.149950 rover 0.135372 lancia 0.108297 lada 0.060397 Name: brand, dtype: float64
More than 50% of the cars are German, with Volkswagon being on the top. Japanese brands occupy 20-25% of the market, and the rest of the market is occuppied by American, Korean, French, and Italian.
brands=autos['brand'].value_counts(normalize=True)*100
brands_label=brands[brands>=4].index
print(brands_label)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault'], dtype='object')
brand_mean_prices={}
for label in brands_label:
price=autos[autos['brand']==label][['price']]
brand_mean_prices[label]=int(price.mean())
brand_mean_prices
{'volkswagen': 5231, 'bmw': 8102, 'opel': 2876, 'mercedes_benz': 8485, 'audi': 9093, 'ford': 3652, 'renault': 2395}
According to the above analysis:
brand_mean_mileage={}
for label in brands_label:
miles=autos[autos['brand']==label][['odometer_km']]
brand_mean_mileage[label]=int(miles.mean())
brand_mean_mileage
{'volkswagen': 128724, 'bmw': 132431, 'opel': 129223, 'mercedes_benz': 130856, 'audi': 129287, 'ford': 124068, 'renault': 128183}
bmp_series=pd.Series(brand_mean_prices)
bmm_series=pd.Series(brand_mean_mileage)
df=pd.DataFrame(bmp_series.sort_index(ascending=False),columns=['mean_price'])
df['mean_mileage']=bmm_series.sort_index(ascending=False)
df
mean_price | mean_mileage | |
---|---|---|
volkswagen | 5231 | 128724 |
renault | 2395 | 128183 |
opel | 2876 | 129223 |
mercedes_benz | 8485 | 130856 |
ford | 3652 | 124068 |
bmw | 8102 | 132431 |
audi | 9093 | 129287 |
Mean mileage of these top brands is very close together that is they are with in 10% of each other. There is also a relationship between mean price and mean mileage. Greater price means lesser mileage. For example, Mercedez Benz has a higher mean price and a lower mean milleage than BMW.
brand_count={}
for label in brands_label:
brand_name=autos[autos['brand']==label]
count=int(brand_name['brand'].value_counts())
brand_count[label]=count
bc_series=pd.Series(brand_count)
df['brand_num_cars']=bc_series
common_model={}
for label in brands_label:
model=autos[autos['brand']==label]
cm_model=model['model'].value_counts().sort_values(ascending=False).index
max_model=cm_model[0]
common_model[label]=max_model
cm_series=pd.Series(common_model)
df['popular_model']=cm_series
df
mean_price | mean_mileage | brand_num_cars | popular_model | |
---|---|---|---|---|
volkswagen | 5231 | 128724 | 10185 | golf |
renault | 2395 | 128183 | 2274 | twingo |
opel | 2876 | 129223 | 5194 | corsa |
mercedes_benz | 8485 | 130856 | 4579 | c_klasse |
ford | 3652 | 124068 | 3350 | focus |
bmw | 8102 | 132431 | 5283 | 3er |
audi | 9093 | 129287 | 4149 | a4 |
We already know that Volkswagon is the most popular brand and that its most popular model present on e-bay is golf.
print(autos['odometer_km'].value_counts().sort_values(ascending=False))
150000 31023 125000 4959 100000 2109 90000 1696 80000 1396 70000 1199 60000 1137 50000 1006 5000 910 40000 800 30000 769 20000 763 10000 249 Name: odometer_km, dtype: int64
autos.groupby(by='odometer_km')['price'].mean()
odometer_km 5000 7654.626374 10000 19890.598394 20000 17940.720839 30000 16414.455137 40000 15441.445000 50000 13633.685885 60000 12286.970097 70000 10817.819850 80000 9575.700573 90000 8350.228184 100000 7936.031769 125000 6086.207905 150000 3654.001451 Name: price, dtype: float64
Price and odometer have an inverse relationship as seen in the above analysis. Price starts to strictly follow the relationship at 10,000km group where it is 19,890 and then it decreases to 3,654 as odometer moves to 150,000km.
autos.groupby('unrepaired_damage')['price'].mean()
unrepaired_damage no 7077.225496 yes 2126.840092 Name: price, dtype: float64
Non-damaged cars value $4,950 more than damaged cars.
In the above exercise we have analyzed that more than 50% of the market has been occuppied by German car manufacturers. Amongst these brands Volkswagon has the highest number of cars that is 10,185 with a decent mean price of 5,231. I noticed a trend between the mean price and mean milleage amongst the top brands, and I verified it by splitting the data into groups, and noticing changes in mean price as the distance increases. Finally, I have also noticed that damaged cars are 4,950 cheaper on average than non-damaged cars.