Which Car to Buy: EBAY Car Data Analysis¶

In this project I will be using the dataset for used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. You can find this database from here.

I will be using a portion of the dataset for my analysis. The dataset I am using has 50,000 rows and 20 columns. The column description is given below:

dateCrawled - When this ad was first crawled. All field-values are taken from this date.
name - Name of the car.
seller - Whether the seller is private or a dealer.
offerType - The type of listing.
price - The price on the ad to sell the car.
abtest - Whether the listing is included in an A/B test.
vehicleType - The vehicle Type.
yearOfRegistration - The year in which the car was first registered.
gearbox - The transmission type.
powerPS - The power of the car in PS.
model - The car model name.
kilometer - How many kilometers the car has driven.
monthOfRegistration - The month in which the car was first registered.
fuelType - What type of fuel the car uses.
brand - The brand of the car.
notRepairedDamage - If the car has a damage which is not yet repaired.
dateCreated - The date on which the eBay listing was created.
nrOfPictures - The number of pictures in the ad.
postalCode - The postal code for the location of the vehicle.
lastSeenOnline - When the crawler saw this ad last online.

The aim of this project is to clean the data, and analyze the included used car listings.

Data Exploration¶

In [1]:

#importing some libraries
import pandas as pd
import numpy as np
autos=pd.read_csv('autos.csv',encoding='Latin-1')#reading the csv file

In [2]:

print(autos.info())
print(autos.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   dateCrawled          50000 non-null  object
 1   name                 50000 non-null  object
 2   seller               50000 non-null  object
 3   offerType            50000 non-null  object
 4   price                50000 non-null  object
 5   abtest               50000 non-null  object
 6   vehicleType          44905 non-null  object
 7   yearOfRegistration   50000 non-null  int64 
 8   gearbox              47320 non-null  object
 9   powerPS              50000 non-null  int64 
 10  model                47242 non-null  object
 11  odometer             50000 non-null  object
 12  monthOfRegistration  50000 non-null  int64 
 13  fuelType             45518 non-null  object
 14  brand                50000 non-null  object
 15  notRepairedDamage    40171 non-null  object
 16  dateCreated          50000 non-null  object
 17  nrOfPictures         50000 non-null  int64 
 18  postalCode           50000 non-null  int64 
 19  lastSeen             50000 non-null  object
dtypes: int64(5), object(15)
memory usage: 7.6+ MB
None
           dateCrawled                                               name  \
0  2016-03-26 17:47:46                   Peugeot_807_160_NAVTECH_ON_BOARD   
1  2016-04-04 13:38:56         BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik   
2  2016-03-26 18:57:24                         Volkswagen_Golf_1.6_United   
3  2016-03-12 16:58:10  Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...   
4  2016-04-01 14:38:50  Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...   

   seller offerType   price   abtest vehicleType  yearOfRegistration  \
0  privat   Angebot  $5,000  control         bus                2004   
1  privat   Angebot  $8,500  control   limousine                1997   
2  privat   Angebot  $8,990     test   limousine                2009   
3  privat   Angebot  $4,350  control  kleinwagen                2007   
4  privat   Angebot  $1,350     test       kombi                2003   

     gearbox  powerPS   model   odometer  monthOfRegistration fuelType  \
0    manuell      158  andere  150,000km                    3      lpg   
1  automatik      286     7er  150,000km                    6   benzin   
2    manuell      102    golf   70,000km                    7   benzin   
3  automatik       71  fortwo   70,000km                    6   benzin   
4    manuell        0   focus  150,000km                    7   benzin   

        brand notRepairedDamage          dateCreated  nrOfPictures  \
0     peugeot              nein  2016-03-26 00:00:00             0   
1         bmw              nein  2016-04-04 00:00:00             0   
2  volkswagen              nein  2016-03-26 00:00:00             0   
3       smart              nein  2016-03-12 00:00:00             0   
4        ford              nein  2016-04-01 00:00:00             0   

   postalCode             lastSeen  
0       79588  2016-04-06 06:45:54  
1       71034  2016-04-06 14:45:08  
2       35394  2016-04-06 20:15:37  
3       33729  2016-03-15 03:16:28  
4       39218  2016-04-01 14:38:50

In the above output I can see that there are 5 columns that have incomplete data:

gearbox column having only 47320 non null entries.
vehicleType column having only 44905 non null entries
model column having only 47242 non null entries
fuelType column having only 45518 non null entries
NotRepairedDamage column having only 40171 non null entries

dateCrawled, dateCreated, and lastSeen have Y-m-d h:m:s as the date format. Its datatype for calculations and analysis would be timestamp. We may have to remove the ',' and '$' sign from the price column, and the ',' and 'km' sign from the odometer column. We also need to ensure that we check for uppercase characters while cleaning the odometer column.

Since, it was a German website we need to change certain things for example the seller column tells whether its 'private' or 'dealer'. In the data above under the seller column we can see 'privat' instead of 'private'. Similarly, in the notRepairedDamage column we can see 'nein' instead of 'no'. Therefore, we have to take care of these details during analysis.

In [3]:

#Change column names from camelcase to snakecase.


print(autos.columns)
autos.columns=['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest',
       'vehicle_type', 'registration_year', 'gear_box', 'power_ps', 'model',
       'odometer', 'registration_month', 'fuel_type', 'brand',
       'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code',
       'last_seen']
autos.head()

Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest',
       'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model',
       'odometer', 'monthOfRegistration', 'fuelType', 'brand',
       'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode',
       'lastSeen'],
      dtype='object')

Out[3]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	privat	Angebot	$5,000	control	bus	2004	manuell	158	andere	150,000km	3	lpg	peugeot	nein	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	privat	Angebot	$8,500	control	limousine	1997	automatik	286	7er	150,000km	6	benzin	bmw	nein	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	privat	Angebot	$8,990	test	limousine	2009	manuell	102	golf	70,000km	7	benzin	volkswagen	nein	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	privat	Angebot	$4,350	control	kleinwagen	2007	automatik	71	fortwo	70,000km	6	benzin	smart	nein	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	privat	Angebot	$1,350	test	kombi	2003	manuell	0	focus	150,000km	7	benzin	ford	nein	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

The column names have been converted from camelcase to snakecase to make them more meaningful.

Data Cleaning: Part one¶

In [4]:

autos.describe(include='all')

Out[4]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer	registration_month	fuel_type	brand	unrepaired_damage	ad_created	nr_of_pictures	postal_code	last_seen
count	50000	50000	50000	50000	50000	50000	44905	50000.000000	47320	50000.000000	47242	50000	50000.000000	45518	50000	40171	50000	50000.0	50000.000000	50000
unique	48213	38754	2	2	2357	2	8	NaN	2	NaN	245	13	NaN	7	40	2	76	NaN	NaN	39481
top	2016-03-30 19:48:02	Ford_Fiesta	privat	Angebot	$0	test	limousine	NaN	manuell	NaN	golf	150,000km	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-07 06:17:27
freq	3	78	49999	49999	1421	25756	12859	NaN	36993	NaN	4024	32424	NaN	30107	10687	35232	1946	NaN	NaN	8
mean	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2005.073280	NaN	116.355920	NaN	NaN	5.723360	NaN	NaN	NaN	NaN	0.0	50813.627300	NaN
std	NaN	NaN	NaN	NaN	NaN	NaN	NaN	105.712813	NaN	209.216627	NaN	NaN	3.711984	NaN	NaN	NaN	NaN	0.0	25779.747957	NaN
min	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.000000	NaN	0.000000	NaN	NaN	0.000000	NaN	NaN	NaN	NaN	0.0	1067.000000	NaN
25%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1999.000000	NaN	70.000000	NaN	NaN	3.000000	NaN	NaN	NaN	NaN	0.0	30451.000000	NaN
50%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2003.000000	NaN	105.000000	NaN	NaN	6.000000	NaN	NaN	NaN	NaN	0.0	49577.000000	NaN
75%	NaN	NaN	NaN	NaN	NaN	NaN	NaN	2008.000000	NaN	150.000000	NaN	NaN	9.000000	NaN	NaN	NaN	NaN	0.0	71540.000000	NaN
max	NaN	NaN	NaN	NaN	NaN	NaN	NaN	9999.000000	NaN	17700.000000	NaN	NaN	12.000000	NaN	NaN	NaN	NaN	0.0	99998.000000	NaN

I have made the following observations after looking at the above table:-

The price column is text and contains both $ and a ','. Similarly the odometer column contains both km and a ','. These have to be converted into integers.
Frequency of columns seller, and offer_type is 49,999 which means that except for one value almost all of the values are the same.Therefore this column is useless and it should be removed.

In [5]:

#Fixing column odometer
odometer=autos['odometer'].str.replace('km','')
odometer=odometer.str.replace(',','')
odometer=odometer.astype(int)
autos['odometer']=odometer
autos.rename({'odometer':'odometer_km'},axis=1,inplace=True)
#Fixing column price
price=autos['price'].str.replace('$','')
price=price.str.replace(',','')
price=price.astype(int)
autos['price']=price
#Fixing columns: vehicle_Type,gear_box,unrepaired_damage
#Translating from German TO English.
autos['vehicle_type'].value_counts(normalize=True,dropna=False)

Out[5]:

limousine     0.25718
kleinwagen    0.21644
kombi         0.18254
NaN           0.10190
bus           0.08186
cabrio        0.06122
coupe         0.05074
suv           0.03972
andere        0.00840
Name: vehicle_type, dtype: float64

In [6]:

vehicle_mapping={'limousine':'limousine','kleinwagen':'small car','kombi':'station wagon','bus':'bus','cabrio':'convertible','coupe':'coupe','suv':'suv','andere':'other','NaN':'unknown'}
autos['vehicle_type']=autos['vehicle_type'].map(vehicle_mapping)
gb_mapping={'automatik':'automatic','manuell':'manual'}
autos['gear_box']=autos['gear_box'].map(gb_mapping)
autos['unrepaired_damage'].value_counts(normalize=True,dropna=False)
upd_mapping={'nein':'no','NaN':'unknown','ja':'yes'}
autos['unrepaired_damage']=autos['unrepaired_damage'].map(upd_mapping)

Columns gear_box, vehicle_type, and unrepaired_damage have been translated from German to English

In [7]:

#Removing unwanted columns
autos=autos.drop(['seller','offer_type','nr_of_pictures'],axis=1)

Columns seller, offer_type, and nr_of_pictures have been dropped because almost all of the values in these columns were the same.

Part two: Exploring Odometer and Price Columns¶

In [8]:

price=autos['price']
price.unique().shape
price.describe()
price.value_counts().sort_index(ascending=False)
autos=autos[(autos["price"]>=0) & (autos['price']<=400000)]
autos.head()

Out[8]:

	date_crawled	name	price	abtest	vehicle_type	registration_year	gear_box	power_ps	model	odometer_km	registration_month	fuel_type	brand	unrepaired_damage	ad_created	postal_code	last_seen
0	2016-03-26 17:47:46	Peugeot_807_160_NAVTECH_ON_BOARD	5000	control	bus	2004	manual	158	andere	150000	3	lpg	peugeot	no	2016-03-26 00:00:00	79588	2016-04-06 06:45:54
1	2016-04-04 13:38:56	BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik	8500	control	limousine	1997	automatic	286	7er	150000	6	benzin	bmw	no	2016-04-04 00:00:00	71034	2016-04-06 14:45:08
2	2016-03-26 18:57:24	Volkswagen_Golf_1.6_United	8990	test	limousine	2009	manual	102	golf	70000	7	benzin	volkswagen	no	2016-03-26 00:00:00	35394	2016-04-06 20:15:37
3	2016-03-12 16:58:10	Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan...	4350	control	small car	2007	automatic	71	fortwo	70000	6	benzin	smart	no	2016-03-12 00:00:00	33729	2016-03-15 03:16:28
4	2016-04-01 14:38:50	Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg...	1350	test	station wagon	2003	manual	0	focus	150000	7	benzin	ford	no	2016-04-01 00:00:00	39218	2016-04-01 14:38:50

In [9]:

odometer=autos['odometer_km']
odometer.unique().shape
odometer.value_counts().sort_index(ascending=False)

Out[9]:

150000    32416
125000     5169
100000     2168
90000      1757
80000      1436
70000      1230
60000      1164
50000      1025
40000       818
30000       789
20000       784
10000       264
5000        966
Name: odometer_km, dtype: int64

Observations:

After looking at the price column I found that there are 14 outliers in this column. Therefore, I have removed all rows that are greater than 400,000.
Odometer column contains no outliers.

Part three: Exploring Date Columns¶

This dataset has 5 date columns:

date_crawled : added by the crawler
last_seen : added by the crawler
ad_created : from the website
registration_month : from the website
registration_year : from the website

Columns date_crawled, ad_created, and last_seen have string datatype where as colummns registration_month, registration_year are in integer datatype.

Starting date column exploration with string datatypes.

In [10]:

autos[['date_crawled','ad_created','last_seen']][0:5]

Out[10]:

	date_crawled	ad_created	last_seen
0	2016-03-26 17:47:46	2016-03-26 00:00:00	2016-04-06 06:45:54
1	2016-04-04 13:38:56	2016-04-04 00:00:00	2016-04-06 14:45:08
2	2016-03-26 18:57:24	2016-03-26 00:00:00	2016-04-06 20:15:37
3	2016-03-12 16:58:10	2016-03-12 00:00:00	2016-03-15 03:16:28
4	2016-04-01 14:38:50	2016-04-01 00:00:00	2016-04-01 14:38:50

The date format is timestamp(Y-m-d h:m:s). Lets get a date range idea but looking deeper.

In [11]:

autos['date_crawled'].str[:10].value_counts(normalize=True,dropna=False).sort_values(ascending=True)

Out[11]:

2016-04-07    0.001420
2016-04-06    0.003181
2016-03-18    0.013064
2016-04-05    0.013104
2016-03-06    0.013944
2016-03-13    0.015564
2016-03-05    0.025387
2016-03-24    0.029108
2016-03-16    0.029508
2016-03-27    0.031049
2016-03-17    0.031509
2016-03-25    0.031749
2016-03-31    0.031909
2016-03-10    0.032129
2016-03-23    0.032389
2016-03-26    0.032489
2016-03-11    0.032489
2016-03-22    0.032909
2016-03-09    0.033209
2016-03-08    0.033269
2016-03-30    0.033629
2016-04-01    0.033809
2016-03-15    0.033990
2016-03-29    0.034150
2016-03-28    0.034850
2016-03-19    0.034910
2016-04-02    0.035410
2016-03-07    0.035970
2016-04-04    0.036490
2016-03-14    0.036630
2016-03-12    0.036770
2016-03-21    0.037490
2016-03-20    0.037831
2016-04-03    0.038691
Name: date_crawled, dtype: float64

The data was compiled between March and April 2016, and it is unifom.

In [12]:

autos['ad_created'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

Out[12]:

2015-06-11    0.000020
2015-08-10    0.000020
2015-09-09    0.000020
2015-11-10    0.000020
2015-12-05    0.000020
                ...   
2016-04-03    0.038931
2016-04-04    0.036850
2016-04-05    0.011843
2016-04-06    0.003261
2016-04-07    0.001280
Name: ad_created, Length: 76, dtype: float64

This data is very large as compared to the previous date columns. It seems that all the ads were created on the website between 2015-June to 2016-April.

In [13]:

autos['last_seen'].str[:10].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

Out[13]:

2016-03-05    0.001080
2016-03-06    0.004421
2016-03-07    0.005362
2016-03-08    0.007582
2016-03-09    0.009843
2016-03-10    0.010763
2016-03-11    0.012524
2016-03-12    0.023807
2016-03-13    0.008983
2016-03-14    0.012804
2016-03-15    0.015884
2016-03-16    0.016445
2016-03-17    0.027928
2016-03-18    0.007422
2016-03-19    0.015744
2016-03-20    0.020706
2016-03-21    0.020726
2016-03-22    0.021586
2016-03-23    0.018585
2016-03-24    0.019565
2016-03-25    0.019205
2016-03-26    0.016965
2016-03-27    0.016024
2016-03-28    0.020846
2016-03-29    0.022326
2016-03-30    0.024847
2016-03-31    0.023827
2016-04-01    0.023106
2016-04-02    0.024887
2016-04-03    0.025367
2016-04-04    0.024627
2016-04-05    0.124275
2016-04-06    0.220982
2016-04-07    0.130957
Name: last_seen, dtype: float64

It seems that the crawler was fetching data from the site between March and April 2016. This has been confirmed by both columns date_crawled, and last_seen. The distribution of this column is roughly uniform.

However, there is a discrepency in the last three days as these are 6-10x more than the previous day. This means that 6-10x more sales were made on the last three days above. This massive spike in sales is highly unlikely and therefore, it can be concluded that the above spike had more to do with the crawling period ending than with the sales.

In [14]:

autos['registration_month'].describe()

Out[14]:

count    49986.000000
mean         5.723723
std          3.711839
min          0.000000
25%          3.000000
50%          6.000000
75%          9.000000
max         12.000000
Name: registration_month, dtype: float64

The min value for month is 0 which suggests that the data might be missing for that particular car. The max month is 12 and it seems that the data is spread around the entire year.

In [15]:

autos['registration_year'].describe()

Out[15]:

count    49986.000000
mean      2005.075721
std        105.727161
min       1000.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       9999.000000
Name: registration_year, dtype: float64

The registration_year column tells about the age of the car, but according to the analysis above it contains odd values.The minimum year is 1000, when cars never existed, and the maximum year is 9999, many years into the future. This suggests that we may have false data entries and the data needs to be fixed for analysis.

Dealing with Incorrect Registration Year Data¶

The highest acceptable value for registration_year is 2016 as this data was compiled in 2016. After some googling I found that the Germany started registering number plates in 1896. So we can have 1896 as the lowest value.

In [16]:

autos=autos[autos['registration_year'].between(1896,2016)]
autos['registration_year'].value_counts(normalize=True,dropna=False).sort_index(ascending=True)

Out[16]:

1910    0.000187
1927    0.000021
1929    0.000021
1931    0.000021
1934    0.000042
          ...   
2012    0.027553
2013    0.016786
2014    0.013850
2015    0.008310
2016    0.027408
Name: registration_year, Length: 78, dtype: float64

In [17]:

autos['registration_year'].describe()

Out[17]:

count    48016.000000
mean      2002.806002
std          7.306212
min       1910.000000
25%       1999.000000
50%       2003.000000
75%       2008.000000
max       2016.000000
Name: registration_year, dtype: float64

Almost 75% of the cars were registered after 1999 and amongst these 50% were registered in the early 2000s.

Part One: Analyzing Price By Brand¶

In [18]:

autos['brand'].value_counts(normalize=True)*100

Out[18]:

volkswagen        21.211679
bmw               11.002582
opel              10.817228
mercedes_benz      9.536405
audi               8.640870
ford               6.976841
renault            4.735921
peugeot            2.953182
fiat               2.586638
seat               1.818144
skoda              1.603632
mazda              1.514079
nissan             1.509913
citroen            1.391203
smart              1.391203
toyota             1.247501
sonstige_autos     1.089220
hyundai            0.985088
volvo              0.924692
mini               0.864295
mitsubishi         0.814312
honda              0.785155
kia                0.710180
alfa_romeo         0.662279
porsche            0.610213
suzuki             0.591470
chevrolet          0.570643
chrysler           0.366544
dacia              0.256165
daihatsu           0.256165
jeep               0.224925
subaru             0.218677
land_rover         0.204099
saab               0.160363
jaguar             0.158281
trabant            0.156198
daewoo             0.149950
rover              0.135372
lancia             0.108297
lada               0.060397
Name: brand, dtype: float64

More than 50% of the cars are German, with Volkswagon being on the top. Japanese brands occupy 20-25% of the market, and the rest of the market is occuppied by American, Korean, French, and Italian.

In [19]:

brands=autos['brand'].value_counts(normalize=True)*100
brands_label=brands[brands>=4].index
print(brands_label)

Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford',
       'renault'],
      dtype='object')

In [20]:

brand_mean_prices={}
for label in brands_label:
    price=autos[autos['brand']==label][['price']]
    
    brand_mean_prices[label]=int(price.mean())
brand_mean_prices

Out[20]:

{'volkswagen': 5231,
 'bmw': 8102,
 'opel': 2876,
 'mercedes_benz': 8485,
 'audi': 9093,
 'ford': 3652,
 'renault': 2395}

According to the above analysis:

Audi is the most expensive car and Renault is the least expensive car in our selected top German brands.
Volkswagon being the most common German brand has a mean price of 5231 which is almost 3800 less than Audi.

Part Two: Exploring Milleage by brand¶

In [21]:

brand_mean_mileage={}
for label in brands_label:
    miles=autos[autos['brand']==label][['odometer_km']]
    brand_mean_mileage[label]=int(miles.mean())
brand_mean_mileage

Out[21]:

{'volkswagen': 128724,
 'bmw': 132431,
 'opel': 129223,
 'mercedes_benz': 130856,
 'audi': 129287,
 'ford': 124068,
 'renault': 128183}

In [22]:

bmp_series=pd.Series(brand_mean_prices)
bmm_series=pd.Series(brand_mean_mileage)
df=pd.DataFrame(bmp_series.sort_index(ascending=False),columns=['mean_price'])
df['mean_mileage']=bmm_series.sort_index(ascending=False)
df

Out[22]:

	mean_price	mean_mileage
volkswagen	5231	128724
renault	2395	128183
opel	2876	129223
mercedes_benz	8485	130856
ford	3652	124068
bmw	8102	132431
audi	9093	129287

Mean mileage of these top brands is very close together that is they are with in 10% of each other. There is also a relationship between mean price and mean mileage. Greater price means lesser mileage. For example, Mercedez Benz has a higher mean price and a lower mean milleage than BMW.

In [23]:

brand_count={}
for label in brands_label:
    brand_name=autos[autos['brand']==label]
    count=int(brand_name['brand'].value_counts())
    brand_count[label]=count
    
bc_series=pd.Series(brand_count)
df['brand_num_cars']=bc_series

In [24]:

common_model={}
for label in brands_label:
    model=autos[autos['brand']==label]
    cm_model=model['model'].value_counts().sort_values(ascending=False).index 
    max_model=cm_model[0]
    common_model[label]=max_model
    
cm_series=pd.Series(common_model)
df['popular_model']=cm_series
df

Out[24]:

	mean_price	mean_mileage	brand_num_cars	popular_model
volkswagen	5231	128724	10185	golf
renault	2395	128183	2274	twingo
opel	2876	129223	5194	corsa
mercedes_benz	8485	130856	4579	c_klasse
ford	3652	124068	3350	focus
bmw	8102	132431	5283	3er
audi	9093	129287	4149	a4

We already know that Volkswagon is the most popular brand and that its most popular model present on e-bay is golf.

Part Three: Verifying the trend between Odometer & Price¶

In [25]:

print(autos['odometer_km'].value_counts().sort_values(ascending=False))

150000    31023
125000     4959
100000     2109
90000      1696
80000      1396
70000      1199
60000      1137
50000      1006
5000        910
40000       800
30000       769
20000       763
10000       249
Name: odometer_km, dtype: int64

In [26]:

autos.groupby(by='odometer_km')['price'].mean()

Out[26]:

odometer_km
5000       7654.626374
10000     19890.598394
20000     17940.720839
30000     16414.455137
40000     15441.445000
50000     13633.685885
60000     12286.970097
70000     10817.819850
80000      9575.700573
90000      8350.228184
100000     7936.031769
125000     6086.207905
150000     3654.001451
Name: price, dtype: float64

Price and odometer have an inverse relationship as seen in the above analysis. Price starts to strictly follow the relationship at 10,000km group where it is 19,890 and then it decreases to 3,654 as odometer moves to 150,000km.

Part Four: Price difference between Repaired and Non repaired cars¶

In [27]:

autos.groupby('unrepaired_damage')['price'].mean()

Out[27]:

unrepaired_damage
no     7077.225496
yes    2126.840092
Name: price, dtype: float64

Non-damaged cars value $4,950 more than damaged cars.

Conclusion¶

In the above exercise we have analyzed that more than 50% of the market has been occuppied by German car manufacturers. Amongst these brands Volkswagon has the highest number of cars that is 10,185 with a decent mean price of 5,231. I noticed a trend between the mean price and mean milleage amongst the top brands, and I verified it by splitting the data into groups, and noticing changes in mean price as the distance increases. Finally, I have also noticed that damaged cars are 4,950 cheaper on average than non-damaged cars.