What are the Car Models to Make Business with?¶

The aim of the project is to determine which car brands and models are sold the most, what are their corresponding price range and the numbder of days on sale. The results of such investigation might be of importance to car dealers and other cars related businesses, who can thus be better informed of the most sellable car brands and models and strategize better based on that.
The dataset is from Kaggle and covers information about selling used cars in the classifieds section of the German eBay website. The data were collected in 2016 by crawling.
As the major result of our investigation we would propose car dealers, as well as other stackeholders, operating in Germany to pay much attention to such car models as Volkswagen Golf, BMW 3 Series, Opel Corsa, Opel Astra, Mercedes-Benz C-Class, Mercedes-Benz E-Class, Audi A4, Ford Focus and Ford Fiesta.

Initial Data Exploration¶

Let's open our dataset, explore its size, as well as the names and types of major variables.

In [1]:

# Import pandas and numpy packages
import pandas as pd
import numpy as np

# Open the original dataset with pandas package
autos = pd.read_csv('/Users/mac/downloads/autos.csv',encoding = 'Latin-1')
autos.info() # get info about variables' types and the size of our dataset
autos.head() # inspect first five rows

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column               Non-Null Count   Dtype 
---  ------               --------------   ----- 
 0   dateCrawled          371528 non-null  object
 1   name                 371528 non-null  object
 2   seller               371528 non-null  object
 3   offerType            371528 non-null  object
 4   price                371528 non-null  int64 
 5   abtest               371528 non-null  object
 6   vehicleType          333659 non-null  object
 7   yearOfRegistration   371528 non-null  int64 
 8   gearbox              351319 non-null  object
 9   powerPS              371528 non-null  int64 
 10  model                351044 non-null  object
 11  kilometer            371528 non-null  int64 
 12  monthOfRegistration  371528 non-null  int64 
 13  fuelType             338142 non-null  object
 14  brand                371528 non-null  object
 15  notRepairedDamage    299468 non-null  object
 16  dateCreated          371528 non-null  object
 17  nrOfPictures         371528 non-null  int64 
 18  postalCode           371528 non-null  int64 
 19  lastSeen             371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB

Out[1]:

	dateCrawled	name	seller	offerType	price	abtest	vehicleType	yearOfRegistration	gearbox	powerPS	model	kilometer	monthOfRegistration	fuelType	brand	notRepairedDamage	dateCreated	postalCode	lastSeen
0	2016-03-24 11:52:17	Golf_3_1.6	privat	Angebot	480	test	NaN	1993	manuell	0	golf	150000	0	benzin	volkswagen	NaN	2016-03-24 00:00:00	70435	2016-04-07 03:16:57
1	2016-03-24 10:58:45	A5_Sportback_2.7_Tdi	privat	Angebot	18300	test	coupe	2011	manuell	190	NaN	125000	5	diesel	audi	ja	2016-03-24 00:00:00	66954	2016-04-07 01:46:50
2	2016-03-14 12:52:21	Jeep_Grand_Cherokee_"Overland"	privat	Angebot	9800	test	suv	2004	automatik	163	grand	125000	8	diesel	jeep	NaN	2016-03-14 00:00:00	90480	2016-04-05 12:47:46
3	2016-03-17 16:54:04	GOLF_4_1_4__3TÜRER	privat	Angebot	1500	test	kleinwagen	2001	manuell	75	golf	150000	6	benzin	volkswagen	nein	2016-03-17 00:00:00	91074	2016-03-17 17:40:17
4	2016-03-31 17:25:20	Skoda_Fabia_1.4_TDI_PD_Classic	privat	Angebot	3600	test	kleinwagen	2008	manuell	69	fabia	90000	7	diesel	skoda	nein	2016-03-31 00:00:00	60437	2016-04-06 10:17:21

From the above tables we can infer the following about our dataset:

there are over 370 000 observations initially;
a few of our variables are integers, while the rest are non-integer objects;
the maximum amount of empty values within a column is no more than 20%;
camel cases are used to name columns.

The variables of the major importance for our further investigation are price, car model, car brand, kilometer, classified creation date and its last seen date. For further convenience let's get rid of camel cases.

In [2]:

# Create a function to turn camel cases in snake cases
def change_case(str): 
      
    return ''.join(['_'+i.lower() if i.isupper()  
               else i for i in str]).lstrip('_') 

In [3]:

# Apply the function to the column names of our dataset
column_names = [] # list with snake cases

for column_name in autos.columns:
    column_names.append(change_case(column_name))

autos.columns = column_names # rename columns
    
# Check the results
autos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 371528 entries, 0 to 371527
Data columns (total 20 columns):
 #   Column                 Non-Null Count   Dtype 
---  ------                 --------------   ----- 
 0   date_crawled           371528 non-null  object
 1   name                   371528 non-null  object
 2   seller                 371528 non-null  object
 3   offer_type             371528 non-null  object
 4   price                  371528 non-null  int64 
 5   abtest                 371528 non-null  object
 6   vehicle_type           333659 non-null  object
 7   year_of_registration   371528 non-null  int64 
 8   gearbox                351319 non-null  object
 9   power_p_s              371528 non-null  int64 
 10  model                  351044 non-null  object
 11  kilometer              371528 non-null  int64 
 12  month_of_registration  371528 non-null  int64 
 13  fuel_type              338142 non-null  object
 14  brand                  371528 non-null  object
 15  not_repaired_damage    299468 non-null  object
 16  date_created           371528 non-null  object
 17  nr_of_pictures         371528 non-null  int64 
 18  postal_code            371528 non-null  int64 
 19  last_seen              371528 non-null  object
dtypes: int64(7), object(13)
memory usage: 56.7+ MB

Some Variables Contain Unrealistic Numbers¶

Let's observe further descriptive statistics of our major variables.

In [4]:

# Get descriptive statistics for all variables
autos.describe(include='all')

Out[4]:

	date_crawled	name	seller	offer_type	price	abtest	vehicle_type	year_of_registration	gearbox	power_p_s	model	kilometer	month_of_registration	fuel_type	brand	not_repaired_damage	date_created	nr_of_pictures	postal_code	last_seen
count	371528	371528	371528	371528	3.715280e+05	371528	333659	371528.000000	351319	371528.000000	351044	371528.000000	371528.000000	338142	371528	299468	371528	371528.0	371528.00000	371528
unique	280500	233531	2	2	NaN	2	8	NaN	2	NaN	251	NaN	NaN	7	40	2	114	NaN	NaN	182806
top	2016-03-24 14:49:47	Ford_Fiesta	privat	Angebot	NaN	test	limousine	NaN	manuell	NaN	golf	NaN	NaN	benzin	volkswagen	nein	2016-04-03 00:00:00	NaN	NaN	2016-04-06 13:45:54
freq	7	657	371525	371516	NaN	192585	95894	NaN	274214	NaN	30070	NaN	NaN	223857	79640	263182	14450	NaN	NaN	17
mean	NaN	NaN	NaN	NaN	1.729514e+04	NaN	NaN	2004.577997	NaN	115.549477	NaN	125618.688228	5.734445	NaN	NaN	NaN	NaN	0.0	50820.66764	NaN
std	NaN	NaN	NaN	NaN	3.587954e+06	NaN	NaN	92.866598	NaN	192.139578	NaN	40112.337051	3.712412	NaN	NaN	NaN	NaN	0.0	25799.08247	NaN
min	NaN	NaN	NaN	NaN	0.000000e+00	NaN	NaN	1000.000000	NaN	0.000000	NaN	5000.000000	0.000000	NaN	NaN	NaN	NaN	0.0	1067.00000	NaN
25%	NaN	NaN	NaN	NaN	1.150000e+03	NaN	NaN	1999.000000	NaN	70.000000	NaN	125000.000000	3.000000	NaN	NaN	NaN	NaN	0.0	30459.00000	NaN
50%	NaN	NaN	NaN	NaN	2.950000e+03	NaN	NaN	2003.000000	NaN	105.000000	NaN	150000.000000	6.000000	NaN	NaN	NaN	NaN	0.0	49610.00000	NaN
75%	NaN	NaN	NaN	NaN	7.200000e+03	NaN	NaN	2008.000000	NaN	150.000000	NaN	150000.000000	9.000000	NaN	NaN	NaN	NaN	0.0	71546.00000	NaN
max	NaN	NaN	NaN	NaN	2.147484e+09	NaN	NaN	9999.000000	NaN	20000.000000	NaN	150000.000000	12.000000	NaN	NaN	NaN	NaN	0.0	99998.00000	NaN

From the above table we may conclude the following:

the 'seller' and 'offer_type' columns contain almost identical values - the frequency of their respective top values in the total count is almost 100%;
the 'nr_of_pictures' column looks incorrect and we need to check it further.

In [5]:

# Check what values are contained in the 'nr_of_pictures' column
autos['nr_of_pictures'].value_counts()

Out[5]:

0    371528
Name: nr_of_pictures, dtype: int64

The output shows only zero values, which means that there are no pictures of autos recorded in the dataset. Therefore we cannot use this column for further analysis. We drop it along with the 'seller' and 'offer_type' columns containing identical values as irrelevant.

In [6]:

# Drop columns with identical values
autos = autos.drop(['seller','offer_type','nr_of_pictures'],axis=1)

We proceed further with checking 'kilometer' and 'price' columns, which are of greater interest to us.

In [7]:

# show the distribution of mileage and prices
print(autos['kilometer'].value_counts())
print(autos['price'].value_counts().sort_index(ascending=True).head(10))
print(autos['price'].value_counts().sort_index(ascending=False).head(10))
print(autos['price'].describe())

150000    240797
125000     38067
100000     15920
90000      12523
80000      11053
70000       9773
60000       8669
50000       7615
5000        7069
40000       6376
30000       6041
20000       5676
10000       1949
Name: kilometer, dtype: int64
0     10778
1      1189
2        12
3         8
4         1
5        26
7         3
8         9
9         8
10       84
Name: price, dtype: int64
2147483647     1
99999999      15
99000000       1
74185296       1
32545461       1
27322222       1
14000500       1
12345678       9
11111111      10
10010011       1
Name: price, dtype: int64
count    3.715280e+05
mean     1.729514e+04
std      3.587954e+06
min      0.000000e+00
25%      1.150000e+03
50%      2.950000e+03
75%      7.200000e+03
max      2.147484e+09
Name: price, dtype: float64

The 'kilometer' column doesn't seem to have outliers, and the values there are seemingly grouped into ranges, so that the variety of unique values is minimal. Out of the above results we may also conclude that the majority of cars in our dataset have quite high mileage, i.e. they are most likely more or less old.

However, there are outliers in the 'price' columns - we may observe both extremely high and extremely low values. While extremely low values may seem plausible being the strating price in the car sale auctions, extremely high values are rather mistakes in our dataset. Therefore, we get rid of them setting the plausible price range within 0 - 350000.

In [8]:

# Save only rows with price values within 0 - 350000 range
# by using boolean expression
autos = autos[autos['price'].between(0,350000)]

Time Variables of Major Interest Cover Mostly Spring 2016¶

Let's explore next date variables in our dataset.

In [9]:

# Show the distribution of time points when classfields were last seen on the Ebay website
print(autos['last_seen'].value_counts(normalize=True,dropna=False).sort_values(ascending=False).head(10))
print(autos['last_seen'].value_counts(normalize=True,dropna=False).sort_values(ascending=True).head(10))

2016-04-06 13:45:54    0.000046
2016-04-07 06:45:59    0.000046
2016-04-06 01:15:23    0.000043
2016-04-07 00:45:17    0.000043
2016-04-06 09:17:58    0.000043
2016-04-07 05:44:30    0.000043
2016-04-06 07:45:46    0.000043
2016-04-07 10:46:23    0.000043
2016-04-07 09:46:25    0.000043
2016-04-06 09:46:00    0.000043
Name: last_seen, dtype: float64
2016-03-12 10:37:56    0.000003
2016-03-16 22:16:37    0.000003
2016-03-09 20:47:38    0.000003
2016-03-30 23:46:29    0.000003
2016-03-25 22:45:53    0.000003
2016-04-04 10:16:04    0.000003
2016-03-11 06:41:34    0.000003
2016-03-23 19:57:37    0.000003
2016-03-20 06:44:24    0.000003
2016-03-28 07:03:20    0.000003
Name: last_seen, dtype: float64

In [10]:

# Show descriptive statistics
print(autos['last_seen'].describe())

count                  371413
unique                 182756
top       2016-04-06 13:45:54
freq                       17
Name: last_seen, dtype: object

From the results above we may see that the distribution of the time the classfields were last seen is quite unifrom over the dates range in our dataset. Even though the frequency of some days might be comparatively high, but the share of unique values in the total is high enough to assume that there is no particular time, at which the bulk of car sales occured. It is also important to note that all the observations in the 'last_seen' column took place within two months of 2016 - March and April, which is a particular dates range considered in our dataset.

In [11]:

# Show the distribution of time points when classfields were created
print(autos['date_created'].value_counts(normalize=True,dropna=False).sort_values(ascending=False).head(10))
print(autos['date_created'].value_counts(normalize=True,dropna=False).sort_values(ascending=True).head(10))

2016-04-03 00:00:00    0.038887
2016-04-04 00:00:00    0.037745
2016-03-20 00:00:00    0.036463
2016-03-12 00:00:00    0.036011
2016-03-21 00:00:00    0.035798
2016-03-14 00:00:00    0.035222
2016-03-28 00:00:00    0.035179
2016-04-02 00:00:00    0.034966
2016-03-07 00:00:00    0.034517
2016-03-19 00:00:00    0.034326
Name: date_created, dtype: float64
2015-09-09 00:00:00    0.000003
2015-11-12 00:00:00    0.000003
2015-06-18 00:00:00    0.000003
2016-01-15 00:00:00    0.000003
2016-01-06 00:00:00    0.000003
2015-10-14 00:00:00    0.000003
2015-08-10 00:00:00    0.000003
2015-11-13 00:00:00    0.000003
2015-11-02 00:00:00    0.000003
2015-11-10 00:00:00    0.000003
Name: date_created, dtype: float64

In [12]:

# Show descriptive statistics
print(autos['date_created'].describe())

count                  371413
unique                    114
top       2016-04-03 00:00:00
freq                    14443
Name: date_created, dtype: object

A slightly different situation can be observed among the dates when advertisements were created. The majority of them were seemingly created at certain days in March and April 2016, and a far smaller part of classfields was created before 2016. Therefore, we may conclude that the bulk of the observations in this column occured in spring 2016 as well.

In [13]:

# Show the distribution of years when cars in the dataset were registered
print(autos['year_of_registration'].value_counts(normalize=True,dropna=False).sort_values(ascending=False).head(10))
print(autos['year_of_registration'].value_counts(normalize=True,dropna=False).sort_values(ascending=True).head(10))

2000    0.066072
1999    0.061290
2005    0.060073
2006    0.054457
2001    0.054425
2003    0.053498
2004    0.053156
2002    0.051659
1998    0.048324
2007    0.047580
Name: year_of_registration, dtype: float64
8455    0.000003
3700    0.000003
2290    0.000003
8200    0.000003
8500    0.000003
9229    0.000003
6500    0.000003
2066    0.000003
5900    0.000003
1039    0.000003
Name: year_of_registration, dtype: float64

In [14]:

# Show descriptive statistics
print(autos['year_of_registration'].describe())

count    371413.000000
mean       2004.561152
std          91.937676
min        1000.000000
25%        1999.000000
50%        2003.000000
75%        2008.000000
max        9999.000000
Name: year_of_registration, dtype: float64

From the above tables on years of cars' registrations we may see, that there are implausible values among years - min of 1000 and max of 9999. Therefore, we should clean these obviously wrong rows from our dataset. The range is set between 1900 and 2016, when all the car deals were presumably made.

In [15]:

# Check the share of wrong rows in the total
print('Percentage share of wrong rows:',(~autos['year_of_registration'].between(1900,2016)).sum()/autos.shape[0]*100,'(%)')

# Save only rows with price values within 1900 - 2016 range
# by using boolean expression
autos = autos[autos['year_of_registration'].between(1900,2016)]

Percentage share of wrong rows: 3.9683586734982352 (%)

We have just 4% of obviously incorrect data, therefore we don't need to delete too many rows thus making our analysis constrained.

Let's proceed further with the years' distribution.

In [16]:

# Calculate percentage share distribution of years in our dataset
autos['year_of_registration'].value_counts(normalize=True)

Out[16]:

2000    0.068802
1999    0.063823
2005    0.062556
2006    0.056707
2001    0.056674
          ...   
1925    0.000003
1920    0.000003
1919    0.000003
1915    0.000003
1911    0.000003
Name: year_of_registration, Length: 97, dtype: float64

From the above results we may observe that the most common vehicles' registration years in our dataset are either 1990-s or afterwards. On the other hand, years of the begginning of the 20-th century are rather rare, which nonetheless signifies that even extremely old cars are sold sometimes through eBay website.

Most Popular Car Brands Can Be Divided into Cheaper and Premium Groups¶

Let's explore further the variety of car brands sold, as recorded in our dataset.

In [17]:

# Calculate shares of unique values in the 'brand' column of our dataset
autos['brand'].value_counts(normalize=True)

Out[17]:

volkswagen        0.212396
bmw               0.109697
opel              0.107092
mercedes_benz     0.095964
audi              0.089356
ford              0.068856
renault           0.047573
peugeot           0.029868
fiat              0.025763
seat              0.018636
skoda             0.015412
mazda             0.015350
smart             0.014108
citroen           0.013875
nissan            0.013573
toyota            0.012751
sonstige_autos    0.010584
hyundai           0.009833
mini              0.009213
volvo             0.009132
mitsubishi        0.008268
honda             0.007590
kia               0.006880
alfa_romeo        0.006348
suzuki            0.006317
porsche           0.006101
chevrolet         0.005013
chrysler          0.003942
dacia             0.002450
daihatsu          0.002187
jeep              0.002184
land_rover        0.002136
subaru            0.002125
jaguar            0.001719
trabant           0.001618
saab              0.001452
daewoo            0.001438
rover             0.001298
lancia            0.001292
lada              0.000611
Name: brand, dtype: float64

It is not surprising that german cars contitute the majority among the brands with Volkswagen being the unanimous leader of the list - more than 21% of the total turnover. It is followed by three major German premium car brands and Opel with roughly equal shares of around 10%.

Let's consider further only car brands which comprise at least 1% of all sales in our dataset. Those below this threshold are not worthy of further analysis and hence dropped as irrelevant.

In [18]:

# Create series containing shares of car brands in total turnover 
brands = autos['brand'].value_counts(normalize=True)
popular_brands = brands[brands>0.01].index # use only header line to create list
                                           # of brands we are interested in

Now that we have the list of popular car brands being sold in Germany, it is possible to calculate their mean prices, mileage, horsepower, year of registration and the average number of days these car brands were on sale - i.e. how fast they were sold.

In [19]:

# Create a dictionary containing average prices corresponding to popular car models
mean_prices = {}

for i in popular_brands:
    mean_prices[i] = int(autos[autos['brand']==i]['price'].mean())
 
# check results
print(mean_prices)

{'volkswagen': 5231, 'bmw': 8224, 'opel': 2870, 'mercedes_benz': 8387, 'audi': 8849, 'ford': 3595, 'renault': 2366, 'peugeot': 3206, 'fiat': 2803, 'seat': 4420, 'skoda': 6460, 'mazda': 3970, 'smart': 3586, 'citroen': 3649, 'nissan': 4585, 'toyota': 5263, 'sonstige_autos': 12456}

In [20]:

# Repeat the same procedure of average mileage
mean_mileage = {}

for i in popular_brands:
    mean_mileage[i] = int(autos[autos['brand']==i]['kilometer'].mean())

print(mean_mileage)

{'volkswagen': 128338, 'bmw': 132666, 'opel': 128756, 'mercedes_benz': 130585, 'audi': 129498, 'ford': 123617, 'renault': 127875, 'peugeot': 124599, 'fiat': 116523, 'seat': 120916, 'skoda': 113720, 'mazda': 125945, 'smart': 99684, 'citroen': 120313, 'nissan': 119543, 'toyota': 117337, 'sonstige_autos': 86854}

In [21]:

# Repeat the same procedure of cars' average age
mean_age = {}

for i in popular_brands:
    mean_age[i] = int(autos[autos['brand']==i]['year_of_registration'].mean())

print(mean_age)

{'volkswagen': 2002, 'bmw': 2002, 'opel': 2002, 'mercedes_benz': 2001, 'audi': 2003, 'ford': 2002, 'renault': 2002, 'peugeot': 2003, 'fiat': 2002, 'seat': 2004, 'skoda': 2006, 'mazda': 2002, 'smart': 2005, 'citroen': 2003, 'nissan': 2003, 'toyota': 2004, 'sonstige_autos': 1993}

In [22]:

# Repeat the same procedure of cars' average horsepower
mean_power = {}

for i in popular_brands:
    mean_power[i] = int(autos[autos['brand']==i]['power_p_s'].mean())

print(mean_power)

{'volkswagen': 98, 'bmw': 166, 'opel': 92, 'mercedes_benz': 154, 'audi': 157, 'ford': 96, 'renault': 79, 'peugeot': 91, 'fiat': 72, 'seat': 92, 'skoda': 100, 'mazda': 106, 'smart': 61, 'citroen': 88, 'nissan': 95, 'toyota': 102, 'sonstige_autos': 125}

Let's now create a new column, which would reflect the duration of time, when a car was on sale. For this purpose we should use the date each classfield was created and the date it was last seen - we consider these two dates as proxies for the start and the end of cars selling.

In [23]:

# Turn two columns of interest into date and time format
last_seen = pd.to_datetime(autos['last_seen'])
date_created = pd.to_datetime(autos['date_created'])
last_seen = last_seen.dt.date # drop hours out of our consideration
date_created = date_created.dt.date

# calculate time on sale
days_on_sale = (last_seen - date_created)
days_on_sale = days_on_sale.dt.days # consider only the number of days

# get descriptive statistics
days_on_sale.describe()

Out[23]:

count    356674.000000
mean          8.870470
std           8.730316
min           0.000000
25%           2.000000
50%           6.000000
75%          14.000000
max         759.000000
dtype: float64

In [24]:

# Check the distribution of days when classfields were created
date_created.value_counts(normalize=True).head(10)

Out[24]:

2016-04-03    0.038971
2016-04-04    0.037796
2016-03-20    0.036538
2016-03-12    0.036053
2016-03-21    0.035825
2016-03-28    0.035144
2016-03-14    0.035116
2016-04-02    0.034923
2016-03-07    0.034642
2016-03-19    0.034289
Name: date_created, dtype: float64

In [25]:

# Check the distribution of days when classfields were last seen
last_seen.value_counts(normalize=True).head(10)

Out[25]:

2016-04-06    0.219467
2016-04-07    0.130752
2016-04-05    0.126757
2016-03-17    0.028544
2016-04-04    0.025561
2016-04-03    0.025197
2016-04-02    0.024793
2016-03-31    0.024165
2016-04-01    0.023941
2016-03-30    0.023758
Name: last_seen, dtype: float64

From the above results we may note that it took almost 9 days for a car to get sold on average, and around 75% of cars in the dataset were sold within 14 days, as measured by the time respective advertisements were on public. It is also important to note that almost half of the observations, when classfields were last seen, did occur in just three days - 5, 6, 7 April 2016. This might be due to technical limitations, and we are thus warned against considering it as a unanimous indicator of the time cars were sold - these dates are rather rough approximations of that at the most. On the other hand, the distribution of dates when classfields were created is uniform and doen't present a similar conundrum at the first sight.

To proceed further we should calculate the average values of most important variables for each car brand.

In [26]:

# Create a dictionary containing cars' average number of days on sale
autos['days_on_sale'] = days_on_sale.astype(int)

mean_days_on_sale = {}

for i in popular_brands:
    mean_days_on_sale[i] = round(autos[autos['brand']==i]['days_on_sale'].mean(),1)

# check results
print(mean_days_on_sale)

{'volkswagen': 8.6, 'bmw': 9.2, 'opel': 8.2, 'mercedes_benz': 9.1, 'audi': 9.3, 'ford': 8.4, 'renault': 8.4, 'peugeot': 8.7, 'fiat': 8.9, 'seat': 9.1, 'skoda': 9.2, 'mazda': 8.9, 'smart': 9.1, 'citroen': 9.0, 'nissan': 8.6, 'toyota': 8.8, 'sonstige_autos': 9.9}

In [27]:

# Create columns for each variable of interest
popular_brands_table = pd.DataFrame(pd.Series(mean_prices), columns=['Price'])
mileage_table = pd.DataFrame(pd.Series(mean_mileage), columns=['mileage'])
days_on_sale_table = pd.DataFrame(pd.Series(mean_days_on_sale), columns=['days_on_sale'])
age_table = pd.DataFrame(pd.Series(mean_age), columns=['year_of_registration'])
power_table = pd.DataFrame(pd.Series(mean_power), columns=['Horse Power'])

# add columns to our output table with mean values
popular_brands_table['Mileage (km)'] = mileage_table
popular_brands_table['Days on Sale'] = days_on_sale_table
popular_brands_table['Registration Year'] = age_table
popular_brands_table['Horse Power'] = power_table

# check output table with averages
popular_brands_table

Out[27]:

	Price	Mileage (km)	Days on Sale	Registration Year	Horse Power
volkswagen	5231	128338	8.6	2002	98
bmw	8224	132666	9.2	2002	166
opel	2870	128756	8.2	2002	92
mercedes_benz	8387	130585	9.1	2001	154
audi	8849	129498	9.3	2003	157
ford	3595	123617	8.4	2002	96
renault	2366	127875	8.4	2002	79
peugeot	3206	124599	8.7	2003	91
fiat	2803	116523	8.9	2002	72
seat	4420	120916	9.1	2004	92
skoda	6460	113720	9.2	2006	100
mazda	3970	125945	8.9	2002	106
smart	3586	99684	9.1	2005	61
citroen	3649	120313	9.0	2003	88
nissan	4585	119543	8.6	2003	95
toyota	5263	117337	8.8	2004	102
sonstige_autos	12456	86854	9.9	1993	125

From the above table we may observe several important things:

BMW, Audi and Mercedes-Benz are the most expensive popular brands, Volkswagen is right in the middle and Opel, Ford are the cheapest among the top six car brands;
the higher mileage of German premium brands doesn't lead to significant price drops, which indicates their high quality and reliability - even brands with far lower mileage are still evidently cheaper;
the most frequently sold top brands are Opel, Ford and Volkswagen, which is not surprising considering their general purpose of being accessible to everybody;
the average age of brands considered is almost the same, therefore it is difficult to make any inferences out of this variable;
the horsepower of German premium brands is on average two times higher than that of other brands, which is again not suprising considering their class and price.

There Are Several Car Models Sold the Most¶

Let's check further what models of most popular brands are sold the most.

In [28]:

# Calculate shares of different models within the car brand
volkswagen_series = autos[autos['brand'] == 'volkswagen']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)

volkswagen = pd.DataFrame(pd.Series(volkswagen_series))
volkswagen.columns = ['Volkswagen']
volkswagen[:5]

Out[28]:

	Volkswagen
golf	0.391313
polo	0.171710
passat	0.137679
transporter	0.073857
touran	0.045843

In [29]:

# Calculate shares of different models within the car brand
bmw_series = autos[autos['brand'] == 'bmw']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)

bmw = pd.DataFrame(pd.Series(bmw_series))
bmw.columns = ['BMW']
bmw[:5]

Out[29]:

	BMW
3er	0.535301
5er	0.224450
1er	0.099955
x_reihe	0.060711
7er	0.029567

In [30]:

# Calculate shares of different models within the car brand
opel_series = autos[autos['brand'] == 'opel']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)

opel = pd.DataFrame(pd.Series(opel_series))
opel.columns = ['Opel']
opel[:5]

Out[30]:

	Opel
corsa	0.326900
astra	0.283509
vectra	0.112161
zafira	0.078214
omega	0.037697

In [31]:

# Calculate shares of different models within the car brand
mercedes_benz_series = autos[autos['brand'] == 'mercedes_benz']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)

mercedes_benz = pd.DataFrame(pd.Series(mercedes_benz_series))
mercedes_benz.columns = ['Mercedes-Benz']
mercedes_benz[:5]

Out[31]:

	Mercedes-Benz
c_klasse	0.258783
e_klasse	0.223711
a_klasse	0.121906
andere	0.099088
clk	0.054389

In [32]:

# Calculate shares of different models within the car brand
audi_series = autos[autos['brand'] == 'audi']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)

audi = pd.DataFrame(pd.Series(audi_series))
audi.columns = ['Audi']
audi[:5]

Out[32]:

	Audi
a4	0.327256
a3	0.203204
a6	0.192134
andere	0.051184
80	0.050760

In [33]:

# Calculate shares of different models within the car brand
ford_series = autos[autos['brand'] == 'ford']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)

ford = pd.DataFrame(pd.Series(ford_series))
ford.columns = ['Ford']
ford[:5]

Out[33]:

	Ford
focus	0.241287
fiesta	0.234774
mondeo	0.149129
ka	0.109964
andere	0.062553

The above tables show us that each of the considered car brands has one or two most traded models - Volkswagen Golf, BMW 3 Series, Opel Corsa, Mercedes-Benz C-Class, Audi A4 and Ford Focus are sold the most.

Conclusion¶

In this project we attempted to figure out what types of cars are most likely to be sold and what is their average price level. By using the dataset on used cars from the German eBay classifieds section, we found out that Volkswagen, Ford and Opel are very popular brands and can be bought and sold most easily, considering their average time on sale. Their price in general is also lower, as compared to German premium brands, like BMW, Audi or Mercedes-Benz, even though the mileage is lower on average. This point might be explained by the higher reliability, horsepower and prestige of the latter group of brands.

All in all, we would suggest a car dealer operating in Germany two major options. The first one is to focus on cheaper models, which are sold the fastest - Volkswagen Golf, Opel Corsa, Opel Astra, Ford Focus and Ford Fiesta. The second option is to operate with premium brands, the price of which might be less affected by mileage and probably other negative factors as well - BMW 3 Series, Mercedes-Benz C-Class, Mercedes-Benz E-Class and Audi A4. However, it should be kept in mind that it takes slightly longer (one day on average) for them to be sold out.

The information concerning car models turnover might be of interest to other stakeholders as well. For example, the businesses related to auto parts and auto services may potentially take advantage of the knowledge concerning what cars are used the most, so that to meet better the automobile users' demand. Car buyers and sellers, who want to know the market pricing of various models may also refer to the observed average prices of sold cars as a benchmark for their subsequent market operations.