Let's open our dataset, explore its size, as well as the names and types of major variables.
# Import pandas and numpy packages
import pandas as pd
import numpy as np
# Open the original dataset with pandas package
autos = pd.read_csv('/Users/mac/downloads/autos.csv',encoding = 'Latin-1')
autos.info() # get info about variables' types and the size of our dataset
autos.head() # inspect first five rows
<class 'pandas.core.frame.DataFrame'> RangeIndex: 371528 entries, 0 to 371527 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 371528 non-null object 1 name 371528 non-null object 2 seller 371528 non-null object 3 offerType 371528 non-null object 4 price 371528 non-null int64 5 abtest 371528 non-null object 6 vehicleType 333659 non-null object 7 yearOfRegistration 371528 non-null int64 8 gearbox 351319 non-null object 9 powerPS 371528 non-null int64 10 model 351044 non-null object 11 kilometer 371528 non-null int64 12 monthOfRegistration 371528 non-null int64 13 fuelType 338142 non-null object 14 brand 371528 non-null object 15 notRepairedDamage 299468 non-null object 16 dateCreated 371528 non-null object 17 nrOfPictures 371528 non-null int64 18 postalCode 371528 non-null int64 19 lastSeen 371528 non-null object dtypes: int64(7), object(13) memory usage: 56.7+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | kilometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-24 11:52:17 | Golf_3_1.6 | privat | Angebot | 480 | test | NaN | 1993 | manuell | 0 | golf | 150000 | 0 | benzin | volkswagen | NaN | 2016-03-24 00:00:00 | 0 | 70435 | 2016-04-07 03:16:57 |
1 | 2016-03-24 10:58:45 | A5_Sportback_2.7_Tdi | privat | Angebot | 18300 | test | coupe | 2011 | manuell | 190 | NaN | 125000 | 5 | diesel | audi | ja | 2016-03-24 00:00:00 | 0 | 66954 | 2016-04-07 01:46:50 |
2 | 2016-03-14 12:52:21 | Jeep_Grand_Cherokee_"Overland" | privat | Angebot | 9800 | test | suv | 2004 | automatik | 163 | grand | 125000 | 8 | diesel | jeep | NaN | 2016-03-14 00:00:00 | 0 | 90480 | 2016-04-05 12:47:46 |
3 | 2016-03-17 16:54:04 | GOLF_4_1_4__3TÜRER | privat | Angebot | 1500 | test | kleinwagen | 2001 | manuell | 75 | golf | 150000 | 6 | benzin | volkswagen | nein | 2016-03-17 00:00:00 | 0 | 91074 | 2016-03-17 17:40:17 |
4 | 2016-03-31 17:25:20 | Skoda_Fabia_1.4_TDI_PD_Classic | privat | Angebot | 3600 | test | kleinwagen | 2008 | manuell | 69 | fabia | 90000 | 7 | diesel | skoda | nein | 2016-03-31 00:00:00 | 0 | 60437 | 2016-04-06 10:17:21 |
From the above tables we can infer the following about our dataset:
The variables of the major importance for our further investigation are price, car model, car brand, kilometer, classified creation date and its last seen date. For further convenience let's get rid of camel cases.
# Create a function to turn camel cases in snake cases
def change_case(str):
return ''.join(['_'+i.lower() if i.isupper()
else i for i in str]).lstrip('_')
# Apply the function to the column names of our dataset
column_names = [] # list with snake cases
for column_name in autos.columns:
column_names.append(change_case(column_name))
autos.columns = column_names # rename columns
# Check the results
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 371528 entries, 0 to 371527 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 371528 non-null object 1 name 371528 non-null object 2 seller 371528 non-null object 3 offer_type 371528 non-null object 4 price 371528 non-null int64 5 abtest 371528 non-null object 6 vehicle_type 333659 non-null object 7 year_of_registration 371528 non-null int64 8 gearbox 351319 non-null object 9 power_p_s 371528 non-null int64 10 model 351044 non-null object 11 kilometer 371528 non-null int64 12 month_of_registration 371528 non-null int64 13 fuel_type 338142 non-null object 14 brand 371528 non-null object 15 not_repaired_damage 299468 non-null object 16 date_created 371528 non-null object 17 nr_of_pictures 371528 non-null int64 18 postal_code 371528 non-null int64 19 last_seen 371528 non-null object dtypes: int64(7), object(13) memory usage: 56.7+ MB
Let's observe further descriptive statistics of our major variables.
# Get descriptive statistics for all variables
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | year_of_registration | gearbox | power_p_s | model | kilometer | month_of_registration | fuel_type | brand | not_repaired_damage | date_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 371528 | 371528 | 371528 | 371528 | 3.715280e+05 | 371528 | 333659 | 371528.000000 | 351319 | 371528.000000 | 351044 | 371528.000000 | 371528.000000 | 338142 | 371528 | 299468 | 371528 | 371528.0 | 371528.00000 | 371528 |
unique | 280500 | 233531 | 2 | 2 | NaN | 2 | 8 | NaN | 2 | NaN | 251 | NaN | NaN | 7 | 40 | 2 | 114 | NaN | NaN | 182806 |
top | 2016-03-24 14:49:47 | Ford_Fiesta | privat | Angebot | NaN | test | limousine | NaN | manuell | NaN | golf | NaN | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-06 13:45:54 |
freq | 7 | 657 | 371525 | 371516 | NaN | 192585 | 95894 | NaN | 274214 | NaN | 30070 | NaN | NaN | 223857 | 79640 | 263182 | 14450 | NaN | NaN | 17 |
mean | NaN | NaN | NaN | NaN | 1.729514e+04 | NaN | NaN | 2004.577997 | NaN | 115.549477 | NaN | 125618.688228 | 5.734445 | NaN | NaN | NaN | NaN | 0.0 | 50820.66764 | NaN |
std | NaN | NaN | NaN | NaN | 3.587954e+06 | NaN | NaN | 92.866598 | NaN | 192.139578 | NaN | 40112.337051 | 3.712412 | NaN | NaN | NaN | NaN | 0.0 | 25799.08247 | NaN |
min | NaN | NaN | NaN | NaN | 0.000000e+00 | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | 5000.000000 | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.00000 | NaN |
25% | NaN | NaN | NaN | NaN | 1.150000e+03 | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | 125000.000000 | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30459.00000 | NaN |
50% | NaN | NaN | NaN | NaN | 2.950000e+03 | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | 150000.000000 | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49610.00000 | NaN |
75% | NaN | NaN | NaN | NaN | 7.200000e+03 | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | 150000.000000 | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71546.00000 | NaN |
max | NaN | NaN | NaN | NaN | 2.147484e+09 | NaN | NaN | 9999.000000 | NaN | 20000.000000 | NaN | 150000.000000 | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.00000 | NaN |
From the above table we may conclude the following:
# Check what values are contained in the 'nr_of_pictures' column
autos['nr_of_pictures'].value_counts()
0 371528 Name: nr_of_pictures, dtype: int64
The output shows only zero values, which means that there are no pictures of autos recorded in the dataset. Therefore we cannot use this column for further analysis. We drop it along with the 'seller' and 'offer_type' columns containing identical values as irrelevant.
# Drop columns with identical values
autos = autos.drop(['seller','offer_type','nr_of_pictures'],axis=1)
We proceed further with checking 'kilometer' and 'price' columns, which are of greater interest to us.
# show the distribution of mileage and prices
print(autos['kilometer'].value_counts())
print(autos['price'].value_counts().sort_index(ascending=True).head(10))
print(autos['price'].value_counts().sort_index(ascending=False).head(10))
print(autos['price'].describe())
150000 240797 125000 38067 100000 15920 90000 12523 80000 11053 70000 9773 60000 8669 50000 7615 5000 7069 40000 6376 30000 6041 20000 5676 10000 1949 Name: kilometer, dtype: int64 0 10778 1 1189 2 12 3 8 4 1 5 26 7 3 8 9 9 8 10 84 Name: price, dtype: int64 2147483647 1 99999999 15 99000000 1 74185296 1 32545461 1 27322222 1 14000500 1 12345678 9 11111111 10 10010011 1 Name: price, dtype: int64 count 3.715280e+05 mean 1.729514e+04 std 3.587954e+06 min 0.000000e+00 25% 1.150000e+03 50% 2.950000e+03 75% 7.200000e+03 max 2.147484e+09 Name: price, dtype: float64
The 'kilometer' column doesn't seem to have outliers, and the values there are seemingly grouped into ranges, so that the variety of unique values is minimal. Out of the above results we may also conclude that the majority of cars in our dataset have quite high mileage, i.e. they are most likely more or less old.
However, there are outliers in the 'price' columns - we may observe both extremely high and extremely low values. While extremely low values may seem plausible being the strating price in the car sale auctions, extremely high values are rather mistakes in our dataset. Therefore, we get rid of them setting the plausible price range within 0 - 350000.
# Save only rows with price values within 0 - 350000 range
# by using boolean expression
autos = autos[autos['price'].between(0,350000)]
Let's explore next date variables in our dataset.
# Show the distribution of time points when classfields were last seen on the Ebay website
print(autos['last_seen'].value_counts(normalize=True,dropna=False).sort_values(ascending=False).head(10))
print(autos['last_seen'].value_counts(normalize=True,dropna=False).sort_values(ascending=True).head(10))
2016-04-06 13:45:54 0.000046 2016-04-07 06:45:59 0.000046 2016-04-06 01:15:23 0.000043 2016-04-07 00:45:17 0.000043 2016-04-06 09:17:58 0.000043 2016-04-07 05:44:30 0.000043 2016-04-06 07:45:46 0.000043 2016-04-07 10:46:23 0.000043 2016-04-07 09:46:25 0.000043 2016-04-06 09:46:00 0.000043 Name: last_seen, dtype: float64 2016-03-12 10:37:56 0.000003 2016-03-16 22:16:37 0.000003 2016-03-09 20:47:38 0.000003 2016-03-30 23:46:29 0.000003 2016-03-25 22:45:53 0.000003 2016-04-04 10:16:04 0.000003 2016-03-11 06:41:34 0.000003 2016-03-23 19:57:37 0.000003 2016-03-20 06:44:24 0.000003 2016-03-28 07:03:20 0.000003 Name: last_seen, dtype: float64
# Show descriptive statistics
print(autos['last_seen'].describe())
count 371413 unique 182756 top 2016-04-06 13:45:54 freq 17 Name: last_seen, dtype: object
From the results above we may see that the distribution of the time the classfields were last seen is quite unifrom over the dates range in our dataset. Even though the frequency of some days might be comparatively high, but the share of unique values in the total is high enough to assume that there is no particular time, at which the bulk of car sales occured. It is also important to note that all the observations in the 'last_seen' column took place within two months of 2016 - March and April, which is a particular dates range considered in our dataset.
# Show the distribution of time points when classfields were created
print(autos['date_created'].value_counts(normalize=True,dropna=False).sort_values(ascending=False).head(10))
print(autos['date_created'].value_counts(normalize=True,dropna=False).sort_values(ascending=True).head(10))
2016-04-03 00:00:00 0.038887 2016-04-04 00:00:00 0.037745 2016-03-20 00:00:00 0.036463 2016-03-12 00:00:00 0.036011 2016-03-21 00:00:00 0.035798 2016-03-14 00:00:00 0.035222 2016-03-28 00:00:00 0.035179 2016-04-02 00:00:00 0.034966 2016-03-07 00:00:00 0.034517 2016-03-19 00:00:00 0.034326 Name: date_created, dtype: float64 2015-09-09 00:00:00 0.000003 2015-11-12 00:00:00 0.000003 2015-06-18 00:00:00 0.000003 2016-01-15 00:00:00 0.000003 2016-01-06 00:00:00 0.000003 2015-10-14 00:00:00 0.000003 2015-08-10 00:00:00 0.000003 2015-11-13 00:00:00 0.000003 2015-11-02 00:00:00 0.000003 2015-11-10 00:00:00 0.000003 Name: date_created, dtype: float64
# Show descriptive statistics
print(autos['date_created'].describe())
count 371413 unique 114 top 2016-04-03 00:00:00 freq 14443 Name: date_created, dtype: object
A slightly different situation can be observed among the dates when advertisements were created. The majority of them were seemingly created at certain days in March and April 2016, and a far smaller part of classfields was created before 2016. Therefore, we may conclude that the bulk of the observations in this column occured in spring 2016 as well.
# Show the distribution of years when cars in the dataset were registered
print(autos['year_of_registration'].value_counts(normalize=True,dropna=False).sort_values(ascending=False).head(10))
print(autos['year_of_registration'].value_counts(normalize=True,dropna=False).sort_values(ascending=True).head(10))
2000 0.066072 1999 0.061290 2005 0.060073 2006 0.054457 2001 0.054425 2003 0.053498 2004 0.053156 2002 0.051659 1998 0.048324 2007 0.047580 Name: year_of_registration, dtype: float64 8455 0.000003 3700 0.000003 2290 0.000003 8200 0.000003 8500 0.000003 9229 0.000003 6500 0.000003 2066 0.000003 5900 0.000003 1039 0.000003 Name: year_of_registration, dtype: float64
# Show descriptive statistics
print(autos['year_of_registration'].describe())
count 371413.000000 mean 2004.561152 std 91.937676 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: year_of_registration, dtype: float64
From the above tables on years of cars' registrations we may see, that there are implausible values among years - min of 1000 and max of 9999. Therefore, we should clean these obviously wrong rows from our dataset. The range is set between 1900 and 2016, when all the car deals were presumably made.
# Check the share of wrong rows in the total
print('Percentage share of wrong rows:',(~autos['year_of_registration'].between(1900,2016)).sum()/autos.shape[0]*100,'(%)')
# Save only rows with price values within 1900 - 2016 range
# by using boolean expression
autos = autos[autos['year_of_registration'].between(1900,2016)]
Percentage share of wrong rows: 3.9683586734982352 (%)
We have just 4% of obviously incorrect data, therefore we don't need to delete too many rows thus making our analysis constrained.
Let's proceed further with the years' distribution.
# Calculate percentage share distribution of years in our dataset
autos['year_of_registration'].value_counts(normalize=True)
2000 0.068802 1999 0.063823 2005 0.062556 2006 0.056707 2001 0.056674 ... 1925 0.000003 1920 0.000003 1919 0.000003 1915 0.000003 1911 0.000003 Name: year_of_registration, Length: 97, dtype: float64
From the above results we may observe that the most common vehicles' registration years in our dataset are either 1990-s or afterwards. On the other hand, years of the begginning of the 20-th century are rather rare, which nonetheless signifies that even extremely old cars are sold sometimes through eBay website.
Let's explore further the variety of car brands sold, as recorded in our dataset.
# Calculate shares of unique values in the 'brand' column of our dataset
autos['brand'].value_counts(normalize=True)
volkswagen 0.212396 bmw 0.109697 opel 0.107092 mercedes_benz 0.095964 audi 0.089356 ford 0.068856 renault 0.047573 peugeot 0.029868 fiat 0.025763 seat 0.018636 skoda 0.015412 mazda 0.015350 smart 0.014108 citroen 0.013875 nissan 0.013573 toyota 0.012751 sonstige_autos 0.010584 hyundai 0.009833 mini 0.009213 volvo 0.009132 mitsubishi 0.008268 honda 0.007590 kia 0.006880 alfa_romeo 0.006348 suzuki 0.006317 porsche 0.006101 chevrolet 0.005013 chrysler 0.003942 dacia 0.002450 daihatsu 0.002187 jeep 0.002184 land_rover 0.002136 subaru 0.002125 jaguar 0.001719 trabant 0.001618 saab 0.001452 daewoo 0.001438 rover 0.001298 lancia 0.001292 lada 0.000611 Name: brand, dtype: float64
It is not surprising that german cars contitute the majority among the brands with Volkswagen being the unanimous leader of the list - more than 21% of the total turnover. It is followed by three major German premium car brands and Opel with roughly equal shares of around 10%.
Let's consider further only car brands which comprise at least 1% of all sales in our dataset. Those below this threshold are not worthy of further analysis and hence dropped as irrelevant.
# Create series containing shares of car brands in total turnover
brands = autos['brand'].value_counts(normalize=True)
popular_brands = brands[brands>0.01].index # use only header line to create list
# of brands we are interested in
Now that we have the list of popular car brands being sold in Germany, it is possible to calculate their mean prices, mileage, horsepower, year of registration and the average number of days these car brands were on sale - i.e. how fast they were sold.
# Create a dictionary containing average prices corresponding to popular car models
mean_prices = {}
for i in popular_brands:
mean_prices[i] = int(autos[autos['brand']==i]['price'].mean())
# check results
print(mean_prices)
{'volkswagen': 5231, 'bmw': 8224, 'opel': 2870, 'mercedes_benz': 8387, 'audi': 8849, 'ford': 3595, 'renault': 2366, 'peugeot': 3206, 'fiat': 2803, 'seat': 4420, 'skoda': 6460, 'mazda': 3970, 'smart': 3586, 'citroen': 3649, 'nissan': 4585, 'toyota': 5263, 'sonstige_autos': 12456}
# Repeat the same procedure of average mileage
mean_mileage = {}
for i in popular_brands:
mean_mileage[i] = int(autos[autos['brand']==i]['kilometer'].mean())
print(mean_mileage)
{'volkswagen': 128338, 'bmw': 132666, 'opel': 128756, 'mercedes_benz': 130585, 'audi': 129498, 'ford': 123617, 'renault': 127875, 'peugeot': 124599, 'fiat': 116523, 'seat': 120916, 'skoda': 113720, 'mazda': 125945, 'smart': 99684, 'citroen': 120313, 'nissan': 119543, 'toyota': 117337, 'sonstige_autos': 86854}
# Repeat the same procedure of cars' average age
mean_age = {}
for i in popular_brands:
mean_age[i] = int(autos[autos['brand']==i]['year_of_registration'].mean())
print(mean_age)
{'volkswagen': 2002, 'bmw': 2002, 'opel': 2002, 'mercedes_benz': 2001, 'audi': 2003, 'ford': 2002, 'renault': 2002, 'peugeot': 2003, 'fiat': 2002, 'seat': 2004, 'skoda': 2006, 'mazda': 2002, 'smart': 2005, 'citroen': 2003, 'nissan': 2003, 'toyota': 2004, 'sonstige_autos': 1993}
# Repeat the same procedure of cars' average horsepower
mean_power = {}
for i in popular_brands:
mean_power[i] = int(autos[autos['brand']==i]['power_p_s'].mean())
print(mean_power)
{'volkswagen': 98, 'bmw': 166, 'opel': 92, 'mercedes_benz': 154, 'audi': 157, 'ford': 96, 'renault': 79, 'peugeot': 91, 'fiat': 72, 'seat': 92, 'skoda': 100, 'mazda': 106, 'smart': 61, 'citroen': 88, 'nissan': 95, 'toyota': 102, 'sonstige_autos': 125}
Let's now create a new column, which would reflect the duration of time, when a car was on sale. For this purpose we should use the date each classfield was created and the date it was last seen - we consider these two dates as proxies for the start and the end of cars selling.
# Turn two columns of interest into date and time format
last_seen = pd.to_datetime(autos['last_seen'])
date_created = pd.to_datetime(autos['date_created'])
last_seen = last_seen.dt.date # drop hours out of our consideration
date_created = date_created.dt.date
# calculate time on sale
days_on_sale = (last_seen - date_created)
days_on_sale = days_on_sale.dt.days # consider only the number of days
# get descriptive statistics
days_on_sale.describe()
count 356674.000000 mean 8.870470 std 8.730316 min 0.000000 25% 2.000000 50% 6.000000 75% 14.000000 max 759.000000 dtype: float64
# Check the distribution of days when classfields were created
date_created.value_counts(normalize=True).head(10)
2016-04-03 0.038971 2016-04-04 0.037796 2016-03-20 0.036538 2016-03-12 0.036053 2016-03-21 0.035825 2016-03-28 0.035144 2016-03-14 0.035116 2016-04-02 0.034923 2016-03-07 0.034642 2016-03-19 0.034289 Name: date_created, dtype: float64
# Check the distribution of days when classfields were last seen
last_seen.value_counts(normalize=True).head(10)
2016-04-06 0.219467 2016-04-07 0.130752 2016-04-05 0.126757 2016-03-17 0.028544 2016-04-04 0.025561 2016-04-03 0.025197 2016-04-02 0.024793 2016-03-31 0.024165 2016-04-01 0.023941 2016-03-30 0.023758 Name: last_seen, dtype: float64
From the above results we may note that it took almost 9 days for a car to get sold on average, and around 75% of cars in the dataset were sold within 14 days, as measured by the time respective advertisements were on public. It is also important to note that almost half of the observations, when classfields were last seen, did occur in just three days - 5, 6, 7 April 2016. This might be due to technical limitations, and we are thus warned against considering it as a unanimous indicator of the time cars were sold - these dates are rather rough approximations of that at the most. On the other hand, the distribution of dates when classfields were created is uniform and doen't present a similar conundrum at the first sight.
To proceed further we should calculate the average values of most important variables for each car brand.
# Create a dictionary containing cars' average number of days on sale
autos['days_on_sale'] = days_on_sale.astype(int)
mean_days_on_sale = {}
for i in popular_brands:
mean_days_on_sale[i] = round(autos[autos['brand']==i]['days_on_sale'].mean(),1)
# check results
print(mean_days_on_sale)
{'volkswagen': 8.6, 'bmw': 9.2, 'opel': 8.2, 'mercedes_benz': 9.1, 'audi': 9.3, 'ford': 8.4, 'renault': 8.4, 'peugeot': 8.7, 'fiat': 8.9, 'seat': 9.1, 'skoda': 9.2, 'mazda': 8.9, 'smart': 9.1, 'citroen': 9.0, 'nissan': 8.6, 'toyota': 8.8, 'sonstige_autos': 9.9}
# Create columns for each variable of interest
popular_brands_table = pd.DataFrame(pd.Series(mean_prices), columns=['Price'])
mileage_table = pd.DataFrame(pd.Series(mean_mileage), columns=['mileage'])
days_on_sale_table = pd.DataFrame(pd.Series(mean_days_on_sale), columns=['days_on_sale'])
age_table = pd.DataFrame(pd.Series(mean_age), columns=['year_of_registration'])
power_table = pd.DataFrame(pd.Series(mean_power), columns=['Horse Power'])
# add columns to our output table with mean values
popular_brands_table['Mileage (km)'] = mileage_table
popular_brands_table['Days on Sale'] = days_on_sale_table
popular_brands_table['Registration Year'] = age_table
popular_brands_table['Horse Power'] = power_table
# check output table with averages
popular_brands_table
Price | Mileage (km) | Days on Sale | Registration Year | Horse Power | |
---|---|---|---|---|---|
volkswagen | 5231 | 128338 | 8.6 | 2002 | 98 |
bmw | 8224 | 132666 | 9.2 | 2002 | 166 |
opel | 2870 | 128756 | 8.2 | 2002 | 92 |
mercedes_benz | 8387 | 130585 | 9.1 | 2001 | 154 |
audi | 8849 | 129498 | 9.3 | 2003 | 157 |
ford | 3595 | 123617 | 8.4 | 2002 | 96 |
renault | 2366 | 127875 | 8.4 | 2002 | 79 |
peugeot | 3206 | 124599 | 8.7 | 2003 | 91 |
fiat | 2803 | 116523 | 8.9 | 2002 | 72 |
seat | 4420 | 120916 | 9.1 | 2004 | 92 |
skoda | 6460 | 113720 | 9.2 | 2006 | 100 |
mazda | 3970 | 125945 | 8.9 | 2002 | 106 |
smart | 3586 | 99684 | 9.1 | 2005 | 61 |
citroen | 3649 | 120313 | 9.0 | 2003 | 88 |
nissan | 4585 | 119543 | 8.6 | 2003 | 95 |
toyota | 5263 | 117337 | 8.8 | 2004 | 102 |
sonstige_autos | 12456 | 86854 | 9.9 | 1993 | 125 |
From the above table we may observe several important things:
Let's check further what models of most popular brands are sold the most.
# Calculate shares of different models within the car brand
volkswagen_series = autos[autos['brand'] == 'volkswagen']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)
volkswagen = pd.DataFrame(pd.Series(volkswagen_series))
volkswagen.columns = ['Volkswagen']
volkswagen[:5]
Volkswagen | |
---|---|
golf | 0.391313 |
polo | 0.171710 |
passat | 0.137679 |
transporter | 0.073857 |
touran | 0.045843 |
# Calculate shares of different models within the car brand
bmw_series = autos[autos['brand'] == 'bmw']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)
bmw = pd.DataFrame(pd.Series(bmw_series))
bmw.columns = ['BMW']
bmw[:5]
BMW | |
---|---|
3er | 0.535301 |
5er | 0.224450 |
1er | 0.099955 |
x_reihe | 0.060711 |
7er | 0.029567 |
# Calculate shares of different models within the car brand
opel_series = autos[autos['brand'] == 'opel']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)
opel = pd.DataFrame(pd.Series(opel_series))
opel.columns = ['Opel']
opel[:5]
Opel | |
---|---|
corsa | 0.326900 |
astra | 0.283509 |
vectra | 0.112161 |
zafira | 0.078214 |
omega | 0.037697 |
# Calculate shares of different models within the car brand
mercedes_benz_series = autos[autos['brand'] == 'mercedes_benz']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)
mercedes_benz = pd.DataFrame(pd.Series(mercedes_benz_series))
mercedes_benz.columns = ['Mercedes-Benz']
mercedes_benz[:5]
Mercedes-Benz | |
---|---|
c_klasse | 0.258783 |
e_klasse | 0.223711 |
a_klasse | 0.121906 |
andere | 0.099088 |
clk | 0.054389 |
# Calculate shares of different models within the car brand
audi_series = autos[autos['brand'] == 'audi']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)
audi = pd.DataFrame(pd.Series(audi_series))
audi.columns = ['Audi']
audi[:5]
Audi | |
---|---|
a4 | 0.327256 |
a3 | 0.203204 |
a6 | 0.192134 |
andere | 0.051184 |
80 | 0.050760 |
# Calculate shares of different models within the car brand
ford_series = autos[autos['brand'] == 'ford']['model'].dropna().value_counts(normalize=True,dropna=False).sort_values(ascending=False)
ford = pd.DataFrame(pd.Series(ford_series))
ford.columns = ['Ford']
ford[:5]
Ford | |
---|---|
focus | 0.241287 |
fiesta | 0.234774 |
mondeo | 0.149129 |
ka | 0.109964 |
andere | 0.062553 |
The above tables show us that each of the considered car brands has one or two most traded models - Volkswagen Golf, BMW 3 Series, Opel Corsa, Mercedes-Benz C-Class, Audi A4 and Ford Focus are sold the most.
In this project we attempted to figure out what types of cars are most likely to be sold and what is their average price level. By using the dataset on used cars from the German eBay classifieds section, we found out that Volkswagen, Ford and Opel are very popular brands and can be bought and sold most easily, considering their average time on sale. Their price in general is also lower, as compared to German premium brands, like BMW, Audi or Mercedes-Benz, even though the mileage is lower on average. This point might be explained by the higher reliability, horsepower and prestige of the latter group of brands.
All in all, we would suggest a car dealer operating in Germany two major options. The first one is to focus on cheaper models, which are sold the fastest - Volkswagen Golf, Opel Corsa, Opel Astra, Ford Focus and Ford Fiesta. The second option is to operate with premium brands, the price of which might be less affected by mileage and probably other negative factors as well - BMW 3 Series, Mercedes-Benz C-Class, Mercedes-Benz E-Class and Audi A4. However, it should be kept in mind that it takes slightly longer (one day on average) for them to be sold out.
The information concerning car models turnover might be of interest to other stakeholders as well. For example, the businesses related to auto parts and auto services may potentially take advantage of the knowledge concerning what cars are used the most, so that to meet better the automobile users' demand. Car buyers and sellers, who want to know the market pricing of various models may also refer to the observed average prices of sold cars as a benchmark for their subsequent market operations.