In this project I will work with data set containing information on eBay used car sales in Germany. I will explore dataset, try to find most popular car brands and some similar patterns in relation to average price of car by each popular brand. Patterns that I am interested in are covered distance, presence of unrepaired damage and average age of car for each popular brand. In other words, I am going to analyse if there are similar patterns between average price of car for brand and other characteristics of brand in average. This will also help to understand if some natural expectations of buyers and/or sellers for price of used cars can be trusted.
You can download dataset here.
Pandas are main module I used for analysis. Except that I imported NumPy and created print_2n function. Latter prints 2 empty lines before each printed object. Numbers in tables will be printed with 3 digits after decimal point.
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.3f' % x)
def print_2n(*args):
for arg in args:
print("\n"*2, arg)
autos = pd.read_csv('autos.csv', encoding="Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26.3.16 17:47 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 26.3.16 00:00 | 0 | 79588 | 6.4.16 06:45 |
1 | 4.4.16 13:38 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 4.4.16 00:00 | 0 | 71034 | 6.4.16 14:45 |
2 | 26.3.16 18:57 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 26.3.16 00:00 | 0 | 35394 | 6.4.16 20:15 |
3 | 12.3.16 16:58 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 12.3.16 00:00 | 0 | 33729 | 15.3.16 03:16 |
4 | 1.4.16 14:38 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 1.4.16 00:00 | 0 | 39218 | 1.4.16 14:38 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 27.3.16 14:38 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 27.3.16 00:00 | 0 | 82131 | 1.4.16 13:47 |
49996 | 28.3.16 10:50 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 28.3.16 00:00 | 0 | 44807 | 2.4.16 14:18 |
49997 | 2.4.16 14:44 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2.4.16 00:00 | 0 | 73430 | 4.4.16 11:47 |
49998 | 8.3.16 19:25 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 8.3.16 00:00 | 0 | 35683 | 5.4.16 16:45 |
49999 | 14.3.16 00:42 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 13.3.16 00:00 | 0 | 45897 | 6.4.16 21:18 |
50000 rows × 20 columns
As we can see from result of above code there are 50,000 rows and 20 columns in dataset.
In order to see all columns, types and # of non-null values in each column I use info() method. Except that printing first 5 rows helps to comprehend with what type of data we deal.
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26.3.16 17:47 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 26.3.16 00:00 | 0 | 79588 | 6.4.16 06:45 |
1 | 4.4.16 13:38 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 4.4.16 00:00 | 0 | 71034 | 6.4.16 14:45 |
2 | 26.3.16 18:57 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 26.3.16 00:00 | 0 | 35394 | 6.4.16 20:15 |
3 | 12.3.16 16:58 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 12.3.16 00:00 | 0 | 33729 | 15.3.16 03:16 |
4 | 1.4.16 14:38 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 1.4.16 00:00 | 0 | 39218 | 1.4.16 14:38 |
Only 5 of 20 columns are saved as integer types of data. Most of the columns are saved as string types.
Majority of columns do not have null values. Share of null values in minority of columns where they exist is less than 20% by each column.
Below I printed column names. From result you can see that column labels are in camelcase and some labels are unnecessarily long. I renamed them to snakecase and shortened some labels. But besides that I added 'km' to 'odemeter' column and 'usd' to 'price' column. Added information to labels ('km' & 'usd') are presented in all values of both corresponding columns if you look closely to data printed as result of above cells. But I will delete them later as I convert those columns to integer type. Therefore information that prices are in US dollars and distance covered by car is in kilometers will be saved in column labels.
print(autos.columns)
col_dict = {'dateCrawled': 'date_crawled', 'name': 'name', 'seller': 'seller',
'offerType': 'offer_type', 'price': 'price_usd', 'abtest': 'ab_test',
'vehicleType': 'vehicle_type', 'yearOfRegistration': 'registration_year',
'gearbox': 'gearbox', 'powerPS': 'power_ps', 'model': 'model',
'odometer': 'odometer_km', 'monthOfRegistration': 'registration_month',
'fuelType': 'fuel_type', 'brand': 'brand','notRepairedDamage': 'unrepaired_damage',
'dateCreated': 'ad_created', 'nrOfPictures': 'nr_of_pictures',
'postalCode': 'postal_code','lastSeen': 'last_seen'}
autos.rename(columns=col_dict, inplace=True)
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price_usd', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
As mentioned above I will delete special characters from 'odometer_km' and 'price_usd' columns. Then I convert them to numeric types of data for future purposes. The main future purpose is to use converted to float kilometers and prices for calculation of average values for most frequently presented brands in dataset.
autos['odometer_km'] = autos['odometer_km'].str.replace('km', '').str.replace(',', '').astype(float)
autos['price_usd'] = autos['price_usd'].str.replace('$', '').str.replace(',', '').astype(float)
autos.head(10)
date_crawled | name | seller | offer_type | price_usd | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26.3.16 17:47 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.000 | control | bus | 2004 | manuell | 158 | andere | 150000.000 | 3 | lpg | peugeot | nein | 26.3.16 00:00 | 0 | 79588 | 6.4.16 06:45 |
1 | 4.4.16 13:38 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.000 | control | limousine | 1997 | automatik | 286 | 7er | 150000.000 | 6 | benzin | bmw | nein | 4.4.16 00:00 | 0 | 71034 | 6.4.16 14:45 |
2 | 26.3.16 18:57 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.000 | test | limousine | 2009 | manuell | 102 | golf | 70000.000 | 7 | benzin | volkswagen | nein | 26.3.16 00:00 | 0 | 35394 | 6.4.16 20:15 |
3 | 12.3.16 16:58 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.000 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70000.000 | 6 | benzin | smart | nein | 12.3.16 00:00 | 0 | 33729 | 15.3.16 03:16 |
4 | 1.4.16 14:38 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.000 | test | kombi | 2003 | manuell | 0 | focus | 150000.000 | 7 | benzin | ford | nein | 1.4.16 00:00 | 0 | 39218 | 1.4.16 14:38 |
5 | 21.3.16 13:47 | Chrysler_Grand_Voyager_2.8_CRD_Aut.Limited_Sto... | privat | Angebot | 7900.000 | test | bus | 2006 | automatik | 150 | voyager | 150000.000 | 4 | diesel | chrysler | NaN | 21.3.16 00:00 | 0 | 22962 | 6.4.16 09:45 |
6 | 20.3.16 17:55 | VW_Golf_III_GT_Special_Electronic_Green_Metall... | privat | Angebot | 300.000 | test | limousine | 1995 | manuell | 90 | golf | 150000.000 | 8 | benzin | volkswagen | NaN | 20.3.16 00:00 | 0 | 31535 | 23.3.16 02:48 |
7 | 16.3.16 18:55 | Golf_IV_1.9_TDI_90PS | privat | Angebot | 1990.000 | control | limousine | 1998 | manuell | 90 | golf | 150000.000 | 12 | diesel | volkswagen | nein | 16.3.16 00:00 | 0 | 53474 | 7.4.16 03:17 |
8 | 22.3.16 16:51 | Seat_Arosa | privat | Angebot | 250.000 | test | NaN | 2000 | manuell | 0 | arosa | 150000.000 | 10 | NaN | seat | nein | 22.3.16 00:00 | 0 | 7426 | 26.3.16 18:18 |
9 | 16.3.16 13:47 | Renault_Megane_Scenic_1.6e_RT_Klimaanlage | privat | Angebot | 590.000 | control | bus | 1997 | manuell | 90 | megane | 150000.000 | 7 | benzin | renault | nein | 16.3.16 00:00 | 0 | 15749 | 6.4.16 10:46 |
There are 2 columns with German words as values which I need to change:
I will change German words for 'yes' and 'no' in 'unrepaired_damage' column to 1's and 0's. And German words for transmission type to their English counterparts.
# leaving nan in its place
damage_dict = {'nein': 0, 'ja': 1}
autos['unrepaired_damage']= autos['unrepaired_damage'].map(damage_dict)
gear_dict = {'automatik' : 'automatic', 'manuell': 'manual'}
autos['gearbox'] = autos['gearbox'].map(gear_dict)
print_2n(autos['unrepaired_damage'].value_counts(dropna=False),\
autos['gearbox'].value_counts(dropna=False))
0.000 35232 nan 9829 1.000 4939 Name: unrepaired_damage, dtype: int64 manual 36993 automatic 10327 NaN 2680 Name: gearbox, dtype: int64
I already cleaned column showing prices from special characters, added information stored in special character to column label, converted column to numeric type. And there is no null-values as you can see in exploring data section. But before we can use this column in any project we should identify if there are outliers or unrealistic data. Considering that we analyse used car market very expensive cars seems to be unrelated to the topic. Too high prices can be related to rare old models, mistakes in dataset, almost new cars and so on. In any case this prices cannot be corrected on objective basis at this step. Also there are too cheap cars in data set. I will check descriptive statistics, top 10 and bottom 10 values to have better understanding of extreme prices.
desc_stat_price = autos['price_usd'].describe()
print(desc_stat_price)
top_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).head(10)
bottom_10_price = autos['price_usd'].value_counts().sort_index(ascending=False).tail(10)
print_2n(top_10_price, bottom_10_price)
count 50000.000 mean 9840.044 std 481104.381 min 0.000 25% 1100.000 50% 2950.000 75% 7200.000 max 99999999.000 Name: price_usd, dtype: float64 99999999.000 1 27322222.000 1 12345678.000 3 11111111.000 2 10000000.000 1 3890000.000 1 1300000.000 1 1234566.000 1 999999.000 2 999990.000 1 Name: price_usd, dtype: int64 12.000 3 11.000 2 10.000 7 9.000 1 8.000 1 5.000 2 3.000 1 2.000 3 1.000 156 0.000 1421 Name: price_usd, dtype: int64
If you pay attention in output of above cell descriptive statistics on price column shows that minimum and maximum values are unexpectedly high (almost 10 million USD) and low (0 USD), respectively. Bottom 10 prices are below than 15 dollars and top ten are between 100k and 10 million dollars. Therefore I should set up boundaries and I found range between 100 USD and 100,000 USD as sensible. Down I will check number of cars that fell out of mentioned range.
unreal_prices = autos[(autos['price_usd'] >= 100000) |\
(autos['price_usd'] <= 100)].shape[0]
print('\n'*2, '# of rows with unrealistic prices (below 100 and more tha 100k):'\
, unreal_prices)
# of rows with unrealistic prices (below 100 and more tha 100k): 1949
In cell below I will delete those 1949 rows in order to eliminate effect of outliers in terms of prices and check on descriptive statistics.
autos = autos[autos['price_usd'].between(100, 100000)]
desc_stat_price = autos['price_usd'].describe()
print_2n(desc_stat_price)
count 48185.000 mean 5796.100 std 7525.532 min 100.000 25% 1250.000 50% 3000.000 75% 7499.000 max 99900.000 Name: price_usd, dtype: float64
I use same methods which was implemented above to understand if there are outliers in 'odometer_km' column which presents distance covered by each car. There also no null-values in this column (see exploring data).
# checking descriptive statistics and unrealisticly high and low
# covered distance by car
desc_stat_odo = autos['odometer_km'].describe()
print_2n(desc_stat_odo)
top_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).head(10)
bottom_10_odo = autos['odometer_km'].value_counts().sort_index(ascending=False).tail(10)
print_2n(top_10_odo, bottom_10_odo)
count 48185.000 mean 125986.718 std 39467.457 min 5000.000 25% 125000.000 50% 150000.000 75% 150000.000 max 150000.000 Name: odometer_km, dtype: float64 150000.000 31209 125000.000 5035 100000.000 2099 90000.000 1733 80000.000 1411 70000.000 1214 60000.000 1151 50000.000 1007 40000.000 813 30000.000 771 Name: odometer_km, dtype: int64 90000.000 1733 80000.000 1411 70000.000 1214 60000.000 1151 50000.000 1007 40000.000 813 30000.000 771 20000.000 749 10000.000 242 5000.000 751 Name: odometer_km, dtype: int64
Minimum (5,000 km) and maximum (150,000 km) values make sense in 'odometer_km' column. Top 10 and bottom 10 values also seems to be realistic and useful for later analysis where I will use it for aggregation of average values.
There are 5 columns containing date and/or time data:
To have better understanding we should pay attention on below cell output to see data types and descriptive statistics on all 5 columns.
date_cols = ['registration_month', 'registration_year', 'date_crawled',\
'ad_created' , 'last_seen']
print(autos[date_cols].dtypes)
autos[date_cols].describe(include='all')
registration_month int64 registration_year int64 date_crawled object ad_created object last_seen object dtype: object
registration_month | registration_year | date_crawled | ad_created | last_seen | |
---|---|---|---|---|---|
count | 48185.000 | 48185.000 | 48185 | 48185 | 48185 |
unique | nan | nan | 11994 | 76 | 10226 |
top | nan | nan | 28.3.16 14:49 | 3.4.16 00:00 | 6.4.16 02:45 |
freq | nan | nan | 15 | 1872 | 111 |
mean | 5.802 | 2004.730 | NaN | NaN | NaN |
std | 3.678 | 87.932 | NaN | NaN | NaN |
min | 0.000 | 1000.000 | NaN | NaN | NaN |
25% | 3.000 | 1999.000 | NaN | NaN | NaN |
50% | 6.000 | 2004.000 | NaN | NaN | NaN |
75% | 9.000 | 2008.000 | NaN | NaN | NaN |
max | 12.000 | 9999.000 | NaN | NaN | NaN |
Form result of above cell it is visible that 'registration_month' and 'registration_year' columns are saved as numeric types while rest of 3 columns as string data type. To see in which format values in all 5 date columns represented I will separate column labels based on column data types in 2 different lists. Then I will print top 5 rows from both to see example of how date related information is stored.
str_date_cols = ['date_crawled','ad_created' , 'last_seen']
int_date_cols = ['registration_year', 'registration_month']
print_2n(autos[str_date_cols].head(), autos[int_date_cols].head())
date_crawled ad_created last_seen 0 26.3.16 17:47 26.3.16 00:00 6.4.16 06:45 1 4.4.16 13:38 4.4.16 00:00 6.4.16 14:45 2 26.3.16 18:57 26.3.16 00:00 6.4.16 20:15 3 12.3.16 16:58 12.3.16 00:00 15.3.16 03:16 4 1.4.16 14:38 1.4.16 00:00 1.4.16 14:38 registration_year registration_month 0 2004 3 1 1997 6 2 2009 7 3 2007 6 4 2003 7
As we can see 3 columns with values saved as strings contain dates and time. Later those 3 columns can be used to extract date or time from them and use for analysis. Other 2 date columns represent year and month of registration in numbers and therefore can be used for further analysis without any change in data type.
Now we saw what data is saved in date columns we can pay closer attention to 'last_seen', 'registration_year' & 'registration_month' columns. There are 2 reasons to look closer at mentioned columns:
To clean mentioned above 2 categories first I will check distribution in 'last_seen' column. This will help to understand approximate upper limit for registration year to be realistic. I will also check distribution for other 2 columns saved in string data type ('date_crawled','ad_created').
for col in str_date_cols:
date = autos[col].str.split().str[0]
month_year = date.str.split('.').str[1] + "-" +\
date.str.split('.').str[2]
autos[f'{col}_month_year'] = month_year
month_year_dist = month_year.value_counts(normalize=True, dropna=False)\
.sort_values(ascending=False)*100
print_2n(f"Distribution for '{col}_month_year' column:", month_year_dist)
C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:5: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy """
Distribution for 'date_crawled_month_year' column: 3-16 83.792 4-16 16.208 Name: date_crawled, dtype: float64 Distribution for 'ad_created_month_year' column: 3-16 83.734 4-16 16.103 2-16 0.127 1-16 0.025 12-15 0.004 6-15 0.002 11-15 0.002 9-15 0.002 8-15 0.002 Name: ad_created, dtype: float64 Distribution for 'last_seen_month_year' column: 4-16 57.651 3-16 42.349 Name: last_seen, dtype: float64
The output presents format of date as 'month number-2 digit year number' (e.g. 4-16 is April of 2016). Distribution of ‘last_seen’ and ‘date_crawled’ columns shows that there are only 2 values: 3-16 (March 2016) and 4-16 (April 2016). Distribution for 'ad_created' column also shows that there is no date later than 4-16 (April 2016). These factors tells that dataset presents situation in used car market on eBay in Germany as of March - April 2016.
Above observations make it clear that registration year above 2016 is error and cannot be realistic. Moreover, in rows where registration year is 2016 and registration month is later than month in 'last_seen' column false date is presented and should be deleted.
autos['last_seen_month'] = autos['last_seen'].str.split().str[0]\
.str.split(".").str[1].astype(int)
autos = autos[~((autos['registration_year'] == 2016) &\
(autos['registration_month'] >= autos['last_seen_month']))]
autos.to_csv('autos_cleaned.csv', index=False)
autos
C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:2: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
date_crawled | name | seller | offer_type | price_usd | ab_test | vehicle_type | registration_year | gearbox | power_ps | ... | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | date_crawled_month_year | ad_created_month_year | last_seen_month_year | last_seen_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 26.3.16 17:47 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | 5000.000 | control | bus | 2004 | manual | 158 | ... | peugeot | 0.000 | 26.3.16 00:00 | 0 | 79588 | 6.4.16 06:45 | 3-16 | 3-16 | 4-16 | 4 |
1 | 4.4.16 13:38 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | 8500.000 | control | limousine | 1997 | automatic | 286 | ... | bmw | 0.000 | 4.4.16 00:00 | 0 | 71034 | 6.4.16 14:45 | 4-16 | 4-16 | 4-16 | 4 |
2 | 26.3.16 18:57 | Volkswagen_Golf_1.6_United | privat | Angebot | 8990.000 | test | limousine | 2009 | manual | 102 | ... | volkswagen | 0.000 | 26.3.16 00:00 | 0 | 35394 | 6.4.16 20:15 | 3-16 | 3-16 | 4-16 | 4 |
3 | 12.3.16 16:58 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | 4350.000 | control | kleinwagen | 2007 | automatic | 71 | ... | smart | 0.000 | 12.3.16 00:00 | 0 | 33729 | 15.3.16 03:16 | 3-16 | 3-16 | 3-16 | 3 |
4 | 1.4.16 14:38 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | 1350.000 | test | kombi | 2003 | manual | 0 | ... | ford | 0.000 | 1.4.16 00:00 | 0 | 39218 | 1.4.16 14:38 | 4-16 | 4-16 | 4-16 | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 27.3.16 14:38 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | 24900.000 | control | limousine | 2011 | automatic | 239 | ... | audi | 0.000 | 27.3.16 00:00 | 0 | 82131 | 1.4.16 13:47 | 3-16 | 3-16 | 4-16 | 4 |
49996 | 28.3.16 10:50 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | 1980.000 | control | cabrio | 1996 | manual | 75 | ... | opel | 0.000 | 28.3.16 00:00 | 0 | 44807 | 2.4.16 14:18 | 3-16 | 3-16 | 4-16 | 4 |
49997 | 2.4.16 14:44 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | 13200.000 | test | cabrio | 2014 | automatic | 69 | ... | fiat | 0.000 | 2.4.16 00:00 | 0 | 73430 | 4.4.16 11:47 | 4-16 | 4-16 | 4-16 | 4 |
49998 | 8.3.16 19:25 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | 22900.000 | control | kombi | 2013 | manual | 150 | ... | audi | 0.000 | 8.3.16 00:00 | 0 | 35683 | 5.4.16 16:45 | 3-16 | 3-16 | 4-16 | 4 |
49999 | 14.3.16 00:42 | Opel_Vectra_1.6_16V | privat | Angebot | 1250.000 | control | limousine | 1996 | manual | 101 | ... | opel | 0.000 | 13.3.16 00:00 | 0 | 45897 | 6.4.16 21:18 | 3-16 | 3-16 | 4-16 | 4 |
47440 rows × 24 columns
For lower limit let's first consider 1950. Now I need to see how many cars were registered before that year and decide if number is insignificant enough to delete corresponding rows.
too_old_cars = autos[autos['registration_year'] < 1950].shape[0]
print(f'# of too old cars = {too_old_cars}')
# of too old cars = 22
As number of cars registered before 1950 is just 22 and will not have significant data loss effect they can be removed from dataset.
# removing cars registered before 1950 and after 2016
autos = autos[autos['registration_year'].between(1949, 2017, inclusive=False)]
As a result of all above operations I got cleaned data and saved it to new csv file (''). I will use cleaned data for analysis of used car market in German eBay based on most common brands, their average mileage, percentage of unrepaired damages, gearbox type and average age. As mentioned before I will try to find similar patterns between average price and other factors of popular brands.
First of all in order to analyse most frequent brands in our dataset we need to see detailed information on brands. And then I will create new dataset where we will have only popular brands and aggregated values for these brands. In below cell I check the unique brands, descriptive statistics of 'brands' column, and the distribution of brands in percentage in descending order.
unique_brands = autos['brand'].unique()
stat_brands = autos['brand'].describe()
brands_dist = autos['brand'].value_counts(normalize=True, dropna=False)*100
print_2n(unique_brands, stat_brands, brands_dist)
['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault' 'mercedes_benz' 'audi' 'sonstige_autos' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'trabant' 'land_rover' 'alfa_romeo' 'lada' 'rover' 'daihatsu' 'lancia'] count 45551 unique 40 top volkswagen freq 9618 Name: brand, dtype: object volkswagen 21.115 bmw 11.093 opel 10.667 mercedes_benz 9.708 audi 8.724 ford 6.968 renault 4.702 peugeot 2.975 fiat 2.553 seat 1.811 skoda 1.655 nissan 1.528 mazda 1.521 smart 1.416 citroen 1.401 toyota 1.284 hyundai 1.001 sonstige_autos 0.933 volvo 0.920 mini 0.883 mitsubishi 0.819 honda 0.788 kia 0.716 alfa_romeo 0.667 suzuki 0.593 chevrolet 0.571 porsche 0.551 chrysler 0.351 dacia 0.266 daihatsu 0.252 jeep 0.231 land_rover 0.213 subaru 0.211 saab 0.167 jaguar 0.156 daewoo 0.151 trabant 0.138 rover 0.134 lancia 0.108 lada 0.059 Name: brand, dtype: float64
There are 40 unique brands in dataset with Volkswagen being top frequent brand. We can see these brands in list of unique brands. But I plan to analyse some patterns in popular brands. Therefore, brands with very low number of cars in dataset will not be included in new dataset I will create. The threshold to be included to analysis for popular cars set by me is 1 percent. In other words if in above cell percentage in distribution of brands is less than 1 % then corresponding brand will be left out. We can see from distribution than last brand qualified for analysis is 'hyundai' and first one not qualifies is 'sonstige_autos'.
Next step is to create new dataset with popular brands only. Each brand will be represented by one line where brands will be index values. Besides I will create also first column containing average price for each brand.
# selecting brands which cover 1 or more than 1 percent of total share
cmmn_brands = brands_dist[brands_dist >= 1]
# aggregating avg price for each common brand
# adding brand as key & avg price as value to dict
# creating series from dict: key turns to index,
# values to column and descending order
# creating df from series
brands_price_dict = {}
for brand in cmmn_brands.index:
avg_price = autos.loc[autos['brand'] == brand, 'price_usd'].mean()
brands_price_dict[brand] = avg_price
cmmn_brands_avg_price = pd.Series(brands_price_dict).sort_values(ascending=False)
cmmn_brands_df = pd.DataFrame(cmmn_brands_avg_price, columns=['avg_price'])
cmmn_brands_df.index.name = 'brand'
cmmn_brands_df
avg_price | |
---|---|
brand | |
audi | 9379.054 |
mercedes_benz | 8627.313 |
bmw | 8293.899 |
skoda | 6424.845 |
hyundai | 5484.410 |
volkswagen | 5482.053 |
toyota | 5200.318 |
nissan | 4803.099 |
seat | 4479.248 |
mazda | 4178.258 |
citroen | 3841.723 |
ford | 3774.345 |
smart | 3614.164 |
peugeot | 3129.524 |
opel | 3026.059 |
fiat | 2857.802 |
renault | 2512.908 |
Now we have new dataset called 'cmmn_brands_df'. In next 6 cells I will add to new dataset other columns.
Adding 'avg_km_covered' column:
Values in this column represent average distance covered by cars representing particular brand in kilometers.
# aggreating avg covered distance by each brand in
# common brands df and adding it as series to same df
brands_avg_km = {}
for brand in cmmn_brands_df.index:
avg_mlg = autos.loc[autos['brand'] == brand, 'odometer_km'].mean()
brands_avg_km[brand] = avg_mlg
cmmn_brands_df['avg_km_covered'] = pd.Series(brands_avg_km)
cmmn_brands_df
avg_price | avg_km_covered | |
---|---|---|
brand | ||
audi | 9379.054 | 129266.482 |
mercedes_benz | 8627.313 | 131007.463 |
bmw | 8293.899 | 132673.659 |
skoda | 6424.845 | 110835.544 |
hyundai | 5484.410 | 106293.860 |
volkswagen | 5482.053 | 128614.057 |
toyota | 5200.318 | 115777.778 |
nissan | 4803.099 | 118520.115 |
seat | 4479.248 | 121321.212 |
mazda | 4178.258 | 124206.349 |
citroen | 3841.723 | 119184.953 |
ford | 3774.345 | 124188.721 |
smart | 3614.164 | 99263.566 |
peugeot | 3129.524 | 126974.170 |
opel | 3026.059 | 129320.848 |
fiat | 2857.802 | 116822.872 |
renault | 2512.908 | 128204.949 |
Adding 'damage_info_%' column:
Values in this column represent percentage of cars for each brand where there is information weather car has unrepaired damage or not.
In next cells I will also add percentage of cars with unrepaired damage and percentage of healthy cars. But as we learned here there are null values in corresponding column. Therefore it is crucial to learn if majority of cars have information about presence of unrepaired damage.
damage_info_dict = {}
for brand in cmmn_brands_df.index:
count_brands = autos[autos['brand'] == brand].shape[0]
count_na = autos.loc[(autos['brand'] == brand) &\
(autos['unrepaired_damage'].isnull())].shape[0]
share_dmg = 100 - (count_na / count_brands ) * 100
damage_info_dict[brand] = share_dmg
cmmn_brands_df['damage_info_%'] = pd.Series(damage_info_dict)
cmmn_brands_df
avg_price | avg_km_covered | damage_info_% | |
---|---|---|---|
brand | |||
audi | 9379.054 | 129266.482 | 84.625 |
mercedes_benz | 8627.313 | 131007.463 | 85.052 |
bmw | 8293.899 | 132673.659 | 85.197 |
skoda | 6424.845 | 110835.544 | 88.859 |
hyundai | 5484.410 | 106293.860 | 88.158 |
volkswagen | 5482.053 | 128614.057 | 81.004 |
toyota | 5200.318 | 115777.778 | 88.547 |
nissan | 4803.099 | 118520.115 | 85.057 |
seat | 4479.248 | 121321.212 | 84.606 |
mazda | 4178.258 | 124206.349 | 81.818 |
citroen | 3841.723 | 119184.953 | 81.818 |
ford | 3774.345 | 124188.721 | 80.466 |
smart | 3614.164 | 99263.566 | 83.876 |
peugeot | 3129.524 | 126974.170 | 84.207 |
opel | 3026.059 | 129320.848 | 80.078 |
fiat | 2857.802 | 116822.872 | 77.988 |
renault | 2512.908 | 128204.949 | 79.458 |
Output of above cell shows that more than 75 % of cars by each brand has information about presence of unrepaired damage. As significant majority of cars has this information we can rely on analysis based on percentage of healthy cars or cars with unrepaired damage.
Adding 'unrep_damage_%' column:
Values in this column represent percentage of cars for each brand where there is unrepaired damage
Adding 'healthy_car_%' column:
Values in this column represent percentage of cars with no unrepaired damage by each brand.
unrep_damage_dict = {}
for brand in cmmn_brands_df.index:
damaged_share = autos.loc[(autos['brand'] == brand) &\
(autos['unrepaired_damage'].notnull()), 'unrepaired_damage'].mean() * 100
unrep_damage_dict[brand] = damaged_share
cmmn_brands_df['unrep_damage_%'] = pd.Series(unrep_damage_dict)
cmmn_brands_df['healthy_car_%'] = cmmn_brands_df['damage_info_%']\
- cmmn_brands_df['unrep_damage_%']
cmmn_brands_df
avg_price | avg_km_covered | damage_info_% | unrep_damage_% | healthy_car_% | |
---|---|---|---|---|---|
brand | |||||
audi | 9379.054 | 129266.482 | 84.625 | 9.664 | 74.961 |
mercedes_benz | 8627.313 | 131007.463 | 85.052 | 8.881 | 76.171 |
bmw | 8293.899 | 132673.659 | 85.197 | 9.175 | 76.022 |
skoda | 6424.845 | 110835.544 | 88.859 | 9.701 | 79.158 |
hyundai | 5484.410 | 106293.860 | 88.158 | 13.930 | 74.228 |
volkswagen | 5482.053 | 128614.057 | 81.004 | 10.602 | 70.402 |
toyota | 5200.318 | 115777.778 | 88.547 | 7.915 | 80.632 |
nissan | 4803.099 | 118520.115 | 85.057 | 15.372 | 69.686 |
seat | 4479.248 | 121321.212 | 84.606 | 11.461 | 73.145 |
mazda | 4178.258 | 124206.349 | 81.818 | 14.286 | 67.532 |
citroen | 3841.723 | 119184.953 | 81.818 | 12.452 | 69.366 |
ford | 3774.345 | 124188.721 | 80.466 | 15.427 | 65.040 |
smart | 3614.164 | 99263.566 | 83.876 | 8.133 | 75.743 |
peugeot | 3129.524 | 126974.170 | 84.207 | 15.688 | 68.519 |
opel | 3026.059 | 129320.848 | 80.078 | 14.418 | 65.660 |
fiat | 2857.802 | 116822.872 | 77.988 | 14.002 | 63.986 |
renault | 2512.908 | 128204.949 | 79.458 | 15.805 | 63.654 |
Adding 'automatic_%' column:
Values in this column represent percentage of cars with automatic gearbox by each brand
Adding 'manual_%' column:
Values in this column represent percentage of cars with manual gearbox by each brand
automatic_dict = {}
manual_dict = {}
for brand in cmmn_brands_df.index:
automatic_share = autos.loc[autos['brand'] == brand, 'gearbox']\
.value_counts(normalize=True, dropna=False)['automatic'] * 100
manual_share = autos.loc[autos['brand'] == brand, 'gearbox']\
.value_counts(normalize=True, dropna=False)['manual'] * 100
automatic_dict[brand] = automatic_share
manual_dict[brand] = manual_share
cmmn_brands_df['automatic_%'] = pd.Series(automatic_dict)
cmmn_brands_df['manual_%'] = pd.Series(manual_dict)
cmmn_brands_df
avg_price | avg_km_covered | damage_info_% | unrep_damage_% | healthy_car_% | automatic_% | manual_% | |
---|---|---|---|---|---|---|---|
brand | |||||||
audi | 9379.054 | 129266.482 | 84.625 | 9.664 | 74.961 | 33.065 | 63.538 |
mercedes_benz | 8627.313 | 131007.463 | 85.052 | 8.881 | 76.171 | 55.450 | 41.633 |
bmw | 8293.899 | 132673.659 | 85.197 | 9.175 | 76.022 | 33.742 | 63.606 |
skoda | 6424.845 | 110835.544 | 88.859 | 9.701 | 79.158 | 11.538 | 85.942 |
hyundai | 5484.410 | 106293.860 | 88.158 | 13.930 | 74.228 | 12.719 | 84.211 |
volkswagen | 5482.053 | 128614.057 | 81.004 | 10.602 | 70.402 | 13.329 | 82.086 |
toyota | 5200.318 | 115777.778 | 88.547 | 7.915 | 80.632 | 12.137 | 84.274 |
nissan | 4803.099 | 118520.115 | 85.057 | 15.372 | 69.686 | 10.920 | 85.201 |
seat | 4479.248 | 121321.212 | 84.606 | 11.461 | 73.145 | 3.879 | 92.606 |
mazda | 4178.258 | 124206.349 | 81.818 | 14.286 | 67.532 | 6.061 | 90.909 |
citroen | 3841.723 | 119184.953 | 81.818 | 12.452 | 69.366 | 13.323 | 81.191 |
ford | 3774.345 | 124188.721 | 80.466 | 15.427 | 65.040 | 8.034 | 87.492 |
smart | 3614.164 | 99263.566 | 83.876 | 8.133 | 75.743 | 73.178 | 11.628 |
peugeot | 3129.524 | 126974.170 | 84.207 | 15.688 | 68.519 | 5.461 | 90.849 |
opel | 3026.059 | 129320.848 | 80.078 | 14.418 | 65.660 | 10.249 | 85.120 |
fiat | 2857.802 | 116822.872 | 77.988 | 14.002 | 63.986 | 2.408 | 92.347 |
renault | 2512.908 | 128204.949 | 79.458 | 15.805 | 63.654 | 6.676 | 88.562 |
Adding 'vehicle_count' column:
Values in this column represent number of cars by each brand
count_dict = {}
for brand in cmmn_brands_df.index:
vehicle_count = autos[autos['brand'] == brand].shape[0]
count_dict[brand] = vehicle_count
cmmn_brands_df.insert(0, 'vehicle_count', pd.Series(count_dict))
cmmn_brands_df
vehicle_count | avg_price | avg_km_covered | damage_info_% | unrep_damage_% | healthy_car_% | automatic_% | manual_% | |
---|---|---|---|---|---|---|---|---|
brand | ||||||||
audi | 3974 | 9379.054 | 129266.482 | 84.625 | 9.664 | 74.961 | 33.065 | 63.538 |
mercedes_benz | 4422 | 8627.313 | 131007.463 | 85.052 | 8.881 | 76.171 | 55.450 | 41.633 |
bmw | 5053 | 8293.899 | 132673.659 | 85.197 | 9.175 | 76.022 | 33.742 | 63.606 |
skoda | 754 | 6424.845 | 110835.544 | 88.859 | 9.701 | 79.158 | 11.538 | 85.942 |
hyundai | 456 | 5484.410 | 106293.860 | 88.158 | 13.930 | 74.228 | 12.719 | 84.211 |
volkswagen | 9618 | 5482.053 | 128614.057 | 81.004 | 10.602 | 70.402 | 13.329 | 82.086 |
toyota | 585 | 5200.318 | 115777.778 | 88.547 | 7.915 | 80.632 | 12.137 | 84.274 |
nissan | 696 | 4803.099 | 118520.115 | 85.057 | 15.372 | 69.686 | 10.920 | 85.201 |
seat | 825 | 4479.248 | 121321.212 | 84.606 | 11.461 | 73.145 | 3.879 | 92.606 |
mazda | 693 | 4178.258 | 124206.349 | 81.818 | 14.286 | 67.532 | 6.061 | 90.909 |
citroen | 638 | 3841.723 | 119184.953 | 81.818 | 12.452 | 69.366 | 13.323 | 81.191 |
ford | 3174 | 3774.345 | 124188.721 | 80.466 | 15.427 | 65.040 | 8.034 | 87.492 |
smart | 645 | 3614.164 | 99263.566 | 83.876 | 8.133 | 75.743 | 73.178 | 11.628 |
peugeot | 1355 | 3129.524 | 126974.170 | 84.207 | 15.688 | 68.519 | 5.461 | 90.849 |
opel | 4859 | 3026.059 | 129320.848 | 80.078 | 14.418 | 65.660 | 10.249 | 85.120 |
fiat | 1163 | 2857.802 | 116822.872 | 77.988 | 14.002 | 63.986 | 2.408 | 92.347 |
renault | 2142 | 2512.908 | 128204.949 | 79.458 | 15.805 | 63.654 | 6.676 | 88.562 |
Adding 'avg_age' column:
Values in this column represent average age of cars by each brand
# as there are '0' values in registration month column I replaced them with 1
reg_month_0to1 = autos['registration_month'].replace(0, 1).astype(str)
# combining values in registration month and registration year columns
m_y_reg = reg_month_0to1 + '-' + autos['registration_year'].astype(str)
# converting above combination to date values
date_m_y_reg = pd.to_datetime(m_y_reg, format='%m-%Y')
# finding latest date from last seen column
latest_date = pd.to_datetime(autos['last_seen'].str.split().str[0], format='%d.%m.%y').max()
# finding age of each car in days
diff = latest_date - date_m_y_reg
# converting age from days to years and adding to autos dataset
autos['vehicle_age'] = diff.dt.components['days'].astype(int) / 365
# finding average age of cars by brand and adding it to cmmn_brands_df dataset
age_dict = {}
for brand in cmmn_brands_df.index:
avg = autos.loc[autos['brand'] == brand, 'vehicle_age'].mean()
age_dict[brand] = avg
cmmn_brands_df.insert(1, 'avg_age', pd.Series(age_dict))
#cmmn_brands_df.to_csv('autos_brands.csv')
cmmn_brands_df
C:\Users\Asus\anaconda3\lib\site-packages\ipykernel_launcher.py:12: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy if sys.path[0] == '':
vehicle_count | avg_age | avg_price | avg_km_covered | damage_info_% | unrep_damage_% | healthy_car_% | automatic_% | manual_% | |
---|---|---|---|---|---|---|---|---|---|
brand | |||||||||
audi | 3974 | 11.815 | 9379.054 | 129266.482 | 84.625 | 9.664 | 74.961 | 33.065 | 63.538 |
mercedes_benz | 4422 | 13.924 | 8627.313 | 131007.463 | 85.052 | 8.881 | 76.171 | 55.450 | 41.633 |
bmw | 5053 | 12.921 | 8293.899 | 132673.659 | 85.197 | 9.175 | 76.022 | 33.742 | 63.606 |
skoda | 754 | 9.385 | 6424.845 | 110835.544 | 88.859 | 9.701 | 79.158 | 11.538 | 85.942 |
hyundai | 456 | 8.780 | 5484.410 | 106293.860 | 88.158 | 13.930 | 74.228 | 12.719 | 84.211 |
volkswagen | 9618 | 13.636 | 5482.053 | 128614.057 | 81.004 | 10.602 | 70.402 | 13.329 | 82.086 |
toyota | 585 | 11.670 | 5200.318 | 115777.778 | 88.547 | 7.915 | 80.632 | 12.137 | 84.274 |
nissan | 696 | 12.687 | 4803.099 | 118520.115 | 85.057 | 15.372 | 69.686 | 10.920 | 85.201 |
seat | 825 | 11.593 | 4479.248 | 121321.212 | 84.606 | 11.461 | 73.145 | 3.879 | 92.606 |
mazda | 693 | 13.174 | 4178.258 | 124206.349 | 81.818 | 14.286 | 67.532 | 6.061 | 90.909 |
citroen | 638 | 11.612 | 3841.723 | 119184.953 | 81.818 | 12.452 | 69.366 | 13.323 | 81.191 |
ford | 3174 | 13.258 | 3774.345 | 124188.721 | 80.466 | 15.427 | 65.040 | 8.034 | 87.492 |
smart | 645 | 10.611 | 3614.164 | 99263.566 | 83.876 | 8.133 | 75.743 | 73.178 | 11.628 |
peugeot | 1355 | 12.357 | 3129.524 | 126974.170 | 84.207 | 15.688 | 68.519 | 5.461 | 90.849 |
opel | 4859 | 13.900 | 3026.059 | 129320.848 | 80.078 | 14.418 | 65.660 | 10.249 | 85.120 |
fiat | 1163 | 12.877 | 2857.802 | 116822.872 | 77.988 | 14.002 | 63.986 | 2.408 | 92.347 |
renault | 2142 | 13.163 | 2512.908 | 128204.949 | 79.458 | 15.805 | 63.654 | 6.676 | 88.562 |
Now we got dataset ‘cmmn_brands_df’ ready for analysis. I will use it to understand if there are similar patterns and trends between average price and other data in other columns.
At this step of my training I will not use statistical methods as they haven't been covered in this course yet, although I have some knowledge from university. Techniques used below are simple but some explanation of logic behind analysis is required. First, I will create function named 'compare_around_mean' to see if there is pattern between 2 columns. Then I will use it several times to understand if there is pattern between 'avg_price' column and 4 columns.
def compare_around_mean(a_df, x_col, y_col):
print(f"'{x_col}' column values in descending order:\n \
{a_df[x_col].sort_values(ascending=False)}")
x_mean = a_df[x_col].mean()
buttomx_y_avg = a_df.loc[a_df[x_col] < x_mean, y_col].mean()
topx_y_avg = a_df.loc[a_df[x_col] >= x_mean, y_col].mean()
print_2n(f"Mean of '{y_col}' for below average values in '{x_col}': {buttomx_y_avg}")
print_2n(f"Mean of '{y_col}' for above average values in '{x_col}': {topx_y_avg}")
I will explain what function does and logic behind it on example of 'avg_price' and 'manual_%' columns below. 'manual_%' column represent share of cars with manual gear by each brand :
In below cell we can see that average price for brands with share of manual cars below average (7478.61 USD) is much higher than for above average (4245.74 USD). This pattern confirms generally known fact that usually car with manual gear is cheaper. There is only one concern in our case with this conclusin. Share (and number) of cars with automatic gearbox is really low relatively to cars with manual gearbox. Therefore difference between average prices can change significantly once more cars with automatic gears added to the data.
compare_around_mean(cmmn_brands_df, 'manual_%', 'avg_price')
'manual_%' column values in descending order: brand seat 92.606 fiat 92.347 mazda 90.909 peugeot 90.849 renault 88.562 ford 87.492 skoda 85.942 nissan 85.201 opel 85.120 toyota 84.274 hyundai 84.211 volkswagen 82.086 citroen 81.191 bmw 63.606 audi 63.538 mercedes_benz 41.633 smart 11.628 Name: manual_%, dtype: float64 Mean of 'avg_price' for below average values in 'manual_%': 7478.6076593315465 Mean of 'avg_price' for above average values in 'manual_%': 4245.7378633367825
Before using 'cmmn_brands_df' function below I added new column to dataset: 'damage/age'. This column represent ratio of share of cars with unrepaired damage to average age of cars by each brand. Therefore we can see which share of cars by each brand gets unrepaired damage every year. Ideally, the higher the ratio the worse condition of car should be. So price of car should be in that case also lower compared to brands with lower ratio. And as we can see from output of below cell average price for brands with lower ratio is higher approximately by 60 percent.
cmmn_brands_df['damage/age'] = cmmn_brands_df['unrep_damage_%'] / cmmn_brands_df['avg_age']
compare_around_mean(cmmn_brands_df, 'damage/age', 'avg_price')
'damage/age' column values in descending order: brand hyundai 1.587 peugeot 1.270 nissan 1.212 renault 1.201 ford 1.164 fiat 1.087 mazda 1.084 citroen 1.072 opel 1.037 skoda 1.034 seat 0.989 audi 0.818 volkswagen 0.778 smart 0.767 bmw 0.710 toyota 0.678 mercedes_benz 0.638 Name: damage/age, dtype: float64 Mean of 'avg_price' for below average values in 'damage/age': 6439.435683507786 Mean of 'avg_price' for above average values in 'damage/age': 4003.2973076149838
Another column we have to check on relation between price and damage of car is 'healthy_car_%' column. Below cell confirms finding of above cell. The higher the share of healthy (no unrepaired damage) cars the higher the average price.
compare_around_mean(cmmn_brands_df, 'healthy_car_%', 'avg_price')
'healthy_car_%' column values in descending order: brand toyota 80.632 skoda 79.158 mercedes_benz 76.171 bmw 76.022 smart 75.743 audi 74.961 hyundai 74.228 seat 73.145 volkswagen 70.402 nissan 69.686 citroen 69.366 peugeot 68.519 mazda 67.532 opel 65.660 ford 65.040 fiat 63.986 renault 63.654 Name: healthy_car_%, dtype: float64 Mean of 'avg_price' for below average values in 'healthy_car_%': 3733.974541611802 Mean of 'avg_price' for above average values in 'healthy_car_%': 6437.906498274766
And last column I will check in pair with 'avg_price' column is 'avg_km_covered'. Natural expectation is higher the distance covered lower the price. From output of cell it is visible that average price for brands which covered above average distance is lower by ~790 USD. This number confirms natural expectation but we saw that this difference was higher in previous cells. There fore difference between average prices here indicates weaker similarity between covered distance and price.
compare_around_mean(cmmn_brands_df, 'avg_km_covered', 'avg_price')
'avg_km_covered' column values in descending order: brand bmw 132673.659 mercedes_benz 131007.463 opel 129320.848 audi 129266.482 volkswagen 128614.057 renault 128204.949 peugeot 126974.170 mazda 124206.349 ford 124188.721 seat 121321.212 citroen 119184.953 nissan 118520.115 fiat 116822.872 toyota 115777.778 skoda 110835.544 hyundai 106293.860 smart 99263.566 Name: avg_km_covered, dtype: float64 Mean of 'avg_price' for below average values in 'avg_km_covered': 4588.201204252344 Mean of 'avg_price' for above average values in 'avg_km_covered': 5378.157025187288
Even before looking at any data anybody with plans to set a price of used car has some expectation of what and how affects price of car. Or if you want to buy one you probably have some price in your mind depending in what condition is car. In this project I analysed average prices of popular brands. I found out supporting indications for 4 of our natural expectations about prices of cars that belong to some popular brand in trends & patterns section.
Basically average price for car that belongs to certain brand is higher if cars of that brand :
The method I used here is simple and does not involve sufficient statistical methods. The reason for it is quite simple as at this stage of my training I tried to use only concepts already covered in the course. Therefore method I used cannot be enough on its own to draw deterministic conclusions but can serve as additional evidence.
There are some parts of code that can be deleted in this project. For example, adding some new columns to original dataset which later I did not use. I decide to leave and not delete those parts. The reason is that I am on my training phase. As a result, I want to type as more code as possible to memorize the concepts and syntax better. And leave it there cause in case I need come back to update my knowledge it will be easy to find what I learned.