The given dataset contains information about used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website. The dataset was originally scraped and uploaded by other user, afterwards it was modified and dirtied a bit as well - so, it is necessary to clean the data before working with it. The original dataset isn't available on Kaggle anymore, but it can be found here.
The aim of the following project is to analyze the included used car listings and make some conclusions about the given dataframe.
# Importing necessary modules - pandas and numpy
import pandas as pd
import numpy as np
# Reading the 'autos.csv' file as 'autos'
autos = pd.read_csv('autos.csv', encoding='Latin-1')
# Only first 5 rows are returned as currently there's no need to examine all the rows
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
# Obtaining information about the dataframe
autos.info()
# Checking how many rows are present in the dataframe
print('\nnumber of rows in df:', autos.shape[0])
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB number of rows in df: 50000
All the columns are non-null, however, there are NaN values in some columns (such as: 'vehicleType', 'gearbox', 'model', 'fueltype', 'notRepairedDamage' - because they aren't equal to 50000 - the total number of rows in the dataframe). There are 5 integer and 15 object columns - that makes 20 columns in total.
# Getting an array of existing column names
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Editing column names and converting them from camelcase
autos.rename({'dateCrawled':'date_crawled', 'offerType':'offer_type', 'vehicleType':'vehicle_type', 'yearOfRegistration':'registration_year', 'powerPS':'power_ps', 'monthOfRegistration':'registration_month', 'fuelType':'fuel_type', 'notRepairedDamage':'unrepaired_damage', 'dateCreated':'ad_created', 'nrOfPictures':'nr_of_pictures', 'postalCode':'postal_code', 'lastSeen':'last_seen'}, axis=1, inplace=True)
# Making sure that the columns have been converted to snakecase
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'nr_of_pictures', 'postal_code', 'last_seen'], dtype='object')
# Making sure that the converted columns are now in the dataframe and haven't altered the rest of the dataframe structure
autos.head()
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Column labels were replaces from camelcase to snakecase to make the data more descriptive, also some column names ('yearOfRegistration', 'monthOfRegistration', 'notRepairedDamage', 'dateCreated') were altered to make them easier to understand and avoid misinformation they could cause
# Looking at descriptive statistics for all columns
autos.describe(include='all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | nr_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-12 16:06:22 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Now, after we examined the results presented above, it is necessary to optimize the dataframe:
# Removing the 'nr_of_pictures', 'seller', 'offer_type' columns
autos.drop(columns = ['nr_of_pictures','seller','offer_type'], inplace=True)
# Making sure that the mentioned columns have been removed
autos.shape[1]
17
# Understanding what non-numeric characters are present in the 'price' column
autos['price'].head()
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 Name: price, dtype: object
# Removing non-numeric values (',' and '$') from the 'price' column -
# including converting the column to a numeric dtype
autos['price'] = autos['price'].str.replace(',','').str.replace('$','').astype('int64')
# Making sure that non-numeric values have been removed
autos['price'].head()
0 5000 1 8500 2 8990 3 4350 4 1350 Name: price, dtype: int64
# Understanding what non-numeric characters are present in the 'odometer' column
autos['odometer'].head()
0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km Name: odometer, dtype: object
# Removing non-numeric values ('km' and ',') from the 'odometer' column and converting it to a numeric dtype
autos['odometer'] = autos['odometer'].str.replace('km','').str.replace(',','').astype('int64')
# Making sure that non-numeric values have been removed
autos['odometer'].head()
0 150000 1 150000 2 70000 3 70000 4 150000 Name: odometer, dtype: int64
# Renaming the column from 'odometer' to 'odometer_km'
autos.rename({'odometer':'odometer_km'}, axis=1, inplace=True)
# Making sure that the column has been renamed
autos.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
Now we can find out how many unique values numeric columns have. It is necessary to make sure that in many rows they have more than one value - otherwise these columns will be viewed as statistically insignificant and, consequently, will be removed from the dataframe
# Calculating the number of unique values for each numeric column
print('Unique values for numeric columns:')
print('registration_year:', *autos['registration_year'].unique().shape)
print('power_ps:', *autos['power_ps'].unique().shape)
print('registration_month:', *autos['registration_month'].unique().shape)
print('postal_code:', *autos['postal_code'].unique().shape)
Unique values for numeric columns: registration_year: 97 power_ps: 448 registration_month: 13 postal_code: 7014
As we can see, all the four numeric columns have a plenty of unique values - that proves that they statistically significant and that they cannot be dropped
Before removing the outliers from 'odometer_km' and 'price' columns, we need to describe these columns (with outliers included) in order to be able to compare them with themselves, but without outliers
# Observing the values present in 'odometer_km' and 'price' columns before the outliers are removed
outliers_table = autos[['odometer_km', 'price']].describe()
outliers_table
odometer_km | price | |
---|---|---|
count | 50000.000000 | 5.000000e+04 |
mean | 125732.700000 | 9.840044e+03 |
std | 40042.211706 | 4.811044e+05 |
min | 5000.000000 | 0.000000e+00 |
25% | 125000.000000 | 1.100000e+03 |
50% | 150000.000000 | 2.950000e+03 |
75% | 150000.000000 | 7.200000e+03 |
max | 150000.000000 | 1.000000e+08 |
Now, to remove the outliers (if there are any), we need to identify them with the '1.5xIQR rule'. This rule provides us with an interval needed to exclude all the outliers: [Q1-1.5xIQR; Q3+1.5xIQR], where Q1 is the lower quartile (0.25 quantile), Q3 - upper quartile (0.75 quantile), and IQR = Q3 - Q1. The quantiles can be found with the following function: series.quantile(n), where n ∈ [0,1].
# Removing the outliers from the 'odometer_km' column
odometer_no_outliers = autos[autos['odometer_km'].between(autos['odometer_km'].quantile(0.25)-1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)),
autos['odometer_km'].quantile(0.75)+1.5*(autos['odometer_km'].quantile(0.75)-autos['odometer_km'].quantile(0.25)))]
# Printing dataframe with and without outliers
print('df with potential outliers:', *autos['odometer_km'].shape)
print('df without outliers:', *odometer_no_outliers['odometer_km'].shape)
df with potential outliers: 50000 df without outliers: 41520
There were outliers in the 'odometer_km' column because the number of rows of 'df without outliers' (equal to 41520) is less than this number of 'df with potential outliers' (50000). Then we can replace the values of the 'autos' dataframe with the values of 'odometer_no_outliers' - it is the same dataframe but without outliers in the 'odometer_km' column
# Removing the rows with the outliers in the 'odometer_km' column
autos = odometer_no_outliers
# Making sure that the number of rows in the new 'autos' version is less than 50000
autos.shape[0]
41520
# Removing the outliers from the 'price' column
price_no_outliers = autos[autos['price'].between(autos['price'].quantile(0.25)-1.5*(autos['price'].quantile(0.75)-autos['price'].quantile(0.25)),
autos['price'].quantile(0.75)+1.5*(autos['price'].quantile(0.75)-autos['price'].quantile(0.25)))]
# Printing dataframe with and without outliers
print('df with potential outliers:', *autos['price'].shape)
print('df without outliers:', *price_no_outliers['price'].shape)
df with potential outliers: 41520 df without outliers: 38386
There were outliers in the 'price' column because the number of rows of 'df without outliers' (equal to 38386) is less than this number of 'df with potential outliers' (41520). Then we can replace the values of the 'autos' dataframe with the values of 'price_no_outliers' - it is the same dataframe but without outliers in the 'price' column
# Removing the rows with the outliers in the 'price' column
autos = price_no_outliers
# Making sure that the number of rows in the new 'autos' version is less than 41520
autos.shape[0]
38386
Now we can compare the values of the columns with outliers included with the values of the same columns without outliers:
# Printing the table with outliers
outliers_table
odometer_km | price | |
---|---|---|
count | 50000.000000 | 5.000000e+04 |
mean | 125732.700000 | 9.840044e+03 |
std | 40042.211706 | 4.811044e+05 |
min | 5000.000000 | 0.000000e+00 |
25% | 125000.000000 | 1.100000e+03 |
50% | 150000.000000 | 2.950000e+03 |
75% | 150000.000000 | 7.200000e+03 |
max | 150000.000000 | 1.000000e+08 |
# Observing the values left in 'odometer_km' and 'price' after removing outliers
autos[['odometer_km', 'price']].describe()
odometer_km | price | |
---|---|---|
count | 38386.000000 | 38386.000000 |
mean | 142674.412546 | 3127.445136 |
std | 16147.477812 | 2872.206674 |
min | 90000.000000 | 0.000000 |
25% | 150000.000000 | 950.000000 |
50% | 150000.000000 | 2131.000000 |
75% | 150000.000000 | 4500.000000 |
max | 150000.000000 | 12250.000000 |
After the outliers were removed, 'odometer_km' has experienced some changes. Its smallest values were outliers and their removal, consequenty, lead to smaller standard deviation (16147 instead of 40042) and bigger minimum value (90000 instead of 5000).
The new 'price' column has changed significantly as well: its standard deviation is now equal to 2872 instead of 481104; its maximum value is now 12250 instead of 1.000000+e08 (1 billion)
It can be noted that all there are three columns ('date_crawled', 'ad_created', and 'last_seen') characterize ads of cars sold. Then analyzing them can provide us with a brief description of: when the ads were discovered by the crawler, when these ads were active, and when they went inactive.
# Calculating the distribution of values in 'date_crawled'
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.024697 2016-03-06 0.013911 2016-03-07 0.036628 2016-03-08 0.034414 2016-03-09 0.033736 2016-03-10 0.032642 2016-03-11 0.032017 2016-03-12 0.037540 2016-03-13 0.015058 2016-03-14 0.037383 2016-03-15 0.034127 2016-03-16 0.030037 2016-03-17 0.032121 2016-03-18 0.013234 2016-03-19 0.033398 2016-03-20 0.037514 2016-03-21 0.037488 2016-03-22 0.032564 2016-03-23 0.032538 2016-03-24 0.028630 2016-03-25 0.032199 2016-03-26 0.032069 2016-03-27 0.030819 2016-03-28 0.034674 2016-03-29 0.033867 2016-03-30 0.034127 2016-03-31 0.031991 2016-04-01 0.033267 2016-04-02 0.034492 2016-04-03 0.037800 2016-04-04 0.036862 2016-04-05 0.013442 2016-04-06 0.003569 2016-04-07 0.001146 Name: date_crawled, dtype: float64
All the data given was obtained by the crawler within a year (2016). Most of the data was obtained throughout March (3-rd month), while its remaining part - throughout April (4-th month); that means there were only 2 months when the data was 'crawled'.
# Calculating the distribution of values in 'ad_created'
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-12-05 0.000026 2015-12-30 0.000026 2016-01-07 0.000026 2016-01-10 0.000026 2016-01-13 0.000026 2016-01-16 0.000026 2016-01-22 0.000026 2016-01-27 0.000052 2016-01-29 0.000026 2016-02-01 0.000026 2016-02-02 0.000052 2016-02-05 0.000026 2016-02-08 0.000026 2016-02-09 0.000026 2016-02-11 0.000026 2016-02-12 0.000078 2016-02-16 0.000026 2016-02-18 0.000052 2016-02-19 0.000026 2016-02-20 0.000052 2016-02-21 0.000078 2016-02-22 0.000026 2016-02-23 0.000026 2016-02-25 0.000052 2016-02-26 0.000026 2016-02-27 0.000130 2016-02-28 0.000208 2016-02-29 0.000130 2016-03-01 0.000130 2016-03-02 0.000078 ... 2016-03-09 0.033632 2016-03-10 0.032434 2016-03-11 0.032329 2016-03-12 0.037436 2016-03-13 0.016490 2016-03-14 0.035820 2016-03-15 0.033919 2016-03-16 0.030480 2016-03-17 0.031835 2016-03-18 0.013989 2016-03-19 0.032382 2016-03-20 0.037462 2016-03-21 0.037644 2016-03-22 0.032642 2016-03-23 0.032356 2016-03-24 0.028474 2016-03-25 0.032382 2016-03-26 0.032095 2016-03-27 0.030714 2016-03-28 0.034648 2016-03-29 0.034023 2016-03-30 0.033893 2016-03-31 0.031861 2016-04-01 0.033424 2016-04-02 0.034075 2016-04-03 0.038087 2016-04-04 0.037488 2016-04-05 0.012010 2016-04-06 0.003595 2016-04-07 0.001042 Name: ad_created, Length: 66, dtype: float64
The first ad was created on June of 2015 (the 6-th month), while the last one - on April of 2016 (the 4-th month). Most of ads were created in 2016, it took around 8 months to create all the ads. February and March of 2016 (the 2-nd and the 3-rd months accordingly) represented the period of time when the majority of ads was created.
# Calculating the distribution of values in 'last_seen'
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001094 2016-03-06 0.005002 2016-03-07 0.006174 2016-03-08 0.008701 2016-03-09 0.010707 2016-03-10 0.011384 2016-03-11 0.014120 2016-03-12 0.026051 2016-03-13 0.009821 2016-03-14 0.013755 2016-03-15 0.016985 2016-03-16 0.017871 2016-03-17 0.030141 2016-03-18 0.007685 2016-03-19 0.017090 2016-03-20 0.022300 2016-03-21 0.022091 2016-03-22 0.022638 2016-03-23 0.019434 2016-03-24 0.021049 2016-03-25 0.020320 2016-03-26 0.017845 2016-03-27 0.017064 2016-03-28 0.022326 2016-03-29 0.023368 2016-03-30 0.025712 2016-03-31 0.024879 2016-04-01 0.024488 2016-04-02 0.025817 2016-04-03 0.025634 2016-04-04 0.025426 2016-04-05 0.116944 2016-04-06 0.205309 2016-04-07 0.120773 Name: last_seen, dtype: float64
All the data was last seen within two months: March and April of 2016. However, it took only about a month for all the data to be seen last (as this process started on the 5-th of March and ended on the 7-th of April). It means that after 2016-04-07 no ad was seen by the crawler online.
# Describing the distribution of 'registration_year'
autos['registration_year'].describe()
count 38386.000000 mean 2002.499557 std 36.331672 min 1910.000000 25% 1999.000000 50% 2002.000000 75% 2006.000000 max 9000.000000 Name: registration_year, dtype: float64
The biggest values of the series (from 2008 up to 9999) are located after the third quartile (75%) while the max is equal to 9999 - that means most of the values in the series are located within the [1000;2008] interval.
It is known, however, that the values before 1903 cannot be taken into consideration because the first car registration took place in 1903. The latest year was 2020, so the upper limit is this year. All these facts taken into the consideration create an interval [1903,2020]. Then it is necessary to remove the rows that don't meet the criterion ('registration_year' out of the interval).
autos['registration_year'][autos['registration_year'].between(1903,2020)].value_counts(normalize=True)
2000 0.080839 1999 0.074144 2005 0.068699 2001 0.065963 2003 0.065911 2004 0.064635 2002 0.061456 1998 0.060049 2006 0.057965 1997 0.050410 2007 0.044835 2008 0.036889 1996 0.035300 2017 0.032461 1995 0.031002 2016 0.029725 2009 0.027354 1994 0.015553 2010 0.014771 2018 0.010812 1993 0.010655 2011 0.009222 1992 0.008858 1990 0.008363 1991 0.008180 2012 0.004012 1989 0.003960 1988 0.002918 1985 0.001902 1987 0.001615 ... 1970 0.000391 1960 0.000391 1973 0.000391 1972 0.000365 1976 0.000339 1977 0.000287 1975 0.000208 2014 0.000208 1965 0.000182 1968 0.000156 1974 0.000156 2015 0.000130 1966 0.000130 1967 0.000104 1971 0.000104 1969 0.000104 1910 0.000052 1956 0.000052 1961 0.000052 1958 0.000052 1964 0.000052 1963 0.000052 2019 0.000052 1962 0.000026 1937 0.000026 1934 0.000026 1959 0.000026 1954 0.000026 1953 0.000026 1950 0.000026 Name: registration_year, Length: 69, dtype: float64
Biggest number of annual registrations belongs to 2000s and the end of 90s. Lowest number belongs to the first half of the previous century (from 30s to 50s).
# Removing the rows where the 'registation_year' values are outside the interval
autos = autos.loc[autos['registration_year'].between(1903,2020)]
A decision was made to choose 15 most sold brands, the most commonly sold of which is 'volkswagen' and the least one is 'toyota'.
# Picking 15 most common values in the 'brand' column
autos['brand'].value_counts().head(15)
volkswagen 8588 opel 4700 bmw 4052 mercedes_benz 3434 audi 3045 ford 2790 renault 2071 peugeot 1229 fiat 1007 seat 743 mazda 624 nissan 568 citroen 546 skoda 525 toyota 457 Name: brand, dtype: int64
# Creating a dictionary for mean price
aggregate_price = {}
top_brands = autos['brand'].value_counts().head(15).index
for brand in top_brands:
sel_row = autos[autos['brand']==brand]
mean_row = sel_row['price'].mean()
aggregate_price[brand] = mean_row
aggregate_price
{'audi': 4322.562889983579, 'bmw': 4476.98198420533, 'citroen': 2698.168498168498, 'fiat': 1843.119165839126, 'ford': 2243.584946236559, 'mazda': 2677.7596153846152, 'mercedes_benz': 4145.680838672102, 'nissan': 2648.906690140845, 'opel': 2096.4821276595744, 'peugeot': 2303.6102522375913, 'renault': 1713.7469821342347, 'seat': 2730.265141318977, 'skoda': 4192.601904761905, 'toyota': 3775.301969365427, 'volkswagen': 3180.179902189101}
# Sorting the dictionary by its values in descending order
sorted(aggregate_price.items(), key=lambda x: x[1], reverse=True)
[('bmw', 4476.98198420533), ('audi', 4322.562889983579), ('skoda', 4192.601904761905), ('mercedes_benz', 4145.680838672102), ('toyota', 3775.301969365427), ('volkswagen', 3180.179902189101), ('seat', 2730.265141318977), ('citroen', 2698.168498168498), ('mazda', 2677.7596153846152), ('nissan', 2648.906690140845), ('peugeot', 2303.6102522375913), ('ford', 2243.584946236559), ('opel', 2096.4821276595744), ('fiat', 1843.119165839126), ('renault', 1713.7469821342347)]
'bmw' mean price is the highest among the top 15 brands, it is being followed by 'audi' and 'skoda'. The lowest mean price among the given brands belongs to 'renault'.
# Creating a dictionary for mean mileage
aggregate_mileage = {}
for brand in autos['brand'].value_counts().head(15).index:
sel_row = autos[autos['brand'] == brand]
mean_mileage = sel_row['odometer_km'].mean()
aggregate_mileage[brand] = mean_mileage
aggregate_mileage
{'audi': 146834.15435139573, 'bmw': 145826.75222112538, 'citroen': 138095.2380952381, 'fiat': 137120.1588877855, 'ford': 140992.83154121865, 'mazda': 141153.84615384616, 'mercedes_benz': 145789.16715200932, 'nissan': 138345.0704225352, 'opel': 142320.21276595743, 'peugeot': 140537.02196908055, 'renault': 141144.37469821342, 'seat': 140343.20323014804, 'skoda': 137285.7142857143, 'toyota': 137133.47921225382, 'volkswagen': 144168.60735910572}
# Creating series for mean price
bmp_series = pd.Series(aggregate_price)
bmp_series
audi 4322.562890 bmw 4476.981984 citroen 2698.168498 fiat 1843.119166 ford 2243.584946 mazda 2677.759615 mercedes_benz 4145.680839 nissan 2648.906690 opel 2096.482128 peugeot 2303.610252 renault 1713.746982 seat 2730.265141 skoda 4192.601905 toyota 3775.301969 volkswagen 3180.179902 dtype: float64
# Creating series for mean mileage
bmm_series = pd.Series(aggregate_mileage)
bmm_series
audi 146834.154351 bmw 145826.752221 citroen 138095.238095 fiat 137120.158888 ford 140992.831541 mazda 141153.846154 mercedes_benz 145789.167152 nissan 138345.070423 opel 142320.212766 peugeot 140537.021969 renault 141144.374698 seat 140343.203230 skoda 137285.714286 toyota 137133.479212 volkswagen 144168.607359 dtype: float64
# Assigning the values of mean price to df
df = pd.DataFrame(bmp_series, columns=['mean_price'])
df
mean_price | |
---|---|
audi | 4322.562890 |
bmw | 4476.981984 |
citroen | 2698.168498 |
fiat | 1843.119166 |
ford | 2243.584946 |
mazda | 2677.759615 |
mercedes_benz | 4145.680839 |
nissan | 2648.906690 |
opel | 2096.482128 |
peugeot | 2303.610252 |
renault | 1713.746982 |
seat | 2730.265141 |
skoda | 4192.601905 |
toyota | 3775.301969 |
volkswagen | 3180.179902 |
# Inserting the values of mean mileage as a new column
df.insert(1, 'mean_mileage', bmm_series)
df
mean_price | mean_mileage | |
---|---|---|
audi | 4322.562890 | 146834.154351 |
bmw | 4476.981984 | 145826.752221 |
citroen | 2698.168498 | 138095.238095 |
fiat | 1843.119166 | 137120.158888 |
ford | 2243.584946 | 140992.831541 |
mazda | 2677.759615 | 141153.846154 |
mercedes_benz | 4145.680839 | 145789.167152 |
nissan | 2648.906690 | 138345.070423 |
opel | 2096.482128 | 142320.212766 |
peugeot | 2303.610252 | 140537.021969 |
renault | 1713.746982 | 141144.374698 |
seat | 2730.265141 | 140343.203230 |
skoda | 4192.601905 | 137285.714286 |
toyota | 3775.301969 | 137133.479212 |
volkswagen | 3180.179902 | 144168.607359 |
# Sorting the dataframe by mean price in descending order
df.sort_values('mean_price', ascending=False)
mean_price | mean_mileage | |
---|---|---|
bmw | 4476.981984 | 145826.752221 |
audi | 4322.562890 | 146834.154351 |
skoda | 4192.601905 | 137285.714286 |
mercedes_benz | 4145.680839 | 145789.167152 |
toyota | 3775.301969 | 137133.479212 |
volkswagen | 3180.179902 | 144168.607359 |
seat | 2730.265141 | 140343.203230 |
citroen | 2698.168498 | 138095.238095 |
mazda | 2677.759615 | 141153.846154 |
nissan | 2648.906690 | 138345.070423 |
peugeot | 2303.610252 | 140537.021969 |
ford | 2243.584946 | 140992.831541 |
opel | 2096.482128 | 142320.212766 |
fiat | 1843.119166 | 137120.158888 |
renault | 1713.746982 | 141144.374698 |
# Calculating correlation between the two columns
df['mean_price'].corr(df['mean_mileage'])
0.43168914135180714
The correlation between these two columns ('mean_price' and 'mean_mileage') is equal to 43% - that is considered to be a weak correlation. Then it can be concluded that these two characteristics ('mean_price' and 'mean_mileage') are weakly correlated; consequently, the changes in one column are not likely to influence the other column (and the other way around).