The dataset for this project we'll work with is on used cars from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The dataset was originally scraped and uploaded to Kaggle by user orgesleka. The original dataset isn't available on Kaggle anymore, but you can find it here.
The aim of this project is to clean the data and analyze the included used car listings.
We will import the Pandas and NumPy liabries to explore the dataset and get information about it.
#Import libraries to read the dataset
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
#Use the info() method to get infomation about the data
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
#Use the head() method to view the first few rows of the dataset
autos.head()
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
From the exploriing our dataset, we can make the following observations:
Next we will convert the column names from camelcase to snakecase and reword some of the column names to be more descriptive.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
# Use dataframe.column attribute to view the existing column names
# And rename the columns, and transforming camelcase to snakecase
old_column_names = autos.columns
new_column_names = ['date_crawled',
'name',
'seller',
'offer_type',
'price',
'abtest',
'vehicle_type',
'registration_year',
'gearbox',
'power_ps',
'model',
'odometer',
'registration_month',
'fuel_type',
'brand',
'unrepaired_damage',
'ad_created',
'num_of_pictures',
'postal_code',
'last_seen']
# Assign the new column names to the dataset using dataframe.column
autos.columns = new_column_names
autos.head(2)
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
We will continue with some basic exploration to figure where there is need to carry out further data cleaning tasks. We will look for:
#We will look at the descriptive statistics of our dataset
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_of_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
From the described dataset we can notice that:
vehicle_type
,gearbox
,model
,fuel_type
, last_seen
and unrepaired_damage
columns have counts less than 50000. Possible to have missing or NaN values.seller
and offer_type
columns has 2 unique entries each, but 'privat' and 'Angebot' accounted for 49999 of 50000 entries in their respective columns. These can be dropped.price
column return NaN, suggesting the price is not in numeric format.num_of_pictures
column seems to have only 0 values. It can be dropped.registration_year
and registration_month
columns appear to be in int format instead of datetime.postal_code
column can be converted to strings as they are not actual numeric values.dateCrawled
, last_seen
, and last_seen
appear to be in string format.odometer
column and converted to numeric data type.#Investigating futher the seller column
autos['seller'].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
#Investigating futher the seller column
autos['offer_type'].value_counts()
Angebot 49999 Gesuch 1 Name: offer_type, dtype: int64
#Investigating futher the seller column
autos['num_of_pictures'].value_counts()
0 50000 Name: num_of_pictures, dtype: int64
The columns offer_type
, num_of_pictures
and seller
columns can be dropped from the exploration above. The first two mentioned columns have mostly one value, while the later had the same value.
# converting the price column to numeric field
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
# converting the odometer column to numeric field
autos["odometer"] = autos["odometer"].str.replace("km", "").str.replace(",", "").astype(int)
#Renaming the odometer column to a more descriptive name
autos.rename({'odometer':'odometer_km'}, axis = 1, inplace = True)
#Removing unwanted columns
autos.drop(['seller','offer_type', 'num_of_pictures'],
axis=1, inplace=True)
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 50000 non-null object 1 name 50000 non-null object 2 price 50000 non-null int64 3 abtest 50000 non-null object 4 vehicle_type 44905 non-null object 5 registration_year 50000 non-null int64 6 gearbox 47320 non-null object 7 power_ps 50000 non-null int64 8 model 47242 non-null object 9 odometer_km 50000 non-null int64 10 registration_month 50000 non-null int64 11 fuel_type 45518 non-null object 12 brand 50000 non-null object 13 unrepaired_damage 40171 non-null object 14 ad_created 50000 non-null object 15 postal_code 50000 non-null int64 16 last_seen 50000 non-null object dtypes: int64(6), object(11) memory usage: 6.5+ MB
We'll continue exploring the data, specifically looking for data that doesn't look right. We'll begin by analyzing the odometer_km and price columns. Tasks we will carry out are:
# Use Series.unique().shape to see unique values
print('The shape of odometer_km column is:', autos['odometer_km'].unique().shape)
print('The shape of price column is:', autos['price'].unique().shape)
The shape of odometer_km column is: (13,) The shape of price column is: (2357,)
# Use Series.describe() to view the summary statistics
autos[['odometer_km','price']].describe()
odometer_km | price | |
---|---|---|
count | 50000.000000 | 5.000000e+04 |
mean | 125732.700000 | 9.840044e+03 |
std | 40042.211706 | 4.811044e+05 |
min | 5000.000000 | 0.000000e+00 |
25% | 125000.000000 | 1.100000e+03 |
50% | 150000.000000 | 2.950000e+03 |
75% | 150000.000000 | 7.200000e+03 |
max | 150000.000000 | 1.000000e+08 |
The price column has minimum value of 0. This could be an outlier as cars are not sold for $0.
Similarly, it is unlikely for used cars to be sold for as high as $100,000,000.
On the minimum value of 5000 for odometer column might be an outlier as it has huge difference from the min and the lowest quartile.
# Exploring the odometer_km column
# With Series.value_counts() to see unique values
autos['odometer_km'].value_counts(ascending=False).head(13)
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
The data for the odometer column looks realistics, and needs no further modification at this point.
We will keep the price maximum value at 350000. Values above this seem unrealistic.
# Exploring the price column
# With Series.value_counts() to see unique values
autos['price'].value_counts(ascending=False).head(20)
0 1421 500 781 1500 734 2500 643 1000 639 1200 639 600 531 800 498 3500 498 2000 460 999 434 750 433 900 420 650 419 850 410 700 395 4500 394 300 384 2200 382 950 379 Name: price, dtype: int64
We will not be removing the values with 0. They make up a large part of the dataset, and may be useful as we go along.
autos = autos[autos['price'].between(0,350000)]
autos['price'].describe()
count 49986.000000 mean 5721.525167 std 8983.617820 min 0.000000 25% 1100.000000 50% 2950.000000 75% 7200.000000 max 350000.000000 Name: price, dtype: float64
There are 5 columns that should represent date values. Some of these columns were created by the crawler, some came from the website itself. They include:
The date_crawled, last_seen, and ad_created columns are all identified as string; we need to convert the data into a numerical representation so we can understand it quantitatively. The other two columns are in as numeric formmat; methods like Series.describe() can help us understand the distribution without further data processing.
# A closer look to first understand how the values in the three string columns are formatted
autos[['date_crawled','ad_created','last_seen']][:5]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | 2016-03-12 00:00:00 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 2016-04-01 14:38:50 |
You'll notice that the first 10 characters represent the day, representd as year, month and day (e.g. 2016-03-12); and the last 8 characters the time in hour, minute and seconds (e.g.14:38:50).
Going forward, we will extract just the date values to understand the date range.
# Select the first characters with series.str[]
# Generate a distribution, and then sort by the index.
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025387 2016-03-06 0.013944 2016-03-07 0.035970 2016-03-08 0.033269 2016-03-09 0.033209 2016-03-10 0.032129 2016-03-11 0.032489 2016-03-12 0.036770 2016-03-13 0.015564 2016-03-14 0.036630 2016-03-15 0.033990 2016-03-16 0.029508 2016-03-17 0.031509 2016-03-18 0.013064 2016-03-19 0.034910 2016-03-20 0.037831 2016-03-21 0.037490 2016-03-22 0.032909 2016-03-23 0.032389 2016-03-24 0.029108 2016-03-25 0.031749 2016-03-26 0.032489 2016-03-27 0.031049 2016-03-28 0.034850 2016-03-29 0.034150 2016-03-30 0.033629 2016-03-31 0.031909 2016-04-01 0.033809 2016-04-02 0.035410 2016-04-03 0.038691 2016-04-04 0.036490 2016-04-05 0.013104 2016-04-06 0.003181 2016-04-07 0.001420 Name: date_crawled, dtype: float64
The data for data_crawled is between 2016-03-05 and 2016-04-07
# Select the first characters with series.str[]
# Generate a distribution, and then sort by the index.
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000020 2015-08-10 0.000020 2015-09-09 0.000020 2015-11-10 0.000020 2015-12-05 0.000020 ... 2016-04-03 0.038931 2016-04-04 0.036850 2016-04-05 0.011843 2016-04-06 0.003261 2016-04-07 0.001280 Name: ad_created, Length: 76, dtype: float64
Ads were created between 2015-06-11 and 2016-04-07. The maximum ad was in 2016-04-03.
# Select the first characters with series.str[]
# Generate a distribution, and then sort by the index.
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.001080 2016-03-06 0.004421 2016-03-07 0.005362 2016-03-08 0.007582 2016-03-09 0.009843 2016-03-10 0.010763 2016-03-11 0.012524 2016-03-12 0.023807 2016-03-13 0.008983 2016-03-14 0.012804 2016-03-15 0.015884 2016-03-16 0.016445 2016-03-17 0.027928 2016-03-18 0.007422 2016-03-19 0.015744 2016-03-20 0.020706 2016-03-21 0.020726 2016-03-22 0.021586 2016-03-23 0.018585 2016-03-24 0.019565 2016-03-25 0.019205 2016-03-26 0.016965 2016-03-27 0.016024 2016-03-28 0.020846 2016-03-29 0.022326 2016-03-30 0.024847 2016-03-31 0.023827 2016-04-01 0.023106 2016-04-02 0.024887 2016-04-03 0.025367 2016-04-04 0.024627 2016-04-05 0.124275 2016-04-06 0.220982 2016-04-07 0.130957 Name: last_seen, dtype: float64
# We will view the distribution of registration_year
autos['registration_year'].describe()
count 49986.000000 mean 2005.075721 std 105.727161 min 1000.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
The minimum (1000) and maximum (9999) of the registration year column might be wrong, and appear unrealistic.
# Generate a distribution, and then sort by the index.
registration_year_list = autos['registration_year'].value_counts(normalize=True, dropna=False, ascending=True).sort_index()
print(registration_year_list.head(20))
print('\n')
print(registration_year_list.tail(20))
1000 0.00002 1001 0.00002 1111 0.00002 1500 0.00002 1800 0.00004 1910 0.00018 1927 0.00002 1929 0.00002 1931 0.00002 1934 0.00004 1937 0.00008 1938 0.00002 1939 0.00002 1941 0.00004 1943 0.00002 1948 0.00002 1950 0.00006 1951 0.00004 1952 0.00002 1953 0.00002 Name: registration_year, dtype: float64 2011 0.032689 2012 0.026467 2013 0.016125 2014 0.013304 2015 0.007982 2016 0.026327 2017 0.029048 2018 0.009823 2019 0.000060 2800 0.000020 4100 0.000020 4500 0.000020 4800 0.000020 5000 0.000080 5911 0.000020 6200 0.000020 8888 0.000020 9000 0.000040 9996 0.000020 9999 0.000080 Name: registration_year, dtype: float64
From the exploration we did on the registration_year column, we observed the column contains:
Because a car can't be first registered after the listing was seen, any vehicle with a registration year above 2016 is definitely inaccurate. Additionally, it is more realistic, for the earliest valid year to be somewhere in the first few decades of the 1900s.
So we will be keeping date values for 1910 (as earliest) to 2016.
# Selecting year between 1990 to 2016
autos = autos[autos['registration_year'].between(1990,2016)]
autos.describe()
price | registration_year | power_ps | odometer_km | registration_month | postal_code | |
---|---|---|---|---|---|---|
count | 46628.000000 | 46628.000000 | 46628.000000 | 46628.000000 | 46628.000000 | 46628.000000 |
mean | 5753.910269 | 2003.518036 | 117.910097 | 126136.119928 | 5.801021 | 50921.426782 |
std | 8908.747513 | 5.801633 | 197.405225 | 39567.381126 | 3.694594 | 25757.597852 |
min | 0.000000 | 1990.000000 | 0.000000 | 5000.000000 | 0.000000 | 1067.000000 |
25% | 1149.000000 | 1999.000000 | 75.000000 | 125000.000000 | 3.000000 | 30519.000000 |
50% | 2990.000000 | 2003.000000 | 109.000000 | 150000.000000 | 6.000000 | 49671.000000 |
75% | 7300.000000 | 2008.000000 | 150.000000 | 150000.000000 | 9.000000 | 71636.000000 |
max | 350000.000000 | 2016.000000 | 17700.000000 | 150000.000000 | 12.000000 | 99998.000000 |
autos['registration_year'].value_counts(normalize=True, ascending=True)
1991 0.007635 1992 0.008364 1990 0.008471 2015 0.008557 1993 0.009544 1994 0.014155 2014 0.014262 2013 0.017286 1995 0.028138 2016 0.028223 2012 0.028374 1996 0.030969 2010 0.034250 2011 0.035043 1997 0.043493 2009 0.044973 2008 0.047847 2007 0.049412 1998 0.052608 2002 0.054324 2001 0.057948 2006 0.058055 2003 0.058484 2004 0.058699 1999 0.064296 2005 0.064661 2000 0.071931 Name: registration_year, dtype: float64
The data earlier than 1990 and above 2016 have been removed. Also, there is an increase in registration as the year increases.
We can use aggregation to understand the brand column and understand the variations there are. For this we will:
Identify the unique values we want to aggregate by
Create an empty dictionary to store our aggregate data
Loop over the unique values, and for each:
autos.head(3)
date_crawled | name | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer_km | registration_month | fuel_type | brand | unrepaired_damage | ad_created | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | 5000 | control | bus | 2004 | manuell | 158 | andere | 150000 | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | 8500 | control | limousine | 1997 | automatik | 286 | 7er | 150000 | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | 8990 | test | limousine | 2009 | manuell | 102 | golf | 70000 | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 35394 | 2016-04-06 20:15:37 |
autos['brand'].describe()
count 46628 unique 40 top volkswagen freq 9874 Name: brand, dtype: object
autos_brand = autos['brand'].value_counts(normalize=True)
unique_brands = autos['brand'].unique()
print(autos_brand)
print('\n')
print('These are the unique brands in the data brand column:', '\n', unique_brands)
volkswagen 0.211761 bmw 0.111221 opel 0.109398 mercedes_benz 0.092670 audi 0.087866 ford 0.069851 renault 0.048426 peugeot 0.030389 fiat 0.025864 seat 0.018701 skoda 0.016406 mazda 0.015484 nissan 0.015356 smart 0.014326 citroen 0.013919 toyota 0.012632 hyundai 0.010144 volvo 0.009029 mini 0.008793 mitsubishi 0.008343 sonstige_autos 0.008042 honda 0.007871 kia 0.007313 alfa_romeo 0.006391 suzuki 0.006026 porsche 0.005404 chevrolet 0.005104 chrysler 0.003667 dacia 0.002638 daihatsu 0.002595 jeep 0.002230 subaru 0.002166 land_rover 0.002016 saab 0.001566 jaguar 0.001544 daewoo 0.001544 rover 0.001394 lancia 0.001072 lada 0.000493 trabant 0.000343 Name: brand, dtype: float64 These are the unique brands in the data brand column: ['peugeot' 'bmw' 'volkswagen' 'smart' 'ford' 'chrysler' 'seat' 'renault' 'mercedes_benz' 'audi' 'opel' 'mazda' 'porsche' 'mini' 'toyota' 'dacia' 'nissan' 'jeep' 'saab' 'volvo' 'mitsubishi' 'jaguar' 'fiat' 'skoda' 'subaru' 'sonstige_autos' 'kia' 'citroen' 'chevrolet' 'hyundai' 'honda' 'daewoo' 'suzuki' 'land_rover' 'alfa_romeo' 'rover' 'trabant' 'daihatsu' 'lancia' 'lada']
The volkswagen is the most popular, accounting for about 21 percent of the total brand value, while trabant had the least. Let's explore brands over 5%.
#select the brands for the aggregation
selected_brands = autos_brand[autos_brand > 0.05].index
print(selected_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
bprice_means = {}
for brand in selected_brands:
brand_select = autos[autos['brand']==brand]
avg_brand = brand_select['price'].mean()
bprice_means[brand] = int(avg_brand)
bprice_means
{'volkswagen': 5233, 'bmw': 8135, 'opel': 2859, 'mercedes_benz': 8447, 'audi': 9181, 'ford': 3362}
From the price means we can group car brands that contribute over 5% to the total sum into 3 distinct price categories:
For the top 6 brands, we'll use aggregation to understand the average mileage for those cars and see if there is any visible link with the mean price.
autos.columns
Index(['date_crawled', 'name', 'price', 'abtest', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer_km', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen'], dtype='object')
# selected_brands
mean_milage = {}
for brand in selected_brands:
brand_mile = autos[autos['brand']==brand]
milage_mean = brand_mile['odometer_km'].mean()
mean_milage[brand] = int(milage_mean)
print(mean_milage)
{'volkswagen': 129127, 'bmw': 132561, 'opel': 129836, 'mercedes_benz': 130925, 'audi': 129090, 'ford': 125164}
# convert the brand price means and odometer dictionaries to series
bpm_series = pd.Series(bprice_means)
mean_mileage = pd.Series(mean_milage)
# creating a single-column dataframe from the above price mean series object
bpm_df = pd.DataFrame(bpm_series, columns =['price_mean'])
# assigning the mean mileage as a new column to the dataframe
bpm_df['mean_mileage'] = mean_mileage
#print dataframe (bpm_df) and sort by the price column
bpm_df.sort_values(by=['price_mean'], ascending=False)
price_mean | mean_mileage | |
---|---|---|
audi | 9181 | 129090 |
mercedes_benz | 8447 | 130925 |
bmw | 8135 | 132561 |
volkswagen | 5233 | 129127 |
ford | 3362 | 125164 |
opel | 2859 | 129836 |
The analysis, we discovered three distinct classes of car brands based on their prices. Audi cars appeared to be the most expensive brand amongst all. However, the analysis also revealed that there is no correlation between the prices of the car brands and the miles they had covered.