Welcome! This is my third data analysis project and in this project we will analyze data from eBay Kleinanzeigen, which is a classified section of the German eBay website. The objective is to clean the data set and then analyze the car listings.
The complete (and cleaned) data set can be found here.
To get familiar with the data set, here are some columns and their descriptions:
dateCrawled
- When this ad was first crawled. All field-values are taken from this date.name
- Name of the car.seller
- Whether the seller is private or a dealer.offerType
- The type of listingprice
- The price on the ad to sell the car.abtest
- Whether the listing is included in an A/B test.vehicleType
- The vehicle Type.yearOfRegistration
- The year in which the car was first registered.gearbox
- The transmission type.powerPS
- The power of the car in PS.model
- The car model name.kilometer
- How many kilometers the car has driven.monthOfRegistration
- The month in which the car was first registered.fuelType
- What type of fuel the car uses.brand
- The brand of the car.notRepairedDamage
- If the car has a damage which is not yet repaired.dateCreated
- The date on which the eBay listing was created.nrOfPictures
- The number of pictures in the ad.postalCode
- The postal code for the location of the vehicle.lastSeenOnline
- When the crawler saw this ad last online.After importing the data set using the read_csv
method from the pandas library, we can start to identify the data contained in the CSV.
The first 5 and last 5 rows are as shown below.
import numpy as np
import pandas as pd
autos = pd.read_csv('/Catharine/DataSets/autos.csv', encoding = "Latin-1")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___TÜV_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
Through the information available from the table above, we see that there are 20 columns, as described in the introduction, and that there are columns with null values. Another detail is that some columns we would think would be integers or floats, like the price
and odometer
are actually object types. Those probably will have to be transformed as we clean the data.
One of the first things we can work on is cleaning and standardizing the column names. It is currently in camelcase, but we will change it so snakecase (for example: dateCrawled will become date_crawled). We will do this to maintain the Python community standard.
print(autos.columns)
autos.rename({"yearOfRegistration": "registration_year"}, axis = 1, inplace = True)
autos.rename({"monthOfRegistration": "registration_month"}, axis = 1, inplace = True)
autos.rename({"notRepairedDamage": "unrepaired_damage"}, axis = 1, inplace = True)
autos.rename({"dateCreated": "ad_created"}, axis = 1, inplace = True)
autos.rename({"dateCrawled": "date_crawled"}, axis = 1, inplace = True)
autos.rename({"offerType": "offer_type"}, axis = 1, inplace = True)
autos.rename({"vehicleType": "vehicle_type"}, axis = 1, inplace = True)
autos.rename({"powerPS": "power_ps"}, axis = 1, inplace = True)
autos.rename({"fuelType": "fuel_type"}, axis = 1, inplace = True)
autos.rename({"nrOfPictures": "n_pictures"}, axis = 1, inplace = True)
autos.rename({"postalCode": "postal_code"}, axis = 1, inplace = True)
autos.rename({"lastSeen": "last_seen"}, axis = 1, inplace = True)
autos.head()
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Next, we will analyze the columns to determine if there are any more cleaning tasks to be done. Here we will use the df.describe()
to check if there seems to be anything that needs to be cleaned.
autos.describe(include = 'all')
date_crawled | name | seller | offer_type | price | abtest | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | n_pictures | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-09 11:54:38 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
As forseen, the columns price
and odometer
will have to be converted into floats and integers, respectively. Let's start out with the price
column.
print(autos["price"])
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 ... 49995 $24,900 49996 $1,980 49997 $13,200 49998 $22,900 49999 $1,250 Name: price, Length: 50000, dtype: object
As seen from the code above, we have to remove the "$" and "," symbols. Even though no cents are shown, we'll transform the value into float.
autos["price"] = autos["price"].str.replace("$","")
autos["price"] = autos["price"].str.replace(",","")
autos["price"] = autos["price"].astype(float)
print(autos["price"])
0 5000.0 1 8500.0 2 8990.0 3 4350.0 4 1350.0 ... 49995 24900.0 49996 1980.0 49997 13200.0 49998 22900.0 49999 1250.0 Name: price, Length: 50000, dtype: float64
Now let's clean and transform the odometer
column. We have to remove the "," and the "km". However, so we don't lose the information regarding the unit, let's change the name of the column from odometer
to odometer_km
.
autos["odometer"] = autos["odometer"].str.replace("km","")
autos["odometer"] = autos["odometer"].str.replace(",","")
autos["odometer"] = autos["odometer"].astype(int)
autos.rename({"odometer": "odometer_km"}, axis = 1, inplace = True)
print(autos["odometer_km"])
0 150000 1 150000 2 70000 3 70000 4 150000 ... 49995 100000 49996 150000 49997 5000 49998 40000 49999 150000 Name: odometer_km, Length: 50000, dtype: int32
Now let's look more closely at the data to see if there are any outliers that should be removed.
### Closer look at odometer values
odometer_unique = autos["odometer_km"].unique().shape[0]
odometer_bottom = autos["odometer_km"].value_counts().sort_index(ascending = True).head()
odometer_top = autos["odometer_km"].value_counts().sort_index(ascending = False).head()
autos["odometer_km"].describe()
print("There are {} unique values".format(odometer_unique))
print("\n")
print("The 5 lowest values are:\n{}".format(odometer_bottom))
print("\n")
print("The 5 highest values are:\n{}".format(odometer_top))
print("\n")
autos["odometer_km"].describe()
There are 13 unique values The 5 lowest values are: 5000 772 10000 238 20000 739 30000 760 40000 794 Name: odometer_km, dtype: int64 The 5 highest values are: 150000 30079 125000 4856 100000 2052 90000 1670 80000 1374 Name: odometer_km, dtype: int64
count 46634.000000 mean 125654.243685 std 39784.008504 min 5000.000000 25% 100000.000000 50% 150000.000000 75% 150000.000000 max 150000.000000 Name: odometer_km, dtype: float64
The odometer values seem pretty normal. Let's check ou the price values.
### Closer look at price values
price_unique = autos["price"].unique().shape[0]
price_bottom = autos["price"].value_counts().sort_index(ascending = True).head(10)
price_top = autos["price"].value_counts().sort_index(ascending = False).head(10)
autos["price"].describe()
print("There are {} unique values".format(price_unique))
print("\n")
print("The 10 lowest values are:\n{}".format(price_bottom))
print("\n")
print("The 10 highest values are:\n{}".format(price_top))
print("\n")
autos["price"].describe()
There are 2323 unique values The 10 lowest values are: 1.0 148 2.0 1 3.0 1 5.0 2 8.0 1 9.0 1 10.0 6 11.0 2 12.0 3 13.0 2 Name: price, dtype: int64 The 10 highest values are: 350000.0 1 345000.0 1 299000.0 1 295000.0 1 265000.0 1 259000.0 1 250000.0 1 220000.0 1 198000.0 1 197000.0 1 Name: price, dtype: int64
count 46634.000000 mean 5965.964875 std 9143.505580 min 1.000000 25% 1250.000000 50% 3100.000000 75% 7500.000000 max 350000.000000 Name: price, dtype: float64
From the information above, we can see that the price
values have many outliers. There are 1421 cars for \$0.00 and many cars above \$1,000,000.00, and some cards with values like "12345678"
Since the cars at $0.00 only represent about 2% of the data set, we will remove those.
For this analysis we will only consider cars between \$1.00 and \$350,000.00.
autos = autos[autos["price"].between(1,350001)]
autos["price"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price, dtype: float64
In this data set we have 5 columns that have date values:
date_crawled
- When this ad was first crawled. All field-values are taken from this date.ad_created
- The date on which the eBay listing was created.last_seen
- When the crawler saw this ad last online.registration_year
- The year in which the car was first registered.registration_month
- The month in which the car was first registered.Let's analyze the distribuition of all the dates.
autos.iloc[0:5][["date_crawled","last_seen", "ad_created", "registration_month", "registration_year"]]
date_crawled | last_seen | ad_created | registration_month | registration_year | |
---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-04-06 06:45:54 | 2016-03-26 00:00:00 | 3 | 2004 |
1 | 2016-04-04 13:38:56 | 2016-04-06 14:45:08 | 2016-04-04 00:00:00 | 6 | 1997 |
2 | 2016-03-26 18:57:24 | 2016-04-06 20:15:37 | 2016-03-26 00:00:00 | 7 | 2009 |
3 | 2016-03-12 16:58:10 | 2016-03-15 03:16:28 | 2016-03-12 00:00:00 | 6 | 2007 |
4 | 2016-04-01 14:38:50 | 2016-04-01 14:38:50 | 2016-04-01 00:00:00 | 7 | 2003 |
autos.iloc[0:5][["date_crawled","last_seen", "ad_created", "registration_month", "registration_year"]].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 5 entries, 0 to 4 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date_crawled 5 non-null object 1 last_seen 5 non-null object 2 ad_created 5 non-null object 3 registration_month 5 non-null int64 4 registration_year 5 non-null int64 dtypes: int64(2), object(3) memory usage: 240.0+ bytes
The date_crawled
,last_seen
, and ad_created
columns are all strings and the date is represented in the first 10 characters of the string. Let's analyze the distribuition of those three columns by day.
Distribuition of "date_crawled"
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2016-03-05 0.025327 2016-03-06 0.014043 2016-03-07 0.036014 2016-03-08 0.033296 2016-03-09 0.033090 2016-03-10 0.032184 2016-03-11 0.032575 2016-03-12 0.036920 2016-03-13 0.015670 2016-03-14 0.036549 2016-03-15 0.034284 2016-03-16 0.029610 2016-03-17 0.031628 2016-03-18 0.012911 2016-03-19 0.034778 2016-03-20 0.037887 2016-03-21 0.037373 2016-03-22 0.032987 2016-03-23 0.032225 2016-03-24 0.029342 2016-03-25 0.031607 2016-03-26 0.032204 2016-03-27 0.031092 2016-03-28 0.034860 2016-03-29 0.034099 2016-03-30 0.033687 2016-03-31 0.031834 2016-04-01 0.033687 2016-04-02 0.035478 2016-04-03 0.038608 2016-04-04 0.036487 2016-04-05 0.013096 2016-04-06 0.003171 2016-04-07 0.001400 Name: date_crawled, dtype: float64
The date_crawled
column contains dates from March 5th to April 7th of 2016. The distribuition seems pretty normal since there isn't much difference in the distribuition per day.
autos['date_crawled'].str[:10].value_counts(normalize=True, dropna=False).sort_values()
2016-04-07 0.001400 2016-04-06 0.003171 2016-03-18 0.012911 2016-04-05 0.013096 2016-03-06 0.014043 2016-03-13 0.015670 2016-03-05 0.025327 2016-03-24 0.029342 2016-03-16 0.029610 2016-03-27 0.031092 2016-03-25 0.031607 2016-03-17 0.031628 2016-03-31 0.031834 2016-03-10 0.032184 2016-03-26 0.032204 2016-03-23 0.032225 2016-03-11 0.032575 2016-03-22 0.032987 2016-03-09 0.033090 2016-03-08 0.033296 2016-03-30 0.033687 2016-04-01 0.033687 2016-03-29 0.034099 2016-03-15 0.034284 2016-03-19 0.034778 2016-03-28 0.034860 2016-04-02 0.035478 2016-03-07 0.036014 2016-04-04 0.036487 2016-03-14 0.036549 2016-03-12 0.036920 2016-03-21 0.037373 2016-03-20 0.037887 2016-04-03 0.038608 Name: date_crawled, dtype: float64
If we analyze by sorting by percentage, we can see that the smallest value is 0.01% and the greatest is 3.8%
Distribuition of "last_seen"
autos['last_seen'].str[:10].value_counts(normalize=True, dropna=False).sort_values()
2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-18 0.007351 2016-03-08 0.007413 2016-03-13 0.008895 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-14 0.012602 2016-03-27 0.015649 2016-03-19 0.015834 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-26 0.016802 2016-03-23 0.018532 2016-03-25 0.019211 2016-03-24 0.019767 2016-03-21 0.020632 2016-03-20 0.020653 2016-03-28 0.020859 2016-03-22 0.021373 2016-03-29 0.022341 2016-04-01 0.022794 2016-03-31 0.023783 2016-03-12 0.023783 2016-04-04 0.024483 2016-03-30 0.024771 2016-04-02 0.024915 2016-04-03 0.025203 2016-03-17 0.028086 2016-04-05 0.124761 2016-04-07 0.131947 2016-04-06 0.221806 Name: last_seen, dtype: float64
As described in the previous sections, the last_seen
date represents the last day the ad was seen on the website. From the distribuiton above, we can see that a bulk of those ads are on April 4th, 5th, and 6th of 2016. From this information alone, it isn't enough to determine what could cause that spike.
Distribuition of "ad_created"
autos['ad_created'].str[:10].value_counts(normalize=True, dropna=False).sort_index()
2015-06-11 0.000021 2015-08-10 0.000021 2015-09-09 0.000021 2015-11-10 0.000021 2015-12-05 0.000021 ... 2016-04-03 0.038855 2016-04-04 0.036858 2016-04-05 0.011819 2016-04-06 0.003253 2016-04-07 0.001256 Name: ad_created, Length: 76, dtype: float64
The ad_created
dates show us a range from June 11th 2015 to April 7th of 2016.
autos['registration_year'].describe()
count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
As shown in the description above, the data set has some incorrect values for registration_year
, given the oldest registration year is 1000 and the newest is 9999. It would be ideal to remove these rows from the data set, but first let's see how much of the data set those values actually represent.
(~autos["registration_year"].between(1960, 2016)).sum()/autos.shape[0]
0.0397611448574076
The registration years before 1960 and after 2016 represent only 3.9% of the data, therefore we will remove that data.
autos = autos[autos["registration_year"].between(1960, 2016)]
autos["registration_year"].describe()
count 46634.000000 mean 2002.969400 std 6.930433 min 1960.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
autos["registration_year"].value_counts(normalize=True)
2000 0.067676 2005 0.062958 1999 0.062122 2004 0.057962 2003 0.057876 2006 0.057254 2001 0.056525 2002 0.053309 1998 0.050671 2007 0.048827 2008 0.047498 2009 0.044710 1997 0.041836 2011 0.034803 2010 0.034074 1996 0.029442 2012 0.028091 1995 0.026311 2016 0.026161 2013 0.017219 2014 0.014217 1994 0.013488 1993 0.009114 2015 0.008406 1992 0.007934 1990 0.007441 1991 0.007269 1989 0.003731 1988 0.002895 1985 0.002037 1980 0.001823 1986 0.001544 1987 0.001544 1984 0.001094 1983 0.001094 1978 0.000944 1982 0.000879 1970 0.000815 1979 0.000729 1972 0.000708 1981 0.000600 1968 0.000558 1967 0.000558 1971 0.000558 1974 0.000515 1960 0.000493 1973 0.000493 1966 0.000472 1977 0.000472 1976 0.000450 1969 0.000407 1975 0.000386 1965 0.000365 1964 0.000257 1963 0.000172 1961 0.000129 1962 0.000086 Name: registration_year, dtype: float64
It appears that the majority of cars were registered after 2000.
Next, let's take a look at the brands of the cars to determine which brand in most common in the listings, what the price range looks like, and the mileage, and if there might be a pattern between them.
autos["brand"].describe()
count 46634 unique 40 top volkswagen freq 9860 Name: brand, dtype: object
autos["brand"].value_counts(normalize=True)
volkswagen 0.211434 bmw 0.110134 opel 0.107604 mercedes_benz 0.096389 audi 0.086654 ford 0.069842 renault 0.047176 peugeot 0.029871 fiat 0.025668 seat 0.018270 skoda 0.016404 nissan 0.015289 mazda 0.015203 smart 0.014174 citroen 0.014003 toyota 0.012716 hyundai 0.010036 sonstige_autos 0.009457 volvo 0.009156 mini 0.008770 mitsubishi 0.008234 honda 0.007848 kia 0.007076 alfa_romeo 0.006626 porsche 0.006111 suzuki 0.005940 chevrolet 0.005683 chrysler 0.003517 dacia 0.002638 daihatsu 0.002509 jeep 0.002273 subaru 0.002144 land_rover 0.002080 saab 0.001651 jaguar 0.001565 daewoo 0.001501 trabant 0.001372 rover 0.001330 lancia 0.001072 lada 0.000579 Name: brand, dtype: float64
There are a total of 40 unique car brands, Volkswagen being the most frequent and representing 21% of the listings. One thing the top brands have in common is that they are German.
Let's continue exploring, however only with the brands that represent at least 5% of the listings.
First, let's separate only the brands that represent at least 5% of the listing.
common_brands = []
brand_counts = autos["brand"].value_counts(normalize=True)
common_brands = brand_counts[brand_counts > 0.05].index
print(common_brands)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
Now let's discover the mean prices of the cars with those brands.
brand_prices = {}
for b in common_brands:
brand_only = autos[autos["brand"] == b]
mean_price = brand_only["price"].mean()
brand_prices[b] = int(mean_price)
brand_prices
{'volkswagen': 5398, 'bmw': 8332, 'opel': 2974, 'mercedes_benz': 8565, 'audi': 9336, 'ford': 3713}
According to the mean prices: BMW, Mercedes Benz, and Audi seem to have the most expensive listings, while Ford and Opel have the cheapest. Volkswagen seems to have a moderate price.
Now, let's do the same with the mileage information.
Again, only considering the brands that represent at least 5% of the listing.
mileages = autos["odometer_km"].unique()
brand_mileages = {}
for b in common_brands:
brand_only = autos[autos["brand"] == b]
mean_mileage = brand_only["odometer_km"].mean()
brand_mileages[b] = int(mean_mileage)
brand_mileages
{'volkswagen': 128707, 'bmw': 132597, 'opel': 129342, 'mercedes_benz': 130919, 'audi': 129157, 'ford': 124399}
Let's see if there is a pattern regarding the price and mileage of the most common brands. To do that, we will create a data frame with the following columns: brand(index), mean_price, and mean_mileage.
s_brand_mileages = pd.Series(brand_mileages) #creating a series for mileages
s_brand_prices = pd.Series(brand_prices) #creating a series for prices
brand_prices_mileage = pd.DataFrame(s_brand_prices , columns=['mean_price']) #creating a data frame for prices
brand_prices_mileage["mean_mileage"] = s_brand_mileages # adding the mileage column to the data frame
brand_prices_mileage.sort_values("mean_mileage")
mean_price | mean_mileage | |
---|---|---|
ford | 3713 | 124399 |
volkswagen | 5398 | 128707 |
audi | 9336 | 129157 |
opel | 2974 | 129342 |
mercedes_benz | 8565 | 130919 |
bmw | 8332 | 132597 |
There doesn't seem to be much variance between cheaper cars and more expensive car regarding the mileage.
In this project we cleaned the data set by adjusting some column names and types, and removing outlier information. We also analyzed the data over the dates, prices and mileage. We found that there was no apparent patern between the prices and mileages of the most common brands of the data set.
Thanks for reading!