In this project we will work with a dataset from eBay Kleinanzeigen
, a classifieds section of the German eBay website. The original data has over 370,000 rows and was scraped and uploaded to Kaggle. For simplicity purposes, we will only use a subset of 50,000 rows.
Below is a data dictionary describing what each column represents:
Column | Description | Column | Description | |
---|---|---|---|---|
dateCrawled |
When the ad was first crawled | model |
The car model name | |
name |
Name of the Car | kilometer |
How many kilometers the car has driven | |
seller |
Whether the seller is private or a dealer | monthOfRegistration |
The month when the car was first registered | |
offerType |
The type of listing | fuelType |
Type of fuel the car uses | |
price |
Listed price to sell the car | brand |
Brand of the car | |
abtest |
Whether the listing is included in an A/B test | notRepairedDamage |
Damaged car has not been repaired | |
vehicleType |
The vehicle type | dateCreated |
Date listing was created on eBay | |
yearOfRegistration |
The year in which the car was first registered | nrOfPictures |
The number of pictures in the ad | |
gearbox |
Trasmission type | postal |
The postal code for the location of the vehicle | |
powerPS |
The power of the car in PS | lastSeenOnline |
When the crawler saw this ad last online |
The aim of this project is to clean the data to make it easier to work with. Also, we will analyze different parts of the data to show certain relationships between categories
# Importing the necessary libraries we need for this project and importing the CSV file
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv", encoding = "Windows-1252")
# Getting some basic information about the data set
autos.info()
autos.head()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): dateCrawled 50000 non-null object name 50000 non-null object seller 50000 non-null object offerType 50000 non-null object price 50000 non-null object abtest 50000 non-null object vehicleType 44905 non-null object yearOfRegistration 50000 non-null int64 gearbox 47320 non-null object powerPS 50000 non-null int64 model 47242 non-null object odometer 50000 non-null object monthOfRegistration 50000 non-null int64 fuelType 45518 non-null object brand 50000 non-null object notRepairedDamage 40171 non-null object dateCreated 50000 non-null object nrOfPictures 50000 non-null int64 postalCode 50000 non-null int64 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_TÜV_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
Initial Observations
Looking at the information above, we see that the 'autos' dataframe does not have many null values. Only five columns have missing values and the column with the greatest amount of null values (i.e. vehicleType) has approximately 10% of null values. The columns are stored either as integers or strings. There are also some columns (i.e. dateCrawled, lastSeen) that contain dates stroed as strings. Also, there are columns (i.e. price, odometer) that expect numeric values but are treated as strings becuase the stored values contain both numeric and non-numeric charcters. The columns also utilize a camelcase format as opposed to the Python-preferred snakecase format.
autos.columns
Index(['dateCrawled', 'name', 'seller', 'offerType', 'price', 'abtest', 'vehicleType', 'yearOfRegistration', 'gearbox', 'powerPS', 'model', 'odometer', 'monthOfRegistration', 'fuelType', 'brand', 'notRepairedDamage', 'dateCreated', 'nrOfPictures', 'postalCode', 'lastSeen'], dtype='object')
We will clean the columns in the following ways:
We could use a "for" loop to make these changes, but since the number of columns is small, we will make these changes manually.
autos.columns = ['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test',
'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model',
'odometer', 'registration_month', 'fuel_type', 'brand',
'unrepaired_damage', 'ad_created', 'num_of_pics', 'postal_code',
'last_seen']
autos.columns
Index(['date_crawled', 'name', 'seller', 'offer_type', 'price', 'ab_test', 'vehicle_type', 'registration_year', 'gearbox', 'power_ps', 'model', 'odometer', 'registration_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'num_of_pics', 'postal_code', 'last_seen'], dtype='object')
autos.describe(include = "all")
date_crawled | name | seller | offer_type | price | ab_test | vehicle_type | registration_year | gearbox | power_ps | model | odometer | registration_month | fuel_type | brand | unrepaired_damage | ad_created | num_of_pics | postal_code | last_seen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-21 20:37:19 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
Observations
autos = autos.drop(["seller", "offer_type", "num_of_pics"], axis = 1)
autos.shape #there are now 17 columns remaining from the original 20
(50000, 17)
These two columns should have numeric values, but instead they are treated as strings. Let's see what these values look like:
print(autos["price"].head(10))
print("\n")
print(autos["odometer"].head(10))
0 $5,000 1 $8,500 2 $8,990 3 $4,350 4 $1,350 5 $7,900 6 $300 7 $1,990 8 $250 9 $590 Name: price, dtype: object 0 150,000km 1 150,000km 2 70,000km 3 70,000km 4 150,000km 5 150,000km 6 150,000km 7 150,000km 8 150,000km 9 150,000km Name: odometer, dtype: object
After printing a few values in each column, we see that the non-numeric characters in "price" are "$" and ",". For the "odometer" column, these characters are "," and "km". We will clean the data by removing these characters, change the value type from "str" to "float", and re-labeling the columns to include these units of measurement.
#Cleaning the "price" column
autos["price"] = autos["price"].str.replace("$", "").str.replace(",", "").astype(int)
#Cleaning the "odometer" column
autos["odometer"] = autos["odometer"].str.replace(",", "").str.replace("km", "").astype(int)
#Renaming these columns to reference units of measurement
autos.rename({"price" : "price_dollar", "odometer" : "odometer_km"}, axis = 1, inplace = True)
autos[["price_dollar", "odometer_km"]][:3]
price_dollar | odometer_km | |
---|---|---|
0 | 5000 | 150000 |
1 | 8500 | 150000 |
2 | 8990 | 70000 |
As we look at the first few rows of our manipulated data, we see that our changes have been applied correctly.
autos[["price_dollar", "odometer_km"]].describe()
price_dollar | odometer_km | |
---|---|---|
count | 5.000000e+04 | 50000.000000 |
mean | 9.840044e+03 | 125732.700000 |
std | 4.811044e+05 | 40042.211706 |
min | 0.000000e+00 | 5000.000000 |
25% | 1.100000e+03 | 125000.000000 |
50% | 2.950000e+03 | 150000.000000 |
75% | 7.200000e+03 | 150000.000000 |
max | 1.000000e+08 | 150000.000000 |
In addition, all the values in these columns do not contains any NaN values in the initial autos.describe() method that was called above. Changing the values from strings to numbers caused this result.
autos["odometer_km"].value_counts()
150000 32424 125000 5170 100000 2169 90000 1757 80000 1436 70000 1230 60000 1164 50000 1027 5000 967 40000 819 30000 789 20000 784 10000 264 Name: odometer_km, dtype: int64
Observation: All odometer readings are rounded to the thousands place. It seems that when these cars were put up for sale, odometer readings could only be selected for a specific range and not the exact value.
autos["price_dollar"].value_counts().sort_index().head(30)
0 1421 1 156 2 3 3 1 5 2 8 1 9 1 10 7 11 2 12 3 13 2 14 1 15 2 17 3 18 1 20 4 25 5 29 1 30 7 35 1 40 6 45 4 47 1 49 4 50 49 55 2 59 1 60 9 65 5 66 1 Name: price_dollar, dtype: int64
There are plenty of listings for cars under $100 and even 1,400+ listings of $0. We will delete all the listings for $0 because no person would be selling a car for this amount. We will cautiously keep all the other low values because there can be car listings as low as $1.
autos["price_dollar"].value_counts().sort_index(ascending = False).head(30)
99999999 1 27322222 1 12345678 3 11111111 2 10000000 1 3890000 1 1300000 1 1234566 1 999999 2 999990 1 350000 1 345000 1 299000 1 295000 1 265000 1 259000 1 250000 1 220000 1 198000 1 197000 1 194000 1 190000 1 180000 1 175000 1 169999 1 169000 1 163991 1 163500 1 155000 1 151990 1 Name: price_dollar, dtype: int64
Looking at the highest price cars, we see there is a big jump from $389,000 to the next price of $999,990. Although there are cars worth more than $1 million in the data, we will be remove all cars with prices greater than $389,000 because they are outliers.
# Removing rows with price values = $0 or greater than $389,000
autos = autos[autos["price_dollar"].between(1, 389000)]
autos["price_dollar"].describe()
count 48565.000000 mean 5888.935591 std 9059.854754 min 1.000000 25% 1200.000000 50% 3000.000000 75% 7490.000000 max 350000.000000 Name: price_dollar, dtype: float64
Observation: We have removed approximately 1,500 rows and the mean has decreased by about $4,000.
These are the following columns that store date/time information:
date_crawled
(stored as a string)
registration_year
(stored as an int)
registration_month
(stored as an int)
ad_created
(stored as a string)
last_seen
(stored as a string)
We will look at each of these columns more in depth.
# Analyzing date columns stored as strings
autos[["date_crawled", "ad_created", "last_seen"]][:3]
date_crawled | ad_created | last_seen | |
---|---|---|---|
0 | 2016-03-26 17:47:46 | 2016-03-26 00:00:00 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | 2016-04-04 00:00:00 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | 2016-03-26 00:00:00 | 2016-04-06 20:15:37 |
Looking at the firs three rows, each column stores the date in two parts:
Upone further inspection, we see that the "creation date" portion is composed of a string with 10 characters. We will analyze each column by counting how many times each string is found in the data.
autos["date_crawled"].str[:10].value_counts().sort_index()
2016-03-05 1230 2016-03-06 682 2016-03-07 1749 2016-03-08 1617 2016-03-09 1607 2016-03-10 1563 2016-03-11 1582 2016-03-12 1793 2016-03-13 761 2016-03-14 1775 2016-03-15 1665 2016-03-16 1438 2016-03-17 1536 2016-03-18 627 2016-03-19 1689 2016-03-20 1840 2016-03-21 1815 2016-03-22 1602 2016-03-23 1565 2016-03-24 1425 2016-03-25 1535 2016-03-26 1564 2016-03-27 1510 2016-03-28 1693 2016-03-29 1656 2016-03-30 1636 2016-03-31 1546 2016-04-01 1636 2016-04-02 1723 2016-04-03 1875 2016-04-04 1772 2016-04-05 636 2016-04-06 154 2016-04-07 68 Name: date_crawled, dtype: int64
Observation: The website was crawled daily on a 35-day period between March 2016 and April 2016. Except for a few observations, the distribution of traffic on every day is roughly uniform.
autos["ad_created"].str[:10].value_counts().shape
(76,)
# We see that there are 76 unique entries in ad_created. Let's display all instances:
autos["ad_created"].str[:10].value_counts().sort_index()[:40]
2015-06-11 1 2015-08-10 1 2015-09-09 1 2015-11-10 1 2015-12-05 1 2015-12-30 1 2016-01-03 1 2016-01-07 1 2016-01-10 2 2016-01-13 1 2016-01-14 1 2016-01-16 1 2016-01-22 1 2016-01-27 3 2016-01-29 1 2016-02-01 1 2016-02-02 2 2016-02-05 2 2016-02-07 1 2016-02-08 1 2016-02-09 1 2016-02-11 1 2016-02-12 2 2016-02-14 2 2016-02-16 1 2016-02-17 1 2016-02-18 2 2016-02-19 3 2016-02-20 2 2016-02-21 3 2016-02-22 1 2016-02-23 4 2016-02-24 2 2016-02-25 3 2016-02-26 2 2016-02-27 6 2016-02-28 10 2016-02-29 8 2016-03-01 5 2016-03-02 5 Name: ad_created, dtype: int64
autos["ad_created"].str[:10].value_counts().sort_index()[40:76]
2016-03-03 42 2016-03-04 72 2016-03-05 1112 2016-03-06 744 2016-03-07 1687 2016-03-08 1618 2016-03-09 1610 2016-03-10 1549 2016-03-11 1598 2016-03-12 1785 2016-03-13 826 2016-03-14 1709 2016-03-15 1652 2016-03-16 1463 2016-03-17 1519 2016-03-18 660 2016-03-19 1636 2016-03-20 1843 2016-03-21 1825 2016-03-22 1593 2016-03-23 1557 2016-03-24 1422 2016-03-25 1542 2016-03-26 1567 2016-03-27 1505 2016-03-28 1699 2016-03-29 1653 2016-03-30 1627 2016-03-31 1548 2016-04-01 1636 2016-04-02 1707 2016-04-03 1887 2016-04-04 1790 2016-04-05 574 2016-04-06 158 2016-04-07 61 Name: ad_created, dtype: int64
Observation: Ads were created between June 2015 and April 2016 (roughly a 10 month period). A majority of ads were created between March 2016 and April 2016
print(autos["last_seen"].str[:10].value_counts().sort_index())
print("\n")
print(autos["last_seen"].str[:10].value_counts(normalize = True, dropna = False).sort_index())
2016-03-05 52 2016-03-06 210 2016-03-07 262 2016-03-08 360 2016-03-09 466 2016-03-10 518 2016-03-11 601 2016-03-12 1155 2016-03-13 432 2016-03-14 612 2016-03-15 771 2016-03-16 799 2016-03-17 1364 2016-03-18 357 2016-03-19 769 2016-03-20 1003 2016-03-21 1002 2016-03-22 1038 2016-03-23 900 2016-03-24 960 2016-03-25 933 2016-03-26 816 2016-03-27 760 2016-03-28 1013 2016-03-29 1085 2016-03-30 1203 2016-03-31 1155 2016-04-01 1107 2016-04-02 1210 2016-04-03 1224 2016-04-04 1189 2016-04-05 6059 2016-04-06 10772 2016-04-07 6408 Name: last_seen, dtype: int64 2016-03-05 0.001071 2016-03-06 0.004324 2016-03-07 0.005395 2016-03-08 0.007413 2016-03-09 0.009595 2016-03-10 0.010666 2016-03-11 0.012375 2016-03-12 0.023783 2016-03-13 0.008895 2016-03-14 0.012602 2016-03-15 0.015876 2016-03-16 0.016452 2016-03-17 0.028086 2016-03-18 0.007351 2016-03-19 0.015834 2016-03-20 0.020653 2016-03-21 0.020632 2016-03-22 0.021373 2016-03-23 0.018532 2016-03-24 0.019767 2016-03-25 0.019211 2016-03-26 0.016802 2016-03-27 0.015649 2016-03-28 0.020859 2016-03-29 0.022341 2016-03-30 0.024771 2016-03-31 0.023783 2016-04-01 0.022794 2016-04-02 0.024915 2016-04-03 0.025203 2016-04-04 0.024483 2016-04-05 0.124761 2016-04-06 0.221806 2016-04-07 0.131947 Name: last_seen, dtype: float64
Observations: Looking at "last_seen", 50% of ads were last seen within the past three days. This does not necessarily mean that a majority of cars were sold on these days. This illusrtates that ads are currently being looked at regular intervals.
print(autos["registration_month"].value_counts())
print("\n")
print(autos["registration_year"].describe())
3 5003 0 4480 6 4271 4 4036 5 4031 7 3857 10 3588 12 3374 9 3330 11 3313 1 3219 8 3126 2 2937 Name: registration_month, dtype: int64 count 48565.000000 mean 2004.755421 std 88.643887 min 1000.000000 25% 1999.000000 50% 2004.000000 75% 2008.000000 max 9999.000000 Name: registration_year, dtype: float64
Observations:
For "registration_month", we expect 12 unique values but instead we see 13 (month 0 is the 13th). Although there are 4,480 instances of month 0, we will not remove these values because doing so would remove 10% of the total rows. We will just make a note about this anomaly.
For "registration_year", the min value is 1000 an the max value is 9999. Cars were not made until the 20th century, so cars with registration values before 1900 do not make sense. Additionally, we are looking at cars made on or before 2016. Therefore, registration years above 2016 do not make sense either.
We need to determine how many rows contain cars registered out of the scope described above. For simplicity sake, we will use 1900 as the "base year" for registration because this was around the time that cars were being mass produced. If we wanted to dive into the data further, we could look at each individual model of car and see when they were first produced. We could compare these years with the registration year and see if there are any anomalies.
Let's first count the how many rows fall outside our scope for "registration_year":
# Calculting the number of rows where registration year is before 1900 and after 2016
(~autos["registration_year"].between(1900, 2016)).sum()
1884
There are 1884 rows that fall outside the scope we want, which represents (1884/48565) or about 4% of the rows. Since this is a small percentage, we will remove these rows.
autos = autos[autos["registration_year"].between(1900, 2016)]
#Checking to see if rows have been removed
autos["registration_year"].describe()
count 46681.000000 mean 2002.910756 std 7.185103 min 1910.000000 25% 1999.000000 50% 2003.000000 75% 2008.000000 max 2016.000000 Name: registration_year, dtype: float64
autos["registration_year"].value_counts().sort_index(ascending = False)
2016 1220 2015 392 2014 663 2013 803 2012 1310 2011 1623 2010 1589 2009 2085 2008 2215 2007 2277 2006 2670 2005 2936 2004 2703 2003 2699 2002 2486 2001 2636 2000 3156 1999 2897 1998 2363 1997 1951 1996 1373 1995 1227 1994 629 1993 425 1992 370 1991 339 1990 347 1989 174 1988 135 1987 72 ... 1968 26 1967 26 1966 22 1965 17 1964 12 1963 8 1962 4 1961 6 1960 23 1959 6 1958 4 1957 2 1956 4 1955 2 1954 2 1953 1 1952 1 1951 2 1950 3 1948 1 1943 1 1941 2 1939 1 1938 1 1937 4 1934 2 1931 1 1929 1 1927 1 1910 5 Name: registration_year, Length: 78, dtype: int64
Observation: Most of the cars were registered between 1990 and 2016.
autos["brand"].describe()
count 46681 unique 40 top volkswagen freq 9862 Name: brand, dtype: object
autos["brand"].value_counts(normalize = True)
volkswagen 0.211264 bmw 0.110045 opel 0.107581 mercedes_benz 0.096463 audi 0.086566 ford 0.069900 renault 0.047150 peugeot 0.029841 fiat 0.025642 seat 0.018273 skoda 0.016409 nissan 0.015274 mazda 0.015188 smart 0.014160 citroen 0.014010 toyota 0.012703 hyundai 0.010025 sonstige_autos 0.009811 volvo 0.009147 mini 0.008762 mitsubishi 0.008226 honda 0.007840 kia 0.007069 alfa_romeo 0.006641 porsche 0.006127 suzuki 0.005934 chevrolet 0.005698 chrysler 0.003513 dacia 0.002635 daihatsu 0.002506 jeep 0.002271 subaru 0.002142 land_rover 0.002099 saab 0.001649 jaguar 0.001564 daewoo 0.001500 trabant 0.001392 rover 0.001328 lancia 0.001071 lada 0.000578 Name: brand, dtype: float64
There are 40 brands in our data, but 31 of them constitute less than 2% of the total values. We will concentrate our analysis on the remaining 9 brands. From observation, Volkswagen has the highest percentage of listings (21%) and the top 4 brands (Volkswagen, BMW, Opel, Mercedes_Benz) represent more than 50% of the listings.
#Selecting brands that comprise more than 2% of total listings
brand_percentages = autos["brand"].value_counts(normalize = True)
top_two_percent = brand_percentages[brand_percentages > 0.02].index
print(top_two_percent)
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford', 'renault', 'peugeot', 'fiat'], dtype='object')
# Assigning brands and the mean prices of those brands to a dictionary
brand_means = {}
for element in top_two_percent:
brand = autos[autos["brand"] == element]
mean = brand["price_dollar"].mean()
brand_means[element] = int(mean)
print(brand_means)
{'peugeot': 3094, 'opel': 2975, 'fiat': 2813, 'bmw': 8332, 'audi': 9336, 'volkswagen': 5402, 'mercedes_benz': 8628, 'ford': 3749, 'renault': 2474}
Looking at our dictionary of mean prices, we can make the following observations:
# Using a similar process to collect the mean odomoeter readings for the same brands
brand_odometer_means = {}
for element in top_two_percent:
brand = autos[autos["brand"] == element]
mean = brand["odometer_km"].mean()
brand_odometer_means[element] = int(mean)
print(brand_odometer_means)
{'peugeot': 127153, 'opel': 129310, 'fiat': 117121, 'bmw': 132572, 'audi': 129157, 'volkswagen': 128707, 'mercedes_benz': 130788, 'ford': 124266, 'renault': 128071}
# We will convert the two dictionaries we just created into arrays and combine them into one dataframe
brand_mean_series = pd.Series(brand_means).sort_values(ascending = False)
brand_odom_series = pd.Series(brand_odometer_means)
brand_odom_df = pd.DataFrame(brand_mean_series, columns = ["mean_price ($)"])
brand_odom_df["mean_distance (km)"] = brand_odom_series
brand_odom_df
mean_price ($) | mean_distance (km) | |
---|---|---|
audi | 9336 | 129157 |
mercedes_benz | 8628 | 130788 |
bmw | 8332 | 132572 |
volkswagen | 5402 | 128707 |
ford | 3749 | 124266 |
peugeot | 3094 | 127153 |
opel | 2975 | 129310 |
fiat | 2813 | 117121 |
renault | 2474 | 128071 |
Observation: Looking at our dataframe, we see that a more expensive car does not necessarily guarantee a larger lifetime distance traveled for the vehicle. For example, Opel had a higher mean distance than Audi, but costs $6,000 less on average. Therefore, a strong correlation does not exist between these two categories
Here are some other questions we will tackle:
autos["unrepaired_damage"].value_counts(dropna = False)
nein 33834 NaN 8307 ja 4540 Name: unrepaired_damage, dtype: int64
We see that a signficant number of rows have NaN values. We will ignore these rows for our calculations. Also, the strings in this column are in German. We will translate these words to English to avoid any confusion (nein = no, ja = yes).
# Translating from German to English in "unrepaired_damage" column
autos["unrepaired_damage"] = autos["unrepaired_damage"].str.replace("nein", "no").str.replace("ja", "yes")
autos["unrepaired_damage"].value_counts(dropna = False)
no 33834 NaN 8307 yes 4540 Name: unrepaired_damage, dtype: int64
# Aggregating dictionary with mean price of unrepaired vehicles for each brand we studied before
# We will use the same brands from "top_two_percent"
unrepaired_mean_dict = {}
for element in top_two_percent:
unrepaired_element = autos[(autos["brand"] == element) & (autos["unrepaired_damage"] == "yes")]
unrepaired_mean_price = unrepaired_element["price_dollar"].mean()
unrepaired_mean_dict[element] = int(unrepaired_mean_price)
print(unrepaired_mean_dict)
{'peugeot': 1372, 'opel': 1367, 'fiat': 1146, 'bmw': 3512, 'audi': 3324, 'volkswagen': 2179, 'mercedes_benz': 3921, 'ford': 1375, 'renault': 1145}
# Aggregating a corresponding dictionary for repaired vehicles
repaired_mean_dict = {}
for element in top_two_percent:
repaired_element = autos[(autos["brand"] == element) & (autos["unrepaired_damage"] == "no")]
repaired_mean_price = repaired_element["price_dollar"].mean()
repaired_mean_dict[element] = int(repaired_mean_price)
print(repaired_mean_dict)
{'peugeot': 3674, 'opel': 3660, 'fiat': 3446, 'bmw': 9437, 'audi': 10914, 'volkswagen': 6469, 'mercedes_benz': 9798, 'ford': 4660, 'renault': 3103}
#Creating a dataframe putting these dictionaries together
repaired_mean_series = pd.Series(repaired_mean_dict).sort_values(ascending = False)
unrepaired_mean_series = pd.Series(unrepaired_mean_dict)
damaged_df = pd.DataFrame(repaired_mean_series, columns = ["repaired_mean_price ($)"])
damaged_df["unrepaired_mean_price ($)"] = unrepaired_mean_series
damaged_df["difference ($)"] = damaged_df["repaired_mean_price ($)"] - damaged_df["unrepaired_mean_price ($)"]
damaged_df
repaired_mean_price ($) | unrepaired_mean_price ($) | difference ($) | |
---|---|---|---|
audi | 10914 | 3324 | 7590 |
mercedes_benz | 9798 | 3921 | 5877 |
bmw | 9437 | 3512 | 5925 |
volkswagen | 6469 | 2179 | 4290 |
ford | 4660 | 1375 | 3285 |
peugeot | 3674 | 1372 | 2302 |
opel | 3660 | 1367 | 2293 |
fiat | 3446 | 1146 | 2300 |
renault | 3103 | 1145 | 1958 |
Observation: Repaired cars advertised on this site sell more than unrepaired cars. Specifically, the difference between these prices is greater for more expensive cars (i.e. Audi, Mercedes Benz, BMW). From a seller standpoint, I would definitely repair a car before selling it on this site because I could get more money on the sale. I would keep in mind the costs of repairing the car in order to make sure that this avenue could be a more profitable option before selling.
We will use a similar method to the ones we have used earlier to explore the relationship between these categories.
autos["fuel_type"].value_counts(dropna = False)
benzin 28540 diesel 14032 NaN 3318 lpg 649 cng 71 hybrid 37 elektro 19 andere 15 Name: fuel_type, dtype: int64
For our analysis, we will only analyze benzin (i.e. regular gasoline) and diesel fuel types becuase they make up a majority of our data.
benzin_price_dict = {}
diesel_price_dict = {}
benzin_odom_dict = {}
diesel_odom_dict = {}
for element in top_two_percent:
benzin = autos[(autos["brand"] == element) & (autos["fuel_type"] == "benzin")]
diesel = autos[(autos["brand"] == element) & (autos["fuel_type"] == "diesel")]
benzin_mean_price = benzin["price_dollar"].mean()
benzin_mean_odom = benzin["odometer_km"].mean()
diesel_mean_price = diesel["price_dollar"].mean()
diesel_mean_odom = diesel["odometer_km"].mean()
benzin_price_dict[element] = int(benzin_mean_price)
benzin_odom_dict[element] = int(benzin_mean_odom)
diesel_price_dict[element] = int(diesel_mean_price)
diesel_odom_dict[element] = int(diesel_mean_odom)
diesel_price_series = pd.Series(diesel_price_dict).sort_values(ascending = False)
benzin_price_series = pd.Series(benzin_price_dict)
diesel_odom_series = pd.Series(diesel_odom_dict)
benzin_odom_series = pd.Series(benzin_odom_dict)
fuel_df = pd.DataFrame(diesel_price_series, columns = ["mean_diesel_price ($)"])
fuel_df["mean_benzin_price ($)"] = benzin_price_series
fuel_df["price_difference ($)"] = fuel_df["mean_diesel_price ($)"] - fuel_df["mean_benzin_price ($)"]
fuel_df["mean_diesel_odom (km)"] = diesel_odom_series
fuel_df["mean_benzin_odom (km)"] = benzin_odom_series
fuel_df["odom_difference (km)"] = fuel_df["mean_diesel_odom (km)"] - fuel_df["mean_benzin_odom (km)"]
fuel_df
mean_diesel_price ($) | mean_benzin_price ($) | price_difference ($) | mean_diesel_odom (km) | mean_benzin_odom (km) | odom_difference (km) | |
---|---|---|---|---|---|---|
bmw | 11801 | 6509 | 5292 | 131284 | 133330 | -2046 |
audi | 11315 | 8150 | 3165 | 131400 | 125989 | 5411 |
mercedes_benz | 9672 | 8226 | 1446 | 136019 | 126316 | 9703 |
volkswagen | 8090 | 4087 | 4003 | 131257 | 126463 | 4794 |
ford | 5109 | 3393 | 1716 | 133942 | 119572 | 14370 |
opel | 5010 | 2605 | 2405 | 134705 | 127573 | 7132 |
peugeot | 4480 | 2870 | 1610 | 133741 | 124314 | 9427 |
renault | 4073 | 2166 | 1907 | 135926 | 126083 | 9843 |
fiat | 4048 | 2687 | 1361 | 133216 | 112883 | 20333 |
Observations: With respect to average cost of car, vehicles using diesel fuel are more expensive than those using regular gasolne. However, except for BMW cars, vehicles using diesel fuel have larger odometer readings than regular gasoline. For some brands, these differences are telling. For example, Fiats using diesel fuel cost about $1500 more than those using regular gasoline, but Fiats using diesel drove 20,000 km farther (on average). A similar comparison can be made for Ford vehicles. If I was someone looking for a vehicle to buy, I would consider buying the diesel version of these brands because of the extra distance these vehicles can travel, on average. The only brand that did not see a positive correlation between price difference and distance was BMW. Diesel BMWs cost more on average and have lower average distance traveled. If I was buying a BMW vehicle, I would choose one that utilizes regular gasoline.
autos[autos["brand"] == "volkswagen"]["model"].value_counts()
golf 3707 polo 1609 passat 1349 transporter 674 touran 433 lupo 322 sharan 222 caddy 204 beetle 123 tiguan 118 bora 100 andere 96 touareg 94 scirocco 85 fox 82 eos 66 kaefer 57 up 51 jetta 38 phaeton 31 cc 18 amarok 6 Name: model, dtype: int64
For this analysis, we will analyze mean price and odometer readings for each model of Volkswagen to see if there are any conclusions we can make about these models. We will only use models with at least 100 entries so that we have a large enough sample size for each model.
volkswagen_df = autos[autos["brand"] == "volkswagen"]
volkswagen_models = volkswagen_df["model"].value_counts()
volkswagen_100_plus = volkswagen_models[volkswagen_models >= 100].index
volks_mean_price_dict = {}
volks_mean_odom_dict = {}
for element in volkswagen_100_plus:
brand = volkswagen_df[volkswagen_df["model"] == element]
mean_price = brand["price_dollar"].mean()
mean_odom = brand["odometer_km"].mean()
volks_mean_price_dict[element] = int(mean_price)
volks_mean_odom_dict[element] = int(mean_odom)
volks_mean_price_series = pd.Series(volks_mean_price_dict).sort_values(ascending = False)
volks_mean_odom_series = pd.Series(volks_mean_odom_dict)
volkswagen_models_df = pd.DataFrame(volks_mean_price_series, columns = ["mean_price ($)"])
volkswagen_models_df["mean_distance (km)"] = volks_mean_odom_series
volkswagen_models_df["km_per_dollar"] = volkswagen_models_df["mean_distance (km)"] / volkswagen_models_df["mean_price ($)"]
volkswagen_models_df
mean_price ($) | mean_distance (km) | km_per_dollar | |
---|---|---|---|
tiguan | 17809 | 88644 | 4.977483 |
transporter | 10374 | 135445 | 13.056198 |
touran | 7914 | 129284 | 16.336113 |
caddy | 7021 | 112622 | 16.040735 |
beetle | 6951 | 114146 | 16.421522 |
sharan | 5783 | 133040 | 23.005361 |
golf | 5081 | 128048 | 25.201338 |
passat | 4948 | 139725 | 28.238682 |
polo | 2593 | 127445 | 49.149634 |
bora | 2292 | 144700 | 63.132635 |
lupo | 1401 | 138214 | 98.653819 |
Observations: The cheapest models of Volkswagen (Lupo, Bora, and Polo) have km_per_dollar values much greater than the other models. As models becomes more expensive, the km_per_dollar ratio decreases. Therefore, if you are looking for a Volkswagen is cost effective and has great mileage, you should consider buying these models.
You can perform a similar analysis with the other brands to see if a similar trend exists.