eBay Motors is one of the world's largest marketplaces for trading used cars. The dataset of eBay used cars is analysed in this project. The dataset includes used cars advertisements listed in Germany between 11th June 2015 and 7th April 2016. This dataset is taken from from eBay Kleinanzeigen, a classifieds section of the German eBay website.
The original dataset can be found at https://data.world/data-society/used-cars-data. For the purpose of this project, the original dataset was reduced to 50,000 listings.The reduced dataset used for this project is saved as "autos.csv".
In the data cleaning process below, the listings are further reduced to 44,870. Among others, rows with vehicles registerd before 1986 (vehicles categorised as antique in Germany) are removed.
The top 6 brand-model combination have, among others, 2 things in common. More than 2/3 of these listings have their 'gearbox' listed as 'manual' and majority of these manually driven vahicles were driven for 150,000km.
Below, the data is analysed for vehicles registered between 1986 and 2016. The findings are produced in Part E of this project (below).
--> 5. Analyse the data to:
5.1. Calculate the distribution based on the column: 'reg_year'.
5.2. Calculate the distribution based on the columns: 'date_crawled', 'ad_created' and 'last_seen'.
5.3 Select brand and aggregating mean price.
5.4. Calculate the mean mileage and mean price for each of the top brands.
5.5. Find the most common brand/model combinations
5.6. Find out if the average prices follows any patterns based on the mileage.
5.7. Find out how much cheaper are cars with damage than their non-damaged counterparts.
---> 1. How the original dataset is organised
1.1. Observation
---> 2. Rows & Columns
2.1. Review of unique values returned as NaN.
2.2. Review of columns with only 2 unique values.
2.3. Convert datetype of 'price' and 'odometer' columns from object to integer.
2.4. Translate non-English word to English words.
2.5. Chang the use camelcase to snakecase in the names of columns and reorganising the columns.
---> 3. Quick Review of the organised dataset.
---> 4. Data entries
4.1. Remove data for antique vehicles from the columns: 'reg_year'.
4.2. Remove inaccurate entries in the column: 'reg_year'.
4.3. Review data entry for columns: 'reg_month'.
4.4. Check for outliers in the column: 'adometer_km'.
4.5. Check for outliers in the column: 'price_$'.
import pandas as pd
import numpy as np
autos = pd.read_csv("autos.csv")
autos
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2016-03-26 17:47:46 | Peugeot_807_160_NAVTECH_ON_BOARD | privat | Angebot | $5,000 | control | bus | 2004 | manuell | 158 | andere | 150,000km | 3 | lpg | peugeot | nein | 2016-03-26 00:00:00 | 0 | 79588 | 2016-04-06 06:45:54 |
1 | 2016-04-04 13:38:56 | BMW_740i_4_4_Liter_HAMANN_UMBAU_Mega_Optik | privat | Angebot | $8,500 | control | limousine | 1997 | automatik | 286 | 7er | 150,000km | 6 | benzin | bmw | nein | 2016-04-04 00:00:00 | 0 | 71034 | 2016-04-06 14:45:08 |
2 | 2016-03-26 18:57:24 | Volkswagen_Golf_1.6_United | privat | Angebot | $8,990 | test | limousine | 2009 | manuell | 102 | golf | 70,000km | 7 | benzin | volkswagen | nein | 2016-03-26 00:00:00 | 0 | 35394 | 2016-04-06 20:15:37 |
3 | 2016-03-12 16:58:10 | Smart_smart_fortwo_coupe_softouch/F1/Klima/Pan... | privat | Angebot | $4,350 | control | kleinwagen | 2007 | automatik | 71 | fortwo | 70,000km | 6 | benzin | smart | nein | 2016-03-12 00:00:00 | 0 | 33729 | 2016-03-15 03:16:28 |
4 | 2016-04-01 14:38:50 | Ford_Focus_1_6_Benzin_T�V_neu_ist_sehr_gepfleg... | privat | Angebot | $1,350 | test | kombi | 2003 | manuell | 0 | focus | 150,000km | 7 | benzin | ford | nein | 2016-04-01 00:00:00 | 0 | 39218 | 2016-04-01 14:38:50 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
49995 | 2016-03-27 14:38:19 | Audi_Q5_3.0_TDI_qu._S_tr.__Navi__Panorama__Xenon | privat | Angebot | $24,900 | control | limousine | 2011 | automatik | 239 | q5 | 100,000km | 1 | diesel | audi | nein | 2016-03-27 00:00:00 | 0 | 82131 | 2016-04-01 13:47:40 |
49996 | 2016-03-28 10:50:25 | Opel_Astra_F_Cabrio_Bertone_Edition___T�V_neu+... | privat | Angebot | $1,980 | control | cabrio | 1996 | manuell | 75 | astra | 150,000km | 5 | benzin | opel | nein | 2016-03-28 00:00:00 | 0 | 44807 | 2016-04-02 14:18:02 |
49997 | 2016-04-02 14:44:48 | Fiat_500_C_1.2_Dualogic_Lounge | privat | Angebot | $13,200 | test | cabrio | 2014 | automatik | 69 | 500 | 5,000km | 11 | benzin | fiat | nein | 2016-04-02 00:00:00 | 0 | 73430 | 2016-04-04 11:47:27 |
49998 | 2016-03-08 19:25:42 | Audi_A3_2.0_TDI_Sportback_Ambition | privat | Angebot | $22,900 | control | kombi | 2013 | manuell | 150 | a3 | 40,000km | 11 | diesel | audi | nein | 2016-03-08 00:00:00 | 0 | 35683 | 2016-04-05 16:45:07 |
49999 | 2016-03-14 00:42:12 | Opel_Vectra_1.6_16V | privat | Angebot | $1,250 | control | limousine | 1996 | manuell | 101 | vectra | 150,000km | 1 | benzin | opel | nein | 2016-03-13 00:00:00 | 0 | 45897 | 2016-04-06 21:18:48 |
50000 rows × 20 columns
autos.describe(include="all")
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | nrOfPictures | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 50000 | 50000 | 50000 | 50000 | 50000 | 50000 | 44905 | 50000.000000 | 47320 | 50000.000000 | 47242 | 50000 | 50000.000000 | 45518 | 50000 | 40171 | 50000 | 50000.0 | 50000.000000 | 50000 |
unique | 48213 | 38754 | 2 | 2 | 2357 | 2 | 8 | NaN | 2 | NaN | 245 | 13 | NaN | 7 | 40 | 2 | 76 | NaN | NaN | 39481 |
top | 2016-03-27 22:55:05 | Ford_Fiesta | privat | Angebot | $0 | test | limousine | NaN | manuell | NaN | golf | 150,000km | NaN | benzin | volkswagen | nein | 2016-04-03 00:00:00 | NaN | NaN | 2016-04-07 06:17:27 |
freq | 3 | 78 | 49999 | 49999 | 1421 | 25756 | 12859 | NaN | 36993 | NaN | 4024 | 32424 | NaN | 30107 | 10687 | 35232 | 1946 | NaN | NaN | 8 |
mean | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2005.073280 | NaN | 116.355920 | NaN | NaN | 5.723360 | NaN | NaN | NaN | NaN | 0.0 | 50813.627300 | NaN |
std | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 105.712813 | NaN | 209.216627 | NaN | NaN | 3.711984 | NaN | NaN | NaN | NaN | 0.0 | 25779.747957 | NaN |
min | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.000000 | NaN | 0.000000 | NaN | NaN | 0.000000 | NaN | NaN | NaN | NaN | 0.0 | 1067.000000 | NaN |
25% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1999.000000 | NaN | 70.000000 | NaN | NaN | 3.000000 | NaN | NaN | NaN | NaN | 0.0 | 30451.000000 | NaN |
50% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2003.000000 | NaN | 105.000000 | NaN | NaN | 6.000000 | NaN | NaN | NaN | NaN | 0.0 | 49577.000000 | NaN |
75% | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2008.000000 | NaN | 150.000000 | NaN | NaN | 9.000000 | NaN | NaN | NaN | NaN | 0.0 | 71540.000000 | NaN |
max | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9999.000000 | NaN | 17700.000000 | NaN | NaN | 12.000000 | NaN | NaN | NaN | NaN | 0.0 | 99998.000000 | NaN |
autos.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 50000 entries, 0 to 49999 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 dateCrawled 50000 non-null object 1 name 50000 non-null object 2 seller 50000 non-null object 3 offerType 50000 non-null object 4 price 50000 non-null object 5 abtest 50000 non-null object 6 vehicleType 44905 non-null object 7 yearOfRegistration 50000 non-null int64 8 gearbox 47320 non-null object 9 powerPS 50000 non-null int64 10 model 47242 non-null object 11 odometer 50000 non-null object 12 monthOfRegistration 50000 non-null int64 13 fuelType 45518 non-null object 14 brand 50000 non-null object 15 notRepairedDamage 40171 non-null object 16 dateCreated 50000 non-null object 17 nrOfPictures 50000 non-null int64 18 postalCode 50000 non-null int64 19 lastSeen 50000 non-null object dtypes: int64(5), object(15) memory usage: 7.6+ MB
There are 5 values returned as 'NaN' for number of unique values. They are 'yearOfRegistration', 'PowerPS', 'monthOfRegistration', 'nrOfPictures and 'postalCode'.
A further analysis of these rows is required to find out the number of entries and whether they would add any value to the purpose of this project.
There are 5 columns with only 2 unique values. They are 'seller', 'offerType', 'abtest', 'gearbox', and 'notRepairedDamages'.
A further analysis of these rows is required to find out the number of entries attributed to these 2 unique values.
The 'price' and 'odometer' columns' datatypes are 'object'. These can be changed to integar type.
Columns 'fuelType', 'gearbox', 'NotRepairedDamage' includes words that are not in English language. These words can be translated to English.
The columns' names use camelcase instead of snakecase. Python code conventions require that snakecase be used. This would make the code more readable to others using the same programming language and would avoid confusions (for example, with classes which, as a code convention, uses capitalised first alphabets).
This can be changed to Python's preferred style - snakecase. The columns can also be rearranged.
autos["yearOfRegistration"].value_counts()
2000 3354 2005 3015 1999 3000 2004 2737 2003 2727 ... 4500 1 1931 1 1939 1 5911 1 1952 1 Name: yearOfRegistration, Length: 97, dtype: int64
autos["powerPS"].value_counts()
0 5500 75 3171 60 2195 150 2046 140 1884 ... 455 1 1016 1 952 1 696 1 16312 1 Name: powerPS, Length: 448, dtype: int64
autos["monthOfRegistration"].value_counts()
0 5075 3 5071 6 4368 5 4107 4 4102 7 3949 10 3651 12 3447 9 3389 11 3360 1 3282 8 3191 2 3008 Name: monthOfRegistration, dtype: int64
autos["nrOfPictures"].value_counts()
0 50000 Name: nrOfPictures, dtype: int64
autos["postalCode"].value_counts()
10115 109 65428 104 66333 54 45888 50 44145 48 ... 71576 1 76776 1 76872 1 91233 1 67585 1 Name: postalCode, Length: 7014, dtype: int64
autos.drop(columns=["nrOfPictures"], inplace=True)
autos["seller"].value_counts()
privat 49999 gewerblich 1 Name: seller, dtype: int64
autos["offerType"].value_counts()
Angebot 49999 Gesuch 1 Name: offerType, dtype: int64
autos["abtest"].value_counts()
test 25756 control 24244 Name: abtest, dtype: int64
autos["gearbox"].value_counts()
manuell 36993 automatik 10327 Name: gearbox, dtype: int64
autos["notRepairedDamage"].value_counts()
nein 35232 ja 4939 Name: notRepairedDamage, dtype: int64
autos[autos["seller"] == "gewerblich"]
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7738 | 2016-03-15 18:06:22 | Verkaufe_mehrere_Fahrzeuge_zum_Verschrotten | gewerblich | Angebot | $100 | control | kombi | 2000 | manuell | 0 | megane | 150,000km | 8 | benzin | renault | NaN | 2016-03-15 00:00:00 | 65232 | 2016-04-06 17:15:37 |
autos[autos["offerType"] == "Gesuch"]
dateCrawled | name | seller | offerType | price | abtest | vehicleType | yearOfRegistration | gearbox | powerPS | model | odometer | monthOfRegistration | fuelType | brand | notRepairedDamage | dateCreated | postalCode | lastSeen | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17541 | 2016-04-03 15:48:33 | Suche_VW_T5_Multivan | privat | Gesuch | $0 | test | bus | 2005 | NaN | 0 | transporter | 150,000km | 0 | NaN | volkswagen | NaN | 2016-04-03 00:00:00 | 29690 | 2016-04-05 15:16:06 |
autos.drop(index=[7738, 17541], inplace=True)
print(autos["seller"].unique())
print(autos["offerType"].unique())
['privat'] ['Angebot']
autos.drop(columns=["seller"], inplace=True)
autos.drop(columns=["offerType"], inplace=True)
autos["price"].head(3)
0 $5,000 1 $8,500 2 $8,990 Name: price, dtype: object
autos["odometer"].head(3)
0 150,000km 1 150,000km 2 70,000km Name: odometer, dtype: object
### a function (clean_col) is created to remove non-numeric values ###
def clean_col(col):
col = col.str.replace(",", "", regex=True)
col = col.str.replace("$", "", regex=True)
col = col.str.replace("km", "", regex=True)
return col
### clean_col function is called to remove non-numeric values from 'price' and 'odometer' columns ###
autos["price"] = clean_col(autos["price"])
autos["odometer"] = clean_col(autos["odometer"])
### the 'price' and 'odometer columns are converted to integer type'
autos["price"] = autos["price"].astype(int)
autos["odometer"] = autos["odometer"].astype(int)
autos[["price", "odometer"]].head(3)
price | odometer | |
---|---|---|
0 | 5000 | 150000 |
1 | 8500 | 150000 |
2 | 8990 | 70000 |
### extracting unique values from the 3 columns which has data entries not in English language ###
print("Column - 'fuelType':", autos["fuelType"].unique())
print("Column - 'gearbox':", autos["gearbox"].unique())
print("Column - 'NotRepairedDamage':", autos["notRepairedDamage"].unique())
Column - 'fuelType': ['lpg' 'benzin' 'diesel' nan 'cng' 'hybrid' 'elektro' 'andere'] Column - 'gearbox': ['manuell' 'automatik' nan] Column - 'NotRepairedDamage': ['nein' nan 'ja']
### using the map() function to assign the English translation to these 3 columns ###
autos["fuelType"] = autos["fuelType"].map({
'lpg':"lpg",
'benzin':"petrol",
'diesel':"diesel",
'cng':"cng",
'hybrid':"hybrid",
'elektro':"electric",
'andere':"other"
})
autos["gearbox"] = autos["gearbox"].map({'manuell':"manual",'automatik':"automatic"})
autos["notRepairedDamage"] = autos["notRepairedDamage"].map({'nein':"no",'ja':"yes"})
### changing names style from camelcase to snakecase###
autos.columns = ['date_crawled', 'name', 'price_$', 'ab_test', 'vehicle_type', 'reg_year', 'gearbox', 'power_ps', 'model',
'odometer_km', 'reg_month', 'fuel_type', 'brand', 'unrepaired_damage', 'ad_created', 'postal_code', 'last_seen']
### rearranging columns ###
autos = autos[['brand', 'model', 'name', 'vehicle_type', 'price_$', 'odometer_km', 'reg_year', 'reg_month', 'gearbox',
'power_ps', 'fuel_type', 'ab_test', 'unrepaired_damage', 'postal_code', 'ad_created', 'date_crawled',
'last_seen']]
autos.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 49998 entries, 0 to 49999 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 brand 49998 non-null object 1 model 47240 non-null object 2 name 49998 non-null object 3 vehicle_type 44903 non-null object 4 price_$ 49998 non-null int32 5 odometer_km 49998 non-null int32 6 reg_year 49998 non-null int64 7 reg_month 49998 non-null int64 8 gearbox 47319 non-null object 9 power_ps 49998 non-null int64 10 fuel_type 45517 non-null object 11 ab_test 49998 non-null object 12 unrepaired_damage 40171 non-null object 13 postal_code 49998 non-null int64 14 ad_created 49998 non-null object 15 date_crawled 49998 non-null object 16 last_seen 49998 non-null object dtypes: int32(2), int64(4), object(11) memory usage: 6.5+ MB
0 brand: The brand of the car.
1 model: The car model name.
2 name: Name of the car.
3 vehicle_type: The vehicle type.
4 price_$: The price on the advertisement to sell the car.
5 odometer_km: How many kilometers the car has been driven.
6 reg_year: The year in which the car was first registered.
7 reg_month: The month in which the car was first registered.
8 gearbox: The transmission type.
9 power_ps: The power of the car in PS.
10 fuel_type: What type of fuel the car uses.
11 ab_test: Whether the listing is included in an A/B test.
12 unrepaired_damage: If the car has a damage which is not yet repaired.
13 postal_code: The postal code for the location of the vehicle.
14 ad_created: The date on which the eBay listing was created.
15 date_crawled: When this advertisement was first crawled. All field-values are taken from this date.
16 last_seen: When the crawler saw this advertisement last online.
Rows: (Initial Count: 50,000) (Previous Count:50,000) (Current Count: 49,998)
Part 2.2: 'seller' with entry for "gewerblich"
Part 2.2: 'offerType' with entry for "Gesuch".
Columns: (Initial Count: 20) (Previous Count:20) (Current Count: 17)
Part 2.1: ''nr_of_pictures' with one unique value: '0'.
Part 2.2: 'seller' left with one unique value: 'privat'.
Part 2.2: 'offerType' left with one unique value: 'Angebot'
autos.loc[autos["reg_year"]<1986].shape
(924, 17)
### from the dataset, removing antique vehicles ###
autos = autos.drop(autos.loc[autos["reg_year"]<1986].index)
len(autos[autos["reg_year"]>2016])
1966
### from the dataset, removing rows with vehicle 'reg_year' beyond 2016. ###
autos = autos.drop(autos[autos["reg_year"]>2016].index)
### checking for vehicles with 'reg_year' year 2016 and 'reg_month' beyond April ###
reg_month_above_4 = autos[(autos["reg_month"]==2016) & (autos["reg_month"] >4)]
print("Number of listings with registration month indicated beyond April:",
len(reg_month_above_4))
Number of listings with registration month indicated beyond April: 0
### printing non-null values, unique values and listing counts for the column: 'reg_month' ###
print("reg_month 49998 non-null int64")
print("Unique entries for 'reg_month' is", len(autos["reg_month"].unique()))
autos["reg_month"].value_counts()
reg_month 49998 non-null int64 Unique entries for 'reg_month' is 13
3 4810 0 4383 6 4128 5 3887 4 3880 7 3684 10 3519 12 3305 9 3261 11 3247 1 3076 8 3049 2 2879 Name: reg_month, dtype: int64
### printing non-null values, unique, min and max values for 'odometer_km' column ###
print("odometer_km 49998 non-null int64")
print("Unique entries:", len(autos["odometer_km"].unique()))
print("Minimum value:", autos["odometer_km"].min())
print("Maximum value:", autos["odometer_km"].max())
### extracting value_count() ###
autos["odometer_km"].value_counts().sort_values()
odometer_km 49998 non-null int64 Unique entries: 13 Minimum value: 5000 Maximum value: 150000
10000 229 20000 737 30000 746 40000 779 5000 805 50000 974 60000 1091 70000 1153 80000 1353 90000 1654 100000 2008 125000 4880 150000 30699 Name: odometer_km, dtype: int64
### counting entries by registration year for cars driven 5,000km ###
print(autos.loc[autos["odometer_km"]==5000, ["reg_year"]].value_counts().head(15))
print('\n')
### counting entries by registration year for cars driven 10,000km ###
print(autos.loc[autos["odometer_km"]==10000, ["reg_year"]].value_counts().head(15))
print('\n')
### counting entries by registration year for cars driven 20,000km ###
print(autos.loc[autos["odometer_km"]==20000, ["reg_year"]].value_counts().head(15))
reg_year 2000 107 2015 89 2016 86 1995 50 2005 48 1998 38 2002 35 1999 35 2001 35 1990 35 1997 32 2004 24 2014 23 1996 20 2007 19 dtype: int64 reg_year 2015 113 2014 41 2013 17 2011 6 2010 6 2009 6 2012 5 2000 4 1986 3 1988 3 1995 3 2005 2 2007 2 2006 2 2016 2 dtype: int64 reg_year 2014 188 2015 149 2013 74 2012 70 2000 24 2011 23 2010 21 2005 19 2009 17 1999 15 2007 14 2001 14 1998 12 1995 11 2002 10 dtype: int64
### printing number of rows and columns for ###
### vehicles driven 5,000km and are registered before 2014 ###
print("Vehicles driven 5,00 km and registered before 2014 - rows & columns:",
autos[(autos["odometer_km"] == 5000) & (autos["reg_year"] < 2014)].shape)
### vehicles driven 10,000km and are registered before 2009 ###
print("Vehicles driven 10,00 km and registered before 2009 - rows & columns:",
autos[(autos["odometer_km"] == 10000) & (autos["reg_year"] < 2009)].shape)
### vehicles driven 20,000km and are registered before 2009 ###
print("Vehicles driven 20,00 km and registered before 2009 - rows & columns:",
autos[(autos["odometer_km"] == 20000) & (autos["reg_year"] < 2009)].shape)
Vehicles driven 5,00 km and registered before 2014 - rows & columns: (607, 17) Vehicles driven 10,00 km and registered before 2009 - rows & columns: (33, 17) Vehicles driven 20,00 km and registered before 2009 - rows & columns: (187, 17)
### from the dataset, removing rows with vehicles driven 5,000km and are registered before 2014 ###
autos = autos.drop(autos[(autos["odometer_km"] <= 5000) & (autos["reg_year"] < 2014)].index)
### from the dataset, removing rows with vehicles driven 10,000km and are registered before 2009 ###
autos = autos.drop(autos[(autos["odometer_km"] <= 10000) & (autos["reg_year"] < 2009)].index)
### from the dataset, removing rows with vehicles driven 20,000km and are registered before 2009 ###
autos = autos.drop(autos[(autos["odometer_km"] <= 20000) & (autos["reg_year"] < 2009)].index)
### printing non-null values, unique, min and max values for 'price_$' column ###
print("price_$ 49998 non-null int64")
print("Unique entries:", len(autos["price_$"].unique()))
print("Minimum value:", autos["price_$"].min())
print("Maximum value:", autos["price_$"].max())
price_$ 49998 non-null int64 Unique entries: 2301 Minimum value: 0 Maximum value: 99999999
### extracting number of rows with price between $0 - $99 and assigning it to a variable: price_below_100 ###
price_below_100 = autos.loc[(autos["price_$"] >=0) & (autos["price_$"] <= 99), "price_$"]
print("Number of entries with price between $0-99:", len(price_below_100))
### counting values in the variable: below_100 ###
print(price_below_100.value_counts().sort_values(ascending=False).head(10))
Number of entries with price between $0-99: 1404 0 1149 1 124 50 38 99 18 80 12 70 7 60 6 90 5 75 5 40 3 Name: price_$, dtype: int64
### extracting number of rows with price between $100 and assigning it to a variable: price_equal_100 ###
price_equal_100 = autos[autos["price_$"] == 100]
print("Number of entries with price equal to $100:", len(price_equal_100))
Number of entries with price equal to $100: 117
### from the dataset, removing rows with vehicle price below $100 ###
autos = autos.drop(autos[autos["price_$"] <= 99].index)
### extracting number of rows with price above 500k and assigning it to a variable: price_above_500k ###
price_above_500k = autos.loc[autos["price_$"] > 500000,['brand', 'model', 'vehicle_type', 'reg_year', 'price_$']]
print("Number of rows with price above $500,000:", len(price_above_500k))
print('\n')
print(price_above_500k)
Number of rows with price above $500,000: 7 brand model vehicle_type reg_year price_$ 514 ford focus kombi 2009 999999 7814 sonstige_autos NaN coupe 1992 1300000 22947 bmw NaN kombi 1999 1234566 24384 volkswagen NaN NaN 1995 11111111 39705 mercedes_benz s_klasse limousine 1999 99999999 42221 citroen c4 limousine 2014 27322222 47598 opel vectra limousine 2001 12345678
### from the dataset, removing rows with vehicle price above $500,000 ###
autos = autos.drop(autos[autos["price_$"] > 500000].index)
Rows: (Initial Count: 50,000) (Previous Count:50,000) (Current Count: 49,998)
Part 2.2: 'seller' with entry for "gewerblich"
Part 2.2: 'offerType' with entry for "Gesuch".
Rows: (Initial Count: 50,000)(Previous Count:49,998) (Current Count: 44,870)
Part 4.1: antique vehicles (registered before 1986): 924 rows.
Part 4.2: vehicle registration year indicated beyond 2016: 1,966 rows.
Part 4.4: vehicles driven 5,00 km and registered before 2014: 607 rows.
Part 4.4: vehicles driven 10,00 km and registered before 2009: 33 rows.
Part 4.4: vehicles driven 20,00 km and registered before 2009: 187 rows.
Part 4.5: vehicle priced between $0-99: 1,404 rows.
Part 4.5: vehicles prices above 500,000: 7 rows
Columns: No change (Initial Count: 20)(Previous Count:17) (Current Count: 17)
print("Number of rows and columns in the dataframe:", autos.shape)
### checking for accuracy of the number of rows ###
print(len(autos) == 49998 - (924 + 1966 + 607 + 33 + 187 + 1404 + 7))
Number of rows and columns in the dataframe: (44870, 17) True
### counting number of listings based on the registration year and converting them to percentage ###
distr_ = autos["reg_year"].value_counts(normalize=True).sort_values(ascending=False)*100
distr_
2000 6.739470 2005 6.369512 1999 6.307109 2003 5.961667 2004 5.946066 2006 5.894807 2001 5.754402 2002 5.424560 1998 5.110319 2007 5.001114 2008 4.891910 2009 4.622242 1997 4.214397 2011 3.592601 2010 3.521284 1996 2.966347 2012 2.903945 2016 2.678850 1995 2.560731 2013 1.769556 2014 1.475373 1994 1.346111 1993 0.922665 2015 0.846891 1992 0.793403 1991 0.744373 1990 0.693113 1989 0.372186 1988 0.276354 1987 0.153778 1986 0.144863 Name: reg_year, dtype: float64
### extracting percentage of vehicles that are 5 years old or younger ###
distr_above_2010 = distr_[distr_.index>2010]
print("Vehicles 5 years old or younger",
sum(distr_above_2010),
"%")
### extracting percentage of vehicles that are 5-9 years old ###
distr_btw_2006_2010 = distr_[(distr_.index>=2006) & (distr_.index<=2010)]
print("Vehicles 5-9 years old:",
sum(distr_btw_2006_2010),
"%")
### extracting percentage of vehicles that are 10 years old or older ###
distr_below_2006 = distr_[distr_.index<2006]
print("Vehicles 10 years old or older:",
sum(distr_below_2006),
"%")
Vehicles 5 years old or younger 13.267216402941832 % Vehicles 5-9 years old: 23.931357254290173 % Vehicles 10 years old or older: 62.801426342767996 %
### extracting percentage of vehicles that are 9 years old or older ###
print("Vehicles 9 years old or older:",
sum(distr_[distr_.index<=2006]),
"%")
### extracting percentage of vehicles that are 20 years old or older ###
print("Vehicles 20 years old or older:",
sum(distr_[distr_.index<1996]),
"%")
Vehicles 9 years old or older: 68.69623356362827 % Vehicles 20 years old or older: 8.00757744595498 %
### For the 'ad_created', 'date_craweled' and 'last_seen':###
### - extracting the years and months only using the str[] function;###
### - spliting the years and months to 2 seperate columns; and###
### - converting them to integer ###
ad_year = autos["ad_created"].str[:7].str.split("-", expand=True,).astype(int)
ad_year.columns = ["yr_created", "month_created"]
ad_year[["yr_crawled","month_crawled"]]= autos["date_crawled"].str[:7].str.split("-", expand=True,).astype(int)
ad_year[["yr_seen","month_seen"]]= autos["last_seen"].str[:7].str.split("-", expand=True,).astype(int)
print(ad_year.info())
ad_year.head()
<class 'pandas.core.frame.DataFrame'> Int64Index: 44870 entries, 0 to 49999 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 yr_created 44870 non-null int32 1 month_created 44870 non-null int32 2 yr_crawled 44870 non-null int32 3 month_crawled 44870 non-null int32 4 yr_seen 44870 non-null int32 5 month_seen 44870 non-null int32 dtypes: int32(6) memory usage: 1.4 MB None
yr_created | month_created | yr_crawled | month_crawled | yr_seen | month_seen | |
---|---|---|---|---|---|---|
0 | 2016 | 3 | 2016 | 3 | 2016 | 4 |
1 | 2016 | 4 | 2016 | 4 | 2016 | 4 |
2 | 2016 | 3 | 2016 | 3 | 2016 | 4 |
3 | 2016 | 3 | 2016 | 3 | 2016 | 3 |
4 | 2016 | 4 | 2016 | 4 | 2016 | 4 |
### For 'ad_created', 'date_craweled' and 'last_seen', ###
### - counting listings by year ###
### - counting listings by month. ###
print(ad_year[["yr_created", "month_created"]].value_counts())
print('\n')
print(ad_year[["yr_crawled", "month_crawled"]].value_counts())
print('\n')
print(ad_year[["yr_seen", "month_seen"]].value_counts())
yr_created month_created 2016 3 37540 4 7256 2 60 1 8 2015 12 2 6 1 8 1 9 1 11 1 dtype: int64 yr_crawled month_crawled 2016 3 37566 4 7304 dtype: int64 yr_seen month_seen 2016 4 25910 3 18960 dtype: int64
### counting number of listings based on the brand and converting them to percentage ###
brand = autos["brand"].value_counts(normalize=True)*100
brand
volkswagen 21.201248 bmw 11.123245 opel 10.719857 mercedes_benz 9.538667 audi 8.872298 ford 6.948964 renault 4.773791 peugeot 3.046579 fiat 2.527301 seat 1.863160 skoda 1.664809 nissan 1.555605 mazda 1.542233 smart 1.446401 citroen 1.415199 toyota 1.290394 hyundai 1.029641 mini 0.900379 volvo 0.900379 mitsubishi 0.826833 honda 0.788946 kia 0.726543 alfa_romeo 0.641854 sonstige_autos 0.639626 suzuki 0.599510 porsche 0.554936 chevrolet 0.517049 chrysler 0.347671 dacia 0.274125 daihatsu 0.247381 jeep 0.229552 subaru 0.218409 land_rover 0.207265 saab 0.167150 jaguar 0.151549 daewoo 0.147092 rover 0.138177 lancia 0.102518 trabant 0.066860 lada 0.046802 Name: brand, dtype: float64
### extracting brand that has more than 5% listings in total ###
brand_listing_more_5perc = brand[brand > 5].index
brand_listing_more_5perc
Index(['volkswagen', 'bmw', 'opel', 'mercedes_benz', 'audi', 'ford'], dtype='object')
### create empty dictionary ###
dic_brand_avg_price = {}
### loop over each brand that has more than 5% listings ###
### assign each brand to the empty dictionary as the key ###
### for each brands, calculate the average price ###
### assign the average price for that brand to the emptry dictionary as the value ###
for brand in brand_listing_more_5perc:
avg_price = autos.loc[autos["brand"] == brand, "price_$"].mean()
dic_brand_avg_price[brand] = int(avg_price)
dic_brand_avg_price
{'volkswagen': 5436, 'bmw': 8375, 'opel': 3001, 'mercedes_benz': 8556, 'audi': 9419, 'ford': 3490}
### create empty dictionary ###
dic_brand_avg_mileage = {}
### loop over each brand in dic_brand_avg_price ###
### assign each brand to the empty dictionary as the key ###
### for each brands, calculate the average mileage ###
### assign the average mileage for that brand to the emptry dictionary as the value ###
for brand in dic_brand_avg_price:
avg_mileage = autos.loc[autos["brand"]==brand, "odometer_km"].mean()
dic_brand_avg_mileage[brand] = int(avg_mileage)
dic_brand_avg_mileage
{'volkswagen': 130877, 'bmw': 134713, 'opel': 132144, 'mercedes_benz': 132517, 'audi': 130106, 'ford': 127134}
### assigning dic_brand_avg_price to a variable: brand_avg_price ###
brand_avg_price = pd.Series(dic_brand_avg_price).sort_values(ascending=False)
### assigning dic_brand_avg_mileage to a variable: brand_avg_mileage ###
brand_avg_mileage = pd.Series(dic_brand_avg_mileage).sort_values(ascending=False)
### converting brand_avg_price to dataframe ###
brand_price_mileage = pd.DataFrame(brand_avg_price, columns = ['avg_price'])
### to that dataframe, adding avg_mileage column ###
brand_price_mileage["avg_mileage"] = brand_avg_mileage
brand_price_mileage
avg_price | avg_mileage | |
---|---|---|
audi | 9419 | 130106 |
mercedes_benz | 8556 | 132517 |
bmw | 8375 | 134713 |
volkswagen | 5436 | 130877 |
ford | 3490 | 127134 |
opel | 3001 | 132144 |
### using groupby() function to count the listings by brand and model and assigning it to a variable: 'brand_model' ###
brand_model = autos.groupby(["brand","model"]).size().nlargest(6)
brand_model
brand model volkswagen golf 3613 bmw 3er 2545 volkswagen polo 1558 opel corsa 1541 volkswagen passat 1326 opel astra 1319 dtype: int64
### using groupby() function to count the listings by 'brand', 'model' and 'gearbox' and ###
### assigning it to a variable: brand_model_gear ###
brand_model_gearbox = autos.groupby(["brand","model", "gearbox"]).size().sort_values(ascending=False).head(6)
### calculating the percentage of 'brand_model_gearbox' against 'brand_model' ###
brand_model_gearbox_perc = (brand_model_gearbox/brand_model) * 100
brand_model_gearbox_perc.sort_values(ascending=False).astype(int)
brand model gearbox volkswagen polo manual 92 opel corsa manual 91 astra manual 89 volkswagen golf manual 84 bmw 3er manual 76 volkswagen passat manual 75 dtype: int32
### using groupby() function to count the listings by 'brand', 'model', 'gearbox' and 'adometer_km' and ###
### assigning it to a variable: brand_model_gearbox_mileage ###
brand_model_gearbox_mileage =autos.groupby(["brand","model", "gearbox", "odometer_km"]).size().sort_values(ascending=False).head(6)
### calculating the percentage of 'brand_model_gearbox_mileage' against 'brand_model_gearbox' ###
brand_model_gearbox_mileage_per = (brand_model_gearbox_mileage/brand_model_gearbox) * 100
brand_model_gearbox_mileage_per.sort_values(ascending=False).astype(int)
brand model gearbox odometer_km volkswagen passat manual 150000 85 bmw 3er manual 150000 84 volkswagen golf manual 150000 71 opel astra manual 150000 70 volkswagen polo manual 150000 66 opel corsa manual 150000 65 dtype: int32
### counting number of listings based on 'odometer_km' and assigning it to varaible: mileage ###
mileage = autos["odometer_km"].value_counts()
mileage
150000 29590 125000 4763 100000 1960 90000 1631 80000 1330 70000 1139 60000 1081 50000 959 40000 775 30000 736 20000 547 10000 192 5000 167 Name: odometer_km, dtype: int64
### extracting the index of mileage above and assigning them to a variable: mileage_ ###
mileage_ = mileage.index
### creating emptry dictionary ###
dic_mileage_avg_price = {}
### loop over each mileage / index and assigning it to the empty dictionary as the key ###
### calculate average price of all listings based on each mileage/index ###
### assigning the average price to the dictionary as the value ###
for mile in mileage_:
avg_price = autos.loc[autos["odometer_km"] == mile, "price_$"].mean()
dic_mileage_avg_price[mile] = int(avg_price)
### extracting the dictionary keys and value and presenting them in a new dataframe format ###
mileage_avg_price = pd.Series(dic_mileage_avg_price).sort_values(ascending=False)
mileage_avg_price = pd.DataFrame(mileage_avg_price, columns = ["avg_price"])
mileage_avg_price
avg_price | |
---|---|
5000 | 28708 |
10000 | 24005 |
20000 | 21884 |
30000 | 16710 |
40000 | 15617 |
50000 | 13823 |
60000 | 12350 |
70000 | 10946 |
80000 | 9768 |
90000 | 8373 |
100000 | 7900 |
125000 | 6213 |
150000 | 3742 |
### using groupby() function to calculate the average difference in price ###
autos.groupby('unrepaired_damage')['price_$'].mean().astype(int)
unrepaired_damage no 7085 yes 2252 Name: price_$, dtype: int32